Stochastic Approximation with Markov Noise: Analysis and applications in reinforcement learning
Abstract
Stochastic approximation algorithms are sequential non-parametric methods for finding a zero
or minimum of a function in the situation where only the noisy observations of the function
values are available. Two time-scale stochastic approximation algorithms consist of two coupled
recursions which are updated with different (one is considerably smaller than the other) step
sizes which in turn facilitate convergence for such algorithms.
We present for the first time an asymptotic convergence analysis of two time- scale stochastic
approximation driven by 'controlled' Markov noise. In particular, the faster and slower
recursions have non-additive controlled Markov noise components in addition to martingale
difference noise. We analyze the asymptotic behavior of our framework by relating it to limiting
differential inclusions in both time scales that are de fined in terms of the ergodic occupation
measures associated with the controlled Markov processes.
Using a special case of our results, we present a solution to the o -policy convergence
problem for temporal-difference learning with linear function approximation.
One of the important assumption in the earlier analysis is the point-wise boundedness (also
called the 'stability') of the iterates. However, finding sufficient veri able conditions for this is
very hard when the noise is Markov as well as when there are multiple timescales. We compile
several aspects of the dynamics of stochastic approximation algorithms with Markov iterate dependent
noise when the iterates are not known to be stable beforehand. We achieve the same
by extending the lock-in probability (i.e. the probability of convergence to a specific attractor
of the limiting o.d.e. given that the iterates are in its domain of attraction after a sufficiently
large number of iterations (say) n0) framework to such recursions. Specifically, with the more
restrictive assumption of Markov iterate-dependent noise supported on a bounded subset of the
Euclidean space we give a lower bound for the lock- in probability. We use these results to prove
almost sure convergence of the iterates to the specified attractor when the iterates satisfy an
`asymptotic tightness' condition. This, in turn, is shown to be useful in analyzing the tracking
ability of general 'adaptive' algorithms. Additionally, we show that our results can be used to
derive a sample complexity estimate of such recursions, which then can be used for step-size
selection.
Finally, we obtain the first informative error bounds on function approximation for the
policy evaluation algorithm proposed by Basu et al. when the aim is to nd the risk-sensitive
cost represented using exponential utility. We also give examples where all our bounds achieve
the \actual error" whereas the earlier bound given by Basu et al. is much weaker in comparison.
We show that this happens due to the absence of difference term in the earlier bound which is
always present in all our bounds when the state space is large. Additionally, we discuss how all our bounds compare with each other