dc.description.abstract | Stochastic approximation deals with the problem of finding zeros of a function expressed as an expectation
of a random variable. In this thesis we propose convergent algorithms for problems in
optimization, statistical estimation and reinforcement learning. We first formulate these problems
in the stochastic approximation setting. Subsequently we utilize the ordinary differential equations
based analysis of stochastic approximation algorithms to prove the convergence of our proposed algorithms.
Additionally we explore second order methods in the context of Markov Decision Processes.
Through experimental evaluations, we validate our theoretical results.
First, we consider a Stochastic Optimization (SO) problem where the aim is to optimize a given
objective function in the presence of noise. Most of the solution techniques in SO estimate gradients
from the noise corrupted observations of the objective and adjust parameters of the objective along the
direction of the estimated gradients to obtain locally optimal solutions. Two prominent algorithms in
SO namely Random Direction Kiefer-Wolfowitz (RDKW) and Simultaneous Perturbation Stochastic
Approximation (SPSA) obtain noisy gradient estimates by randomly perturbing all the parameters
simultaneously. This forces the search direction to be random in these algorithms and causes them
to suffer additional noise on top of the noise incurred from the samples of the objective. Owing to
this additional noise, the idea of using deterministic perturbations instead of random perturbations for
gradient estimation has also been studied in the past. Specifically, two constructions of the deterministic
perturbation sequence using lexicographical ordering and Hadamard matrices have been explored
in the context of SPSA algorithms and encouraging results have been reported in the literature. In
our work, we characterize the class of deterministic perturbation sequences that can be utilized in the
RDKW algorithm. This class expands the set of known deterministic perturbation sequences available
in the literature. Using our characterization we propose a construction of a deterministic perturbation
sequence that has the least possible cycle length among all deterministic perturbations. We establish
the convergence of the RDKW algorithm for the generalized class of deterministic perturbations utilizing
stochastic approximation techniques. Through simulations we illustrate the performance gain
of the proposed deterministic perturbation sequence in the RDKW algorithm over the Hadamard and
the random perturbation counterparts. Next, we consider the problem of estimating the mode of a density function utilizing only the
samples from this density. One of the popular measures of central tendency that provides better representation
and interesting insights of the data compared to the other measures like mean and median
is the mode. If the analytical form of the density function is known, mode is an argument of the
maximum value of the density function and one can apply optimization techniques to find the mode.
In many of the practical applications, the analytical form of the density is not known and only the
samples from the distribution are available. Most of the techniques proposed in the literature for
estimating the mode from data samples assume that all the samples are available beforehand. Moreover,
some of the techniques employ computationally expensive operations like sorting. In our work,
we provide a computationally effective, on-line, iterative algorithm that estimates the mode of a unimodal
smooth density given only the samples generated from the density. Asymptotic convergence of
the proposed algorithm using an ordinary differential equation (ODE) based analysis is provided. We
also prove the stability of estimates by utilizing the concept of regularization. Experimental results
further demonstrate the effectiveness of our proposed algorithm.
In the third part of our thesis, we propose a learning algorithm to solve a reinforcement learning
problem. In a discounted reward Markov Decision Process (MDP), the objective is to find the optimal
value function, i.e., the value function corresponding to an optimal policy. This problem reduces to
solving a functional equation known as the Bellman equation and a fixed point iteration scheme known
as the value iteration is utilized to obtain the solution. In literature, a successive over-relaxation based
value iteration scheme is proposed to speed-up the computation of the optimal value function. The
speed-up is achieved by constructing a modified Bellman equation that ensures faster convergence to
the optimal value function. However, in many practical applications, the model information is not
known and we resort to reinforcement learning algorithms to estimate the optimal policy and value
function. One such popular algorithm is Q-learning. In our work, we propose a novel Successive
Over-Relaxation (SOR) Q-learning algorithm. We first derive a modified fixed point iteration for
SOR Q-values and utilize stochastic approximation to derive a learning algorithm to compute the
optimal value function and an optimal policy. We then prove the almost sure convergence of the SOR
Q-learning algorithm to SOR Q-values. Through numerical experiments, we show that our proposed
SOR Q-learning algorithm is faster compared to the standard Q-learning algorithm.
Finally, we explore second order methods in Markov Decision Processes (MDPs). Value iteration
is a fixed point iteration technique utilized to obtain the optimal value function and policy in a discounted
reward MDP. Here, a contraction operator is constructed and applied recursively to arrive at
the optimal solution. Value iteration is a first order method and therefore it may take a large number of
iterations to converge to the optimal solution. As discussed above, successive relaxation is a popular
technique that can be applied to solve a fixed point equation. It has been shown in the literature that under a special structure of the MDP, the successive over-relaxation technique computes the optimal
value function faster than standard value iteration. In our work, we propose a second order value iteration
procedure that is obtained by applying the Newton-Raphson method to the successive relaxation
value iteration scheme. We prove the global convergence of our algorithm to the optimal solution
asymptotically and show the second order convergence. Through experiments, we then demonstrate
the effectiveness of our proposed approach | en_US |