Algorithms for Stochastic Optimization, Statistical Estimation and Markov Decision Processes

Kamanchi, Chandramouli

dc.contributor.advisor	Bhatnagar, Shalabh
dc.contributor.author	Kamanchi, Chandramouli
dc.date.accessioned	2020-10-28T06:58:01Z
dc.date.available	2020-10-28T06:58:01Z
dc.date.submitted	2020
dc.identifier.uri	https://etd.iisc.ac.in/handle/2005/4645
dc.description.abstract	Stochastic approximation deals with the problem of finding zeros of a function expressed as an expectation of a random variable. In this thesis we propose convergent algorithms for problems in optimization, statistical estimation and reinforcement learning. We first formulate these problems in the stochastic approximation setting. Subsequently we utilize the ordinary differential equations based analysis of stochastic approximation algorithms to prove the convergence of our proposed algorithms. Additionally we explore second order methods in the context of Markov Decision Processes. Through experimental evaluations, we validate our theoretical results. First, we consider a Stochastic Optimization (SO) problem where the aim is to optimize a given objective function in the presence of noise. Most of the solution techniques in SO estimate gradients from the noise corrupted observations of the objective and adjust parameters of the objective along the direction of the estimated gradients to obtain locally optimal solutions. Two prominent algorithms in SO namely Random Direction Kiefer-Wolfowitz (RDKW) and Simultaneous Perturbation Stochastic Approximation (SPSA) obtain noisy gradient estimates by randomly perturbing all the parameters simultaneously. This forces the search direction to be random in these algorithms and causes them to suffer additional noise on top of the noise incurred from the samples of the objective. Owing to this additional noise, the idea of using deterministic perturbations instead of random perturbations for gradient estimation has also been studied in the past. Specifically, two constructions of the deterministic perturbation sequence using lexicographical ordering and Hadamard matrices have been explored in the context of SPSA algorithms and encouraging results have been reported in the literature. In our work, we characterize the class of deterministic perturbation sequences that can be utilized in the RDKW algorithm. This class expands the set of known deterministic perturbation sequences available in the literature. Using our characterization we propose a construction of a deterministic perturbation sequence that has the least possible cycle length among all deterministic perturbations. We establish the convergence of the RDKW algorithm for the generalized class of deterministic perturbations utilizing stochastic approximation techniques. Through simulations we illustrate the performance gain of the proposed deterministic perturbation sequence in the RDKW algorithm over the Hadamard and the random perturbation counterparts. Next, we consider the problem of estimating the mode of a density function utilizing only the samples from this density. One of the popular measures of central tendency that provides better representation and interesting insights of the data compared to the other measures like mean and median is the mode. If the analytical form of the density function is known, mode is an argument of the maximum value of the density function and one can apply optimization techniques to find the mode. In many of the practical applications, the analytical form of the density is not known and only the samples from the distribution are available. Most of the techniques proposed in the literature for estimating the mode from data samples assume that all the samples are available beforehand. Moreover, some of the techniques employ computationally expensive operations like sorting. In our work, we provide a computationally effective, on-line, iterative algorithm that estimates the mode of a unimodal smooth density given only the samples generated from the density. Asymptotic convergence of the proposed algorithm using an ordinary differential equation (ODE) based analysis is provided. We also prove the stability of estimates by utilizing the concept of regularization. Experimental results further demonstrate the effectiveness of our proposed algorithm. In the third part of our thesis, we propose a learning algorithm to solve a reinforcement learning problem. In a discounted reward Markov Decision Process (MDP), the objective is to find the optimal value function, i.e., the value function corresponding to an optimal policy. This problem reduces to solving a functional equation known as the Bellman equation and a fixed point iteration scheme known as the value iteration is utilized to obtain the solution. In literature, a successive over-relaxation based value iteration scheme is proposed to speed-up the computation of the optimal value function. The speed-up is achieved by constructing a modified Bellman equation that ensures faster convergence to the optimal value function. However, in many practical applications, the model information is not known and we resort to reinforcement learning algorithms to estimate the optimal policy and value function. One such popular algorithm is Q-learning. In our work, we propose a novel Successive Over-Relaxation (SOR) Q-learning algorithm. We first derive a modified fixed point iteration for SOR Q-values and utilize stochastic approximation to derive a learning algorithm to compute the optimal value function and an optimal policy. We then prove the almost sure convergence of the SOR Q-learning algorithm to SOR Q-values. Through numerical experiments, we show that our proposed SOR Q-learning algorithm is faster compared to the standard Q-learning algorithm. Finally, we explore second order methods in Markov Decision Processes (MDPs). Value iteration is a fixed point iteration technique utilized to obtain the optimal value function and policy in a discounted reward MDP. Here, a contraction operator is constructed and applied recursively to arrive at the optimal solution. Value iteration is a first order method and therefore it may take a large number of iterations to converge to the optimal solution. As discussed above, successive relaxation is a popular technique that can be applied to solve a fixed point equation. It has been shown in the literature that under a special structure of the MDP, the successive over-relaxation technique computes the optimal value function faster than standard value iteration. In our work, we propose a second order value iteration procedure that is obtained by applying the Newton-Raphson method to the successive relaxation value iteration scheme. We prove the global convergence of our algorithm to the optimal solution asymptotically and show the second order convergence. Through experiments, we then demonstrate the effectiveness of our proposed approach	en_US
dc.language.iso	en_US	en_US
dc.rights	I grant Indian Institute of Science the right to archive and to make available my thesis or dissertation in whole or in part in all forms of media, now hereafter known. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation	en_US
dc.subject	Random Direction Kiefer-Wolfowitz	en_US
dc.subject	Hadamard matrices	en_US
dc.subject	lexicographical ordering	en_US
dc.subject.classification	Research Subject Categories::TECHNOLOGY::Information technology::Computer science	en_US
dc.title	Algorithms for Stochastic Optimization, Statistical Estimation and Markov Decision Processes	en_US
dc.type	Thesis	en_US
dc.degree.name	PhD	en_US
dc.degree.level	Doctoral	en_US
dc.degree.grantor	Indian Institute of Science	en_US
dc.degree.discipline	Engineering	en_US

Files in this item

Name:: Chandramouli_PhD_Thesis_Revisi ...
Size:: 884.2Kb
Format:: PDF
Description:: Thesis full text

View/Open

This item appears in the following Collection(s)

Computer Science and Automation (CSA) [394]

Show simple item record