Online Learning and Simulation Based Algorithms for Stochastic Optimization
Abstract
In many optimization problems, the relationship between the objective and parameters is not known. The objective function itself may be stochastic such as a longrun average over some random cost samples. In such cases finding the gradient of the objective is not possible. It is in this setting that stochastic approximation algorithms are used. These algorithms use some estimates of the gradient and are stochastic in nature. Amongst gradient estimation techniques, Simultaneous Perturbation Stochastic Approximation (SPSA) and Smoothed Functional(SF) scheme are widely used. In this thesis we have proposed a novel multitime scale quasiNewton based smoothed functional (QNSF) algorithm for unconstrained as well as constrained optimization. The algorithm uses the smoothed functional scheme for estimating the gradient and the quasiNewton method to solve the optimization problem. The algorithm is shown to converge with probability one.
We have also provided here experimental results on the problem of optimal routing in a multistage network of queues. Policies like Join the Shortest Queue or Least Work Left assume knowledge of the queue length values that can change rapidly or hard to estimate. If the only information available is the expected endtoend delay as with our case, such policies cannot be used. The QNSF based probabilistic routing algorithm uses only the total endtoend delay for tuning the probabilities. We observe from the experiments that the QNSF algorithm has better performance than the gradient and Jacobi versions of Newton based smoothed functional algorithms. Next we consider constrained routing in a similar queueing network. We extend the QNSF algorithm to this case. We study the convergence behavior of the algorithm and observe that the constraints are satisfied at the point of convergence. We provide experimental results for the constrained routing setup as well.
Next we study reinforcement learning algorithms which are useful for solving Markov Decision Process(MDP) when the precise information on transition probabilities is not known. When the state, and action sets are very large, it is not possible to store all the stateaction tuples. In such cases, function approximators like neural networks have been used. The popular Qlearning algorithm is known to diverge when used with linear function approximation due to the ’offpolicy’ problem. Hence developing stable learning algorithms when used with function approximation is an important problem.
We present in this thesis a variant of Qlearning with linear function approximation that is based on twotimescale stochastic approximation. The Qvalue parameters for a given policy in our algorithm are updated on the slower timescale while the policy parameters themselves are updated on the faster scale. We perform a gradient search in the space of policy parameters. Since the objective function and hence the gradient are not analytically known, we employ the efficient onesimulation simultaneous perturbation stochastic approximation(SPSA) gradient estimates that employ Hadamard matrix based deterministic perturbations. Our algorithm has the advantage that, unlike Qlearning, it does not suffer from high oscillations due to the offpolicy problem when using function approximators. Whereas it is difficult to prove convergence of regular Qlearning with linear function approximation because of the offpolicy problem, we prove that our algorithm which is onpolicy is convergent. Numerical results on a multistage stochastic shortest path problem show that our algorithm exhibits significantly better performance and is more robust as compared to Qlearning. Future work would be to compare it with other policybased reinforcement learning algorithms.
Finally, we develop an online actorcritic reinforcement learning algorithm with function approximation for a problem of control under inequality constraints. We consider the longrun average cost Markov decision process(MDP) framework in which both the objective and the constraint functions are suitable policydependent longrun averages of certain sample path functions. The Lagrange multiplier method is used to handle the inequality constraints. We prove the asymptotic almost sure convergence of our algorithm to a locally optimal solution. We also provide the results of numerical experiments on a problem of routing in a multistage queueing network with constraints on longrun average queue lengths. We observe that our algorithm exhibits good performance on this setting and converges to a feasible point.
Collections
Related items
Showing items related by title, author, creator and subject.

Image Reconstruction Based On Hilbert And Hybrid Filtered Algorithms With Inverse Distance Weight And No Backprojection Weight
Narasimhadhan, A V (20140716)Filtered backprojection (FBP) reconstruction algorithms are very popular in the field of Xray computed tomography (CT) because they give advantages in terms of the numerical accuracy and computational complexity. Ramp ... 
Generalizations Of The Quantum Search Algorithm
Tulsi, Tathagat Avatar (20101203)Quantum computation has attracted a great deal of attention from the scientific community in recent years. By using the quantum mechanical phenomena of superposition and entanglement, a quantum computer can solve certain ... 
Performance Evaluation Of Fanbeam And Conebeam Reconstruction Algorithms With No Backprojection Weight On Truncated Data Problems
Sumith, K (20140716)This work focuses on using the linear prediction based projection completion for the fanbeam and conebeam reconstruction algorithm with no backprojection weight. The truncated data problems are addressed in the computed ...