|Optimization is a very important field with diverse applications in physical, social and biological sciences and in various areas of engineering. It appears widely in ma-chine learning, information retrieval, regression, estimation, operations research and a wide variety of computing domains. The subject is being deeply studied both theoretically and experimentally and several algorithms are available in the literature. These algorithms which can be executed (sequentially or concurrently) on a computing machine explore the space of input parameters to seek high quality solutions to the optimization problem with the search mostly guided by certain structural properties of the objective function. In certain situations, the setting might additionally demand for “absolute optimum” or solutions close to it, which makes the task even more challenging.
In this thesis, we propose an optimization algorithm which is “gradient-free”, i.e., does not employ any knowledge of the gradient or higher order derivatives of the objective function, rather utilizes objective function values themselves to steer the search. The proposed algorithm is particularly effective in a black-box setting, where a closed-form expression of the objective function is unavailable and gradient or higher-order derivatives are hard to compute or estimate. Our algorithm is inspired by the well known cross entropy (CE) method. The CE method is a model based search method to solve continuous/discrete multi-extremal optimization problems, where the objective function has minimal structure. The proposed method seeks, in the statistical manifold of the parameters which identify the probability distribution/model defined over the input space to find the degenerate distribution concentrated on the global optima (assumed to be finite in quantity). In the early part of the thesis, we propose a novel stochastic approximation version of the CE method to the unconstrained optimization problem, where the objective function is real-valued and deterministic. The basis of the algorithm is a stochastic process of model parameters which is probabilistically dependent on the past history, where we reuse all the previous samples obtained in the process till the current instant based on discounted averaging. This approach can save the overall computational and storage cost. Our algorithm is incremental in nature and possesses attractive features such as stability, computational and storage efficiency and better accuracy. We further investigate, both theoretically and empirically, the asymptotic behaviour of the algorithm and find that the proposed algorithm exhibits global optimum convergence for a particular class of objective functions.
Further, we extend the algorithm to solve the simulation/stochastic optimization problem. In stochastic optimization, the objective function possesses a stochastic characteristic, where the underlying probability distribution in most cases is hard to comprehend and quantify. This begets a more challenging optimization problem, where the ostentatious nature is primarily due to the hardness in computing the objective function values for various input parameters with absolute certainty. In this case, one can only hope to obtain noise corrupted objective function values for various input parameters. Settings of this kind can be found in scenarios where the objective function is evaluated using a continuously evolving dynamical system or through a simulation. We propose a multi-timescale stochastic approximation algorithm, where we integrate an additional timescale to accommodate the noisy measurements and decimate the eﬀects of the gratuitous noise asymptotically. We found that if the objective function and the noise involved in the measurements are well behaved and the timescales are compatible, then our algorithm can generate high quality solutions.
In the later part of the thesis, we propose algorithms for reinforcement learning/Markov decision processes using the optimization techniques we developed in the early stage. MDP can be considered as a generalized framework for modelling planning under uncertainty. We provide a novel algorithm for the problem of prediction in reinforcement learning, i.e., estimating the value function of a given stationary policy of a model free MDP (with large state and action spaces) using the linear function approximation architecture. Here, the value function is defined as the long-run average of the discounted transition costs. The resource requirement of the proposed method in terms of computational and storage cost scales quadratically in the size of the feature set. The algorithm is an adaptation of the multi-timescale variant of the CE method proposed in the earlier part of the thesis for simulation optimization. We also provide both theoretical and empirical evidence to corroborate the credibility and effectiveness of the approach.
In the final part of the thesis, we consider a modified version of the control problem in a model free MDP with large state and action spaces. The control problem most commonly addressed in the literature is to find an optimal policy which maximizes the value function, i.e., the long-run average of the discounted transition payoffs. The contemporary methods also presume access to a generative model/simulator of the MDP with the hidden premise that observations of the system behaviour in the form of sample trajectories can be obtained with ease from the model. In this thesis, we consider a modified version, where the cost function to be optimized is a real-valued performance function (possibly non-convex) of the value function. Additionally, one has to seek the optimal policy without presuming access to the generative model. In this thesis, we propose a stochastic approximation algorithm for this peculiar control problem. The only information, we presuppose, available to the algorithm is the sample trajectory generated using a priori chosen behaviour policy. The algorithm is data (sample trajectory) efficient, stable, robust as well as computationally and storage efficient. We provide a proof of convergence of our algorithm to a high performing policy relative to the behaviour policy.