## Novel First-order Algorithms for Non-smooth Optimization Problems in Machine Learning

dc.contributor.advisor | Bhattacharyya, Chiranjib | |

dc.contributor.author | Kundu, Achintya | |

dc.date.accessioned | 2022-04-04T04:51:18Z | |

dc.date.available | 2022-04-04T04:51:18Z | |

dc.date.submitted | 2021 | |

dc.identifier.uri | https://etd.iisc.ac.in/handle/2005/5678 | |

dc.description.abstract | This thesis is devoted to designing efficient optimization algorithms for machine learning (ML) problems where the underlying objective function to be optimized is convex but not necessarily differentiable. Such non-smooth objective functions are ubiquitous in ML mainly due to the use of one or more of the following: (a) non-differentiable loss function (hinge loss in binary classification), (b) sparsity promoting regularization term (L1 norm in regression), (c) constraint sets (elliptope in graph embedding) to induce specific structure on the parameters to be learned. Motivated by such a wide range of learning problems that can be cast as optimization of a non-smooth convex objective, we focus on developing first-order algorithms with non-asymptotic convergence rate guarantees to solve such problems in a large-scale setting. Based on shortcomings of the existing research in this domain, we address the following specific issues in this thesis. First, we consider the problem of minimizing a convex function over a feasible set given by the intersection of finitely many simple sets, each of which is equipped with a projection oracle. Examples of constraint sets that possess such structure include the set of doubly stochastic matrices, elliptope, the intersection of PSD cone with an L1-norm ball, etc. The main difficulty lies in computing the projection of a point onto the feasible set. Existing approaches yield an infeasible point with no guarantees on its in-feasibility. Exploiting the intersecting sets' linear regularity property, we present an exact penalty approach that leads to first-order algorithms with explicit guarantees on the approximate solution's distance from the feasible set. Further, we show improved iteration-complexity when the objective possesses structural smoothness / strong convexity. This is achieved through a saddle-point reformulation where the proximal operators required by the primal-dual algorithms can be computed in closed form. We illustrate the benefits of our approach on a graph transduction problem and graph matching. Next, we consider the classic composite problem of minimizing the sum of two convex functions: a smooth one (possessing Lipschitz continuous gradient) and a simple non-smooth one with easy to compute proximal operator. The well-known FISTA algorithm (also Nesterov's accelerated gradient method) achieves the optimal O(1/T^2) non-ergodic convergence rate for this problem. One of the drawbacks of these fast gradient methods is that they require computing gradients of the smooth function at points different from those on which the convergence rate guarantee applies. Inspired by Polyak's Heavy Ball method and the use of past gradients as momentum in training deep nets, we propose an accelerated gradient algorithm to rectify this drawback keeping the convergence rate intact. We achieve this through a judicious choice of momentum in both primal and dual space. To the best of our knowledge, this is the first accelerated gradient algorithm that achieves O(1/T^2) convergence rate guarantee on the iterates at which gradients are evaluated. Next, we propose modifications of Nesterov's accelerated gradient method by calling the first-order oracle at the same points on which the convergence rate guarantees apply. Algorithms thus obtained additionally achieve a linear convergence rate when the objective function is strongly convex. This fills a significant research gap as Polyak's Heavy Ball method guarantees accelerated convergence rate through gradient momentum only for a restrictive class of twice differentiable and strongly convex objective functions. Third, we focus on the problem of learning a positive semidefinite (PSD) kernel matrix from m similarity matrices under a general convex loss. The existing algorithms do not apply if one employs arbitrary loss functions and often can not handle m>1 case. Based on the black-box approach of Mirror Descent (MD), we present several provably convergent iterative algorithms that exploit the availability of off-the-shelf support vector machine (SVM) solvers. One of the significant contributions involves an extension of the well-known MD algorithm for simplex to handle the Cartesian product of PSD matrices leading to a novel algorithm called Entropic Multiple Kernel Learning. We also show simulation results on protein structure classification involving several similarity matrices to demonstrate the proposed algorithms' efficacy. Finally, we consider the class of problems possessing convex-concave saddle point structure with bilinear interaction. This model encompasses most convex optimization problems arising in ML and includes minimizing the sum of many simple non-smooth convex functions as a special case; thereby, it subsumes learning problems involving complex regularization terms such as total-variation based image denoising, overlapping group lasso, isotonic regression, etc. We first propose a primal-dual algorithm for this general class of problems that can achieve the O(1/T) convergence rate guarantee on the non-ergodic primal-dual iterate pair. Further, assuming strong convexity in the primal domain, we show an improved non-ergodic convergence rate of O(1/T^2). In contrast, the existing primal-dual algorithms achieve such convergence rates only in the ergodic or semi-ergodic setting. Further, we propose another primal-dual algorithm utilizing an additional prox-operator computation (in the primal space) per iteration. This variant additionally enjoys a \emph{non-ergodic} accelerated linear convergence rate when the objective function is strongly convex-strongly concave. The proposed algorithms were designed by cleverly incorporating the key ingredients from Nesterov's accelerated gradient method in the standard primal-dual algorithmic framework of Chambolle-Pock. | en_US |

dc.language.iso | en_US | en_US |

dc.rights | I grant Indian Institute of Science the right to archive and to make available my thesis or dissertation in whole or in part in all forms of media, now hereafter known. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation | en_US |

dc.subject | Machine Learning | en_US |

dc.subject | Convex Optimization | en_US |

dc.subject | convex function | en_US |

dc.subject.classification | Research Subject Categories::TECHNOLOGY::Information technology::Computer science | en_US |

dc.title | Novel First-order Algorithms for Non-smooth Optimization Problems in Machine Learning | en_US |

dc.type | Thesis | en_US |

dc.degree.name | PhD | en_US |

dc.degree.level | Doctoral | en_US |

dc.degree.grantor | Indian Institute of Science | en_US |

dc.degree.discipline | Engineering | en_US |