Structured Regularization Through Convex Relaxations Of Discrete Penalties
Motivation. Empirical risk minimization(ERM) is a popular framework for learning predictive models from data, which has been used in various domains such as computer vision, text processing, bioinformatics, neuro-biology, temporal point processes, to name a few. We consider the cases where one has apriori information regarding the model structure, the simplest one being the sparsity of the model. The desired sparsity structure can be imputed into ERM problems using a regularizer, which is typically a norm on the model vector. Popular choices of the regularizers include the `1 norm (LASSO) which encourages sparse solutions, block-`1 which encourages block level sparsity, among many others which induce more complicated structures. To impute the structural prior, recently, many studies have considered combinatorial functions on the model's support to be the regularizer. But this leads to an NP-hard problem, which thus motivates us to consider convex relaxations of the corresponding combinatorial functions, which are tractable. Existing work and research gaps. The convex relaxations of combinatorial functions have been studied recently in the context of structured sparsity, but they still lead to inefficient computational procedures in general cases: e.g., even when the combinatorial function is submodular, whose convex relaxations are well studied and easier to work with than the general ones, the resulting problem is computationally expensive (the proximal operator takes O(d6) time for d variables). Hence, the associated high expenses have limited the research interest towards these these regularizers, despite the submodular functions which generate them are expressive enough to encourage many of the structures such as those mentioned before. Hence it remains open to design efficient optimization procedures to work with the submodular penalties, and with combinatorial penalties in general. It is also desirable that the optimization algorithms designed to be applicable across the possible choices of the loss functions arising from various applications. We identify four such problems from these existing research gaps, and address them through the contributions which are listed below. We provide the list of publications related to this thesis following this abstract Contributions. First, we propose a novel kernel learning technique termed as Variable sparsity kernel learning (VSKL) for support vector machines (SVM), which are applicable when there is an apriori information regarding the grouping among the kernels. In such cases, we propose a novel mixed-norm regularizer, which encourages sparse selection of the kernels within a group while selecting all groups. This regularizer can also be viewed as the convex relaxation of a specifi c discrete penalty on the model's support. The resulting non-smooth optimization problem is difficult, where o -the-shelf techniques are not applicable. We propose a mirror descent based optimization algorithm to solve this problem, which has a guaranteed convergence rate of O(1= p T) over T iterations. We demonstrate the efficiency of the proposed algorithm in various scaling experiments, and applicability of the regularizer in an object recognition task. Second, we introduce a family of regularizers termed as Smoothed Ordered Weighted L1 (SOWL) norms, which are derived as the convex relaxation of non-decreasing cardinalitybased submodular penalties, which form an important special class of the general discrete penalties. Considering linear regression, where the presence of correlated predictors cause the traditional regularizers such as Lasso to fail recovering the true support, SOWL has the property of selecting the correlated predictors as groups. While Ordered Weighted `1 (OWL) norms address similar use cases, they are biased to promote undesirable piece-wise constant solutions, which SOWL does not have. SOWL is also shown equivalent to group-lasso, but without specifying the groups explicitly. We develop efficient proximal operators for SOWL, and illustrate its computational and theoretical benefi ts through various simulations. Third, we discuss Hawkes-OWL, an application of OWL regularizers for the setting of multidimensional Hawkes processes. Hawkes process is a multi-dimensional point process (collection of multiple event streams) with self and mutual in fluences between the event streams. While the popular `1 regularizer fails to recover the true models in the presence of strongly correlated event streams, OWL regularization address this issue and groups the correlated predictors. This is the first instance in the literature, where OWL norms, which predominantly have been studied with respect to simple loss functions such as the squared loss, are extended to the Hawkes process with similar theoretical and computational guarantees. In the fourth part, we discuss generic first-order algorithms for learning with Subquadraic norms, a special sub-family of convex relaxations of submodular penalties. We consider subquadratic norms regularization in a very general setting, covering all loss functions, and propose different reformulations of the original problem. The reformulations enable us to propose two different primal-dual algorithms, CP- and ADMM- , both of which having a guaranteed convergence rate of O(1=T ). This study thus provides the rst ever algorithms with e cient convergence rates for learning with subquadratic norms.