Structured Regularization Through Convex Relaxations Of Discrete Penalties
Abstract
Motivation. Empirical risk minimization(ERM) is a popular framework for learning predictive
models from data, which has been used in various domains such as computer vision, text
processing, bioinformatics, neuro-biology, temporal point processes, to name a few. We consider
the cases where one has apriori information regarding the model structure, the simplest
one being the sparsity of the model. The desired sparsity structure can be imputed into ERM
problems using a regularizer, which is typically a norm on the model vector. Popular choices of
the regularizers include the `1 norm (LASSO) which encourages sparse solutions, block-`1 which
encourages block level sparsity, among many others which induce more complicated structures.
To impute the structural prior, recently, many studies have considered combinatorial functions
on the model's support to be the regularizer. But this leads to an NP-hard problem, which
thus motivates us to consider convex relaxations of the corresponding combinatorial functions,
which are tractable.
Existing work and research gaps. The convex relaxations of combinatorial functions have
been studied recently in the context of structured sparsity, but they still lead to inefficient
computational procedures in general cases: e.g., even when the combinatorial function is submodular,
whose convex relaxations are well studied and easier to work with than the general
ones, the resulting problem is computationally expensive (the proximal operator takes O(d6)
time for d variables). Hence, the associated high expenses have limited the research interest
towards these these regularizers, despite the submodular functions which generate them are
expressive enough to encourage many of the structures such as those mentioned before. Hence
it remains open to design efficient optimization procedures to work with the submodular penalties,
and with combinatorial penalties in general. It is also desirable that the optimization
algorithms designed to be applicable across the possible choices of the loss functions arising
from various applications. We identify four such problems from these existing research gaps,
and address them through the contributions which are listed below. We provide the list of
publications related to this thesis following this abstract Contributions. First, we propose a novel kernel learning technique termed as Variable
sparsity kernel learning (VSKL) for support vector machines (SVM), which are applicable
when there is an apriori information regarding the grouping among the kernels. In such cases, we
propose a novel mixed-norm regularizer, which encourages sparse selection of the kernels within
a group while selecting all groups. This regularizer can also be viewed as the convex relaxation
of a specifi c discrete penalty on the model's support. The resulting non-smooth optimization
problem is difficult, where o -the-shelf techniques are not applicable. We propose a mirror
descent based optimization algorithm to solve this problem, which has a guaranteed convergence
rate of O(1=
p
T) over T iterations. We demonstrate the efficiency of the proposed algorithm in
various scaling experiments, and applicability of the regularizer in an object recognition task.
Second, we introduce a family of regularizers termed as Smoothed Ordered Weighted
L1 (SOWL) norms, which are derived as the convex relaxation of non-decreasing cardinalitybased
submodular penalties, which form an important special class of the general discrete
penalties. Considering linear regression, where the presence of correlated predictors cause
the traditional regularizers such as Lasso to fail recovering the true support, SOWL has the
property of selecting the correlated predictors as groups. While Ordered Weighted `1 (OWL)
norms address similar use cases, they are biased to promote undesirable piece-wise constant
solutions, which SOWL does not have. SOWL is also shown equivalent to group-lasso, but
without specifying the groups explicitly. We develop efficient proximal operators for SOWL,
and illustrate its computational and theoretical benefi ts through various simulations.
Third, we discuss Hawkes-OWL, an application of OWL regularizers for the setting of
multidimensional Hawkes processes. Hawkes process is a multi-dimensional point process (collection
of multiple event streams) with self and mutual in
fluences between the event streams.
While the popular `1 regularizer fails to recover the true models in the presence of strongly
correlated event streams, OWL regularization address this issue and groups the correlated predictors.
This is the first instance in the literature, where OWL norms, which predominantly
have been studied with respect to simple loss functions such as the squared loss, are extended
to the Hawkes process with similar theoretical and computational guarantees.
In the fourth part, we discuss generic first-order algorithms for learning with Subquadraic
norms, a special sub-family of convex relaxations of submodular penalties. We consider subquadratic
norms regularization in a very general setting, covering all loss functions, and propose
different reformulations of the original problem. The reformulations enable us to propose two
different primal-dual algorithms, CP- and ADMM- , both of which having a guaranteed convergence
rate of O(1=T ). This study thus provides the rst ever algorithms with e cient
convergence rates for learning with subquadratic norms.