Design and Analysis of Consistent Algorithms for Multiclass Learning Problems
Harish, Guruprasad Ramaswami
MetadataShow full item record
We consider the broad framework of supervised learning, where one gets examples of objects together with some labels (such as tissue samples labeled as cancerous or non-cancerous, or images of handwritten digits labeled with the correct digit in 0-9), and the goal is to learn a prediction model which given a new object, makes an accurate prediction. The notion of accuracy depends on the learning problem under study and is measured by a performance measure of interest. A supervised learning algorithm is said to be 'statistically consistent' if it returns an `optimal' prediction model with respect to the desired performance measure in the limit of infinite data. Statistical consistency is a fundamental notion in supervised machine learning, and therefore the design of consistent algorithms for various learning problems is an important question. While this has been well studied for simple binary classification problems and some other specific learning problems, the question of consistent algorithms for general multiclass learning problems remains open. We investigate several aspects of this question as detailed below. First, we develop an understanding of consistency for multiclass performance measures defined by a general loss matrix, for which convex surrogate risk minimization algorithms are widely used. Consistency of such algorithms hinges on the notion of 'calibration' of the surrogate loss with respect to target loss matrix; we start by developing a general understanding of this notion, and give both necessary conditions and sufficient conditions for a surrogate loss to be calibrated with respect to a target loss matrix. We then define a fundamental quantity associated with any loss matrix, which we term the `convex calibration dimension' of the loss matrix; this gives one measure of the intrinsic difficulty of designing convex calibrated surrogates for a given loss matrix. We derive lower bounds on the convex calibration dimension which leads to several new results on non-existence of convex calibrated surrogates for various losses. For example, our results improve on recent results on the non-existence of low dimensional convex calibrated surrogates for various subset ranking losses like the pairwise disagreement (PD) and mean average precision (MAP) losses. We also upper bound the convex calibration dimension of a loss matrix by its rank, by constructing an explicit, generic, least squares type convex calibrated surrogate, such that the dimension of the surrogate is at most the (linear algebraic) rank of the loss matrix. This yields low-dimensional convex calibrated surrogates - and therefore consistent learning algorithms - for a variety of structured prediction problems for which the associated loss is of low rank, including for example the precision @ k and expected rank utility (ERU) losses used in subset ranking problems. For settings where achieving exact consistency is computationally difficult, as is the case with the PD and MAP losses in subset ranking, we also show how to extend these surrogates to give algorithms satisfying weaker notions of consistency, including both consistency over restricted sets of probability distributions, and an approximate form of consistency over the full probability space. Second, we consider the practically important problem of hierarchical classification, where the labels to be predicted are organized in a tree hierarchy. We design a new family of convex calibrated surrogate losses for the associated tree-distance loss; these surrogates are better than the generic least squares surrogate in terms of easier optimization and representation of the solution, and some surrogates in the family also operate on a significantly lower dimensional space than the rank of the tree-distance loss matrix. These surrogates, which we term the `cascade' family of surrogates, rely crucially on a new understanding we develop for the problem of multiclass classification with an abstain option, for which we construct new convex calibrated surrogates that are of independent interest by themselves. The resulting hierarchical classification algorithms outperform the current state-of-the-art in terms of both accuracy and running time. Finally, we go beyond loss-based multiclass performance measures, and consider multiclass learning problems with more complex performance measures that are nonlinear functions of the confusion matrix and that cannot be expressed using loss matrices; these include for example the multiclass G-mean measure used in class imbalance settings and the micro F1 measure used often in information retrieval applications. We take an optimization viewpoint for such settings, and give a Frank-Wolfe type algorithm that is provably consistent for any complex performance measure that is a convex function of the entries of the confusion matrix (this includes the G-mean, but not the micro F1). The resulting algorithms outperform the state-of-the-art SVMPerf algorithm in terms of both accuracy and running time. In conclusion, in this thesis, we have developed a deep understanding and fundamental results in the theory of supervised multiclass learning. These insights have allowed us to develop computationally efficient and statistically consistent algorithms for a variety of multiclass learning problems of practical interest, in many cases significantly outperforming the state-of-the-art algorithms for these problems.
Showing items related by title, author, creator and subject.
Narasimhan, Harikrishna (2017-12-07)We consider supervised learning problems, where one is given objects with labels, and the goal is to learn a model that can make accurate predictions on new objects. These problems abound in applications, ranging from ...
Bapat, Tanuja (2013-06-20)Many real-world problems like hand-written digit recognition or semantic scene classiﬁcation are treated as multiclass or multi-label classiﬁcation prob-lems. Solutions to these problems using support vector machines (SVMs) ...
Chakraborty, Avijit (2017-05-04)Scheduling policies adopted for statistical multiplexing should provide delay differentiation between different traffic classes, where each class represents an aggregate traﬃc of individual applications having same ...