Design and Analysis of Consistent Algorithms for Multiclass Learning Problems
Abstract
We consider the broad framework of supervised learning, where one gets examples of objects together with some labels (such as tissue samples labeled as cancerous or noncancerous, or images of handwritten digits labeled with the correct digit in 09), and the goal is to learn a prediction model which given a new object, makes an accurate prediction. The notion of accuracy depends on the learning problem under study and is measured by a performance measure of interest. A supervised learning algorithm is said to be 'statistically consistent' if it returns an `optimal' prediction model with respect to the desired performance measure in the limit of infinite data. Statistical consistency is a fundamental notion in supervised machine learning, and therefore the design of consistent algorithms for various learning problems is an important question. While this has been well studied for simple binary classification problems and some other specific learning problems, the question of consistent algorithms for general multiclass learning problems remains open. We investigate several aspects of this question as detailed below.
First, we develop an understanding of consistency for multiclass performance measures defined by a general loss matrix, for which convex surrogate risk minimization algorithms are widely used. Consistency of such algorithms hinges on the notion of 'calibration' of the surrogate loss with respect to target loss matrix; we start by developing a general understanding of this notion, and give both necessary conditions and sufficient conditions for a surrogate loss to be calibrated with respect to a target loss matrix. We then define a fundamental quantity associated with any loss matrix, which we term the `convex calibration dimension' of the loss matrix; this gives one measure of the intrinsic difficulty of designing convex calibrated surrogates for a given loss matrix. We derive lower bounds on the convex calibration dimension which leads to several new results on nonexistence of convex calibrated surrogates for various losses. For example, our results improve on recent results on the nonexistence of low dimensional convex calibrated surrogates for various subset ranking losses like the pairwise disagreement (PD) and mean average precision (MAP) losses. We also upper bound the convex calibration dimension of a loss matrix by its rank, by constructing an explicit, generic, least squares type convex calibrated surrogate, such that the dimension of the surrogate is at most the (linear algebraic) rank of the loss matrix. This yields lowdimensional convex calibrated surrogates  and therefore consistent learning algorithms  for a variety of structured prediction problems for which the associated loss is of low rank, including for example the precision @ k and expected rank utility (ERU) losses used in subset ranking problems. For settings where achieving exact consistency is computationally difficult, as is the case with the PD and MAP losses in subset ranking, we also show how to extend these surrogates to give algorithms satisfying weaker notions of consistency, including both consistency over restricted sets of probability distributions, and an approximate form of consistency over the full probability space.
Second, we consider the practically important problem of hierarchical classification, where the labels to be predicted are organized in a tree hierarchy. We design a new family of convex calibrated surrogate losses for the associated treedistance loss; these surrogates are better than the generic least squares surrogate in terms of easier optimization and representation of the solution, and some surrogates in the family also operate on a significantly lower dimensional space than the rank of the treedistance loss matrix. These surrogates, which we term the `cascade' family of surrogates, rely crucially on a new understanding we develop for the problem of multiclass classification with an abstain option, for which we construct new convex calibrated surrogates that are of independent interest by themselves. The resulting hierarchical classification algorithms outperform the current stateoftheart in terms of both accuracy and running time.
Finally, we go beyond lossbased multiclass performance measures, and consider multiclass learning problems with more complex performance measures that are nonlinear functions of the confusion matrix and that cannot be expressed using loss matrices; these include for example the multiclass Gmean measure used in class imbalance settings and the micro F1 measure used often in information retrieval applications. We take an optimization viewpoint for such settings, and give a FrankWolfe type algorithm that is provably consistent for any complex performance measure that is a convex function of the entries of the confusion matrix (this includes the Gmean, but not the micro F1). The resulting algorithms outperform the stateoftheart SVMPerf algorithm in terms of both accuracy and running time.
In conclusion, in this thesis, we have developed a deep understanding and fundamental results in the theory of supervised multiclass learning. These insights have allowed us to develop computationally efficient and statistically consistent algorithms for a variety of multiclass learning problems of practical interest, in many cases significantly outperforming the stateoftheart algorithms for these problems.
Collections
Related items
Showing items related by title, author, creator and subject.

Learning with Complex Performance Measures : Theory, Algorithms and Applications
Narasimhan, Harikrishna (20171207)We consider supervised learning problems, where one is given objects with labels, and the goal is to learn a model that can make accurate predictions on new objects. These problems abound in applications, ranging from ... 
Sparse Multiclass And MultiLabel Classifier Design For Faster Inference
Bapat, Tanuja (20130620)Many realworld problems like handwritten digit recognition or semantic scene classiﬁcation are treated as multiclass or multilabel classiﬁcation problems. Solutions to these problems using support vector machines (SVMs) ... 
Bayes Optimal Feature Selection for Supervised Learning
Saneem Ahmed, C G (20180218)The problem of feature selection is critical in several areas of machine learning and data analysis such as, for example, cancer classification using gene expression data, text categorization, etc. In this work, we consider ...