Robust Risk Minimization under Label Noise
Abstract
In the setting of supervised learning, one learns a classi fier from training data consisting of
patterns and the corresponding labels. When labels of the examples in training data have errors,
it is referred to as label noise. In practice, label noise is unavoidable. For example, when
labelling of patterns is done by human experts, we may have label noise due to the unavoidable
subjective biases and/or random human errors. Now-a-days, in many applications, large data
sets are often labelled through crowd-sourcing which would also result in label noise both due
to human errors as well as due to variations in the quality of the crowd-sourced labellers. Many
studies have shown that label errors adversely affect the standard classifier learning algorithms
such as Support Vector Machine(SVM), Logistic Regression, Neutral Networks etc. Thus, robustness
of classifier learning algorithms to label noise is an important desired property. This
thesis investigates the robustness of risk minimization algorithms to label noise.
There are many approaches suggested in the literature for mitigating the adverse affects of
label noise. One can use some heuristics to detect examples with noisy labels and remove them
from training data. Using similar heuristics, modi fications are suggested in algorithms such
as perceptron, Adaboost etc. for mitigating adverse effects of label noise. Another important
approach is to treat the true labels as missing data and, using some probabilistic model of label
corruption, estimate the posterior probability of the true labels using, e.g., EM algorithm.
In this thesis, we study robustness of classi fier learning algorithms which can be formulated
as risk minimization methods. In risk minimization framework, one learns a classifi er by minimizing
the expectation of a loss function with respect to the underlying unknown distribution.
Many of the standard classi fier learning algorithms (e.g., Naive Bayes, Backpropagation for
learning feedforward neural networks, SVMs etc.) can be posed as risk minimization. One
approach to robust risk minimization is called loss correction. Here, to minimize risk with loss
L with respect to the true label distribution, one creates a new loss function L0 and minimizes
risk with it under the corrupted labels. However, to nd the proper L0 for a given L, one
needs knowledge of the label corruption probabilities (which may be estimated from the data).
Another approach to robust risk minimization is to seek loss functions that result in inherent
robustness of risk minimization. An advantage with this approach is that one need not differentiate
between the noisy or noise free training data. The classi fier learning algorithm remains
the same. This is the approach that is investigated in this thesis.
The robustness of risk minimization depends on the loss function used. In this thesis we
derive sufficient conditions on the loss function so that risk minimization under that loss function
is robust to different types of label noise models. We call loss functions that satisfy these
conditions as robust losses. Our main theoretical results address the robustness of risk minimization
under symmetric and class-conditional label noise model. In symmetric label noise,
probability of mislabelling a sample to other class is same irrespective of a pattern. Symmetric
label noise model is suitable for applications where errors in the labels are random. In class conditional
label noise, errors in labels are dependent on the underlying true class of a pattern.
This model is suitable for applications where some pairs of classes are more likely to be confused
than others. We also discuss our results on the most general noise model called non-uniform
label noise where probability of labelling error depends on the pattern vector also. All our
theoretical results are for the case of multi-class classifi cation and these results generalize some
similar results known for the case of binary classi fication. All our theoretical results concern
minimization of risk though in practice one can only minimize empirical risk. We provide one
result on the consistency of empirical risk minimization under symmetric label noise.
We also empirically demonstrate the utility of our theoretical results using neural network
classi fiers. We consider three commonly used loss functions with deep neural networks, namely,
Categorical Cross Entropy (CCE), Mean Square Error (MSE) and Mean Absolute Error (MAE).
Out of these three, MAE loss satis fies the sufficient conditions of a robust loss while the other
two do not. Through empirical investigation on synthetic and standard real data sets, we show
the robustness of MAE loss compared to the others.
While the MAE loss is robust, it is difficult to minimize empirical risk under this loss and
this is seen from our empirical results. It takes a very large number of epochs and a good
initialization point to optimize MAE loss compared to CCE and MSE, both of which are not
robust. To alleviate this issue, we propose a novel robust loss called Robust Log Losses (RLL).
This loss can be viewed as a modi fication of CCE to make it robust. Empirical risk minimization
under RLL is similar to that under CCE in terms of learning rate. However, RLL satis es
the sufficient condition for robustness and we show empirically that RLL is superior to CCE in
terms of robustness to label noise. Learning with RLL is more efficient compared to that with
MAE.
We further extend our concept of robust risk minimization under label noise to multi-label
categorization problems. In multi-label problems, a pattern may belong to more than one class
unlike the case with multi-class problems where only one label is associated with a pattern.
We fi rst de ne symmetric label noise model in the context of multi-label classifi cation problems
which is a useful model for random errors in labelling. Next, we study robust learning of
multi-label classfii ers under risk minimization and propose sufficient conditions for a loss to be
robust under symmetric label noise. These su cient conditions are satis ed by the Hamming
loss and its surrogate robust losses. In the case of multi-label problems also, we empirically
demonstrate our theoretical results.