Kernel Methods Fast Algorithms and real life applications
Vishwanathan, S V N
MetadataShow full item record
Support Vector Machines (SVM) have recently gained prominence in the field of machine learning and pattern classification (Vapnik, 1995, Herbrich, 2002, Scholkopf and Smola, 2002). Classification is achieved by finding a separating hyperplane in a feature space, which can be mapped back onto a non-linear surface in the input space. However, training an SVM involves solving a quadratic optimization problem, which tends to be computationally intensive. Furthermore, it can be subject to stability problems and is non-trivial to implement. This thesis proposes an fast iterative Support Vector training algorithm which overcomes some of these problems. Our algorithm, which we christen Simple SVM, works mainly for the quadratic soft margin loss (also called the l2 formulation). We also sketch an extension for the linear soft-margin loss (also called the l1 formulation). Simple SVM works by incrementally changing a candidate Support Vector set using a locally greedy approach, until the supporting hyperplane is found within a finite number of iterations. It is derived by a simple (yet computationally crucial) modification of the incremental SVM training algorithms of Cauwenberghs and Poggio (2001) which allows us to perform update operations very efficiently. Constant-time methods for initialization of the algorithm and experimental evidence for the speed of the proposed algorithm, when compared to methods such as Sequential Minimal Optimization and the Nearest Point Algorithm are given. We present results on a variety of real life datasets to validate our claims. In many real life applications, especially for the l2 formulation, the kernel matrix K є R n x n can be written as K = Z T Z + Λ , where, Z є R n x m with m << n and Λ є R n x n is diagonal with nonnegative entries. Hence the matrix K - Λ is rank-degenerate, Extending the work of Fine and Scheinberg (2001) and Gill et al. (1975) we propose an efficient factorization algorithm which can be used to find a L D LT factorization of K in 0(nm2) time. The modified factorization, after a rank one update of K, can be computed in 0(m2) time. We show how the Simple SVM algorithm can be sped up by taking advantage of this new factorization. We also demonstrate applications of our factorization to interior point methods. We show a close relation between the LDV factorization of a rectangular matrix and our LDLT factorization (Gill et al., 1975). An important feature of SVM's is that they can work with data from any input domain as long as a suitable mapping into a Hilbert space can be found, in other words, given the input data we should be able to compute a positive semi definite kernel matrix of the data (Scholkopf and Smola, 2002). In this thesis we propose kernels on a variety of discrete objects, such as strings, trees, Finite State Automata, and Pushdown Automata. We show that our kernels include as special cases the celebrated Pair-HMM kernels (Durbin et al., 1998, Watkins, 2000), the spectrum kernel (Leslie et al., 20024, convolution kernels for NLP (Collins and Duffy, 2001), graph diffusion kernels (Kondor and Lafferty, 2002) and various other string-matching kernels. Because of their widespread applications in bio-informatics and web document based algorithms, string kernels are of special practical importance. By intelligently using the matching statistics algorithm of Chang and Lawler (1994), we propose, perhaps, the first ever algorithm to compute string kernels in linear time. This obviates dynamic programming with quadratic time complexity and makes string kernels a viable alternative for the practitioner. We also propose extensions of our string kernels to compute kernels on trees efficiently. This thesis presents a linear time algorithm for ordered trees and a log-linear time algorithm for unordered trees. In general, SVM's require time proportional to the number of Support Vectors for prediction. In case the dataset is noisy a large fraction of the data points become Support Vectors and thus time required for prediction increases. But, in many applications like search engines or web document retrieval, the dataset is noisy, yet, the speed of prediction is critical. We propose a method for string kernels by which the prediction time can be reduced to linear in the length of the sequence to be classified, regardless of the number of Support Vectors. We achieve this by using a weighted version of our string kernel algorithm. We explore the relationship between dynamic systems and kernels. We define kernels on various kinds of dynamic systems including Markov chains (both discrete and continuous), diffusion processes on graphs and Markov chains, Finite State Automata, various linear time-invariant systems etc Trajectories arc used to define kernels introduced on initial conditions lying underlying dynamic system. The same idea is extended to define Kernels on a. dynamic system with respect to a set of initial conditions. This framework leads to a large number of novel kernels and also generalize many previously proposed kernels. Lack of adequate training data is a problem which plagues classifiers. We propose n new method to generate virtual training samples in the case of handwritten digit data. Our method uses the two dimensional suffix tree representation of a set of matrices to encode an exponential number of virtual samples in linear space thus leading to an increase in classification accuracy. This in turn, leads us naturally to a, compact data dependent representation of a test pattern which we call the description tree. We propose a new kernel for images and demonstrate a quadratic time algorithm for computing it by wing the suffix tree representation of an image. We also describe a method to reduce the prediction time to quadratic in the size of the test image by using techniques similar to those used for string kernels.