Genetic Programming Based Multicategory Pattern Classification
Abstract
Nature has created complex biological structures that exhibit intelligent behaviour through an evolutionary process. Thus, intelligence and evolution are intimately connected. This has inspired evolutionary computation (EC) that simulates the evolutionary process to develop powerful techniques such as genetic algorithms (GAs), genetic programming (GP), evolutionary strategies (ES) and evolutionary programming (EP) to solve real-world problems in learning, control, optimization and classification. GP discovers the relationship among data and expresses it as a LISP-S expression i.e., a computer program. Thus the goal of program discovery as a solution for a problem is addressed by GP in the framework of evolutionary computation. In this thesis, we address for the first time the problem of applying GP to mu1ticategory pattern classification. In supervised pattern classification, an input vector of m dimensions is mapped onto one of the n classes. It has a number of application areas such as remote sensing, medical diagnosis etc., A supervised classifier is developed by using a training set that contains representative samples of various classes present in the application. Supervised classification has been done earlier with maximum likelihood classifier: neural networks and fuzzy logic. The major considerations in applying GP to pattern classification are listed below:
(i) GP-based techniques are data distribution-free i.e., no a priori knowledge is needed abut the statistical distribution of the data or no assumption such as normal distribution for data needs to be made as in MLC.
(ii) GP can directly operate on the data in its original form.
(iii) GP can detect the underlying but unknown relationship that mists among data and express it as a mathematical LISP S-expression. The generated LISP S-expressions can be directly used in the application environment.
(iv) GP can either discover the most important discriminating features of a class during evolution or it requires minor post-processing of the LISP-S expression to discover the discriminant features. In a neural network, the knowledge learned by the neural network about the data distributions is embedded in the interconnection weights and it requires considerable amount of post-processing of the weights to understand the decision of the neural network.
In 2-category pattern classification, a single GP expression is evolved as a discriminant function. The output of the GP expression can be +l for samples of one class and -1 for samples of the other class. When the GP paradigm is applied to an n-class problem, the following questions arise:
Ql. As a typical GP expression returns a value (+l or -1) for a 2-class problem, how does
one apply GP for the n-class pattern classification problem?
Q2. What should be the fitness function during evolution of the GP expressions?
Q3. How does the choice of a function set affect the performance of GP-based classification?
Q4. How should training sets be created for evaluating fitness during the evolution of GP
classifier expressions?
Q5. How does one improve learning of the underlying data distributions in a GP framework?
Q6. How should conflict resolution be handled before assigning a class to the input feature vector?
Q7. How does GP compare with other classifiers for an n-class pattern classification problem?
The research described here seeks to answer these questions. We show that GP can be applied to an n-category pattern classification problem by considering it as n 2-class problems. The suitability of this approach is demonstrated by considering a real-world problem based on remotely sensed satellite images and Fisher's Iris data set. In a 2-class problem, simple thresholding is sufficient for a discriminant function to divide the feature space into two regions. This means that one genetic programming classifier expression (GPCE) is sufficient to say whether or not the given input feature vector belongs to that class; i.e., the GP expression returns a value (+1 or -1). As the n-class problem is formulated as n 2-class problems, n GPCEs are evolved. Hence, n GPCE specific training sets are needed to evolve these n GPCEs. For the sake of illustration, consider a 5-class pat tern classification problem. Let n, be the number of samples that belong to class j, and N, be the number of samples that do not belong to class j, (j = 1,..., 5). Thus,
N1=n2+n3+n4+n5
N2=n1+n3+n4+n5
N3=n1+n2+n4+n5
N4=n1+n2+n3+n5
N5=n1+n2+n3+n4
Thus, When the five class problem is formulated as five 2-class problems. we need five GPCEs as discriminant functions to resolve between n1 and N1, n2 and N2, n3 and N3, n4 and N4 and lastly n5 and N5. Each of these five 2-class problems is handled as a separate 2-class problem with simple thresholding. Thus, GPCE# l resolves between samples of class# l and the remaining n - 1 classes. A training set is needed to evaluate the fitness of GPCE during its evolution. If we directly create the training set, it leads to skewness (as n1 < N1). To overcome the skewness, an interleaved data format is proposed for the training set of a GPCE. For example, in the training set of GPCE# l, samples of class# l are placed alternately between samples of the remaining n - 1 classes. Thus, the interleaved data format is an artifact to create a balanced training set.
Conventionally, all the samples of a training set are fed to evaluate the fitness of every member of the population in each generation. We call this "global" learning 3s GP tries to learn the entire training set at every stage of the evolution. We have introduced incremental learning to simplify the task of learning for the GP paradigm. A subset of the training set is fed and the size of the subset is gradually increased over time to cover the entire training
data. The basic motivation for incremental learning is to improve learning during evolution as it is easier to learn a smaller task and then to progress from a smaller task to a bigger task. Experimental results are presented to show that the interleaved data format and incremental learning improve the performance of the GP classifier.
We also show that the GPCEs evolved with an arithmetic function set are able to track variation in the input better than GPCEs evolved with function sets containing logical and nonlinear elements. Hence, we have used arithmetic function set, incremental learning, and interleaved data format to evolve GPCEs in our simulations. AS each GPCE is trained to recognize samples belonging to its own class and reject samples belonging to other classes a strength of association measure is associated with each GPCE to indicate the degree to which it can recognize samples belonging to its own class. The strength of association measures are used for assigning a class to an input feature vector. To reduce misclassification of samples, we also show how heuristic rules can be generated in the GP framework unlike in either MLC or the neural network classifier. We have also studied the scalability and generalizing ability of the GP classifier by varying the number of classes.
We also analyse the performance of the GP classifier by considering the well-known Iris data set. We compare the performance of classification rules generated from the GP classifier with those generated from neural network classifier, (24.5 method and fuzzy classifier for the Iris data set. We show that the performance of GP is comparable to other classifiers for the Iris data set. We notice that the classification rules can be generated with very little post-processing and they are very similar to the rules generated from the neural network and C4.5 for the Iris data set.
Incremental learning influences the number of generations available for GP to learn the data distribution of classes whose d is -1 in the interleaved data format. This is because the samples belonging to the true class (desired output d is +1) are alternately placed between samples belonging to other classes i.e., they are repeated to balance the training set in the interleaved data format. For example, in the evolution of GPCE for class# l, the fitness function can be fed initially with samples of class#:! and subsequently with the samples of class#3, class#4 and class#. So in the evaluation of the fitness function, the samples of class#kt5 will not be present when the samples of class#2 are present in the initial stages. However, in the later stages of evolution, when samples of class#5 are fed, the fitness function will utilize the samples of both class#2 and class#5. As learning in evolutionary computation is guided by the evaluation of the fitness function, GPCE# l gets lesser number of generations to learn how to reject data of class#5 as compared to the data of class#2. This is because the termination criterion (i.e., the maximum number of generations) is defined a priori.
It is clear that there are (n-l)! Ways of ordering the samples of classes whose d is -1 in the interleaved data format. Hence a heuristic is presented to determine a possible order to feed data of different classes for the GPCEs evolved with incremental learning and interleaved data format. The heuristic computes an overlap index for each class based on its spatial spread and distribution of data in the region of overlap with respect to other classes in each feature. The heuristic determines the order in which classes whose desired output d is –1 should be placed in each GPCE-specific training set for the interleaved data format. This ensures that GP gets more number of generations to learn about the data distribution of a class with higher overlap index than a class with lower overlap index. The ability of the GP classifier to learn the data distributions depends upon the number of classes and the spatial spread of data. As the number of classes increases, the GP classifier finds it difficult to resolve between classes. So there is a need to partition the feature space and identify subspaces with reduced number of classes. The basic objective is to divide the feature space into subspaces and hence the data set that contains representative samples of n classes into subdata sets corresponding to the subspaces of the feature space, so that some of the subdata sets/spaces can have data belonging to only p classes (p < n). The GP classifier is then evolved independently for the subdata sets/spaces of the feature space. This results in localized learning as the GP classifier has to learn the data distribution in only a subspace of the feature space rather than in the entire feature space. By integrating the GP classifier with feature space partitioning (FSP), we improve classification accuracy due to localized learning.
Although serial computers have increased steadily in their performance, the quest for parallel implementation of a given task has continued to be of interest in any computationally intensive task since parallel implementation leads to a faster execution than a serial implementation
As fitness evaluation, selection strategy and population structures are used to evolve a solution in GP, there is scope for a parallel implementation of GP classifier. We have studied distributed GP and massively parallel GP for our approach to GP-based multicategory pattern classification. We present experimental results for distributed GP with Message Passing Interface on IBM SP2 to highlight the speedup that can be achieved over the serial implementation of GP. We also show how data parallelism can be used to further speed up fitness evaluation and hence the execution of the GP paradigm for multicategory pat tern classification.
We conclude that GP can be applied to n-category pattern classification and its potential lies in its simplicity and scope for parallel implementation. The GP classifier developed in this thesis can be looked upon as an addition to the earlier statistical, neural and fuzzy approaches to multicategory pattern classification.