| dc.description.abstract | In this work, we present a scheme for selecting optimal prototypes from large data sets, as a part of "Data Mining process". Data mining is defined as a process of non-trivial extraction of implicit, previously unknown and potentially useful information, such as, knowledge rules, constraints and regularities, from data in databases. The prototypes are so chosen that they would be sufficient enough to classify any new input pattern with reasonably high classification accuracy. Also such representative patterns are good enough for generating association rules. Handwritten character data is made use for all the exercises in the work. The prototypes are selected by using medoids and leaders. Medoids are most centrally located in a cluster. Both medoids and leaders of a cluster are members of the cluster. After selection of initial set of prototypes that provide a high classification accuracy, evolutionary algorithms are used to compute the optimal set of the prototypes. Further, the dimensionality of the optimal prototypes is reduced by means of optimal feature selection using evolutionary algorithms to arrive at optimal prototypes, each prototype being represented using a minimum number of features. The entire work can be summarized into five stages, viz., (1) Data pre-processing (2) Selection of Representative pattern using medoids and leaders (3) Optimal Prototype Selection using Steady State Genetic Algorithms (SSGA), (4) Optimal Feature Selection using SSGA and (5) Association Rule Generation for classification. The work addresses the challenges in (1) clustering large datasets, (2) demonstration of utility of medoids, leaders and their variants in finding prototypes, (3) use of SSGA for learning, and (4) evolution of a general procedure to deal with large data sets of labeled patterns within the frame work of Knowledge Discovery in Databases (KDD). A large dataset of handwritten digits is used to conduct all the experiments reported in the thesis. Selection of prototypes by means of medoids and leaders has provided good classification accuracy (CA). Computation of medoids is expensive in terms of computation time, whereas computation of leaders is less time consuming. Among the alternate approaches for reduction of number of prototypes, distance-threshold based reduction of prototypes has provided best results both in terms of CA and in reduction of prototypes. Out of the optimal prototype selection approaches, Steady State Genetic Algorithm for medoid selection by means of one-to-one mapping medoids to alleles of chromosome had provided least number of prototypes. The CA obtained for optimal leaders is the highest. The optimal feature selection using SSGA has provided good results. Together, the optimal prototypes with each pattern being represented by an optimal feature set help in generating effective association rules. Such a scenario would effectively classify any new pattern with a high classification accuracy. | |