Large Data Clustering And Classification Schemes For Data Mining

Babu, T Ravindra

dc.contributor.advisor	Narasimha Murty, M
dc.contributor.author	Babu, T Ravindra
dc.date.accessioned	2009-03-20T11:36:35Z
dc.date.accessioned	2018-07-31T04:39:34Z
dc.date.available	2009-03-20T11:36:35Z
dc.date.available	2018-07-31T04:39:34Z
dc.date.issued	2009-03-20T11:36:35Z
dc.date.submitted	2006
dc.identifier.uri	https://etd.iisc.ac.in/handle/2005/440
dc.description.abstract	Data Mining deals with extracting valid, novel, easily understood by humans, potentially useful and general abstractions from large data. A data is large when number of patterns, number of features per pattern or both are large. Largeness of data is characterized by its size which is beyond the capacity of main memory of a computer. Data Mining is an interdisciplinary field involving database systems, statistics, machine learning, visualization and computational aspects. The focus of data mining algorithms is scalability and efficiency. Large data clustering and classification is an important activity in Data Mining. The clustering algorithms are predominantly iterative requiring multiple scans of dataset, which is very expensive when data is stored on the disk. In the current work we propose different schemes that have both theoretical validity and practical utility in dealing with such a large data. The schemes broadly encompass data compaction, classification, prototype selection, use of domain knowledge and hybrid intelligent systems. The proposed approaches can be broadly classified as (a) compressing the data by some means in a non-lossy manner; cluster as well as classify the patterns in their compressed form directly through a novel algorithm, (b) compressing the data in a lossy fashion such that a very high degree of compression and abstraction is obtained in terms of 'distinct subsequences'; classify the data in such compressed form to improve the prediction accuracy, (c) with the help of incremental clustering, a lossy compression scheme and rough set approach, obtain simultaneous prototype and feature selection, (d) demonstrate that prototype selection and data-dependent techniques can reduce number of comparisons in multiclass classification scenario using SVMs, and (e) by making use of domain knowledge of the problem and data under consideration, we show that we obtaina very high classification accuracy with less number of iterations with AdaBoost. The schemes have pragmatic utility. The prototype selection algorithm is incremental, requiring a single dataset scan and has linear time and space requirements. We provide results obtained with a large, high dimensional handwritten(hw) digit data. The compression algorithm is based on simple concepts, where we demonstrate that classification of the compressed data improves computation time required by a factor 5 with prediction accuracy with both compressed and original data being exactly the same as 92.47%. With the proposed lossy compression scheme and pruning methods, we demonstrate that even with a reduction of distinct sequences by a factor of 6 (690 to 106), the prediction accuracy improves. Specifically, with original data containing 690 distinct subsequences, the classification accuracy is 92.47% and with appropriate choice of parameters for pruning, the number of distinct subsequences reduces to 106 with corresponding classification accuracy as 92.92%. The best classification accuracy of 93.3% is obtained with 452 distinct subsequences. With the scheme of simultaneous feature and prototype selection, we improved classification accuracy to better than that obtained with kNNC, viz., 93.58%, while significantly reducing the number of features and prototypes, achieving a compaction of 45.1%. In case of hybrid schemes based on SVM, prototypes and domain knowledge based tree(KB-Tree), we demonstrated reduction in SVM training time by 50% and testing time by about 30% as compared to complete data and improvement of classification accuracy to 94.75%. In case of AdaBoost the classification accuracy is 94.48%, which is better than those obtained with NNC and kNNC on the entire data; the training timing is reduced because of use of prototypes instead of the complete data. Another important aspect of the work is to devise a KB-Tree (with maximum depth of 4), that classifies a 10-category data in just 4 comparisons. In addition to hw data, we applied the schemes to Network Intrusion Detection Data (10% dataset of KDDCUP99) and demonstrated that the proposed schemes provided less overall cost than the reported values.	en
dc.language.iso	en_US	en
dc.relation.ispartofseries	G20561	en
dc.subject	Data Mining	en
dc.subject	Data Classification	en
dc.subject	Image Processing	en
dc.subject	Data Clustering	en
dc.subject	Data Compaction	en
dc.subject	Data Mining - Algorithms	en
dc.subject	Hybrid Intelligent Systems	en
dc.subject	Data Reduction	en
dc.subject	Data Representation	en
dc.subject	Hybrid Schemes	en
dc.subject	Hybrid Intelligent Methods	en
dc.subject.classification	Computer Science	en
dc.title	Large Data Clustering And Classification Schemes For Data Mining	en
dc.type	Thesis	en
dc.degree.name	PhD	en
dc.degree.level	Doctoral	en
dc.degree.discipline	Faculty of Engineering	en

Files in this item

Name:: G20561.pdf
Size:: 1.059Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Computer Science and Automation (CSA) [531]

Show simple item record