Efficient Frequent Closed Itemset Algorithms With Applications To Stream Mining And Classification
Data mining is an area to find valid, novel, potentially useful, and ultimately understandable abstractions in a data. Frequent itemset mining is one of the important data mining approaches to find those abstractions in the form of patterns. Frequent Closed itemsets provide complete and condensed information for non-redundant association rules generation. For many applications mining all the frequent itemsets is not necessary, and mining frequent Closed itemsets are adequate. Compared to frequent itemset mining, frequent Closed itemset mining generates less number of itemsets, and therefore improves the efficiency and effectiveness of these tasks. Recently, much research has been done on Closed itemsets mining, but it is mainly for traditional databases where multiple scans are needed, and whenever new transactions arrive, additional scans must be performed on the updated transaction database; therefore, they are not suitable for data stream mining. Mining frequent itemsets from data streams has many potential and broad applications. Some of the emerging applications of data streams that require association rule mining are network traffic monitoring and web click streams analysis. Different from data in traditional static databases, data streams typically arrive continuously in high speed with huge amount and changing data distribution. This raises new issues that need to be considered when developing association rule mining techniques for stream data. Recent works on data stream mining based on sliding window method slide the window by one transaction at a time. But when the window size is large and support threshold is low, the existing methods consume significant time and lead to a large increase in user response time. In our first work, we propose a novel algorithm Stream-Close based on sliding window model to mine frequent Closed itemsets from the data streams within the current sliding window. We enhance the scalabality of the algorithm by introducing several optimization techniques such as sliding the window by multiple transactions at a time and novel pruning techniques which lead to a considerable reduction in the number of candidate itemsets to be examined for closure checking. Our experimental studies show that the proposed algorithm scales well with large data sets. Still the notion of frequent closed itemsets generates a huge number of closed itemsets in some applications. This drawback makes frequent closed itemsets mining infeasible in many applications since users cannot interpret the large volume of output (which sometimes will be greater than the data itself when support threshold is low) and may lead to an overhead to develop extra applications which post processes the output of original algorithm to reduce the size of the output. Recent work on clustering of itemsets considers strictly either expression(consists of items present in itemset) or support of the itemsets or partially both to reduce the number of itemsets. But the drawback of the above approaches is that in some situations, number of itemsets does not reduce due to their restricted view of either considering expressions or support. So we propose a new notion of frequent itemsets called clustered itemsets which considers both expressions and support of the itemsets in summarizing the output. We introduce a new distance measure w.r.t expressions and also prove the problem of mining clustered itemsets to be NP-hard. In our second work, we propose a deterministic locality sensitive hashing based classifier using clustered itemsets. Locality sensitive hashing(LSH)is a technique for efficiently finding a nearest neighbour in high dimensional data sets. The idea of locality sensitive hashing is to hash the points using several hash functions to ensure that for each function the probability of collision is much higher for objects that are close to each other than those that are far apart. We propose a LSH based approximate nearest neighbour classification strategy. But the problem with LSH is, it randomly chooses hash functions and the estimation of a large number of hash functions could lead to an increase in query time. From Classification point of view, since LSH chooses randomly from a family of hash functions the buckets may contain points belonging to other classes which may affect classification accuracy. So, in order to overcome these problems we propose to use class association rules based hash functions which ensure that buckets corresponding to the class association rules contain points from the same class. But associative classification involves generation and examination of large number of candidate class association rules. So, we use the clustered itemsets which reduce the number of class association rules to be examined. We also establish formal connection between clustering parameter(delta used in the generation of clustered frequent itemsets) and discriminative measure such as Information gain. Our experimental studies show that the proposed method achieves an increase in accuracy over LSH based near neighbour classification strategy.