Effective Characterization of Sequence Data through Frequent Episodes

Ibrahim, A

View/Open

G27234.pdf (1.357Mb)

Date

2018-08-14

Author

Ibrahim, A

Metadata

Show full item record

Abstract

Pattern discovery is an important area of data mining referring to a class of techniques designed for the extraction of interesting patterns from the data. A pattern is some kind of a local structure that captures correlations and dependencies present in the elements of the data. In general, pattern discovery is about finding all patterns of `interest' in the data and a popular measure of interestingness for a pattern is its frequency of occurrence in the data. Thus the problem of frequent pattern discovery is to find all patterns in the data whose frequency of occurrence exceeds some user defined threshold. However, frequency of a pattern is not the only measure for finding patterns of interest and there also exist other measures and techniques for finding interesting patterns. This thesis is concerned with efficient discovery of inherent patterns from long sequence (or temporally ordered) data. Mining of such sequentially ordered data is called temporal data mining and the temporal patterns that are discovered from large sequential data are called episodes. More specifically, this thesis explores efficient methods for finding small and relevant subsets of episodes from sequence data that best characterize the data. The thesis also discusses methods for comparing datasets, based on comparing the sets of patterns representing the datasets. The data in a frequent episode discovery framework is abstractly viewed as a single long sequence of events. Here, the event is a tuple, (Ei; ti), where Ei is referred to as an event-type (taking values from a finite alphabet set) and ti is the time of occurrence. The events are ordered in the non-decreasing order of the time of occurrence. The pattern of interest in such a sequence is called an episode, which is a collection of event-types with a partial order defined over it. In this thesis, the focus is on a special type of episode called serial episode, where there is a total order defined among the collection of event-types representing the episode. The occurrence of an episode is essentially a subset of events from the data whose event-types match the set of eventtypes associated with the episode and the order in which they occur conforms to the underlying partial order of the episode. The frequency of an episode is some measure of how often it occurs in the event stream. Many different notions of frequency have been defined in literature. Given a frequency definition, the goal of frequent episode discovery is to unearth all episodes which have a frequency greater than a user-defined threshold. The size of an episode is the number of event-types in the episode. An episode β is called a subepisode of another episode β, if the collection of event-types of β is a subset of the corresponding collection of α and the event-types of β satisfy the same partial order relationships present among the corresponding event-types of α. The set of all episodes can be arranged in a partial order lattice, where each level i contains episodes of size i and the partial order is the subepisode relationship. In general, there are two approaches for mining frequent episodes, based on the way one traverses this lattice. The first approach is to traverse this lattice in a breadth-first manner, and is called the Apriori approach. The other approach is the Pattern growth approach, where the lattice is traversed in a depth-first manner. There exist different frequency notions for episodes, and many Apriori based algorithms have been proposed for mining frequent episodes under the different frequencies. However there do not exist Pattern-growth based methods for many of the frequency notions. The first part of the thesis proposes new Pattern-growth methods for discovering frequent serial episodes under two frequency notions called the non-overlapped frequency and the total frequency. Special cases, where certain additional conditions, called the span and gap constraints, are imposed on the occurrences of the episodes are also considered. The proposed methods, in general, consist of two steps: the candidate generation step and the counting step. The candidate generation step involves finding potential frequent episodes. This is done by following the general Pattern growth approach for finding the candidates, which is the depth-first traversal of the lattice of all episodes. The second step, which is the counting step, involves counting the frequencies of the episodes. The thesis presents efficient methods for counting the occurrences of serial episodes using occurrence windows of subepisodes for both the non-overlapped and total frequency. The relative advantages of Pattern-growth approaches over Apriori approaches are also discussed. Through detailed simulation results, the effectiveness of this approach on a host of synthetic and real data sets is shown. It is shown that the proposed methods are highly scalable and efficient in runtime as compared to the existing Apriori approaches. One of the main issues in frequent pattern mining is the huge number of frequent patterns, returned by the discovery methods, irrespective of the approach taken to solve the problems. The second part of this thesis, addresses this issue and discusses methods of selecting a small subset of relevant episodes from event sequences. There have been a few approaches, discussed in the literature, for finding a small subset of patterns. One set of methods are information theory based methods, where patterns that provide maximum information are searched for. Another approach is the Minimum Description Length (MDL) principle based summarization schemes. Here the data is encoded using a subset of patterns (which forms the model for the data) and its occurrences. The subset of patterns that has the maximum efficiency in encoding the data is the best representative model for the data. The MDL principle takes into account both the encoding efficiency of the model as well as model complexity. A method, called Constrained Serial episode Coding(CSC), is proposed based on the MDL principle, which returns a highly relevant, non-redundant and small subset of serial episodes. This also includes an encoding scheme, where the model representation and the encoding of the data are efficient. An interesting feature of this algorithm for isolating a small set of relevant episodes is that it does not need a user-specified threshold on frequency. The effectiveness of this method is shown on two types of data. The first is data obtained from a detailed simulator for a reconfigurable coupled conveyor system. The conveyor system consists of different intersecting paths and packages flow through such a network. Mining of such data can allow one to unearth the main paths of package ows which can be useful in remote monitoring and visualization of the system. On this data, it is shown that the proposed method is able to return highly consistent sub paths, in the form of serial episodes, with great encoding efficiency as compared to other known related sequence summarization schemes, like SQS and GoKrimp. The second type of data consists of a collection of multi-class sequence datasets. It is shown that the selected episodes from the proposed method form good features in classi cation. The proposed method is compared with SQS and GoKrimp, and it is shown that the episodes selected by this method help in achieving better classification results as compared to other methods. The third and nal part of the thesis discusses methods for comparing sets of patterns representing different datasets. There are many instances when one is interested in comparing datasets. For example, in streaming data, one is interested in knowing whether the characteristics of the data are the same or have changed significantly. In other cases, one may simply like to compare two datasets and quantify the degree of similarity between them. Often, data are characterized by a set of patterns as described above. Comparing sets of patterns representing datasets gives information about the similarity/dissimilarity between the datasets. However not many measures exist for comparing sets of patterns. This thesis proposes a similarity measure for comparing sets of patterns which in turn aids in comparison of di erent datasets. First, a kernel for comparing two patterns, called the Pattern Kernel, is proposed. This kernel is proposed for three types of patterns: serial episodes, sequential patterns and itemsets. Using this kernel, a Pattern Set Kernel is proposed for comparing different sets of patterns. The effectiveness of this kernel is shown in classification and change detection. The thesis concludes with a summary of the main contributions and some suggestions for extending the work presented here.

URI

https://etd.iisc.ac.in/handle/2005/3969

Collections

Electrical Engineering (EE) [412]