dc.description.abstract | Pattern discovery is an important area of data mining referring to a class of techniques designed for the extraction of interesting patterns from the data. A pattern is some kind of a local structure that captures correlations and dependencies present in the elements of the data. In general, pattern discovery is about finding all patterns of `interest' in the data and a popular measure of interestingness for a pattern is its frequency of occurrence in the data. Thus the problem of frequent pattern discovery is to find all patterns in the data whose frequency of occurrence exceeds some user defined threshold. However, frequency of a pattern is not the only measure for finding
patterns of interest and there also exist other measures and techniques for finding
interesting patterns.
This thesis is concerned with efficient discovery of inherent patterns from long
sequence (or temporally ordered) data. Mining of such sequentially ordered data is
called temporal data mining and the temporal patterns that are discovered from large
sequential data are called episodes. More specifically, this thesis explores efficient
methods for finding small and relevant subsets of episodes from sequence data that
best characterize the data. The thesis also discusses methods for comparing datasets,
based on comparing the sets of patterns representing the datasets.
The data in a frequent episode discovery framework is abstractly viewed as a single
long sequence of events. Here, the event is a tuple, (Ei; ti), where Ei is referred to as an event-type (taking values from a finite alphabet set) and ti is the time of occurrence.
The events are ordered in the non-decreasing order of the time of occurrence. The
pattern of interest in such a sequence is called an episode, which is a collection of
event-types with a partial order defined over it. In this thesis, the focus is on a special
type of episode called serial episode, where there is a total order defined among the
collection of event-types representing the episode. The occurrence of an episode is
essentially a subset of events from the data whose event-types match the set of eventtypes
associated with the episode and the order in which they occur conforms to the underlying partial order of the episode. The frequency of an episode is some measure of how often it occurs in the event stream. Many different notions of frequency have been defined in literature. Given a frequency definition, the goal of frequent episode discovery is to unearth all episodes which have a frequency greater than a user-defined threshold. The size of an episode is the number of event-types in the episode. An episode β is called a subepisode of another episode β, if the collection of event-types of β is a subset of the corresponding collection of α and the event-types of β satisfy the same partial order relationships present among the corresponding event-types of α.
The set of all episodes can be arranged in a partial order lattice, where each level
i contains episodes of size i and the partial order is the subepisode relationship. In
general, there are two approaches for mining frequent episodes, based on the way one
traverses this lattice. The first approach is to traverse this lattice in a breadth-first
manner, and is called the Apriori approach. The other approach is the Pattern growth
approach, where the lattice is traversed in a depth-first manner. There exist different frequency notions for episodes, and many Apriori based algorithms have been proposed for mining frequent episodes under the different frequencies. However there do not exist Pattern-growth based methods for many of the frequency notions.
The first part of the thesis proposes new Pattern-growth methods for discovering
frequent serial episodes under two frequency notions called the non-overlapped frequency
and the total frequency. Special cases, where certain additional conditions, called the span and gap constraints, are imposed on the occurrences of the episodes are also considered. The proposed methods, in general, consist of two steps: the candidate
generation step and the counting step. The candidate generation step involves finding potential frequent episodes. This is done by following the general Pattern growth
approach for finding the candidates, which is the depth-first traversal of the lattice of all episodes. The second step, which is the counting step, involves counting the frequencies of the episodes. The thesis presents efficient methods for counting
the occurrences of serial episodes using occurrence windows of subepisodes for both
the non-overlapped and total frequency. The relative advantages of Pattern-growth
approaches over Apriori approaches are also discussed. Through detailed simulation
results, the effectiveness of this approach on a host of synthetic and real data sets
is shown. It is shown that the proposed methods are highly scalable and efficient in
runtime as compared to the existing Apriori approaches.
One of the main issues in frequent pattern mining is the huge number of frequent
patterns, returned by the discovery methods, irrespective of the approach taken to solve the problems. The second part of this thesis, addresses this issue and discusses methods of selecting a small subset of relevant episodes from event sequences. There have been a few approaches, discussed in the literature, for finding a small subset of patterns. One set of methods are information theory based methods, where patterns that provide maximum information are searched for. Another approach is the Minimum Description Length (MDL) principle based summarization schemes. Here the data is encoded using a subset of patterns (which forms the model for the data) and its occurrences. The subset of patterns that has the maximum efficiency in encoding
the data is the best representative model for the data. The MDL principle takes into
account both the encoding efficiency of the model as well as model complexity. A
method, called Constrained Serial episode Coding(CSC), is proposed based on the
MDL principle, which returns a highly relevant, non-redundant and small subset of
serial episodes. This also includes an encoding scheme, where the model representation and the encoding of the data are efficient. An interesting feature of this algorithm for isolating a small set of relevant episodes is that it does not need a user-specified threshold on frequency. The effectiveness of this method is shown on two types of data. The first is data obtained from a detailed simulator for a reconfigurable coupled conveyor system. The conveyor system consists of different intersecting paths and packages flow through such a network. Mining of such data can allow one to unearth the main paths of package
ows which can be useful in remote monitoring
and visualization of the system. On this data, it is shown that the proposed method
is able to return highly consistent sub paths, in the form of serial episodes, with
great encoding efficiency as compared to other known related sequence summarization
schemes, like SQS and GoKrimp. The second type of data consists of a collection
of multi-class sequence datasets. It is shown that the selected episodes from the proposed
method form good features in classi cation. The proposed method is compared
with SQS and GoKrimp, and it is shown that the episodes selected by this method
help in achieving better classification results as compared to other methods.
The third and nal part of the thesis discusses methods for comparing sets of patterns representing different datasets. There are many instances when one is interested in comparing datasets. For example, in streaming data, one is interested in knowing whether the characteristics of the data are the same or have changed significantly.
In other cases, one may simply like to compare two datasets and quantify the degree
of similarity between them. Often, data are characterized by a set of patterns as
described above. Comparing sets of patterns representing datasets gives information
about the similarity/dissimilarity between the datasets. However not many measures
exist for comparing sets of patterns. This thesis proposes a similarity measure for
comparing sets of patterns which in turn aids in comparison of di erent datasets.
First, a kernel for comparing two patterns, called the Pattern Kernel, is proposed.
This kernel is proposed for three types of patterns: serial episodes, sequential patterns and itemsets. Using this kernel, a Pattern Set Kernel is proposed for comparing
different sets of patterns. The effectiveness of this kernel is shown in classification and
change detection. The thesis concludes with a summary of the main contributions and some suggestions for extending the work presented here. | en_US |