dc.description.abstract | Frequent Pattern mining is a popular area of data mining, where the goal is to unearth interesting patterns that occur often in a given data. A pattern is a local structure that captures correlations and dependencies present in the elements of the data. Pattern discovery refers to the problem of extracting such patterns of interest from data. A popular way of defining the interestingness of a pattern is the frequency of its occurrence in the data, leading to the topic of frequent pattern mining. Depending on the type of data being considered, there are many ways to define a pattern. In this thesis, we focus on frequent episode mining, where we are interested in temporal patterns called episodes that occur often in a large sequential data.
In the framework of frequent episode mining, the data is abstractly represented as a single long sequence of events. Each event is a tuple (E_i,t_i), where E_i represents the event-type and t_i refers to the time of occurrence of the event.
The patterns of interest are called episodes and each episode is a collection of event-types constrained by a partial order. An episode is said to occur in the data if there exists a subset of events in the data whose event-types match with the event-types of the episode and the timestamps of those events in the data obey the partial order constraint of the episode. In the literature, many
different notions of the frequency of an episode are defined based on the types of occurrences that would be counted. The goal is to discover all episodes whose frequencies exceed a user-defined threshold. Depending on the nature of the partial order, different types of episodes are distinguished. For serial episodes, the underlying partial order is a total order; for parallel episodes the partial order is empty. For general episodes, no restrictions are imposed on the partial order. Episode mining algorithms proposed in the literature employ either Breadth-First Search (BFS) or Depth-First Search (DFS) to explore the space of all possible episodes.
This thesis presents some novel algorithms for discovering frequent episodes and also some novel approaches for assessing statistical significance of discovered episodes.
For the case of serial and parallel episodes, there exist many efficient mining techniques under both the BFS and the DFS approaches. But for the case of general episodes, the few algorithms that are currently available mostly follow the BFS approach and are not very efficient. In the first part of the thesis, we propose a new DFS based algorithm for two special subclasses of general episodes, namely, injective general episodes and chain episodes. DFS algorithms in general consist of two steps: the pattern extension step and the frequency counting step. Our algorithm includes a novel method to efficiently extend any given episode, and we formally prove that we do not miss any potentially frequent episode in the episode space and no episode is generated twice. Our algorithm also includes a novel method for determining the occurrences of the child episodes from those of the parent episode which improves the efficiency of the frequency counting step. We illustrate the efficiency of our novel pattern growth approach over the existing algorithms through various simulations involving both synthetic and real data sets.
In the second part of the thesis, we address episodes with a more complicated structure. Here the data sequence is allowed to contain multiple event-types occurring at the same timestamp. Unlike the traditional episodes, the patterns here deal with temporal dependencies over sets of event-types. Such patterns are termed as complex episodes or episodes with simultaneous events. In this work, we focus on serial episodes with simultaneous events, where
each node of the episode is associated with a set of event-types (in contrast to traditional episodes where each node is mapped to a single event-type). Thus the size of the episode has two components: the number of nodes called the length and the number of event-types associated with each node called the width. Extending any episode in both length and width is a challenging task. We present a novel BFS approach for mining serial episodes with simultaneous events that efficiently generates candidate episodes without generating duplicates. We provide a formal proof of correctness of our algorithm. We also propose an efficient way of counting frequencies of episodes. For pruning redundant episodes, we include an efficient frequency closure check as a post-processing step. Through simulations on both synthetic and real data sets, we demonstrate the effectiveness and efficiency of our proposed algorithm.
One of the main issues of frequent pattern mining is the huge number of frequent patterns returned by the pattern mining algorithms. Many of them are redundant and/or uninteresting. One of the methods to remove many of the spurious or uninteresting episodes is to assess the statistical significance of the discovered episodes. The goal is to deduce the threshold on frequency for each episode individually on a statistical basis. The later parts of the thesis deal with such significance analysis in the framework of hypothesis testing.
In the third part of the thesis, we consider significance analysis under the non-overlapped occurrences (NO) based frequency. For this, we need to choose a null-model and provide techniques for determining a threshold on frequency for rejecting the null-model at a given level of significance. There are existing algorithms assuming iid-null model. In this thesis, we present a significance analysis with NO frequency under a markov-null hypothesis. We use a central limit theorem for renewal processes to derive an approximate distribution of NO frequency of a given episode under a markov null model, thereby deriving an episode-specific threshold on the frequency. Our method is applicable to all serial episodes and also to general episodes with a unique maximal element. Through simulation studies, we demonstrate the effectiveness of our technique.
In the fourth part of the thesis, we address the problem of assessing the significance of episodes under minimal occurrences (MO) based frequency. Currently, there are no techniques for characterizing distribution of minimal occurrences even under an iid-null model. We present a method for approximating the distribution of MO frequency under an iid-null model using which we can calculate the threshold on frequency for an episode to be significant. This task is quite challenging since minimal occurrences can overlap. We adopt some results on the asymptotic distribution of overlapping occurrences of regular expression for approximating the distribution of minimal window frequency of episodes. We also propose a novel way of efficiently calculating the needed frequency thresholds. This is the first method for assessing statistical significance using MO frequency as the test statistic. Through simulations on synthetic data, we demonstrate the effectiveness of our method.
Finally, we conclude the thesis by pointing out some interesting extensions possible for the work reported in this thesis. | en_US |