Knowledge-based mining of multi-database for associations
Abstract
Knowledge Discovery in Databases (KDD) uses knowledge engineering tools and database technology to extract hidden patterns from databases. Data mining is a step in the KDD process that finds useful patterns in the data. There are many categories of data mining, and one of the most important among them is association rule mining (ARM). The ARM activity establishes an association between two non-overlapping sets of frequently occurring values in a database. Most ARM activity addressed in the literature is based on a single transaction database.
In real-world applications, we have a large collection of data organized in several databases. Mining relations across different databases might reveal useful hidden associations. For example, data about a person’s behavior is distributed across several organizations. More specifically:
CUSTOMER database describes a person’s purchase behavior.
EMPLOYEE database gives professional information.
PATIENT database provides medical history.
AIRLINE database gives travel details.
There may be an association between salary, region of living, mode of traveling, and disease; for instance, people who have a salary in the range of US$10,000–US$20,000, eat frequently at CITY X hotels, and travel by air in executive class may have cardiovascular disease, where CITY X is marked with a high pollution rating. These hidden associations can be discovered if we mine across multiple databases.
Most ARM algorithms are data-driven and hence generate a large number of associations. However, in practice, only a relevant subset of this rule set, which aids in decision-making, is of interest. Such mining activity can be performed using knowledge for mining only those rules that are useful for decision-making. The use of knowledge not only reduces the number of associations generated but also reduces the computational time and space required for mining.
The research work presented in this thesis covers a spectrum of novel methods and data structures for efficient generation of associations. The following issues are addressed in this thesis:
• Multiple Database-Oriented ARM Activity
Not enough attention is paid to mining multiple databases for associations. We call such associations inter-database associations. In this thesis, we propose and implement efficient schemes to obtain inter-database associations. The proposed schemes use domain knowledge in the form of a semantic network to efficiently link various databases.
• Knowledge-Based Mining
In ARM activity, knowledge in the form of an is-a hierarchy is used for discovering generalized rules. In this thesis, we propose a taxonomy based on functionality and a restricted way of generalizing values, and use it for efficient ARM.
• Single Scan of the Database
Data handled by ARM activity is very large; methods to handle this activity should be efficient in terms of computational resources like disk and memory. The best possible algorithm reported in the literature requires at least two database scans. In this thesis, we propose schemes that need only a single scan of the database for finding associations.
• Abstraction Generation
In this thesis, we propose schemes based on intermediate representations (structures) of the data, which we call abstractions. These structures are generated based on data and/or domain knowledge. We use these abstractions instead of the raw data in the database for mining associations. Discovery of associations using abstractions is more efficient because these structures represent data in a compact form and are appropriate for association finding.
• Dynamic Mining
ARM activity based on change of data, called incremental mining, and change of support value are described in the literature. Mining under incremental change to knowledge without having to access the original database is not reported in the literature. In this thesis, we propose novel schemes for mining that can handle changes in data, knowledge, and support value, which we call dynamic mining.
In this thesis, we report four schemes:
Extended Inverted File (EIF)-based scheme
And-Or Taxonomy (AO taxonomy)-based scheme
Pattern Count Tree (PC-tree)-based scheme
Multi-database Domain Link Network (MDLN)-based scheme
Even though all four schemes support discovery of associations across multiple databases, the PC-tree-based and AO taxonomy-based schemes are basically single-database-oriented schemes.
Based on theoretical and experimental studies, we make the following observations:
The MDLN structure-based scheme is ideally suited for mining multiple databases for discovering associations.
The scheme that exploits knowledge in the best possible way is the AO taxonomy-based scheme. Knowledge helps reduce the number of candidates generated and the number of database scans required.
The PC-tree-based scheme is ideally suited for dynamic mining.

