Knowledge-based preanalysis for multilevel clustering
Abstract
Multilevel clustering offers the cluster analyst flexibility of choosing different algorithms at different levels with the possibility of reducing overall computational effort in comparison with the single-level application of a clustering algorithm.
Even if the same algorithm is used at different levels, computational savings may occur because of the division of data at any level into blocks for applying the algorithm on individual blocks separately and collecting the samples from the clusters to form data for the next level.
If the computer's main memory is a limitation for the use of a versatile algorithm like hierarchical agglomerative clustering, multilevel approach becomes inescapable.
In the first part of this thesis, the computational properties of two-level clustering algorithms are analysed.
In particular, it is shown that in the case of two-level hierarchical algorithm, solution of a cubic equation together with a simple rounding procedure gives optimal number of blocks at the first level to minimize the number of similarity computations.
A more general formulation of multilevel clustering which allows for unequal size blocks is introduced.
And a constructive proof for the existence of an optimal scheme of division is given.
In addition to the Prolog program for solving the cubic equation, a Prolog program which calculates the order of time-complexity for hybrid algorithm is given.
It is also proved that two-level hierarchical clustering is optimal if and only if the number of data items is a product of two prime numbers.
In the second part, the problem of optimal number of levels for a multilevel clustering algorithm is solved by using horn clause specification of the optimization logic.
The advantage of horn clause specification is its closeness to a working Prolog program.
From the results of the Prolog program, a regularity is observed and a theorem is proved that the observed regularity is true in all cases.
Similar theorem is proved for the multilevel hierarchical algorithm with unequal size blocks restricted the following way: the number of blocks at any level is less than or equal to half of the number of blocks at the previous level.
Pre-analysis for multilevel clustering involves not only the choices of the number of blocks at each level but also qualitative reasoning about the appropriateness of algorithms.
Qualitative reasoning must be modeled in computer in order to make a knowledge-based approach to preprocessing for multilevel clustering possible.
A rule interpreter is built in Prolog using forward chaining.
The rule interpreter coupled with the horn clause specification of general framework for algorithm selection forms a prototype knowledge system.
Applicability to clustering algorithm selection is illustrated by means of an example.
Possibilities of extending this work and suggestions for improvement to overcome the limitations of this study are also given.

