×

Decomposition methodology for knowledge discovery and data mining. Theory and applications. (English) Zbl 1122.68042

Series in Machine Perception and Artificial Intelligence 61. River Edge, NJ: World Scientific (ISBN 981-256-079-3/hbk). xx, 323 p. (2005).
Data mining is the technology of exploring data in order to discover previously unknown patterns. The recent growth of the field has made several mature approaches available to practitioners. However, there is a sort of “impedance mismatch”: such approaches should be customized to real-world complex problems, but, according to the “no free lunch theorem”, no technique outperforms the others in all possible domains.
The book is devoted to the study of decomposition methodologies for data mining. The data mining problem the authors are interested in is predictive modeling. Predictive modeling can be described as the problem of providing a good approximation of an unknown function by exploiting an example set where the value of the function is known. Roughly speaking, given a tuple with an unknown attribute value, we would like to find a good predictor for that value, given the values of the other attributes. When the value to predict is discrete, the predictive modeling problem is also denoted as “classification”. The main subject of study in this book is how to decompose a classification problem into several, smaller problems. Hopefully, such subproblems are easier to solve by using the existing tools, and likely their combination provides a solution for the original classification problem. The benefits of decomposition include: increased accuracy, easier interpretability of the results, and scalability to large databases.
The book proposes five main decomposition approaches: Feature set decomposition, Space decomposition, Sample decomposition, Function decomposition, and Concept decomposition.
Feature set decomposition is a sort of generalization of the feature selection task, extensively used in data mining. The essence is that the original feature set is decomposed into several subsets. Then, an inducer is trained independently for each feature set. Then, unlabelled instances can be classified by combining the predictions of the resulting inducers. Within the book, three main algorithms that search for the most promising decompositions are described.
Space decomposition subdivides the original feature space into several subspaces, where specific inducers are easier to train. Within the book, this is approached by resorting to clustering: tuples in the original training set are clustered in order to discover homogeneities, which likely influence the prediction. The classification of a new unlabelled tuple is hence accomplished by first associating it with the cluster of membership, and by next exploiting the classifier associated with that cluster.
Sample decomposition, essentially consists in partitioning the original training set into several subsets. A classifier is then trained for each subset, and again a combination procedure is applied in order to define the general predictor. Again, the authors study a methodology based on clustering in order to define the sampling process.
Function decomposition is a generalization of the preprocessing step in data mining: new attributes are computed from the ones available, which hopefully allow us to better characterize the target concept. The book here provides a high-level overview of the current literature, and provides a specific preprocessing algorithm to combine more attributes into a single attribute.
Finally, Concept decomposition aims essentially at providing a hierarchical classification methodology: target concepts can be aggregated in order to improve the classification accuracy at a first stage. A refined classification can hence be accomplished in a second step, by working on a subset of the target concepts.
The authors concentrate on four main issues:
1. What kind of decomposition methods exist;
2. Which elementary decomposition type performs best for which problem;
3. How we could infer the best decomposition structure automatically;
4. How the sub-problems should be recomposed.
It is clear that objectives 1, 3, and 4 are related. In fact, the description of the classes of decomposition methodologies is exhaustive in the book. Each decomposition methodology is approached and discussed in a different chapter. However, a little concern to me is the apparent lack of uniform treatment throughout the chapters: some of them (e.g., feature set decomposition) are thoroughly detailed and analyzed theoretically, while other chapters propose specific algorithms, which are presented and analyzed only experimentally. It seems that a general study about the issues related to each decomposition methodology is omitted, and instead each chapter presents some ad-hoc solutions from the current literature. The same problem can be found in the introductory chapters on data mining, where some classification methodologies are detailed with high emphasis, whereas other techniques are simply outlined.
Also, I did not found a methodological analysis that justifies the choice for a decomposition methodology rather than another. Chapter 11 proposes a study where, starting from the empirical results obtained on some standard benchmark datasets, a classifier is trained that evaluates the characteristics of a dataset and chooses the best method. However, again this is a nice specific experiment, which cannot be generalized to a methodology.
To summarize, I found the book an interesting survey of the decomposition problems in data mining. For sure, data mining experts and practitioners shall appreciate such a survey: decomposition techniques can indeed be applied in a variety of problems where obtaining an accurate predictor is a complex and somehow unachievable task. In such a context, the book presents an interesting an pleasant introduction to the topic, which can guide the reader to the knowledge of the existing decomposition methods, and to the choice of those most promising for the problems he has to face. However, the book does not provide a systematic study of these issues. In particular, it does not guide to the solution of a generic classification problem by suggesting, on the basis of the specific characteristics of the given problem, the best fitting decomposition methodology. Also, it does not suggest application fields where decomposition techniques can profitably be applied.

MSC:

68P05 Data structures
68T05 Learning and adaptive systems in artificial intelligence
68-01 Introductory exposition (textbooks, tutorial papers, etc.) pertaining to computer science
68P15 Database theory
68T10 Pattern recognition, speech recognition