Data Mining Challenges for Digital Libraries

Robert L. Grossman

University of Illinois at Chicago, Laboratory for Advanced Computing
815 S. Morgan Street, Chicago, IL 60607, USA, grossman@uic.edu
&
Magnify, Inc.
815 S. Garfield Street, Oak Park, IL 60304, USA, rlg@magnify.com

Abstract: Data mining is the automatic discovery of patterns, associations, anomalies and changes in data. We describe data mining issues for digital libraries of both textural and numerical data.

General Terms: Data Mining, Digital Libraries
Additional Key Words and Phrases: document, conceptual clustering, tabular data

Introduction

Digital libraries of textural and multi-media data are now common and will soon be ubiquitous, while digital libraries of numerical data, especially tabular data, are growing in importance. In this note, we discuss research challenges arising from the data mining of digital libraries. By data mining, we mean the automatic uncovering of patterns, associations and anomalies in data.

As a simple motivating example, each quarter, public companies in the US file a 10-Q form with the Security and Exchange Commission (SEC). Most companies file this file electronically: it contains a variety of information about the financial health of the company, including tables containing the company's balance sheet, income statement, and cash flow statement. These forms are available on the web and can be retrieved by attribute (company name, company exchange symbol, date filed, etc.) or sometimes via full text search. Note that by retrieving documents in this way only refers to data contained within a single document. These may be thought as intra-document queries.

In contrast, a data mining query is concerned with some collection of documents. This is because data mining queries are essentially statistical in nature and the statistics must be computed with respect to some (reference) set. A typical data mining query might begin by focusing on a line within a table in the document giving the revenues for the current quarter and the past five quarters. An example of an association query is to identify other attributes, both tabular and non-tabular, which have a strong correlation with those companies having the highest growth (within the reference collection of documents). As another example of a data mining query, 10-Q forms for companies with similar growth can be clustered together. Notice that both these examples make use of information between documents (the reference set) and in this sense data mining queries can be thought of as inter-document queries.

Initial work in digital libraries focused on the archiving, searching, and retrieval of documents. As digital libraries supporting this basic functionality become widely available, the next challenge is to develop algorithms and software for the analysis and mining of information available within digital libraries.

In this note, we describe in more detail mining digital libraries of tabular data and mining digital libraries of textural and multi-media documents. Finally, we conclude with a brief description of relevant mathematical techniques and directions for future research.

Mining tabular data

Digital libraries containing tabular data are becoming common. Examples range from the sale price for new homes in a region to the number of violent crimes per 100,000 per county. Viewing rows as records and columns as fields provides several natural data mining queries:

Prediction . Given enough tabular data, a predictive model can be generated to predict the numerical value of a designated field, given a new row with that field element missing.

Classification. Consider a table in which one of the columns is a categorical variable, such as a customer's level of satisfaction (A, B, C, D or F) and the other columns are measured attributes of the customer. With enough data, a classification model can be generated to predict the satisfaction of a customer, given the other attributes. It is distinguished from the problem of prediction since different statistical techniques are generally used to predict continuous and discrete attributes.

Clustering. Given one or more tables, each row can be viewed as a point in space and data can be clustered on the basis of how near they are to each other. This provides a natural means of grouping together similar rows.

Anomalies, deviations, and changes. These types of data mining queries return rows which are in some specified sense different than the other rows, such as those representing a statistically significant change from the previous rows.

Mining Textural Data

A typical query to a digital library of documents today retrieves documents either by keyword or attribute, or through a full text search. As the size and complexity of digital libraries grows, there is increasing information available between and among documents, not just within documents. Here are three examples of data mining queries on textural documents which exploit inter-document information:

Conceptual clustering . Queries can return documents which conceptually related to a specified reference document or to a specified topic. Unlike key word searches, a specified word need not appear for the document to be retrieved, simply a specified concept. For example, searching the SEC Edgar digital library for volatile might return companies described as having high risk, since both volatile and high risk would likely belong to the same semantic cluster.

Attribute-based associations . Imagine a digital library which is customized by allowing the owner to specify the significance he or she attaches to some of the documents. With this information, association queries can look for patterns involving the author, subject, date, or type of publication which the owner tends to associate with higher value. These patterns can then be used to customize the articles retrieved in response to a query.

Anomaly detection. Digital libraries can easily flood users with documents. It is important sometimes to return documents in response to a query which are anomalous in some sense and yet significant. For example, from a collection of weekly reports of interest rates, an expected range for the weekly change can easily be determined. A report which is significantly different can be flagged.

Although data mining techniques can be applied directly to textural documents, it is usually easier to compute metadata containing numerical data summarizing the textural data and to analyze statistically the meta-data. Statistical metadata can prepared from textural data in a variety of ways. The document itself can be summarized by a vector in an appropriate space or auxiliary matrix like structures can be constructed summarizing the occurrences of pairs or triples of words.

Discussion and Future Work

A variety of techniques have been used for data mining, including classification and regression trees, neural nets, Bayesian networks, and genetic algorithms. In this section, we briefly describe three fundamental research challenges in this area:

More complex data . By and large, the techniques listed at the beginning of the section apply only to numerical and categorical vectors. An important challenge is to develop algorithms and software to work with more complex data, including time series, random fields, unstructured data (such as text), and semi-structured data (such as textural files containing embedded formatting commands).
Scalability . Most general purpose data mining algorithms are limited to mining data which can fit in memory. They may sample from large data sets and databases, but, in the end, they do not scale well if out of memory data must be accessed. In addition, most data mining algorithms do not scale as the number of attributes grows. A fundamental challenge is to develop data mining algorithms which scale as the amount of data grows and as the number of attributes grows. An important related challenge is to develop parallel and distributed of data mining algorithms.
Distributed and Agent-based mining. Today, data mining is generally done by collecting data into a centralized digital library or repository and mining it for information. The third challenge is to develop appropriate distributed and agent-based data mining algorithms to mine distributed data.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or permissions@acm.org.

Last modified: November 8, 1996
Robert L. Grossman