The Terabyte Challenge

The Terabyte Challenge: An Open, Distributed Testbed
for Managing and Mining Massive Data Sets

Robert Grossman
University of Illinois at Chicago
and
Magnify, Inc.

This is a draft of the paper: R. L. Grossman, The Terabyte Challenge: An Open, Distributed Testbed for Managing and Mining Massive Data Sets, Proceedings of the 1996 Conference on Supercomputing, IEEE, 1996.

The Terabyte Challenge

High performance data management and mining are emerging as critical technologies. Although it is becoming more common today to spin a Terabyte of disk (1000 Gigabytes), it is still an open problem how to manage, mine, and analyze a Terabyte of information.

A fundamental transition is taking place in high performance computing: over the next decade problems are as likely to be data bound as compute bound; as limited by input-output bandwidth as by cpu power; more concerned with managing disks than managing processors; and more concerned with uncovering patterns in data than solving equations.

The recent dramatic growth of the web has focused attention on the work required to locate, integrate, and make effective use of the distributed information now available on wide area networks. This represents a critical challenge for wide area data management and information retrieval and for distributed data mining.

Within the context of distributed computing, it is now becoming more common to view the web as a distributed computing infrastructure. For this to succeed, wide area data management must make fundamental advances: it is no longer sufficient just to understand how to partition a computation; rather, the challenge is to understand how to distribute a large scale computation involving widely distributed data. Similarly, the uncovering patterns in distributed data is emerging as an important alternative to today's more common approach of first collecting the data into one central data warehouse data and then mining the warehoused data.

The Terabyte Challenge is an evolving, open testbed that can be used to test new algorithms and software for high performance and wide area management, mining and analysis of data as we try to understand the best technology for the next step of working with Petabytes of data (1000 Terabytes).

The first phase of Terabyte Challenge culminated at a demonstration during the Supercomputing 1995 Conference in San Diego in which the Terabyte Challenge team members illustrated managing, mining and analyzing a variety of very large data sets totaling over 100 Gigabytes.

The purpose of the this paper is to provide a current (August, 1996) overview of the Terabyte Challenge. We begin by describing the background and goals. We then describe some of the technical approaches. Next is a description of some of the applications and services supported by the Terabyte Challenge. We conclude with research challenges for the future.

From Data Management to Data Mining

Managing data has always been an important component in high performance computing. It is sometimes helpful to interpret our view of data management as having undergone three fundamental transitions during the past two decades:

Data management systems.. The most first and most important transition was passing from ad hoc, application dependent data management to general data management systems. The goal was to develop databases providing safe support for storing and retrieving data through transactions. This is most commonly done today with tabular or relational data, but systems to handle more complex data, including objects are now also common. During the past few years, several systems have emerged for working with distributed objects, including CORBA and OLE (now XActive), so that different (network) applications can operate and share the same data.
Data warehouses. The second transition was passing from data management systems to data warehouses. The role of a data warehouse is to consolidate and aggregate data in a format suitable for data analysis. The requirements for data analysis versus data storage and retrieval are different. One favors simple data structures optimized for insertion; the other complex data structures optimized for analysis. Data warehouses have emerged since they can reduce the time required for data analysis dramatically, sometimes by several orders of magnitude.
Data mining. The third transition was passing from summarizing data using data warehouses to uncovering patters in data using data mining. Data mining is concerned with uncovering patterns, associations, changes, and anomalies in data. Data mining can be used for a variety of purposes, including building better predictive models, the intelligent screening or filtering of large data sets, or as an aid to decision making.

Approaches

In this section, we discuss some approaches which have proved useful for managing and mining the terabyte size data sets which occur in some of the Terabyte Challenge applications. This list is not meant to be exhaustive, but rather simply reflects some of our own prejudices.

Exploit the natural structure in data by using the appropriate data management infrastructure. Traditionally, in high performance computing access to data is file based and data is typically viewed as tabular. In contrast, scientific, engineering and business data is usually structured. Databases provide the appropriate data management infrastructure for archiving data, data warehouses provide the appropriate infrastructure for performing simple analyses and aggregation of data, and data mining systems provide the appropriate infrastructure for the types of numerically and statistically intensive queries required to uncover patterns in data.

Object-oriented databases and object warehouses have been developed to work with more structured data, while light weight data management has been developed recently to provide low overhead, high performance access to data for specialized applications, such as data mining.

What ever the specific data mining technology, the goal is to use the appropriate data management infrastructure so that natural structure inherent in data can be exploited efficiently, without having to reassemble and recompute the structures which were thrown away when data is flattened into files or tables.

When possible, analyze and mine data using data centered parallelism and combine the results. The majority of data mining and data analysis algorithms do not scale when the computation is required to access data which is out of memory. Generally, these types of algorithms proceed by sampling large data sets or databases to obtain a sample which is small enough to fit into main memory. One of the most fundamental ideas in high performance computing is exploit data centered parallelism. One of the simplest ways to apply this idea in this context is to analyze or mine data in parallel using a shared nothing approach. The challenge is then to combine the information obtained. Sometimes this is simple: in general it is not. Often though, the information obtained is used for predictive modeling. In this case, it is sometimes easier to combine the predictions resulting from each of the predictive models rather than to combine the patterns obtained that went into the various predictive models.

Exploit parallel input-output. Many data mining queries have a component which is input-output bound. Techniques for parallel input-output such as striping, parallel file transports protocols, or protocols for parallel transport of objects are often useful.

Exploit clusters of workstations. Recently, exploiting clusters of workstations has emerged as a valuable alternative to specialized MPP platforms. Many data mining applications in particular can efficiently exploit cluster computing.

Terabyte Challenge Applications

In this section, we give brief descriptions of several applications which have the focus of the Terabyte Challenge to date, including mining structured and semi-structured for anomalies, analyzing high energy physics data, searching for structures and patterns among flows of dynamical systems, and extracting information from digital libraries.

Consider a high volume sequence of transactions or messages containing many sub-populations: one of the sub-populations is considered abnormal, while the others are considered normal. The abnormal one has a low incidence (say less than 3%). An important problem is to uncover abnormal transactions given statistical descriptions of each sub-population. In general the statistical properties of the sub-populations overlap and the challenge is increase the detection rate as high as possible, while minimizing false positives. Two examples are detecting fraud in insurance and detecting intruders in networks.
Particle detectors in high energy physics typically record hundreds of attributes about each collision. So much data is collected that the majority of it resides on tape. The goal is to analyze the data statistically in an attempt to discover new particles and to more accurately measure previously discovered particles. Only a small portion of the data (much less than 1%) is typically relevant in a given query. The challenge is to provide data management and data analysis software which is powerful and flexible enough for these types of statistical queries.
Dynamical systems are systems which evolve in time. Examples include flows of ordinary differential equations which model the flight paths of aircraft and flows of partial differential equations which model the evolution of a fluid. A fundamental problem is to search a large space of flows for flows having a certain relationship to each other. For example, a sequence of flows which when concatenated approximate a desired flight path, or discovering that the weather in England is broadly approximated by the weather elsewhere in the world six months earlier. The problem is difficult due to the large amounts of data that may be present and to the complexity of the query.
Digital libraries can be used not only for searching, retrieving and browsing multi-media documents, but also for computing, simulating, and visualizing numerical and statistical data. Sometimes the latter is derived from the former: tt is becoming more common to extract numerical and statistical meta-data from a large digital library of textural or multi-media data. The digital library of meta-data can then serve as the foundation for sophisticated searches, such as searching for changes in image sequences or for statistical based approaches to the automatic classification and filtering of large collections of textural data.

Terabyte Challenge Services

Currently, the Terabyte Challenge Testbed consists of geographically distributed clusters of workstations located in Chicago, Philadelphia and College Park, Maryland which are configured to manage, analyze and compute with large data sets. All the clusters are connected by the internet. In addition, the Chicago and Philadelphia clusters will also be connected with a high performance network (vBNS) by the time of Supercomputing 96 (November, 1996).

The testbed is open and broadly based upon web protocols, their extensions, and variants. A formal definition of these services is currently being prepared. At this point, a Terabyte Challenge application can make requests using web protocols for a variety of network services, including:

Catalog and Search Services. These services manage requests which return URLs indicating the location of the requested resources or services.

Data Sources and Services. The other services access data by the specifying the source of data and the data service to manage it. Different data services use different data management systems, including flat files, document management systems, databases, or light weight data management services. The data is returned using http or its variants. This may be as simple as providing file based access to data specified by a URL though a standard web service. More typical, though, is providing access to a named data set though an object-oriented database or object warehouse.

Computing Services. Computing services execute specified computations on data sets specified by URLs or by URLs plus additional specifications, such as provided by Query Services.

Query Services.. Query services execute queries on data specified by URLs using query languages such SQL (Structured Query Language), OQL (Object Query Language), or emerging query languages for semi-structured data.

Data Mining Services. Examples of data mining services including classification and prediction, change detection, and anomaly detection.

Open Problems

In this section, we mention several open problems, following.

Scaling algorithms. Today most data analysis and data mining algorithms do not scale well to large data sets. New algorithms may be required, such as when regression trees are used instead of classical regression for high dimensional spaces. Existing algorithms may have to be modified, such as when sparse solvers are used as components in statistical algorithms instead of more traditional linear solvers. Data management may have to be improved, such as when data warehouses are used to scale some statistical algorithms so that they can work with out-of-memory data.

Working with more complex data types. Today most data is structured while most applications use file based or tabular (relational) data. Both data analysis and data mining make essential use of data structures: throwing away the structure and then recomputing it can often be prohibitively expensive. An important challenge is to develop integrated data analysis, data mining and data management systems for structured or semi-structured data.

Distributed Data Analysis and Mining. Today, almost all data analysis and data mining is done by first collecting data into one centralized location. This is no longer an effective strategy as the amount of distributed data grows. Understanding how to leave as much of the data in place as it is analyzed and mining is an important challenge.

Acknowledgments

At the University of Illinois at Chicago, this work was supported in part by DOE grant DE-FG02-92ER25133, and NSF grants IRI 9224605, CDA-9303433, and CDA-9413948.

At Magnify Research, Inc., this work was also supported in part by the Massive Digital Data Systems (MDDS) Program, through the Department of Defense.

Robert Grossman can be contacted at: Department of Mathematics, Statistics, & Computer Science, University of Illinois at Chicago, 851 S. Morgan Street, Chicago, IL 60607, grossman@uic.edu or at Magnify, Inc., 815 Garfield Street, Oak Park, IL 60304, rlg@magnify.com.

Currently, the following individuals are developing applications and services for the Terabyte Challenge: Andrew Baden, University of Maryland at College Park; Stuart Bailey, University of Illinois at Chicago; Don Benton, University of Pennsylvania; Haim Bodek, Magnify, Inc.; Shirley Connelley, University of Illinois at Chicago; Dave Hanley, University of Illinois at Chicago; Bob Hollebeek, University of Pennsylvania; Dave Northcutt, Magnify, Inc.; Michael Ogg, University of Texas; Roque Oliveira, University of Pennsylvania; Ivan Pulleyn, Magnify, Inc.