Robert Grossman
University of Illinois at Chicago
and
Magnify, Inc.
This is a draft of the paper: R. L. Grossman, The Terabyte Challenge: An Open, Distributed Testbed for Managing and Mining Massive Data Sets, Proceedings of the 1996 Conference on Supercomputing, IEEE, 1996.
High performance data management and mining are emerging as critical technologies. Although it is becoming more common today to spin a Terabyte of disk (1000 Gigabytes), it is still an open problem how to manage, mine, and analyze a Terabyte of information.
A fundamental transition is taking place in high performance computing: over the next decade problems are as likely to be data bound as compute bound; as limited by input-output bandwidth as by cpu power; more concerned with managing disks than managing processors; and more concerned with uncovering patterns in data than solving equations.
The recent dramatic growth of the web has focused attention on the work required to locate, integrate, and make effective use of the distributed information now available on wide area networks. This represents a critical challenge for wide area data management and information retrieval and for distributed data mining.
Within the context of distributed computing, it is now becoming more common to view the web as a distributed computing infrastructure. For this to succeed, wide area data management must make fundamental advances: it is no longer sufficient just to understand how to partition a computation; rather, the challenge is to understand how to distribute a large scale computation involving widely distributed data. Similarly, the uncovering patterns in distributed data is emerging as an important alternative to today's more common approach of first collecting the data into one central data warehouse data and then mining the warehoused data.
The Terabyte Challenge is an evolving, open testbed that can be used to test new algorithms and software for high performance and wide area management, mining and analysis of data as we try to understand the best technology for the next step of working with Petabytes of data (1000 Terabytes).
The first phase of Terabyte Challenge culminated at a demonstration during the Supercomputing 1995 Conference in San Diego in which the Terabyte Challenge team members illustrated managing, mining and analyzing a variety of very large data sets totaling over 100 Gigabytes.
The purpose of the this paper is to provide a current (August, 1996) overview of the Terabyte Challenge. We begin by describing the background and goals. We then describe some of the technical approaches. Next is a description of some of the applications and services supported by the Terabyte Challenge. We conclude with research challenges for the future.
Managing data has always been an important component in high performance computing. It is sometimes helpful to interpret our view of data management as having undergone three fundamental transitions during the past two decades:
In this section, we discuss some approaches which have proved useful for managing and mining the terabyte size data sets which occur in some of the Terabyte Challenge applications. This list is not meant to be exhaustive, but rather simply reflects some of our own prejudices.
Exploit the natural structure in data by using the appropriate data management infrastructure. Traditionally, in high performance computing access to data is file based and data is typically viewed as tabular. In contrast, scientific, engineering and business data is usually structured. Databases provide the appropriate data management infrastructure for archiving data, data warehouses provide the appropriate infrastructure for performing simple analyses and aggregation of data, and data mining systems provide the appropriate infrastructure for the types of numerically and statistically intensive queries required to uncover patterns in data.
Object-oriented databases and object warehouses have been developed to work with more structured data, while light weight data management has been developed recently to provide low overhead, high performance access to data for specialized applications, such as data mining.
What ever the specific data mining technology, the goal is to use the appropriate data management infrastructure so that natural structure inherent in data can be exploited efficiently, without having to reassemble and recompute the structures which were thrown away when data is flattened into files or tables.
When possible, analyze and mine data using data centered parallelism and combine the results. The majority of data mining and data analysis algorithms do not scale when the computation is required to access data which is out of memory. Generally, these types of algorithms proceed by sampling large data sets or databases to obtain a sample which is small enough to fit into main memory. One of the most fundamental ideas in high performance computing is exploit data centered parallelism. One of the simplest ways to apply this idea in this context is to analyze or mine data in parallel using a shared nothing approach. The challenge is then to combine the information obtained. Sometimes this is simple: in general it is not. Often though, the information obtained is used for predictive modeling. In this case, it is sometimes easier to combine the predictions resulting from each of the predictive models rather than to combine the patterns obtained that went into the various predictive models.
Exploit parallel input-output. Many data mining queries have a component which is input-output bound. Techniques for parallel input-output such as striping, parallel file transports protocols, or protocols for parallel transport of objects are often useful.
Exploit clusters of workstations. Recently, exploiting clusters of workstations has emerged as a valuable alternative to specialized MPP platforms. Many data mining applications in particular can efficiently exploit cluster computing.
In this section, we give brief descriptions of several applications which have the focus of the Terabyte Challenge to date, including mining structured and semi-structured for anomalies, analyzing high energy physics data, searching for structures and patterns among flows of dynamical systems, and extracting information from digital libraries.
Currently, the Terabyte Challenge Testbed consists of geographically distributed clusters of workstations located in Chicago, Philadelphia and College Park, Maryland which are configured to manage, analyze and compute with large data sets. All the clusters are connected by the internet. In addition, the Chicago and Philadelphia clusters will also be connected with a high performance network (vBNS) by the time of Supercomputing 96 (November, 1996).
The testbed is open and broadly based upon web protocols, their extensions, and variants. A formal definition of these services is currently being prepared. At this point, a Terabyte Challenge application can make requests using web protocols for a variety of network services, including:
In this section, we mention several open problems, following.
At the University of Illinois at Chicago, this work was supported in part by DOE grant DE-FG02-92ER25133, and NSF grants IRI 9224605, CDA-9303433, and CDA-9413948.
At Magnify Research, Inc., this work was also supported in part by the Massive Digital Data Systems (MDDS) Program, through the Department of Defense.
Robert Grossman can be contacted at: Department of Mathematics, Statistics, & Computer Science, University of Illinois at Chicago, 851 S. Morgan Street, Chicago, IL 60607, grossman@uic.edu or at Magnify, Inc., 815 Garfield Street, Oak Park, IL 60304, rlg@magnify.com.
Currently, the following individuals are developing applications and services for the Terabyte Challenge: Andrew Baden, University of Maryland at College Park; Stuart Bailey, University of Illinois at Chicago; Don Benton, University of Pennsylvania; Haim Bodek, Magnify, Inc.; Shirley Connelley, University of Illinois at Chicago; Dave Hanley, University of Illinois at Chicago; Bob Hollebeek, University of Pennsylvania; Dave Northcutt, Magnify, Inc.; Michael Ogg, University of Texas; Roque Oliveira, University of Pennsylvania; Ivan Pulleyn, Magnify, Inc.