*Robert Grossman is also with the Open Data Partners.
This an early draft of the article: Robert Grossman and Marco Mazzucco, DataSpace - A Web Infrastructure for the Exploratory Analysis and Mining of Data, IEEE Computing in Science and Engineering, July/August, 2002, pages 44-51.
Data mining is the semi-automatic extraction of changes, correlations, associations, anomalies, and other statistically significant structures from large data sets. It differs from statistics through its emphasis on semi-automated processes which are data driven rather than human driven. The process of mining data is often thought of as consisting of the following stages:
For many problems, Steps 1 and 2 are the most time consuming. For example, it is not atypical for a study to spend 2 or 3 times longer on Steps 1 and 2 than on Step 3. Often Step 1 is a barrier to the exploration of data at all. If the work in Step 1 is too much, certain data will not be examined at all in a study. As a simple thought experiment, think about how many more documents one examines with the world wide web than was done before when it was necessary to ftp documents and then open them. Despite the importance of Steps 1 and 2, the majority of recent research in data mining has focused on building better statistical models to improve Step 3.
There are probably a variety of reasons for this focus. Algorithms producing statistical algorithms can be analyzed and studied theoretically. Algorithms producing statistical models can be compared to each other on specific data sets. Of course, this often produces a false sense of comfort in that only data sets showing off the new algorithm in a positive light are usually included in published results of the experimental studies.
In this paper, we describe an infrastructure called DataSpace designed to reduce the time required to accomplish Steps 1 and 2. This same infrastructure also has the advantage that it facilitates the use of data produced by others and enables the distributed exploration of remote data. DataSpace is an example of a data web, that is, of a web based infrastructure for working with data.
We believe that DataSpace is novel in that it provides a simple mechanism to lower the cost of extracting, cleaning, transforming, and exploring remote and distributed data. With this type of infrastructure, the data mining of scientific and engineering data becomes significantly easier. Although there are a variety of tools for the exploratory analysis of data, these are designed to work with local data, not remote or distributed data.
Section 2 contains a simple example and describes some of the basic ideas involved. Section 3 describes background material and related work. Section 4 describes the logical structure of DataSpace. Section 5 describes how DataSpace can be implemented as a web service. Section 6 describes some experimental studies. Section 7 describes additional EDA functionality being added to DataSpace based upon on our experimental studies. Section 8 contains discussion and Section 9 is the conclusion.
In this section we describe a simple example of a DataSpace query. An article in Science noted that there is a relation between El Nino and the outbreak of Cholera [Pascual:2000]. Although it is easy to find many text articles on the web about El Nino and cholera it was extremely difficult to find either the El Nino or Cholera data on the web in a format in which this hypothesis could be readily checked. Some Cholera data is available from the WHO, but not the data used in the study. While El Nino data is available, the proper El Nino data is not easy to locate and when located is available in html or ftp, neither of which can be directly correlated. The challenge is to make these types of correlations as easy as pointing and clicking, the same criterion we expect today when viewing remote documents. An infrastructure supporting such casual exploration of data would change in a fundamental fashion the data mining of scientific and engineering data.
To explore data mining from this perspective, we imported atmospheric data from the National Center for Atmospheric Research (NCAR) into DataSpace. We also imported cholera data from the World Health Organization (WHO) into DataSpace.
The importing of this data into DataSpace is relatively simple. DataSpace Clients and Servers communicate using a protocol called the DataSpace Transfer Protocol or DSTP. In this example, the data was managed by an open source DSTP Server and the client was a simple web browser. Putting data into DataSpace consists of the following three steps.
Here is what happens between the DSTP Client and Server in a simple DataSpace query:
We close this section with a few comments about the query:
Comment 1. Notice that the interaction between the DSTP client and server proceeds through three phases: first UCKs are retrieved, then metadata, then data. By using UCKs, different, possibly distributed, columns can be meaningfully compared. By next retrieving the metadata, the client and server can select an appropriate amount of data to return in the final step. The DSTP servers we have developed treat the UCKs, metadata and data quite differently. Different storage formats and different caching policies are used for each. For example, the UCKs and metadata can be stored in XML and kept in memory during operation, while the data files can be stored more efficiently and retrieved on a per query basis.
Comment 2. Notice that the basic transformations required for this query are handled directly and automatically by the client. For known UCKs, simple direct mechanisms can be created by hand to implement the transformations required using, for example, the templated transformations that are part of PMML Version 2.0 [DMG]. Although this may be counterintuitive, interesting distributed queries of data can be done with this approach. The reason is that a basic set of UCKs can sometimes cover a fairly large class of data from within a commmunity (see the examples in Section 6). Think of this as the 80-20 rule for UCKs. The alternative of using ontologies and RDF although much more flexible and powerful creates a higher barrier before data can be put into a data web [W3C:2001]. As an analogy, URLs allow documents to be placed easily into documents, but creates significant barriers to finding documents. One could argue though that requiring documents to use URIs, RDF and ontologies (which are equally applicable to documents and make search easier) would have significantly lowered the liklihood that the web would have been a success.
Comment 3. This example uses multidimensional UCKs (latitude, longitude, and time) each of which is ordered. In practice, range queries were one of the most common queries and a client application with direct support for range queries was also developed.
By a data web [Grossman:2002] we mean a web based infrastructure for data. In Section 5 below, we describe a simple data web server and data web client which communicate with a protocol called the Dataspace Transfer Protocol (DSTP) and which support the following services:
In this section we describe how a data web with these types of services compares with related infrastructures, such as data mining systems, distributed data mining systems, data grids, and semantic webs.
Data mining systems generally assume that the data is local and import the data into the system from flat files, from databases, or from proprietary file formats. Once the data is imported, it can be examined using a variety of statistical and data mining algorithms. Many data systems also include visualization packages.
Distributed data mining systems [Kargupta:2000] are typically of two types: those that use agents to move models (MM) and those that use agents to move data (MD). The most common type of distributed data mining system moves models. Local models at each of several distributed sites are built and then moved to a centralized site where they can be combined. The end result is an ensemble of models or a hierarchical model which combines local models built from data at each of the distributed sites. Another approach is to move data from a number of distributed sites to a central location. Once the data has been centralized a single model or ensemble of models is built. Some systems also support a hybrid strategy in which both the data and the models may be moved.
Data grids are grid based infrastructures [Foster:1999] for working with data. A grid is a distributed collection of computing resources which appears as a single virtual computing infrastructure by sharing security services, including single sign on, an LDAP-based information infrastructure, and resource management services. A data grid adds data specific services [Chervenak:2000]. These currently include grid ftp for moving data over a grid, and data replication services, so that data may be efficiently cached over a grid.
Semantic webs [W3C:2001] extend the web's HTML infrastructure to include semantic information defined by XML and the Resource Description Framework (RDF). The semantic web also interoperates with protocols such as SOAP, which is a serialization protocol so that remote objects can be accessed over the web. RDF views information as a directed labeled graph and serializes it in XML [W3C:1999]. Less formally, RDF codes information using subject-verb-object triples. For example, (www.ncar.ucar.edu/ccm/1/1, Temperature, 45.5) is a subject-verb-object triple giving the Temperature for a particular data record specified by the URL. The semantic web also supports ontologies so that data taxonomies can be used.
Briefly, distributed data mining systems are agent based systems which either move data or move statistical models between sites as one stage in the process of building a statistical model on distributed data. Data webs provide just some of the functionality of semantic webs. In fact, precisely the functionality required in order to work with remote and distributed data. Semantic webs are designed to support general knowledge based computations viewed as computations of RDF and XML data and the transport of general objects via SOAP and of semi-structured data via its serialization as XML. Data grids, unlike data webs, work with data using the infrastructure of a virtual distributed computer. For this reason, the assumption with data grids is that you have logged on and have been authorized to examine and compute with remote data.
The point of view proposed here is that a data web supporting the ability to view, explore, and merge remote and distributed data is sufficient for the initial phases of the data mining process and, therefore, that data webs have the potential for fundamentally changing the way scientists and engineers work with other peoples' data.
To demonstrate this, we have implemented DSTP servers which separate UCKs, metadata, and data and support multidimensional UCKs, UCK based queries, metadata queries, data queries, UCK-based range queries, server side sampling, and missing values. These services are described in more detail below. The DSTP servers also support ad hoc normalizations, transformations and aggregations. We have also developed several DSTP applications, three of which are described below. As far as we are aware, the general infrastructure and services for semantic webs and data grids would not have significantly simplified this work. On the other hand, it is clear that one could develop DSTP servers as either a semantic web application or a data grid application. Our point of view in this paper is that a DSTP server implemented as a data web application is practical, useful, and much easier for many types of client applications.
One way to make sense of the different technologies is to place them along two axes. Along the horizontal axis is what you do with the data (an action), such as viewing it, mining it, or computing with it. Along the vertical axis is the object of the action, which may be a file, rows and columns/records and fields, or higher order concepts such as the ontologies and related concepts underlying a knowledge management system. This viewpoint is summarized in the Table 2 below.
View | Mine/Discover | Compute | |
---|---|---|---|
Knowledge | Digital Libraries | Knowledge Mining | Semantic Webs |
Attributes/Columns | Web accessible databases | Data Webs | Data Grids |
Files | Persistent Archives | Distributed Data Mining | Grids |
Table 1. Technologies for exploring remote and distributed numerical data: Data webs, data grids, and semantic webs can all provide web based access to numerical data. Data webs provide direct access to distributed rows and columns of data. Data grids enable large scale resource sharing of computational and data resources. Semantic webs provide knowledge based access to data using ontologies, RDF and agent based architectures.
DataSpace provides a foundation for remote data analysis and distributed data mining of scientific and engineering data using just four key concepts:
Universal correlation keys (UCKs) form the basic glue among attributes available on different DataSpace servers. All data published by a DataSpace server has to be attached to at least one UCK. Every UCK is characterized by its name (not necessarily unique) and its ID number (globally unique). Suppose one DataSpace server lists temperature values of the earth's surface according to longitude/latitude and another server lists precipitation values. A scientist might want to correlate precipitation with temperature. In this case the UCKs on both servers would be "latitude" (suppose the ID number is 11110) and "longitude" (suppose the ID number is 11111). Identical ID numbers guarantee that the key on both servers is the same and that the unit of the key on all servers is the same. For example, the ID number 11111 might specify that the key represents longitude and that longitude is measured in one degree intervals. If longitude is measured in 5 degree intervals the ID number would have to be changed, although the UCK name might be "longitude" in both cases.
More precisely, UCKs allowed distributed columns to be correlated in the following fashion: Pairs (k_i, x_i), where k_i is a UCK value and x_i is an attribute value, on DataSpace Server 1 can be combined with pairs (k_j, y_j) on DataSpace Server 2 to produce a table (x_k, y_k) in a DataSpace client. The DataSpace client can then, for example, find a function y=f(x) relating x and y.
The open source DataSpace servers and clients we have developed communicate with a protocol called the DataSpace Transfer Protocol or DSTP. Depending upon the request, DSTP servers may return one or more columns, one or more rows, or entire tables. DSTP is broadly based upon the Network News Transfer Protocol (NNTP) [Kantor:1986], a protocol for retrieval of news articles.
DSTP uses XML to describe the metadata. On the other hand, for efficiency, data itself is transmitted as records delimited by carriage returns, with fields delimited by commas.
DSTP servers use stream connections and NNTP-like commands and responses. They are designed to accept connections from DSTP Clients and to provide a simple interface to the data columns on the DSTP Server. A DSTP Server functions as an interface between DSTP applications and remote data.
Here is a list of the basic DSTP commands:
These commands allow DSTP Clients to retrieve metadata; specify UCKs, data sets, ranges, and sampling parameters; and retrieve the specified data. The metadata includes not only information about the units and how the data was collected and processed, but also the minimum, maximum, and other basic statistical information about the attributes which are important in EDA operations.
The hard part about working with data over the web is deciding what data to transport, what units the data is in, what processes were used to prepare the data, and how the data should be normalized so that it can be used with other data. These are all essential for exploratory data analysis and data mining. DSTP servers support services which return data set metadata and attribute metadata to answer these questions. Each DSTP server also has a special file called the catalog file containing meta-data about the data sets on the server to facilitate searching and locating remote data sets.
DSTP Clients and Servers support the following services:
Example 1. Earth Science Data. We placed approximately 100 Gigabytes of Community Climate Model (CCM3) data from the National Center for Atmospheric Research (NCAR) on a DSTP Server located at NCAR. CCM3 data is used by scientists to study CO2 warming and climate change, climate prediction and predictability, atmospheric chemistry, paleoclimate, biosphere-atmosphere transfer and nuclear winter. The data consists of monthly satellite measurements of global surface temperatures, precipitation, ozone levels and vegetation index.
There are three UCKs for this application: latitude, longitude and time. Figure 2 shows the result of a DataSpace query for sea surface temperature. The DSTP Client for this application supports queries by uck, by attribute ranges, by attribute, and by data set. The client can view the data, download the data, graph the data, and do simple EDA operations on the data.
The DSTP Client can also compare the CCM3 data and overlay the CCM3 data with other data sets sharing one or more of the same UCKs.
Example 2. Protein Data. In this application [Hamelberg:2002], we took the data from the the Protein Data Bank [PDB] and placed it in a DSTP Server in Halifax. The data consisted of records of the form
C,PRO,2,28.901,38.374,3.596
In this example, C is the type of atom (Carbon), PRO the residue to which the atom belongs (Proline), 2 is the ID of the molecule which the amino acid is part of, and the final three coordinates are the x, y, and z coordinates of the atom. The x, y, z coordinates serve as the UCKs. We also placed data describing drugs on a DSTP Server in Amsterdam using the same UCKs. Both these DSTP Servers were on a testbed for DataSpace which allowed us to measure the performance of various queries involving large data sets.
The DSTP Client application in Figure 3 can retrieve and interactively explore proteins. The proteins can be displayed in the chemical markup language (CML) or in PDB file formats. The data can also be visualized using a graphics program like Rasmol or a web visualization tool such Webmol.
Of more interest, a distributed query can be done between a protein molecule from the Halifax DSTP Server and a small organic compound from the Amsterdam DSTP Server. For example, Figure 3 shows a drug candidate molecule docking with a protein from the PDB.
In addition, the DSTP Client can query NCBI's PubMed for all references related to the proteins and drugs on the DSTP Clients.
Example 3. Astronomical Data. In this example taken from [Grossman:2001], we queried two geographically distributed astronomical source catalogs: 2MASS (Two Micron All Sky Survey) survey data from a DSTP Server at CalTech and DPOSS (Digital Palomar Observatory Sky Survey) survey data, which we replicated on a DSTP Server in Chicago. The 2MASS data are in the optical wavelengths (0.4 - 0.7 micron), while the DPOSS data are are in the infrared (1.2 - 2.2 micron) range.
The position of the light sources in both data sets are in polar coordinates. The UCKs for this example are the right ascension and declination, measured in degrees.
A typical query of interest to astronomers is of the form: "Find all pairs, one from the DPOSS catalog and one from 2MASS, whose angular separation is less than a given tolerance." Figure 4 illustrates a DataSpace query with a tolerance of 2 seconds in the region of the sky with right ascension from 183 to 270 degrees and declination from 17 to 47 degrees (somewhere over the north pole).
After working with the examples described above and related examples for the past two years, we have decided to standardize on certain primitives for normalizing, transforming and aggregating data. The following four transformations, which are defined in [DMG], seem sufficient to cover most of the transformations we use in practice.
We have been working with the Data Mining Group to standardize these operations. These four operations are part of Version 2.0 of the Predictive Model Markup Language, an XML standard for statistical and data mining models [DMG]. Today these transformations are done in an ad hoc fashion by DataSpace clients. DMG compliant implementations of these operations will be supported in the next version of DataSpace.
Paraphrasing the Data Mining Group's description of PMML Version 2.0 transformations [DMG], the approach to transformations is not to cover the full set of preprocessing functions which may be needed to collect and prepare the data for mining, but rather to introduce a templated set of basic operations designed to cover many of the most common operations. The operations above cover the normalization of input values required for neural networks. They also cover the types of quantile ranges discretizations that are used in order to transform skewed data. Indeed, it has been our experience that they cover many of the transformations we use in practice to prepare data for mining.
For example, we have used these transformations to merge two data sets with different UCKs. For example, correlating El Nino anomalies with cholera outbreaks requires aggregating and normalizing the data from the two data sets so the data can be meaningfully compared. Today, our DSTP clients essentially hard code some of the more common transformations for some of the more common UCKs.
We feel that together with the basic graphics and visualization that are already part of DataSpace, these transformations will cover many of the common EDA operations desired, further enhancing the ability of scientists using DataSpace to explore casually remote and distributed data. For example, overlaying remote data over local data can be done transparently for data sets employing known UCKs.
The work described here can be thought of as a specific semantic web application and implementation. The challenge we faced was understanding what a) specific services, data and metadata are required for exploratory data analysis using data webs and b) implementing these services in a scalable manner.
We next describe the relation between data webs and semantic webs [W3C:2001] in more detail.
XML. It is clear that any of the metadata required for data mining can be put into XML. Based upon our experiences building data web applications, we have been an active participant in the Data Mining Group [DMG] and our data web applications use the XML metadata standards developed by them.
RDF. We have chosen not to use the semantic web's RDF standard since our interest is in data mining not knowledge management. Although in some sense data is a "trival" type of knowledge, from a practical viewpoint having a scalable robust web infrastructure to work with data is useful for the same reason we still have databases and data archive systems even though knowledge management and AI systems should have made these "trivial" long ago.
WSDL. Webs are built from services - DSTP is a web service for working with remote and distributed data. Recently we have released a WSDL for DSTP. This will be useful as semantic web applications become more common.
SOAP. We chose to implement DSTP directly instead of over SOAP for several reasons: SOAP was not around when we started and SOAP doesn't scale as well to high volume data streams. We will shortly release a version of DSTP which uses SOAP and which is suitable for DSTP client applications involving small data sets.
The majority of the time for a typical scientific or engineering data mining application is spent cleaning, transforming, and exploring the data. In this paper, we have described a data web which is designed to simplify these operations. In particular, the infrastructure called DataSpace allows scientists to view, retrieve, and apply simple transformations to remote and distributed data.
We have described how DataSpace can be used for preliminary exploratory analysis of atmospheric, biological and astronomical data. If the data appears interesting, both the data and the relevant metadata can be retrieved with a few points and clicks so that it can be analyzed locally using statistical and data mining software.
We feel that data webs make it easier to use unfamiliar data and that this ability should enable certain data based scientific discoveries which might not occur otherwise. One of the hopes for data mining is the enhanced ability to make scientific discoveries from data that would be ignored otherwise. Although there has been a lot of data mining research on new data mining algorithms, without the data there is no place for these algorithms to start. The data web infrastructure described in this paper encourages the casual exploration of remote and distributed data, making it easier for scientists to use data that might be overlooked otherwise.
[DMG] Data Mining Group, The Predicitive Markup Language (PMML), A Markup Language for Statistical and Data Mining Models, Version 2.0. Retrieved from www.dmg.org on January 10, 2002.
[Pascual:2000] Percedes Pascual, Xavier Rodó, Stephen P. Ellner, Rita Colwell, Menno J. Bouma, Cholera Dynamics and El Niño-Southern Oscillation, Volume 289, 2000, pp. 1766-1769.
[Chervenak:2000] Protocols and Services for Distributed Data-Intensive Science. A. Chervenak, I. Foster, C. Kesselman, S. Tuecke. ACAT2000 Proceedings, pp. 161-163, 2000.
[Foster:1999] I. Foster and C. Kesselman, The Grid: Blueprint for a New Computing Infrastructure, Morgan Kaufmann Publishers, San Francisco, 1999
[Grossman:2001] Robert Grossman, Emory Creel, Marco Mazzucco, Roy Williams, A DataSpace Infrastructure for Astronomical Data, in R. L. Grossman, C. Kamath, W. Philip Kegelmeye, V. Kumar, and R. Namburu, Data Mining for Scientific and Engineering Applications, Kluwer Academic Publishers, 2001.
[Grossman:2002] R. L. Grossman, M. Hornick, and G. Meyer, Emerging KDD Standards, Communications of the ACM, Special Issue on Data Mining, to appear.
[Hamelberg:2002] D. Hamelberg and R. L. Grossman, A DataSpace Infrastructure for Bioinformatics Data.
[Kantor:1986] Brian Kantor and Phil Lapsley, Network News Transfer Protocol, February 1986, RFC 977. Retrieved from www.w3.org/Protocols/rfc977/rfc977.html on January 10, 2002.
[Kargupta:2000] H. Kargupta and P. Chan, editors, Advances in Distributed and Parallel Knowledge Discovery, AAAI Press/The MIT Press, Menlo Park, California, 2000.
[PDB] Protein Data Bank, www.rcsb.org/pdb.
[W3C:2001] The Semantic Web. Retrieved from http://www.w3.org/2001/sw/ on February 10, 2002.
[W3C:1999] Web Architecture: Describing and Exchanging Data, W3C Note 7 June 1999. Retrieved from www.w3.org/1999/04/WebData on February 10, 2002.
Figure 1: Result of a DataSpace query on El Nino data. Using UCKS and the DataSpace client's basic EDA capabilities, the El Nino data can be compared to Cholera data with a few points and clicks, even though the data is from different sites and originally in quite different formats.
Figure 2: Result of a DataSpace query for sea surface temperature on NCAR data from a DSTP Server at NCAR in Boulder.
Figure 3: Result of a distributed DataSpace query. One of the sites contains 3-D protein data from the Protein Data Bank which we replicated in Halifax for testing of the DataSpace infrastructure. The other contains 3-D data describing small organic compounds, which may be used as drugs. This data is on a DataSpace server in Amsterdam which is also part of our testbed. The query results in the docking of the potential drug in the protein.
Figure 4: The result of DataSpace query of astronomical data from two sky surveys. One data set is 2MASS (Two Micron All Sky Survey) survey data from a DSTP Server at CalTech and the other DPOSS (Digital Palomar Observatory Sky Survey) survey data, which we replicated on a DSTP Server in Chicago.