DataSpace - A Data Web for the Exploratory Analysis and Mining of Data

DataSpace - A Data Web for the
Exploratory Analysis and Mining of Data

Robert Grossman* & Marco Mazzucco
Laboratory for Advanced Computing
University of Illinois at Chicago

*Robert Grossman is also with the Open Data Partners.

This an early draft of the article: Robert Grossman and Marco Mazzucco, DataSpace - A Web Infrastructure for the Exploratory Analysis and Mining of Data, IEEE Computing in Science and Engineering, July/August, 2002, pages 44-51.

1. Introduction

Data mining is the semi-automatic extraction of changes, correlations, associations, anomalies, and other statistically significant structures from large data sets. It differs from statistics through its emphasis on semi-automated processes which are data driven rather than human driven. The process of mining data is often thought of as consisting of the following stages:

Extracting, cleaning, and transforming data. In this step, data from one or more sources are extracted and brought to a common format for further analysis.

Exploratory data analysis. In this step, exploratory analysis of the data is done to identify the characteristics of the dependent and independent variables. The variables may be transformed by binning, smoothing, and additional attributes may be introduced, say counting or aggregation.

Building a statistical model. In this step, a statistical or data mining model is built, for example a tree-based classifier, a support vector machine, a cluster model, or a neural network. Data mining is also used more narrowly to refer specifically to the building of models.

Deploying a statistical model. In this step, the statistical model is deployed in an operational system of some type.

For many problems, Steps 1 and 2 are the most time consuming. For example, it is not atypical for a study to spend 2 or 3 times longer on Steps 1 and 2 than on Step 3. Often Step 1 is a barrier to the exploration of data at all. If the work in Step 1 is too much, certain data will not be examined at all in a study. As a simple thought experiment, think about how many more documents one examines with the world wide web than was done before when it was necessary to ftp documents and then open them. Despite the importance of Steps 1 and 2, the majority of recent research in data mining has focused on building better statistical models to improve Step 3.

There are probably a variety of reasons for this focus. Algorithms producing statistical algorithms can be analyzed and studied theoretically. Algorithms producing statistical models can be compared to each other on specific data sets. Of course, this often produces a false sense of comfort in that only data sets showing off the new algorithm in a positive light are usually included in published results of the experimental studies.

In this paper, we describe an infrastructure called DataSpace designed to reduce the time required to accomplish Steps 1 and 2. This same infrastructure also has the advantage that it facilitates the use of data produced by others and enables the distributed exploration of remote data. DataSpace is an example of a data web, that is, of a web based infrastructure for working with data.

We believe that DataSpace is novel in that it provides a simple mechanism to lower the cost of extracting, cleaning, transforming, and exploring remote and distributed data. With this type of infrastructure, the data mining of scientific and engineering data becomes significantly easier. Although there are a variety of tools for the exploratory analysis of data, these are designed to work with local data, not remote or distributed data.

Section 2 contains a simple example and describes some of the basic ideas involved. Section 3 describes background material and related work. Section 4 describes the logical structure of DataSpace. Section 5 describes how DataSpace can be implemented as a web service. Section 6 describes some experimental studies. Section 7 describes additional EDA functionality being added to DataSpace based upon on our experimental studies. Section 8 contains discussion and Section 9 is the conclusion.

2. A Simple Example

In this section we describe a simple example of a DataSpace query. An article in Science noted that there is a relation between El Nino and the outbreak of Cholera [Pascual:2000]. Although it is easy to find many text articles on the web about El Nino and cholera it was extremely difficult to find either the El Nino or Cholera data on the web in a format in which this hypothesis could be readily checked. Some Cholera data is available from the WHO, but not the data used in the study. While El Nino data is available, the proper El Nino data is not easy to locate and when located is available in html or ftp, neither of which can be directly correlated. The challenge is to make these types of correlations as easy as pointing and clicking, the same criterion we expect today when viewing remote documents. An infrastructure supporting such casual exploration of data would change in a fundamental fashion the data mining of scientific and engineering data.

To explore data mining from this perspective, we imported atmospheric data from the National Center for Atmospheric Research (NCAR) into DataSpace. We also imported cholera data from the World Health Organization (WHO) into DataSpace.

The importing of this data into DataSpace is relatively simple. DataSpace Clients and Servers communicate using a protocol called the DataSpace Transfer Protocol or DSTP. In this example, the data was managed by an open source DSTP Server and the client was a simple web browser. Putting data into DataSpace consists of the following three steps.

The data is placed in simple ASCII files with attributes delimited by separators such as "|" and with records delimited by carriage returns. These data files are placed in the data directory of the DSTP Server. This is analogous to placing html documents in a HTTP Server's doc directory. DSTP servers can also access data from databases using ODBC or JDBC or from native file formats such as netcdf.

Certain columns of data are identified as Universal Correlation Keys or UCKs. UCKs are described in more detail in Section 4. UCKs are globally unique identifiers used by DataSpace to join distributed columns of data. In practice, each application domain has its own conventions for data and these conventions naturally lead to UCKs. The NCAR data for this example uses one of the common NCAR formats - a 1 degree by 1 degree latitude-longitude grid at one month intervals from 1870-1998. For this example, therefore, there are three UCKs for each attribute.

A XML file containing the attribute and file meta-data is placed in the same directory.

Here is what happens between the DSTP Client and Server in a simple DataSpace query:

The user opens up two DataSpace sites: one containing sea surface temperature data and one containing El Nino data.

The DSTP servers tell the DSTP client what UCKs are present and displays them. In this case latitude, longitude, and time (in the units mentioned above).

The user selects UCKs of interest.

The DSTP client shows what data columns (fields) are available for these UCKs. In this case, Sea Surface Temperate and El Nino anomaly data. The DSTP client and DSTP server accomplish this by exchanging the relevant XML metadata.

The user selects columns of interest from these (distributed) data columns.

The DSTP client and server exchange metadata about the file size and the client processing and visualization capabilities. The DSTP server provides server side sampling as required for the client. Both the client and server use PMML's conventions for working with missing values [DMG].

The DSTP servers stream the relevant columns (and associated UCKs) to the DSTP client which merges the streams to create the data set desired. The default is that DSTP servers do not store the data in XML. In many cases, this dramatically reduces the server's storage, computation and bandwidth requirements.

The client performs simple EDA of the selected columns, displays simple visualizations, and builds simple models. See Figure 1.

We close this section with a few comments about the query:

Comment 1. Notice that the interaction between the DSTP client and server proceeds through three phases: first UCKs are retrieved, then metadata, then data. By using UCKs, different, possibly distributed, columns can be meaningfully compared. By next retrieving the metadata, the client and server can select an appropriate amount of data to return in the final step. The DSTP servers we have developed treat the UCKs, metadata and data quite differently. Different storage formats and different caching policies are used for each. For example, the UCKs and metadata can be stored in XML and kept in memory during operation, while the data files can be stored more efficiently and retrieved on a per query basis.

Comment 2. Notice that the basic transformations required for this query are handled directly and automatically by the client. For known UCKs, simple direct mechanisms can be created by hand to implement the transformations required using, for example, the templated transformations that are part of PMML Version 2.0 [DMG]. Although this may be counterintuitive, interesting distributed queries of data can be done with this approach. The reason is that a basic set of UCKs can sometimes cover a fairly large class of data from within a commmunity (see the examples in Section 6). Think of this as the 80-20 rule for UCKs. The alternative of using ontologies and RDF although much more flexible and powerful creates a higher barrier before data can be put into a data web [W3C:2001]. As an analogy, URLs allow documents to be placed easily into documents, but creates significant barriers to finding documents. One could argue though that requiring documents to use URIs, RDF and ontologies (which are equally applicable to documents and make search easier) would have significantly lowered the liklihood that the web would have been a success.

Comment 3. This example uses multidimensional UCKs (latitude, longitude, and time) each of which is ordered. In practice, range queries were one of the most common queries and a client application with direct support for range queries was also developed.

3. Background & Related Work

By a data web [Grossman:2002] we mean a web based infrastructure for data. In Section 5 below, we describe a simple data web server and data web client which communicate with a protocol called the Dataspace Transfer Protocol (DSTP) and which support the following services:

The analysis, exploration, and visualization of remote data using a DSTP Client communicating with a (single) DSTP Server. Supporting basic exploratory data analysis (EDA) operations on unfamiliar, remote data is one of the basic advantages of a data web.

The merging, and simple transformation of distributed data, using a DSTP Client communicating with two distributed DSTP Servers. For example, correlating sea surface temperatures on one DSTP Server with malaria outbreaks on another DSTP Server. As a special case, one of the DSTP Servers may be local and one remote. In this case, the role of the merging is to add or append remote data to the local data set in a meaningful way.

In this section we describe how a data web with these types of services compares with related infrastructures, such as data mining systems, distributed data mining systems, data grids, and semantic webs.

Data mining systems generally assume that the data is local and import the data into the system from flat files, from databases, or from proprietary file formats. Once the data is imported, it can be examined using a variety of statistical and data mining algorithms. Many data systems also include visualization packages.

Distributed data mining systems [Kargupta:2000] are typically of two types: those that use agents to move models (MM) and those that use agents to move data (MD). The most common type of distributed data mining system moves models. Local models at each of several distributed sites are built and then moved to a centralized site where they can be combined. The end result is an ensemble of models or a hierarchical model which combines local models built from data at each of the distributed sites. Another approach is to move data from a number of distributed sites to a central location. Once the data has been centralized a single model or ensemble of models is built. Some systems also support a hybrid strategy in which both the data and the models may be moved.

Data grids are grid based infrastructures [Foster:1999] for working with data. A grid is a distributed collection of computing resources which appears as a single virtual computing infrastructure by sharing security services, including single sign on, an LDAP-based information infrastructure, and resource management services. A data grid adds data specific services [Chervenak:2000]. These currently include grid ftp for moving data over a grid, and data replication services, so that data may be efficiently cached over a grid.

Semantic webs [W3C:2001] extend the web's HTML infrastructure to include semantic information defined by XML and the Resource Description Framework (RDF). The semantic web also interoperates with protocols such as SOAP, which is a serialization protocol so that remote objects can be accessed over the web. RDF views information as a directed labeled graph and serializes it in XML [W3C:1999]. Less formally, RDF codes information using subject-verb-object triples. For example, (www.ncar.ucar.edu/ccm/1/1, Temperature, 45.5) is a subject-verb-object triple giving the Temperature for a particular data record specified by the URL. The semantic web also supports ontologies so that data taxonomies can be used.

Briefly, distributed data mining systems are agent based systems which either move data or move statistical models between sites as one stage in the process of building a statistical model on distributed data. Data webs provide just some of the functionality of semantic webs. In fact, precisely the functionality required in order to work with remote and distributed data. Semantic webs are designed to support general knowledge based computations viewed as computations of RDF and XML data and the transport of general objects via SOAP and of semi-structured data via its serialization as XML. Data grids, unlike data webs, work with data using the infrastructure of a virtual distributed computer. For this reason, the assumption with data grids is that you have logged on and have been authorized to examine and compute with remote data.

The point of view proposed here is that a data web supporting the ability to view, explore, and merge remote and distributed data is sufficient for the initial phases of the data mining process and, therefore, that data webs have the potential for fundamentally changing the way scientists and engineers work with other peoples' data.

To demonstrate this, we have implemented DSTP servers which separate UCKs, metadata, and data and support multidimensional UCKs, UCK based queries, metadata queries, data queries, UCK-based range queries, server side sampling, and missing values. These services are described in more detail below. The DSTP servers also support ad hoc normalizations, transformations and aggregations. We have also developed several DSTP applications, three of which are described below. As far as we are aware, the general infrastructure and services for semantic webs and data grids would not have significantly simplified this work. On the other hand, it is clear that one could develop DSTP servers as either a semantic web application or a data grid application. Our point of view in this paper is that a DSTP server implemented as a data web application is practical, useful, and much easier for many types of client applications.

One way to make sense of the different technologies is to place them along two axes. Along the horizontal axis is what you do with the data (an action), such as viewing it, mining it, or computing with it. Along the vertical axis is the object of the action, which may be a file, rows and columns/records and fields, or higher order concepts such as the ontologies and related concepts underlying a knowledge management system. This viewpoint is summarized in the Table 2 below.

	View	Mine/Discover	Compute
Knowledge	Digital Libraries	Knowledge Mining	Semantic Webs
Attributes/Columns	Web accessible databases	Data Webs	Data Grids
Files	Persistent Archives	Distributed Data Mining	Grids

Table 1. Technologies for exploring remote and distributed numerical data: Data webs, data grids, and semantic webs can all provide web based access to numerical data. Data webs provide direct access to distributed rows and columns of data. Data grids enable large scale resource sharing of computational and data resources. Semantic webs provide knowledge based access to data using ontologies, RDF and agent based architectures.

4. The Logical Structure of DataSpace

DataSpace provides a foundation for remote data analysis and distributed data mining of scientific and engineering data using just four key concepts:

Universal correlation keys (UCKs) form the basic glue among attributes available on different DataSpace servers. All data published by a DataSpace server has to be attached to at least one UCK. Every UCK is characterized by its name (not necessarily unique) and its ID number (globally unique). Suppose one DataSpace server lists temperature values of the earth's surface according to longitude/latitude and another server lists precipitation values. A scientist might want to correlate precipitation with temperature. In this case the UCKs on both servers would be "latitude" (suppose the ID number is 11110) and "longitude" (suppose the ID number is 11111). Identical ID numbers guarantee that the key on both servers is the same and that the unit of the key on all servers is the same. For example, the ID number 11111 might specify that the key represents longitude and that longitude is measured in one degree intervals. If longitude is measured in 5 degree intervals the ID number would have to be changed, although the UCK name might be "longitude" in both cases.

More precisely, UCKs allowed distributed columns to be correlated in the following fashion: Pairs (k_i, x_i), where k_i is a UCK value and x_i is an attribute value, on DataSpace Server 1 can be combined with pairs (k_j, y_j) on DataSpace Server 2 to produce a table (x_k, y_k) in a DataSpace client. The DataSpace client can then, for example, find a function y=f(x) relating x and y.

5. Implementing DataSpace as a Web Service

The open source DataSpace servers and clients we have developed communicate with a protocol called the DataSpace Transfer Protocol or DSTP. Depending upon the request, DSTP servers may return one or more columns, one or more rows, or entire tables. DSTP is broadly based upon the Network News Transfer Protocol (NNTP) [Kantor:1986], a protocol for retrieval of news articles.

DSTP uses XML to describe the metadata. On the other hand, for efficiency, data itself is transmitted as records delimited by carriage returns, with fields delimited by commas.

DSTP servers use stream connections and NNTP-like commands and responses. They are designed to accept connections from DSTP Clients and to provide a simple interface to the data columns on the DSTP Server. A DSTP Server functions as an interface between DSTP applications and remote data.

Here is a list of the basic DSTP commands:

METADATA [EXPAND || (CATEGORY [CategoryName] \&| UCK [UCKName] \&| \&| SERVER [ServerName] \&| DATAFILE DataFileName)]
SET (CATEGORY CategoryName) || (UCK UCKName) || (DATAFILE DataFileName)
SET LINE [StartLine EndLine]
SET TYPE [ ASCII || BINARY || SOCKET || PSOCKET || SABUL ]
SET SAMPLE RANDOM [ PERCENTAGE ||LINE ]
SET SAMPLE DECIMATE PERCENTAGE
DATA [[Attibute ID]+ [where (Atrtibute ID constraint [op]*)]+ ] Note: In the current implimentation, only a small subset of SQL queries are supported. Future releases will eventually fully support SQL.
RESET [ ALL || (CATEGORY CategoryName) || (UCK UCKName/ALL) || (DATAFILE DataFileName || LINE || TYPE ASCII/BIN(ARY) || SAMPLE ]
STOP
QUIT
HELP [ Command | STATUSCODE | ERRORCODE ]

These commands allow DSTP Clients to retrieve metadata; specify UCKs, data sets, ranges, and sampling parameters; and retrieve the specified data. The metadata includes not only information about the units and how the data was collected and processed, but also the minimum, maximum, and other basic statistical information about the attributes which are important in EDA operations.

The hard part about working with data over the web is deciding what data to transport, what units the data is in, what processes were used to prepare the data, and how the data should be normalized so that it can be used with other data. These are all essential for exploratory data analysis and data mining. DSTP servers support services which return data set metadata and attribute metadata to answer these questions. Each DSTP server also has a special file called the catalog file containing meta-data about the data sets on the server to facilitate searching and locating remote data sets.

DSTP Clients and Servers support the following services:

UCK-Based Queries. DSTP Client and Server operations are based upon UCKs. For example, a DSTP client can request all UCKs from a DSTP Server, set a UCK, and then request all columns associated with that UCK.

Range-Based Queries. DSTP Client and Servers suport range based queries. Ranges may be determined using a single UCK or using several UCKs.

Metadata Queries. DSTP Servers automatically create metadata about the data they serve and provide a simple mechanism for user supplied metadata. DSTP Clients can request attribute based metadata, data set based metadata, or metadata summarizing all the datasets managed by the DSTP Server. For example, the metadata associated with an attribute contains the number of rows and the min and max values. This simplifies the work of DSTP client applications, such as those supporting EDA.

Server side sampling. It is easy for DSTP Servers to overwhelm DSTP Client applications with data. The DSTP Servers support server side sampling so that the appropriate amounts of data can be returned.

Merging of distributed columns. A key benefit of DataSpace is the ability to work with distributed columns. DSTP Servers pair data columns with their UCKs and DSTP Client applications can merge distributed columns by their UCKs.

Support for missing values. DSTP Servers and clients support missing values as a primitive data type. This is important for exploratory data analysis and data mining applications.

6. Some Examples

Example 1. Earth Science Data. We placed approximately 100 Gigabytes of Community Climate Model (CCM3) data from the National Center for Atmospheric Research (NCAR) on a DSTP Server located at NCAR. CCM3 data is used by scientists to study CO2 warming and climate change, climate prediction and predictability, atmospheric chemistry, paleoclimate, biosphere-atmosphere transfer and nuclear winter. The data consists of monthly satellite measurements of global surface temperatures, precipitation, ozone levels and vegetation index.

There are three UCKs for this application: latitude, longitude and time. Figure 2 shows the result of a DataSpace query for sea surface temperature. The DSTP Client for this application supports queries by uck, by attribute ranges, by attribute, and by data set. The client can view the data, download the data, graph the data, and do simple EDA operations on the data.

The DSTP Client can also compare the CCM3 data and overlay the CCM3 data with other data sets sharing one or more of the same UCKs.

Example 2. Protein Data. In this application [Hamelberg:2002], we took the data from the the Protein Data Bank [PDB] and placed it in a DSTP Server in Halifax. The data consisted of records of the form

C,PRO,2,28.901,38.374,3.596

In this example, C is the type of atom (Carbon), PRO the residue to which the atom belongs (Proline), 2 is the ID of the molecule which the amino acid is part of, and the final three coordinates are the x, y, and z coordinates of the atom. The x, y, z coordinates serve as the UCKs. We also placed data describing drugs on a DSTP Server in Amsterdam using the same UCKs. Both these DSTP Servers were on a testbed for DataSpace which allowed us to measure the performance of various queries involving large data sets.

The DSTP Client application in Figure 3 can retrieve and interactively explore proteins. The proteins can be displayed in the chemical markup language (CML) or in PDB file formats. The data can also be visualized using a graphics program like Rasmol or a web visualization tool such Webmol.

Of more interest, a distributed query can be done between a protein molecule from the Halifax DSTP Server and a small organic compound from the Amsterdam DSTP Server. For example, Figure 3 shows a drug candidate molecule docking with a protein from the PDB.

In addition, the DSTP Client can query NCBI's PubMed for all references related to the proteins and drugs on the DSTP Clients.

Example 3. Astronomical Data. In this example taken from [Grossman:2001], we queried two geographically distributed astronomical source catalogs: 2MASS (Two Micron All Sky Survey) survey data from a DSTP Server at CalTech and DPOSS (Digital Palomar Observatory Sky Survey) survey data, which we replicated on a DSTP Server in Chicago. The 2MASS data are in the optical wavelengths (0.4 - 0.7 micron), while the DPOSS data are are in the infrared (1.2 - 2.2 micron) range.

The position of the light sources in both data sets are in polar coordinates. The UCKs for this example are the right ascension and declination, measured in degrees.

A typical query of interest to astronomers is of the form: "Find all pairs, one from the DPOSS catalog and one from 2MASS, whose angular separation is less than a given tolerance." Figure 4 illustrates a DataSpace query with a tolerance of 2 seconds in the region of the sky with right ascension from 183 to 270 degrees and declination from 17 to 47 degrees (somewhere over the north pole).

7. EDA Primitives for Transformations

After working with the examples described above and related examples for the past two years, we have decided to standardize on certain primitives for normalizing, transforming and aggregating data. The following four transformations, which are defined in [DMG], seem sufficient to cover most of the transformations we use in practice.

Normalization: map values to numbers, the input can be continuous or discrete.

Discretization: map continuous values to discrete values.

Value mapping: map discrete values to discrete values.

Aggregation: summarize or collect groups of values. Aggregations include counts, averages, histograms, etc.

We have been working with the Data Mining Group to standardize these operations. These four operations are part of Version 2.0 of the Predictive Model Markup Language, an XML standard for statistical and data mining models [DMG]. Today these transformations are done in an ad hoc fashion by DataSpace clients. DMG compliant implementations of these operations will be supported in the next version of DataSpace.

Paraphrasing the Data Mining Group's description of PMML Version 2.0 transformations [DMG], the approach to transformations is not to cover the full set of preprocessing functions which may be needed to collect and prepare the data for mining, but rather to introduce a templated set of basic operations designed to cover many of the most common operations. The operations above cover the normalization of input values required for neural networks. They also cover the types of quantile ranges discretizations that are used in order to transform skewed data. Indeed, it has been our experience that they cover many of the transformations we use in practice to prepare data for mining.

For example, we have used these transformations to merge two data sets with different UCKs. For example, correlating El Nino anomalies with cholera outbreaks requires aggregating and normalizing the data from the two data sets so the data can be meaningfully compared. Today, our DSTP clients essentially hard code some of the more common transformations for some of the more common UCKs.

We feel that together with the basic graphics and visualization that are already part of DataSpace, these transformations will cover many of the common EDA operations desired, further enhancing the ability of scientists using DataSpace to explore casually remote and distributed data. For example, overlaying remote data over local data can be done transparently for data sets employing known UCKs.

8. Discussion

The work described here can be thought of as a specific semantic web application and implementation. The challenge we faced was understanding what a) specific services, data and metadata are required for exploratory data analysis using data webs and b) implementing these services in a scalable manner.

We next describe the relation between data webs and semantic webs [W3C:2001] in more detail.

XML. It is clear that any of the metadata required for data mining can be put into XML. Based upon our experiences building data web applications, we have been an active participant in the Data Mining Group [DMG] and our data web applications use the XML metadata standards developed by them.

RDF. We have chosen not to use the semantic web's RDF standard since our interest is in data mining not knowledge management. Although in some sense data is a "trival" type of knowledge, from a practical viewpoint having a scalable robust web infrastructure to work with data is useful for the same reason we still have databases and data archive systems even though knowledge management and AI systems should have made these "trivial" long ago.

WSDL. Webs are built from services - DSTP is a web service for working with remote and distributed data. Recently we have released a WSDL for DSTP. This will be useful as semantic web applications become more common.

SOAP. We chose to implement DSTP directly instead of over SOAP for several reasons: SOAP was not around when we started and SOAP doesn't scale as well to high volume data streams. We will shortly release a version of DSTP which uses SOAP and which is suitable for DSTP client applications involving small data sets.

9. Summary and Conclusion

The majority of the time for a typical scientific or engineering data mining application is spent cleaning, transforming, and exploring the data. In this paper, we have described a data web which is designed to simplify these operations. In particular, the infrastructure called DataSpace allows scientists to view, retrieve, and apply simple transformations to remote and distributed data.

We have described how DataSpace can be used for preliminary exploratory analysis of atmospheric, biological and astronomical data. If the data appears interesting, both the data and the relevant metadata can be retrieved with a few points and clicks so that it can be analyzed locally using statistical and data mining software.

We feel that data webs make it easier to use unfamiliar data and that this ability should enable certain data based scientific discoveries which might not occur otherwise. One of the hopes for data mining is the enhanced ability to make scientific discoveries from data that would be ignored otherwise. Although there has been a lot of data mining research on new data mining algorithms, without the data there is no place for these algorithms to start. The data web infrastructure described in this paper encourages the casual exploration of remote and distributed data, making it easier for scientists to use data that might be overlooked otherwise.

References

[DMG] Data Mining Group, The Predicitive Markup Language (PMML), A Markup Language for Statistical and Data Mining Models, Version 2.0. Retrieved from www.dmg.org on January 10, 2002.

[Pascual:2000] Percedes Pascual, Xavier Rodó, Stephen P. Ellner, Rita Colwell, Menno J. Bouma, Cholera Dynamics and El Niño-Southern Oscillation, Volume 289, 2000, pp. 1766-1769.

[Chervenak:2000] Protocols and Services for Distributed Data-Intensive Science. A. Chervenak, I. Foster, C. Kesselman, S. Tuecke. ACAT2000 Proceedings, pp. 161-163, 2000.

[Foster:1999] I. Foster and C. Kesselman, The Grid: Blueprint for a New Computing Infrastructure, Morgan Kaufmann Publishers, San Francisco, 1999

[Grossman:2001] Robert Grossman, Emory Creel, Marco Mazzucco, Roy Williams, A DataSpace Infrastructure for Astronomical Data, in R. L. Grossman, C. Kamath, W. Philip Kegelmeye, V. Kumar, and R. Namburu, Data Mining for Scientific and Engineering Applications, Kluwer Academic Publishers, 2001.

[Grossman:2002] R. L. Grossman, M. Hornick, and G. Meyer, Emerging KDD Standards, Communications of the ACM, Special Issue on Data Mining, to appear.

[Hamelberg:2002] D. Hamelberg and R. L. Grossman, A DataSpace Infrastructure for Bioinformatics Data.

[Kantor:1986] Brian Kantor and Phil Lapsley, Network News Transfer Protocol, February 1986, RFC 977. Retrieved from www.w3.org/Protocols/rfc977/rfc977.html on January 10, 2002.

[Kargupta:2000] H. Kargupta and P. Chan, editors, Advances in Distributed and Parallel Knowledge Discovery, AAAI Press/The MIT Press, Menlo Park, California, 2000.

[PDB] Protein Data Bank, www.rcsb.org/pdb.

[W3C:2001] The Semantic Web. Retrieved from http://www.w3.org/2001/sw/ on February 10, 2002.

[W3C:1999] Web Architecture: Describing and Exchanging Data, W3C Note 7 June 1999. Retrieved from www.w3.org/1999/04/WebData on February 10, 2002.

Figures

Figure 1: Result of a DataSpace query on El Nino data. Using UCKS and the DataSpace client's basic EDA capabilities, the El Nino data can be compared to Cholera data with a few points and clicks, even though the data is from different sites and originally in quite different formats.

Figure 2: Result of a DataSpace query for sea surface temperature on NCAR data from a DSTP Server at NCAR in Boulder.

Figure 3: Result of a distributed DataSpace query. One of the sites contains 3-D protein data from the Protein Data Bank which we replicated in Halifax for testing of the DataSpace infrastructure. The other contains 3-D data describing small organic compounds, which may be used as drugs. This data is on a DataSpace server in Amsterdam which is also part of our testbed. The query results in the docking of the potential drug in the protein.

Figure 4: The result of DataSpace query of astronomical data from two sky surveys. One data set is 2MASS (Two Micron All Sky Survey) survey data from a DSTP Server at CalTech and the other DPOSS (Digital Palomar Observatory Sky Survey) survey data, which we replicated on a DSTP Server in Chicago.