Foster & Grossman: Data Integration

Data Integration in a Bandwidth Rich World

Ian Foster
Robert L. Grossman

This is an early draft of the article: Ian Foster and Robert L. Grossman, Data Integration in a Bandwidth Rich World, Communications ACM, Volume 46, Issue 11, November, 2003, pages 50-57.

Introduction

Exponential advances in sensors, storage systems, and computers are producing data of unprecedented quantity and quality. Multi-terabyte and even petabyte (1,000TB) data sets are emerging as major assets. For example, the climate science community has access to hundreds of terabytes of observational data from NASA's Earth-observing system and simulation data from high-performance climate models; these data sources can yield new insights into global change. The World-Wide Telescope linking hundreds of digital sky surveys is revolutionizing astronomy [11]. And in industry, multi-terabyte (soon to be petabyte) data warehouses of consumer transactional data are increasingly common.

The key to deriving insight and knowledge is often the correlation of data from multiple sources, as these examples show. The traditional paradigm for such syntheses is to gather data at a single location and transform it into a common format prior to exploring it. However, the expense of this approach in terms of network resources has meant that most data is never correlated or compared to other data. In a world of more and more data, storage systems, computers, and networks, it is both necessary and feasible for system architects to think in terms of a new paradigm based on data integration—the flexible and managed federation, exploration, and processing of data from many sources.

One factor driving this new paradigm is dramatic improvements in network performance. Few Internet1 networks move data at more than a megabit per second (Mbps), taking weeks to move a terabyte. Fortunately, advances in networking technologies are ushering in an era of bandwidth abundance based on Tbps optical backbones providing routine access to end-to-end paths of 10Gbps or more—a four-orders-of-magnitude improvement (see the article by DeFanti et al. in this section). For example, in 2002 Earth science data striped over a three-node cluster was recently transported at a rate of 2.4Gbps between Amsterdam and Chicago, a terabyte in an hour [9].

Just as critical for effective data integration, and our focus here, is the distributed system middleware beginning to allow distributed communities, or virtual organizations, to access and share data, networks, and other resources in a controlled and secure manner. Recent advances promise to provide required capabilities. For example, Open Grid Services Architecture (OGSA) standards and technologies provide for the secure and reliable virtualization and management of distributed data and computing resources [2, 6]. And data Web infrastructures support discovery, exploration, analysis, integration, and mining of remote and distributed data [10]. Such efforts are pioneering a new generation of distributed data discovery, access, and exploration technologies promising to transform the Internet into a data-integration platform. On it, users will be able to perform sophisticated operations on remote and distributed petascale data sets (see the sidebar "Data-Integration Technologies").

Petascale Scenarios

The following scenarios illustrate applications impossible today but achievable over optical networks with only the help of data services.

Virtual data warehouses. Today, data warehouses are centralized repositories of data used for reporting and querying. High-speed optical networks make it possible for data to instead be stored at its source. When reports are required, bandwidth can be requested, data merged from multiple sources, and reports generated using the most current data. In effect, virtual data warehouses are constructed on the fly. Here, as elsewhere, new data architectures become possible when wide-area networks are able to transport data at speeds comparable to that of a computer's backplane.

Data replication for business continuity. Businesses providing critical infrastructure for disaster recovery and business continuity are increasingly locating secondary, even tertiary, backup facilities far from primary sites. In the past, large data volumes, high transaction speeds, slow networks, and poor distributed data management infrastructure made such distributed architectures difficult or impossible. However, with all-optical networks and distributed data services, it becomes feasible to consider replicating transformations from core systems to remote backup systems. Financial exchanges, telecommunication systems, reservation systems, and dispatch and scheduling operations performed by vendors are examples of systems that can be replicated in this way.

Stream-based distributed processing of sensor data. Large centralized detectors and distributed sensor nets in such fields as physics, astronomy, seismology, and national security produce high-volume data streams requiring extensive processing prior to analysis. Today, processing is performed offline, and data sets are prepared and distributed only periodically. Optical networks and data-integration services can enable a new paradigm in which even large data sets are continuously updated, so users always have access to the most current data. Data can also be merged from multiple sources, processed in real time, and analyzed for changes, alerts, and other significant patterns.

Requirements and Technologies

Distributed data sources can be diverse in their formats, schema, quality, access mechanisms, ownership, access policies, and capabilities. Overcoming this multi-tiered Tower of Babel to achieve distributed data integration requires technical solutions and standards in three closely related areas: data discovery and access; data exploration and analysis; and resource management, security, and policy.

Data discovery and access. The first step in integrating data is discovering data that may be relevant, often through middleware that examines metadata. Metadata can be represented, federated, and accessed in a variety of ways. Relevant technologies include Web services mechanisms. For example, there's the Web Services Description Language specifications; Grid-enabled data access and integration services [2]; directory services (such the Lightweight Directory Access Protocol); XML and relational databases; Semantic Web technologies [5]; and text-based Web search mechanisms applied to unstructured text-based metadata.

Having identified data sets that might be relevant, the next step for the user is to access the data to see whether it is likely to be relevant and actually worth investigating. Data formats, schema, and access mechanisms span a broad range. Widely adopted access mechanisms include: the Open source project for a Network Data Access Protocol (OPeNDAP) in the environmental community; Storage Resource Broker (SRB) [3] in scientific projects; Data Web protocols for data mining (the Data Space Transfer Protocol, or DSTP); and GridFTP for high-performance and striped data movement. The OGSA-based Data Access and Integration (OGSA-DAI) [2] standards emerging from the Global Grid Forum seek to integrate these and other approaches.

Data access can demand high transport performance and require parallel data access and movement. For example, if remote data is being delivered at a rate of 1Gbps, and a particular application's data-integration activity involves reading 10 local bytes per remote byte received and performing 100 operations per local byte read, then the application requires 10Gbps local read bandwidth and 1Teraops/sec. of local computing to keep up with data delivery (a substantial and necessarily parallel resource). Striping data using multiple network connections linking pairs of nodes in distributed clusters is becoming a core technique in high-performance data transport [10]. The GridFTP extensions to the popular FTP protocol represent a standard approach to exploiting parallelism in data transfers, allowing multiple data channels to be coordinated via FTP control channel commands. Also relevant is the work on advanced protocols described in the article by Falk et al. in this section.

We anticipate the emergence of data access services supporting the flexible creation and manipulation of views on data sources (whether files or tables) and access to those views using a variety of operations, including database-style operations (such as SQL "select") and other more general operations (such as attribute selection, row selection via range queries, and record selection via sampling). Integrating these mechanisms with high-performance transport protocols remains a major unresolved problem.

Data exploration and analysis. Data rendered accessible can be analyzed in detail. Here, data exploration services are needed to address the challenges inherent in finding relevant data that can be combined with local data or with other remote data to achieve new discoveries. These services can provide basic statistical summaries, enable visual exploration of data, and support standard exploratory functions (such as building clusters), computing the regression of one variable on another.

Efficient integration of distributed data requires protocols and services for managing the data records constituting data archives. Unlike files of bits, data archives of records have attributes, attribute metadata, keys, and missing values. Mechanisms for providing attribute- and record-based access to remote and distributed data include: SQL-based access methods for relational data; protocols designed to work with remote data (such as the Data Web Transfer Protocol [10], OPeNDAP, and OGSA-DAI [2]); and protocols designed to work with remote and distributed semi-structured data (such as XPath). Data Webs support the exploration and mining of distributed data using templated data-mining operations.

The transformation, analysis, and synthesis performed during data integration can be complex and computationally intensive. Data-transformation primitives incorporated into data middleware cannot capture arbitrary computations but can express many common data-preparation operations [10]. More general workflow services are also required to support the integration and scheduling of arbitrary user- and community-defined transformations. Users benefit from tools that record, organize, and exploit knowledge about how these activities derive new data from old. Virtual data systems aim to capture this information so as to allow reuse of generated data, explanation of data provenance, and other activities [8].

Resource management, security, and policy. Being famiiliar with today's bandwidth- and data-poor world, users often assume only standard schema and access methods are required to render remote data accessible. But the distributed analysis of large quantities of data is computationally (and bandwidth) intensive, and a high-performance Internet can expose popular data resources to the risk of essentially unlimited loads. Efficient petascale data integration can require the harnessing and coordinated management of multiple computational and network resources at multiple sites.

Thus, clients (and brokers acting on their behalf) need to negotiate service level agreements (SLAs) with computers, storage systems, and networks. They also need to deploy applications able to achieve desired end-to-end performance across these resources, as well as monitor performance and adapt to performance problems at either the network or SLA level [7]. For example, an application might request an end-to-end optical network plus associated computing and storage resources, use the resources to integrate remote and local data, then release them. Another effective optimization is to decouple data movement and computation so the data is staged to locations "near" (in terms of some access cost metric) to where it is required [12]. Data replication and distribution of data across the network [4] are also effective techniques.

Along with the data itself, the physical resources employed for data integration are frequently precious and thus subject to access controls. Data-integration middleware must therefore provide comprehensive security, policy, and resource management solutions. These solutions are required at multiple levels, ranging from the individual user ("Can I access this file?"), to the user community ("How many Gb-hours is this community allocated?"), and from the local ("Allocate me 1Gbps bandwidth"), to the end-to-end ("Allocate resources to achieve 10Gbps throughput for this pipeline"), to the global ("Ensure that the most popular data sets are replicated"). Security and policy solutions must address the concerns of both the institutions that own specific resources and the communities wishing to achieve distributed analysis.

Implications

Two examples from the sciences illustrate some practical implications and applications of these issues:

Joins of distributed Earth science data. The National Center for Atmospheric Research's Community Climate Model 3 (CCM3) helps research CO₂ warming and climate change, climate prediction and predictability, atmospheric chemistry, paleoclimate, biosphere-atmosphere transfer, and nuclear winter. Scientists regularly want to integrate their data with CCM3 data. For example, they might wish to join their historical data about vegetation levels with CCM3 data to study the effect of global climate change on certain types of vegetation. A typical data-integration operation is to join a field x_i, say, temperature from the remote CCM3 data set that includes a key k_i consisting of a latitude-longitude-time triple (r_i, s_i,, t_i,) with a field y_i from the other data set representing a vegetation level for the same key (r_i, s_i,, t_i,). In this way, scientists can estimate functional relationships of the form y = f(k; x) to capture how vegetation levels change over time with changes in climatic variables.

In one study, the goal was to integrate data on the fly, without co-locating it, in order to obtain an estimate of whether such a relationship is probable, in which case more careful follow-up studies would be needed. The study was performed in conjunction with the iGrid 2002 conference in Amsterdam, The Netherlands, evaluating various algorithms for transporting and joining distributed streams of data indexed by latitude, longitude, and time. One stream contained temperature and related CCM3 data, the other vegetation levels. One data set was located on a three-node cluster in Chicago, the other on a three-node cluster in Amsterdam. The DataSpace Data Web software was used to move the data across the Atlantic and perform a streaming join of it in Amsterdam; a parallel version of the Simple Available Bandwidth Utilization Library (SABUL) protocol was used for data transport, and DSTP was used to manage keys, metadata, and data. Data was moved at a rate greater than 900Mbps per node (2.4Gbps, or 1TB/hour, with a three-node cluster) and merged at approximately half that speed [9]. This experiment demonstrated conclusively that geographical distance need not be an obstacle to data integration.

Galaxy cluster identification in Sloan data. The Sloan Digital Sky Survey (SDSS) is a digital imaging survey that will, by 2007, have mapped a quarter of the sky in five colors with a sensitivity two orders of magnitude greater than previous large sky surveys. The SDSS data is being made available online as both a large collection (~10TB) of images and a smaller set of catalogs (~2TB) containing measurements on each of 250 million detected objects. SDSS is just one example of a growing set of digital sky survey projects that will soon yield an unprecedented international, distributed multi-petabyte collection of digital astronomical data [11].

Another recent experiment [1] showed how this online data could be integrated with distributed computing and storage resources to perform computationally intensive analysis of unprecedented scale. The challenge was to search the Sloan database for galaxy clusters, the largest gravitationally dominated structures in the universe. Software developed for the GriPhyN project—the so-called Virtual Data Toolkit—was used to plan, then manage the required workflow, ultimately involving computational clusters at four sites across the U.S. This illustrates how even large-scale distributed data analysis tasks might become routine once appropriate infrastructure is in place.

Conclusion

The data tsunami already upon us offers great opportunities for new insight and knowledge but demands significant advances in middleware for integrating data from diverse distributed sources. That's why we have sought to explore here not only the state of the art but likely future directions for this middleware.

Data mining emerged from statistics as a new discipline during the past decade, as large data sets became more and more common and the need for new technologies to mine them became critical. In the coming decade, data integration will emerge from distributed computing and data mining, fueled by the increasing number of distributed data sets and enabled by improving network performance. Data integration promises to have at least as great an effect as data mining has had.

References

1. Annis, J., Zhao, Y., Voeckler, J., Wilde, M., Kent, S., and Foster, I. Applying Chimera virtual data concepts to cluster finding in the Sloan Sky Survey. In Proceedings of SC2002 (Baltimore, MD, Nov. 16–22). ACM Press, New York, 2002.

2. Atkinson, M., Chervenak, A., Kunszt, P., Narang, I., Paton, N., Pearson, D., Shoshani, A., and Watson, P. Data access, integration, and management. In The Grid: Blueprint for a New Computing Infrastructure, 2nd Ed., I. Foster and C. Kesselman, Eds. Morgan Kaufmann, San Francisco, CA, 2004.

3. Baru, C., Moore, R., Rajasekar, A., and Wan, M. The SDSC storage resource broker. In Proceedings of the 8th Annual IBM Centers for Advanced Studies Conference (Toronto, Canada, 1998).

4. Beck, M., Moore, T., and Plank, J. An end-to-end approach to globally scalable network storage In Proceedings of ACM Sigcomm'02 (Pittsburgh, PA, Aug. 19–23). ACM Press, 2002,.

5. Berners-Lee, T., Hendler, J., and Lassila, O. The Semantic Web. Sci. Am. 284, 5 (May 2001), 34–43.

6. Chervenak, A., Foster, I., Kesselman, C., Salisbury, C., and Tuecke, S. The Data Grid: Towards an architecture for the distributed management and analysis of large scientific data sets. J. Net. Comput. Applic. 23, 3 (July 2000), 187–200.

7. Czajkowski, K., Foster, I., and Kesselman, C. Resource and Service Management. In The Grid: Blueprint for a New Computing Infrastructure, 2nd Ed., I. Foster and C. Kesselman, Eds. Morgan Kaufmann, San Francisco, CA, 2004.

8. Foster, I., Voeckler, J., Wilde, M., and Zhao, Y. The Virtual Data Grid: A new model and architecture for data-intensive collaboration. In Proceedings of the Conference on Innovative Data Systems Research (Asilomar, CA, Jan. 5–8, 2003).

9. Grossman, R., Gu, Y., Hanley, D., Hong, X., Lillethun, D., Levera, J., Mambretti, J., Mazzucco, M., and Weinberger, J. Experimental studies using photonic data services at iGrid 2002. Future Gen. Comput. Syst. 19, 6 (2003).

10. Grossman, R. Standards and infrastructures for data mining. Commun. ACM 45, 8 (Aug. 2002), 45–48.

11. Szalay, A. and Gray, J. The World-Wide Telescope. Science 293 (2001), 2037–2040.

12. Thain, D., Basney, J., Son, S.-C., and Livny, M. The Kangaroo approach to data movement on the Grid. In Proceedings of the 10th IEEE International Symposium on High-Performance Distributed Computing (San Francisco, CA, Aug. 7–9). IEEE Computer Society Press, New York, 2001, 7–9.

Sidebar: Data-Integration Technologies

The following projects are contributing to the development of data-integration middleware:

Data Web. This open source, Web-based software supports access, exploration, analysis, integration, and mining of remote and distributed data (see www.dataspaceweb.net).

Earth System Grid. This U.S. government-funded project applies data Grid technologies to the integration of Earth system modeling data (see www.earthsystemgrid.org).

EU DataGrid. This European Union-funded project develops and applies data Grid technologies in high-energy physics and other domains (see www.eu-datagrid.org).

Globus Toolkit. This open-source software provides the basic infrastructure for many Grid deployments worldwide (see www.globus.org).

Open Grid Services Architecture. This integration of Grid and Web services technologies defines standard interfaces and behaviors for distributed system integration and management (see www.ggf.org/ogsa-wg and www.globus.org/ogsa).

OGSA Data Access and Integration. This Global Grid Forum working group defines service-oriented interfaces for manipulating distributed data sources (see www.ggf.org/6_DATA/dais.htm).

Storage Resource Broker. This data access and federation technology provides data-mediation functions for data-intensive science (see www.npaci.edu/DICE/SRB).

Semantic Web. This extension of the Web aims to give information well-defined meaning (see www.w3.org/2001/sw).

Virtual Data Toolkit. This product of the National Science Foundation's GriPhyN project (see www.griphyn.org) integrates data management and analysis technologies, including the Globus Toolkit, Condor, and Chimera.

Web Services. This widely used set of standards specifies how applications define, discover, and access network-accessible services (see www.w3.org/2002/ws).