Emerging Data Mining Standards and Interfaces

Robert Grossman, University of Illinois at Chicago & Open Data Partners
Mark Hornick, Oracle, Corporation
Gregor Meyer, IBM


This is an early of the article Robert Grossman, Mark Hornick, and Gregor Meyer, Emerging Standards and Interfaces in Data Mining, Handbook of Data Mining, Nong Ye, editor, Kluwer Academic Publishers.


1. Introduction

Data mining standards are concerned with one or more of the following issues:

  1. A standard representation for data mining and statistical models. This is the more straightforward to standardize and can be done easily, for example, with XML. This includes, for example, the parameters defining a classification tree.
  2. A standard representation for cleaning, transforming, and aggregating attributes to provide the inputs for data mining models. This includes, for example, the parameters defining how zip codes are mapped to three digit codes prior to their use as a categorical variable in a classification tree.
  3. A standard representation for specifying the settings required to build models and to use the outputs of models in other systems. This includes, for example, the name of the training set used to build a classification tree.
  4. Interfaces and Application Programming Interfaces (APIs) to other languages and systems. There are standard data mining APIs for Java and SQL. This includes, for example, a description of the API so that a classification tree can be built on data in a SQL database.
  5. The overall process by which data mining models are produced, used, and deployed. This includes, for example, a description of the business interpretation of the output of a classification tree.
  6. Standards for viewing, analyzing, and mining remote and distributed data. This includes, for example, standards for the format of the data and metadata so that a classification tree can be built on distributed web-based data.

In the sections below, we discuss XML standards, data mining APIs, process standards, and emerging standards for integrating data mining into data webs and other emerging web based applications.

This chapter is based in part on the summary of data mining standards (Grossman et. al., 2002).

2. XML Standards

There are XML standards for the statistical and data mining models themselves, which arise in data mining, as well as for the metadata associated with them, such as the metadata that specifies the settings for building and applying the models.

2.1 XML for Data Mining Models. The Predictive Model Markup Language (PMML) is being developed by the Data Mining Group (DMG, 2002), a vendor led consortium which currently includes over a dozen vendors including Angoss, IBM, Magnify, MINEit, Microsoft, National Center for Data Mining at the University of Illinois (Chicago), Oracle, NCR, Salford Systems, SPSS, SAS, and Xchange (PMML, 2002).

PMML is used to specify the models themselves and consists of the following components:

  1. Data Dictionary. The data dictionary defines the attributes input to models and specifies the type and value range for each attribute.
  2. Mining Schema. Each model contains one mining schema that lists the fields used in the model. These fields are a subset of the fields in the Data Dictionary. The mining schema contains information that is specific to a certain model, while the data dictionary contains data definitions that do not vary with the model. For example, the Mining Schema specifies the usage type of an attribute, which may be active (an input of the model), predicted (an output of the model), or supplementary (holding descriptive information and ignored by the model).
  3. Transformation Dictionary. The Transformation Dictionary defines derived fields. Derived fields may be defined by normalization, which maps continuous or discrete values to numbers; by discretization, which maps continuous values to discrete values; by value mapping, which maps discrete values to discrete values; or by aggregation, which summarizes or collects groups of values, for example by computing averages.
  4. Model Statistics. The Model Statistics component contains basic univariate statistics about the model, such as the minimum, maximum, mean, standard deviation, median, etc. of numerical attributes.
  5. Model Parameters. PMML also specifies the actual parameters defining the statistical and data mining models per se. Models in PMML Version 2.0 include regression models, clusters models, trees, neural networks, Bayesian models, association rules, and sequence models. Additional models are being planned for PMML Version 2.1. There is no mechanism in PMML to define statistical or data mining models that are not one of the supported types.

As an example, here is a data dictionary for Fisher's Iris data set:

<DataDictionary numberOfFields="5">
  <DataField name="Petal_length" optype="continuous"/>
  <DataField name="Petal_width" optype="continuous"/>
  <DataField name="Sepal_length" optype="continuous"/>
  <DataField name="Sepal_width" optype="continuous"/>
  <DataField name="Species_name" optype="categorical">
   <Value value="Setosa"/>
   <Value value="Verginica"/>
   <Value value="Versicolor"/>
  </DataField>
 </DataDictionary>

and here is the corresponding mining schema:

  <MiningSchema>
   <MiningField name="Petal_length" usageType="active"/>
   <MiningField name="Petal_width" usageType="active"/>
   <MiningField name="Sepal_length" usageType="supplementary"/>
   <MiningField name="Sepal_width" usageType="supplementary"/>
   <MiningField name="Species_name" usageType="predicted"/>
  </MiningSchema>

Finally, here is a fragment describing a node of a decision tree built from the data:

   <Node score="Setosa" recordCount="50">
    <SimplePredicate field="Petal_length" operator="lessThan" value="24.5"/>
    <ScoreDistribution value="Setosa" recordCount="50"/>
    <ScoreDistribution value="Verginica" recordCount="0"/>
    <ScoreDistribution value="Versicolor" recordCount="0"/>
   </Node>

PMML Version 1.0 basically concerned itself with defining standards for various common data mining models assuming that the inputs to the models had already been defined. The inputs are called Data Fields. PMML Version 2.0 introduced the TransformationDictionary, which contains DerivedFields. The inputs to models in PMML Version 2.0 may be DataFields or DerivedFields. In principle, this approach is powerful enough to capture the process of preparing data for statistical and data mining models and for deploying these models in operational systems.

2.2 XML for Data Mining Metadata. Through the Object Management Group, a new specification for data mining metadata has recently been defined using the Common Warehouse Metadata (CWM) specification (OMG, 2002). CWM supports interoperability among data warehouse vendors by defining Document Type Definitions (DTDs) that standardize the XML metadata interchanged between data warehouses. The CWM standard generates the DTDs using the following three steps: First, a model using the Unified Modeling Language (UML, 2002) is created. Second the UML model is used to generate a CWM interchange format called the Meta-Object Facility (MOF) / XML Metadata Interchange (XMI) (UML, 2002). Third, the MOF/XML is converted automatically to DTDs. CWM for Data Mining (CWM DM) was specified by members of the JDM expert group, and as such, has many common elements with JDM. For example, CWM DM defines function and algorithm settings that specify the type of model to build with parameters to the algorithm. CWM DM also defines tasks that associate the inputs to mining operations, such as build, test, and apply (score).

3. Application Programming Interfaces (APIs)

Data mining must co-exist with other systems. In addition to XML standards such as PMML there are efforts defining data mining APIs for Java, SQL, and Microsoft's OLE DB.

3.1 SQL APIs. ISO (the International Organization for Standardization) and IEC (the International Electrotechnical Commission) are in the process of adopting a data mining extension to SQL. It consists of a collection of SQL user-defined types and routines to compute and apply data mining models.

ISO and IEC form the specialized system for worldwide standardization. National bodies that are members of ISO or IEC participate in the development of International Standards through technical committees. The data mining extensions in SQL are part of the SQL Multimedia and Applications Packages Standard or SQL/MM. The particular specification, called SQL/MM Part 6: Data Mining, specifies a SQL interface to data mining packages.

Database users are interested in a standard for data mining in order to be able to arbitrarily combine databases and the data mining based applications. From this perspective, data mining is just a sophisticated tool to extract information or to aggregate the original data. This is pretty similar to the functionalities that are provided by SQL today. Hence, standardizing data mining through SQL is a natural extension of the more basic storage and retrieval mechanisms supported by SQL today.

3.2 Java APIs. Turning to Java, the Java Specification Request 73 (JSR-73), known as Java Data Mining (JDM), defines a pure Java (tm) API to support data mining operations. These operations include model building, scoring data using models, as well as the creation, storage, access and maintenance of data and metadata supporting data mining results (JSR-73, 2002). It also includes selected data transformations. JDM not only defines a representative set of functions and algorithms, but also provides a framework so that new mining algorithms can be introduced. A key goal of the JDM team was to make it relatively easy to add additional models. Since many vendors specialize and only support certain data mining models, JDM is defined in a number of packages to support a la carte package compliance. For example, a vendor specializing in neural networks is likely interested in the classification and approximation functions, along with specific neural network algorithms and representations. JDM strives to make data mining accessible to both experts and non-experts by separating high level function specifications from lower level algorithm specifications. JDM leverages the J2EE platform to enable scalability, integration with existing information systems, extensibility and reusability, as well as choices of servers, tools, and components for vendor implementations. JDM influenced and was influenced by several existing standards, such as SQL/MM, CWM DM, and PMML. JDM makes explicit provision for the import and export of data mining objects. The most common representation will be XML, where PMML can be leveraged for model representations and CWM DM for other mining objects such as settings, tasks, and scoring output.

3.3 OLE DB APIs Turning to Microsoft's SQL environment, OLE DB for Data Mining (OLE DB for DM) defines a data mining API to Microsoft's OLE DB environment (Microsoft OLE DB, 2002). OLE DB for DM doesn't introduce any new OLE DB interfaces, but rather uses a SQL-like query language and a specialized data structure called a rowset so that data mining consumers can communicate with data mining producers using OLE DB. Recently OLE DB for DM has been subsumed by Microsoft's Analysis Services for SQL Server 2000 (Microsoft Analysis, 2002). Microsoft's Analysis Services provide APIs to Microsoft's SQL Server 2000 services which support data transformations, data mining and OLAP.

4. Web Standards

4.1 Semantic Web. The World Wide Web Consortium (W3C) standards for the semantic web defines a general structure for knowledge using XML, RDF, and ontologies (W3C SW, 2002). This infrastructure in principle can be used to store the knowledge extracted from data using data mining systems, although at present, one could argue that this is more of a goal than an achievement. As an example of the type of knowledge that can be stored in the semantic web, RDF can be used to code assertions such as "credit transactions with a dollar amount of $1 at merchants with a MCC code of 542 have a 30% likelihood of being fraudulent."

4.2 Data Webs. Less general than semantic webs are data webs. Data webs are web based infrastructures for working with data (instead of knowledge). For example, in the same way that a web client can retrieve a web document from a remote web server using http, with a data web, a data web client can retrieve data and metadata from a remote data web server (Bailey, 2002, DSTP, 2002).

Data web client applications have been developed which allow users to browse remote data in data webs and set the PMML parameters defining the mining dictionary and transformation dictionary, enabling data webs to serve as the foundation for remote and distributed data mining (Grossman, 2001).

4.3 Other Web Services. More recently, Microsoft and Hyperion have introduced XML for Analysis which is a Simple Object Access Protocol (SOAP)-based XML API designed for standardizing data access between a web client application and an analytic data provider, such as an OLAP or data mining application (Microsoft Analysis, Microsoft SQL, 2002). No details are available about XML for Analysis at this time (April, 2002).

5. Process Standards

The CRoss Industry Standard Process for Data Mining or CRISP-DM is the most mature standard for the process of building data mining models (CRISP, 2002). Process standards define not only data mining models and how to prepare data for them, but also the process by which the models are built, and the underlying business problems the models are addressing. The CRISP 1.0 Process Model has six components: business understanding, data understanding, data preparation, modeling, evaluation and deployment.

CRISP was an ESPRIT Program which included Teradata, ISL (which was acquired by SPSS), DaimlerChrysler, and OHRA.

6. Relationships

Data mining is used in many different ways and in combination with many different systems and services. Each of these emerging standards and interfaces were developed to address a particular need, which is why there are several different standards.

XML standards such as PMML and CWM are a natural choice to support vendor neutral data interchange. PMML and CWM DM are complementary in that PMML specifies the detailed content of models, whereas CWM DM specifies data mining metadata such as settings and scoring output.

Application programmer interfaces such as SQL/MM and JDM use existing XML standards (PMML) for exchanging model data and metadata. There is good coordination and good working relationships between the standards committees developing PMML, CWM, JSR-73 and SQL MM. For this reason, we expect that terminology, concepts, and object structures will continue to be shared as these standards evolve.

XML standards such as PMML can serve as a common ground for the several different standards efforts in progress. CWM, JSR-73 and SQL MM use PMML concepts and infrastructure where appropriate. In addition, for each large community interfacing to data mining systems, there will be a continuing need for an API. This includes SQL, Java, Microsoft's COM and .Net, and the W3C's Semantic Web.

7. Summary

As data mining has matured so have the standards supporting it. PMML is a relatively mature XML standard for data mining models per se. Version 2.0 of PMML has begun to add support for the transformations, normalizations, and aggregations required for preparing data for modeling and for scoring and deploying models.

There are now well defined APIs for interfacing data mining with SQL, Java, and Microsoft's COM and .Net environments. These APIs incorporate PMML where appropriate.

There are also emerging web standards and services, such as WSDL, SOAP, and DSTP, for working with remote and distributed web data. These standards hold the promise of opening up data mining into entirely new application domains.

References

Bailey, S., Creel, E., Grossman, R., Gutti, S., and Sivakumar, H. (2000). A High Performance Implementation of the Data Space Transfer Protocol (DSTP). In M. J. Zaki & C.-T. Ho (Eds), Large-Scale Parallel Data Mining, (pp. 55-64). Berlin: Springer-Verlag.

CRoss Industry Standard Process for Data Mining Process (CRISP), CRISP 1.0 Process and User Guide. Retrieved from http://www.crisp-dm.org on March 10, 2002.

Data Mining Group (2002). Retrieved from www.dmg.org. on March 10, 2002.

Data Space Transfer Protocol. National Center for Data Mining. Retrieved from www.ncdm.uic.edu on March 8, 2002.

Grossman, R., Creel, E., Mazzucco, M., and Williams, R. (2001). A DataSpace Infrastructure for Astronomical Data. In R. L. Grossman, C. Kamath, W. Philip Kegelmeye, V. Kumar, and R. Namburu (Eds), Data Mining for Scientific and Engineering Applications, (pp. 115-123). Dordrecht, The Netherlands: Kluwer Academic Publishers.

Grossman, R. Hornick, M, and Meyer, G (2002). Emerging KDD Standards, Communciations of the ACM, August, 2002.

Java Specification Request 73. Retrieved from http://jcp.org/jsr/detail/073.jsp on March 8, 2002.

Microsoft OLE DB for Data Mining Specification 1.0 Retrieved from www.microsoft.com/data/oledb/default.htm on March 8, 2002.

Microsoft SQL Server 2000 Analysis Services. Retrieved from www.microsoft.com/SQL/techinfo/bi/analysis.asp on March 8, 2002.

Microsoft Analysis. Introduction to XML for Analysis, Microsoft, April, 2001. Retrieved from www.microsoft.com/data/xml/XMLAnalysis.htm on April 12, 2002.

Object Management Group's Common Warehouse Metamodel - DataMining. Retrieved from cgi.omg.org/cgi-bin/doclist.pl on March 8, 2002.

PMML (2002). Predictive Model Markup Language (PMML), Data Mining Group, Retrieved from www.dmg.org on March 10, 2002.

Semantic Web (SW), World Wide Web Consortium (W3C). Retrieved from www.w3c.org/2001/sw on March 8, 2002.

UML (2002). Unified Modeling Language. Retrieved from http://www.omg.org/uml/ on May 10, 2002.