This is an early of the article Robert Grossman, Mark Hornick, and Gregor Meyer, Emerging Standards and Interfaces in Data Mining, Handbook of Data Mining, Nong Ye, editor, Kluwer Academic Publishers.
Data mining standards are concerned with one or more of the following issues:
In the sections below, we discuss XML standards, data mining APIs, process standards, and emerging standards for integrating data mining into data webs and other emerging web based applications.
This chapter is based in part on the summary of data mining standards (Grossman et. al., 2002).
There are XML standards for the statistical and data mining models themselves, which arise in data mining, as well as for the metadata associated with them, such as the metadata that specifies the settings for building and applying the models.
2.1 XML for Data Mining Models. The Predictive Model Markup Language (PMML) is being developed by the Data Mining Group (DMG, 2002), a vendor led consortium which currently includes over a dozen vendors including Angoss, IBM, Magnify, MINEit, Microsoft, National Center for Data Mining at the University of Illinois (Chicago), Oracle, NCR, Salford Systems, SPSS, SAS, and Xchange (PMML, 2002).
PMML is used to specify the models themselves and consists of the following components:
As an example, here is a data dictionary for Fisher's Iris data set:
<DataDictionary numberOfFields="5"> <DataField name="Petal_length" optype="continuous"/> <DataField name="Petal_width" optype="continuous"/> <DataField name="Sepal_length" optype="continuous"/> <DataField name="Sepal_width" optype="continuous"/> <DataField name="Species_name" optype="categorical"> <Value value="Setosa"/> <Value value="Verginica"/> <Value value="Versicolor"/> </DataField> </DataDictionary>
and here is the corresponding mining schema:
<MiningSchema> <MiningField name="Petal_length" usageType="active"/> <MiningField name="Petal_width" usageType="active"/> <MiningField name="Sepal_length" usageType="supplementary"/> <MiningField name="Sepal_width" usageType="supplementary"/> <MiningField name="Species_name" usageType="predicted"/> </MiningSchema>
Finally, here is a fragment describing a node of a decision tree built from the data:
<Node score="Setosa" recordCount="50"> <SimplePredicate field="Petal_length" operator="lessThan" value="24.5"/> <ScoreDistribution value="Setosa" recordCount="50"/> <ScoreDistribution value="Verginica" recordCount="0"/> <ScoreDistribution value="Versicolor" recordCount="0"/> </Node>
PMML Version 1.0 basically concerned itself with defining standards for various common data mining models assuming that the inputs to the models had already been defined. The inputs are called Data Fields. PMML Version 2.0 introduced the TransformationDictionary, which contains DerivedFields. The inputs to models in PMML Version 2.0 may be DataFields or DerivedFields. In principle, this approach is powerful enough to capture the process of preparing data for statistical and data mining models and for deploying these models in operational systems.
2.2 XML for Data Mining Metadata. Through the Object Management Group, a new specification for data mining metadata has recently been defined using the Common Warehouse Metadata (CWM) specification (OMG, 2002). CWM supports interoperability among data warehouse vendors by defining Document Type Definitions (DTDs) that standardize the XML metadata interchanged between data warehouses. The CWM standard generates the DTDs using the following three steps: First, a model using the Unified Modeling Language (UML, 2002) is created. Second the UML model is used to generate a CWM interchange format called the Meta-Object Facility (MOF) / XML Metadata Interchange (XMI) (UML, 2002). Third, the MOF/XML is converted automatically to DTDs. CWM for Data Mining (CWM DM) was specified by members of the JDM expert group, and as such, has many common elements with JDM. For example, CWM DM defines function and algorithm settings that specify the type of model to build with parameters to the algorithm. CWM DM also defines tasks that associate the inputs to mining operations, such as build, test, and apply (score).
Data mining must co-exist with other systems. In addition to XML standards such as PMML there are efforts defining data mining APIs for Java, SQL, and Microsoft's OLE DB.
3.1 SQL APIs. ISO (the International Organization for Standardization) and IEC (the International Electrotechnical Commission) are in the process of adopting a data mining extension to SQL. It consists of a collection of SQL user-defined types and routines to compute and apply data mining models.
ISO and IEC form the specialized system for worldwide standardization. National bodies that are members of ISO or IEC participate in the development of International Standards through technical committees. The data mining extensions in SQL are part of the SQL Multimedia and Applications Packages Standard or SQL/MM. The particular specification, called SQL/MM Part 6: Data Mining, specifies a SQL interface to data mining packages.
Database users are interested in a standard for data mining in order to be able to arbitrarily combine databases and the data mining based applications. From this perspective, data mining is just a sophisticated tool to extract information or to aggregate the original data. This is pretty similar to the functionalities that are provided by SQL today. Hence, standardizing data mining through SQL is a natural extension of the more basic storage and retrieval mechanisms supported by SQL today.
3.2 Java APIs. Turning to Java, the Java Specification Request 73 (JSR-73), known as Java Data Mining (JDM), defines a pure Java (tm) API to support data mining operations. These operations include model building, scoring data using models, as well as the creation, storage, access and maintenance of data and metadata supporting data mining results (JSR-73, 2002). It also includes selected data transformations. JDM not only defines a representative set of functions and algorithms, but also provides a framework so that new mining algorithms can be introduced. A key goal of the JDM team was to make it relatively easy to add additional models. Since many vendors specialize and only support certain data mining models, JDM is defined in a number of packages to support a la carte package compliance. For example, a vendor specializing in neural networks is likely interested in the classification and approximation functions, along with specific neural network algorithms and representations. JDM strives to make data mining accessible to both experts and non-experts by separating high level function specifications from lower level algorithm specifications. JDM leverages the J2EE platform to enable scalability, integration with existing information systems, extensibility and reusability, as well as choices of servers, tools, and components for vendor implementations. JDM influenced and was influenced by several existing standards, such as SQL/MM, CWM DM, and PMML. JDM makes explicit provision for the import and export of data mining objects. The most common representation will be XML, where PMML can be leveraged for model representations and CWM DM for other mining objects such as settings, tasks, and scoring output.
3.3 OLE DB APIs Turning to Microsoft's SQL environment, OLE DB for Data Mining (OLE DB for DM) defines a data mining API to Microsoft's OLE DB environment (Microsoft OLE DB, 2002). OLE DB for DM doesn't introduce any new OLE DB interfaces, but rather uses a SQL-like query language and a specialized data structure called a rowset so that data mining consumers can communicate with data mining producers using OLE DB. Recently OLE DB for DM has been subsumed by Microsoft's Analysis Services for SQL Server 2000 (Microsoft Analysis, 2002). Microsoft's Analysis Services provide APIs to Microsoft's SQL Server 2000 services which support data transformations, data mining and OLAP.
4.1 Semantic Web. The World Wide Web Consortium (W3C) standards for the semantic web defines a general structure for knowledge using XML, RDF, and ontologies (W3C SW, 2002). This infrastructure in principle can be used to store the knowledge extracted from data using data mining systems, although at present, one could argue that this is more of a goal than an achievement. As an example of the type of knowledge that can be stored in the semantic web, RDF can be used to code assertions such as "credit transactions with a dollar amount of $1 at merchants with a MCC code of 542 have a 30% likelihood of being fraudulent."
4.2 Data Webs. Less general than semantic webs are data webs. Data webs are web based infrastructures for working with data (instead of knowledge). For example, in the same way that a web client can retrieve a web document from a remote web server using http, with a data web, a data web client can retrieve data and metadata from a remote data web server (Bailey, 2002, DSTP, 2002).
Data web client applications have been developed which allow users to browse remote data in data webs and set the PMML parameters defining the mining dictionary and transformation dictionary, enabling data webs to serve as the foundation for remote and distributed data mining (Grossman, 2001).
4.3 Other Web Services. More recently, Microsoft and Hyperion have introduced XML for Analysis which is a Simple Object Access Protocol (SOAP)-based XML API designed for standardizing data access between a web client application and an analytic data provider, such as an OLAP or data mining application (Microsoft Analysis, Microsoft SQL, 2002). No details are available about XML for Analysis at this time (April, 2002).
The CRoss Industry Standard Process for Data Mining or CRISP-DM is the most mature standard for the process of building data mining models (CRISP, 2002). Process standards define not only data mining models and how to prepare data for them, but also the process by which the models are built, and the underlying business problems the models are addressing. The CRISP 1.0 Process Model has six components: business understanding, data understanding, data preparation, modeling, evaluation and deployment.
CRISP was an ESPRIT Program which included Teradata, ISL (which was acquired by SPSS), DaimlerChrysler, and OHRA.
Data mining is used in many different ways and in combination with many different systems and services. Each of these emerging standards and interfaces were developed to address a particular need, which is why there are several different standards.
XML standards such as PMML and CWM are a natural choice to support vendor neutral data interchange. PMML and CWM DM are complementary in that PMML specifies the detailed content of models, whereas CWM DM specifies data mining metadata such as settings and scoring output.
Application programmer interfaces such as SQL/MM and JDM use existing XML standards (PMML) for exchanging model data and metadata. There is good coordination and good working relationships between the standards committees developing PMML, CWM, JSR-73 and SQL MM. For this reason, we expect that terminology, concepts, and object structures will continue to be shared as these standards evolve.
XML standards such as PMML can serve as a common ground for the several different standards efforts in progress. CWM, JSR-73 and SQL MM use PMML concepts and infrastructure where appropriate. In addition, for each large community interfacing to data mining systems, there will be a continuing need for an API. This includes SQL, Java, Microsoft's COM and .Net, and the W3C's Semantic Web.
As data mining has matured so have the standards supporting it. PMML is a relatively mature XML standard for data mining models per se. Version 2.0 of PMML has begun to add support for the transformations, normalizations, and aggregations required for preparing data for modeling and for scoring and deploying models.
There are now well defined APIs for interfacing data mining with SQL, Java, and Microsoft's COM and .Net environments. These APIs incorporate PMML where appropriate.
There are also emerging web standards and services, such as WSDL, SOAP, and DSTP, for working with remote and distributed web data. These standards hold the promise of opening up data mining into entirely new application domains.
Bailey, S., Creel, E., Grossman, R., Gutti, S., and Sivakumar, H. (2000). A High Performance Implementation of the Data Space Transfer Protocol (DSTP). In M. J. Zaki & C.-T. Ho (Eds), Large-Scale Parallel Data Mining, (pp. 55-64). Berlin: Springer-Verlag.
CRoss Industry Standard Process for Data Mining Process (CRISP), CRISP 1.0 Process and User Guide. Retrieved from http://www.crisp-dm.org on March 10, 2002.
Data Mining Group (2002). Retrieved from www.dmg.org. on March 10, 2002.
Data Space Transfer Protocol. National Center for Data Mining. Retrieved from www.ncdm.uic.edu on March 8, 2002.
Grossman, R., Creel, E., Mazzucco, M., and Williams, R. (2001). A DataSpace Infrastructure for Astronomical Data. In R. L. Grossman, C. Kamath, W. Philip Kegelmeye, V. Kumar, and R. Namburu (Eds), Data Mining for Scientific and Engineering Applications, (pp. 115-123). Dordrecht, The Netherlands: Kluwer Academic Publishers.
Grossman, R. Hornick, M, and Meyer, G (2002). Emerging KDD Standards, Communciations of the ACM, August, 2002.
Java Specification Request 73. Retrieved from http://jcp.org/jsr/detail/073.jsp on March 8, 2002.
Microsoft OLE DB for Data Mining Specification 1.0 Retrieved from www.microsoft.com/data/oledb/default.htm on March 8, 2002.
Microsoft SQL Server 2000 Analysis Services. Retrieved from www.microsoft.com/SQL/techinfo/bi/analysis.asp on March 8, 2002.
Microsoft Analysis. Introduction to XML for Analysis, Microsoft, April, 2001. Retrieved from www.microsoft.com/data/xml/XMLAnalysis.htm on April 12, 2002.
Object Management Group's Common Warehouse Metamodel - DataMining. Retrieved from cgi.omg.org/cgi-bin/doclist.pl on March 8, 2002.
PMML (2002). Predictive Model Markup Language (PMML), Data Mining Group, Retrieved from www.dmg.org on March 10, 2002.
Semantic Web (SW), World Wide Web Consortium (W3C). Retrieved from www.w3c.org/2001/sw on March 8, 2002.
UML (2002). Unified Modeling Language. Retrieved from http://www.omg.org/uml/ on May 10, 2002.