An Introduction to PMML
Just what is Predictive Model Markup Language, and how can it help BI professionals?
By Ray DiGiacomo Jr., President, Lion Data Systems, LLC
After nearly 15 years of development, the Predictive Model Markup Language (PMML) has finally risen from the statistics underground into the business intelligence world to become the de facto pathway from traditional (that is, retrospective) BI to what is now known as "BI 2.0."
This new, exciting, fundamental, and tectonic shift is being driven by the natural coupling of traditional BI techniques with "classic" data mining algorithms that have existed for over 30 years. Until recently, the IT departments of the world's most cutting-edge companies have "quietly" developed their own predictive models using a variety of well-known statistical software packages and then spent several months building custom SQL code to deploy these models into production.
This painstaking and error-prone analytics method is now antiquated due to the technical and financial benefits of PMML. This open, XML-like standard allows businesses to develop their own predictive models in the statistical tool of their choice and then robustly "export" the models to a universal format that can be integrated with virtually any type of BI system in a matter of days.
PMML, which one could think of as "XML for predictive models," was first created by The Data Mining Group in 1997 and has evolved heavily over the past decade to its current 4.1 version that now supports a vast assortment of data mining concepts. The group, which is actually a consortium of predictive analytics software vendors and enthusiasts, had a common vision to create an open and universal language that would allow predictive analytics specialists (aka data miners) to collaborate on predictive models across toolsets and vendors. The objective was to allow a miner to create a predictive model in a specific tool and export the model to an industry-standard format with the click of a button, so the exported model could be deployed into a PMML-compliant application such as a BI system.
Because many business intelligence users have only recently embraced predictive technologies, the PMML scene has been relatively unknown to the BI community. Today, as we move further into a climate where predictive BI is becoming more of a core requirement, the PMML community is now growing faster than ever into a vast, enthusiastic, and well-organized society of global professionals.
The major benefit PMML brings to the BI community is its ability to decrease the amount of time needed to deploy a predictive model into a BI system so that the model can generate the information necessary to create competitive business analytics components (such as predictive dashboards). Before PMML, deploying predictive models took as long as four to six months, requiring custom SQL code development often based on a misinterpretation of the data miner's original model.
There are four main sections of a PMML file:
Header: This area holds basic information about the model such as the timestamp of the model's creation and the name of the toolset (software) used to develop the model
Data Dictionary: This section designates the names of the data variables. If you were to look at your data in spreadsheet form, these would be the column headers
Transformations: This describes any changes made to the original dataset to make it more "modelling-friendly." Sometimes data needs to be fine-tuned (transformed) so that the predictive algorithm can "dive" into the dataset as deeply as possible and extract the most amount of patterns within the data.
Model Description: This is the heart of the PMML file. It describes the mathematical parameters of the model. For example, if you export a linear regression model, this section will contain detailed statistical information such as the coefficients used to generate the model's "best fit" curve.
Here's a sample predictive model deployment scenario. First, a statistician creates a predictive model and exports the model to PMML format. The PMML code is integrated into off-the-shelf deployment software which automatically creates SOAP calls that predict various outcomes based on a production database table's contents. A new column, named "Prediction," is added to the production database table. The SOAP calls score (that is, predict) each record in the production database table and return each score to each record's "Prediction" column. Finally, the production database table is ready to feed a "predictive dashboard."
You can watch a brief, vendor-neutral "Intro to PMML" video that further details this example here.
Ray DiGiacomo, Jr. is the president of Lion Data Systems, LLC, a healthcare predictive analytics consulting firm located in Southern California. He is also the president of The Orange County R User Group and a board member at TDWI's Los Angeles-Orange County Chapter. Ray can be reached at rayd@LionDataSystems.com.