RESEARCH & RESOURCES

Machine Learning Made Simple?

Skytree says its Infinity machine-learning platform picks up where BI leaves off.

Machine-learning (ML) specialist Skytree Inc. says it doesn't have any intention of replacing or displacing that business intelligence (BI) tool you've known or loathed for years.

Skytree's chief product officer, Martin Hack, says the company's Infinity product is more complementary and picks up where BI leaves off.

"We're basically using machine learning now in the era of big data where BI stops. When I want to know how my last quarter was, I use BI for that. It's a very nice workflow," says Hack. "The reality is that people are now using ML not only to make more accurate projections about how the upcoming quarter will be but to predict volatility or certain kinds of anomalies [in coming quarters] -- to make actionable predictions about what will happen and not just what has happened."

This is the classic use case for machine learning, or -- as it's sometimes called in BI -- predictive analytics. (PA) ML and PA aren't the same thing, however. One good way of thinking about ML is as a kind of automated, iterated, self-improving "learning." A statistician, data scientist, or analyst interacts periodically (or in some cases continuously) with a PA tool; ML software is itself designed to "learn" such that it can be tended to less frequently by human overseers.

Traditionally, ML hasn't been used as extensively as PA in the enterprise, although there are exceptions. For example, rules engines use ML to improve the accuracy of the ways in which -- or the frequency with which -- rules are automatically invoked. Why does Skytree think it can make ML as widely used (or as enterprise-grade) as BI? It's all about the algorithms, says Hack.

Some vendors like to claim that ML algorithms have basically been commoditized and that companies can't expect to compete on the use of algorithms. Hack, noting that certain well-known algorithms have, in fact, been commoditized, concedes that there's some truth to this claim. You can't expect to compete on your use of commoditized algorithms, he says. You can, however, compete on your selection and use of just the right algorithm in just the right place and at just the right time he maintains.

"Machine-learning algorithms are kind of Skytree's claim to fame. We came up with a whole new breed or set of algorithms that are statistically identical compared to what's out there but [which are] orders of magnitude faster. We literally changed the math underneath, the functions," claims Hack.

Algorithm-wise, Skytree can claim considerable credibility. Machine learning luminary Michael Jordan (no, not the basketball player) sits on Skytree's technical advisory board, as do James Demmel, Pat Hanrahan, and David Patterson. Demmel, Jordan, and Patterson are at U.C. Berkeley, Hanrahan -- co-founder of Tableau Software -- is at Stanford.

"Sometimes it can seem like the algorithms are almost the only things that matter. There's a small little company called Google: ask them if algorithms matter," Hack points out. "In some [markets] you essentially are competing on algorithms. [For] the Internet of things guys, or the high-frequency trading guys, it's a zero sum game: if your algorithm is faster and more predictive than mine, you win. If it isn't, you lose."

Of course, it isn't all about the algorithms, Hack concedes. The practical stuff, such as the selection and preparation of data for machine learning analysis or the time-consuming task of analytic modeling, almost always trump all, but this, too, is an area in which Skytree has done key preparatory work, he maintains. "Data preparation, that's where data scientists spend 60 percent of their time: prepping, cleaning, data. We have a couple of data transforms in [Skytree Infinity] that are very machine-learning specific: those are joins and splits. They come up over and over again. There are a lot of these [data transforms] in [the product] that you can use today, and you can also define your own. The way that you access [and manipulate] these is via Python or Java code."

Skytree isn't aiming to replace or even to complement traditional data integration (DI) tools, Hack stresses: the data engineering transforms in its Infinity product are designed for ML-specific use cases. "There will always be a need for the Informaticas of the world to do enterprise-grade ETL. Our intention [in offering pre-packaged joins and splits] is that for the guys who use machine learning, what are the transfers they use over and over again," he says.

Data engineering for ML typically involves a mix of strictly-structured (OLTP and data warehouse systems; flat files; hierarchical databases, etc.) and polystructured (NoSQL, files in Hadoop's distributed file system -- HDFS), Hack explains. Infinity's prepackaged data transformation libraries can meaningfully accelerate this process, too, he maintains. Then there's analytic modeling, which is a science -- and an art -- unto itself.

"Modeling is basically the core task for a data scientist who is doing machine learning. It's what these guys do everyday all day long. We've changed the way people are doing it [in Infinity]. Traditionally, you trained your algorithm, tuned it, changed the parameters, and played around with it. Rinse and repeat until you get it essentially right," he says.

"What we've done is we've compacted all of this into one step: train, tune, and test are all one thing, with the goal of instead of running three discrete steps, you're running it once, and you not only get the best model accuracy out of it, the algorithm will be optimized for what you're doing."

None of this is necessarily distinctive, however. Vendors such as Predixion and the former InfoCentricity (which was acquired by Fair Isaac Corp. last year) purport to do some of the same things, particularly with their model design environments, but also with respect to data preparation. For example, Predixion touts its Machine Learning Semantic Model, which it claims makes data prep both reusable and portable. (According to Predixion, MLSM packages are nothing less than self-contained predictive apps that include the data transformations needed to build a predictive model. MLSM packages can be reused, enhanced, or adapted to support new use cases, Predixion says.)

Mainstream vendors such as SAS and IBM Corp. have also focused on improving ease of use and productivity in conjunction with predictive analytics or machine learning. This year, for example, SAS introduced its Visual Statistics product, which touts some of the same ease-of-use analytic modeling advantages that Skytree claims. Maybe it really is all about the algorithms?

Yes and no, says Hack. SAS and IBM can claim to have some killer algorithms, too, after all.

"You can't just say, 'I have an algorithm that will predict something.' You have to have the absolute best in accuracy, so your data model [i.e., analytic model] is absolutely critical, too. We try to give you the best of both in a modeling [environment] that can help you pick the absolute best algorithm for what you're doing," he says.

"Where SAS fits in essentially is that they're coming from a statistical origin. The SAS model always used to be, I create a risk model or a scoring model or whatever. So long as that fits in[to] RAM, I'm good to go. However, this expectation blows up in big data. If you want to take this [SAS model] and put it in a petabyte-scale environment, that just isn't going to work."

True, Hack acknowledges, SAS now offers an in-memory option in its SAS LASR Analytic Server, but LASR is considerably more expensive than a product like Infinity that can run in memory on a Hadoop cluster. Infinity can also persist to HDFS or other storage if or when physical memory is exhausted. Skytree gets its in-memory parallelism cheaply, Hack maintains; SAS doesn't.

There's also the question of accessibility, he argues: in spite of its history and installed base, SAS 4GL isn't as widely used as Java or Python (or, arguably, R). Both Java and, especially, Python are used extensively in advanced analytics and ML, Hack points out, thanks in part to the fact that the site of both practices is shifting to the Hadoop platform, which is a code-intensive environment.

"If you look at the trends, SAS, R, Java, and now Python are all languages you can use to do [ML or predictive analytics]," he says. "Python is actually becoming like the lingua franca of data scientists. It's intuitive [compared to] R, where you'd better be a statistician. I know guys who've invented programming languages -- and they're saying [Python] is it. With our support for Python, we're going to have a much broader community to work with, versus if we had just gone with R."

TDWI Membership

Get immediate access to training discounts, video library, research, and more.

Find the right level of Membership for you.