MLbase: Scalable machine-learning made accessible

In the course of applying machine-learning against large data sets, data scientists face a few pain points. They need to tune and compare several suitable algorithms – a process that may involve having to configure a hodgepodge of tools, requiring different input files, programming languages, and interfaces. Some software tools may not scale to big data, so they first sample and test ideas on smaller subsets, before tackling the problem of having to implement a distributed version of the final algorithm.

To increase productivity, ideally data scientists should be able to quickly test ideas without doing much coding, context switching, tuning and configuration. A research project⁰ out of UC Berkeley’s Amplab and Brown seems to do just that: MLbase aims to make cutting edge, scalable machine-learning algorithms available to non-experts. MLbase will have four pieces: a declarative language (MQL – discussed below), a library of distributed algorithms (ML-Library), an optimizer and a runtime (ML-Optimizer and ML-Runtime).

Status:
According to project leaders Tim Kraska and Ameet Talwalkar, the first release slated for August will be comprised of a rule-based ML-Optimizer, ML-Library¹, and Spark as the run-time. It will also include sample workflows that illustrate how one can automate training and prediction using ML-Optimizer and ML-Library.

Describe rather than code:
To make scalable machine-learning even more accessible, a subsequent version of MLbase will come with a declarative language (MQL): to build a model or run an algorithm, users need only describe what they want to do. Specifically, MLbase will free users from having to configure and choose from among many competing algorithms, and it will also automatically optimize solutions for distributed execution.

To illustrate MQL² in action, here is an example of how to train a classifier:
data = load("hdfs://path/to/als_clinical") // the features are stored in columns 2-10 X = data[, 2 to 10] y = data[, 1] model = do_classify(y, X)
Under the covers, MQL is converted into a logical (learning) plan that an optimizer uses to identify algorithms that are testable in a reasonable timeframe.

Taking the analogy with databases a step further, the optimized logical plan is turned into a physical (learning) plan consisting of operations such as data filtering/scaling, and synchronous & asynchronous operators (similar to Map/Reduce).

For an end-user the details of scaling, fault-tolerance, implementing, and configuring many algorithms, come for free! In fact MQL is simple enough that over time it can be hidden behind simple GUI’s³ (similar to what reporting tools have done for SQL).

Summary:
There are several machine-learning and statistics tools that attack scale: a partial list includes ScaleR, H20, Cetas, Skytree, Ayasdi, Mahout, and VW. Some are tied to particular runtimes and data sources, others specialize in specific algorithms. By introducing an optimizer and a simple declarative language for machine-learning, MLbase will open up many analytic tasks (e.g., cluster, classify, forecast) to non-technical users. Both expert and novice users benefit by being able to quickly test ideas across many distributed algorithms implemented by the mix of systems, database, machine-learning researchers at Amplab and Brown.

Along with Spark and Shark, MLbase is part of BDAS, the Berkeley Data Analytics Stack. With the upcoming release of version 0.7, Spark adds a Python API and Stream processing capabilities. (Spark becomes a single framework and programming model that can handle batch, interactive, real-time, and graph analytics.) If you want to learn more about BDAS (and Spark/Shark), the Amplab team is offering a two-part tutorial (I and II) at the 2013 Strata Conference in Santa Clara next week.

Seven reasons why I like Spark

Shark: Real-time queries and analytics for big data

GraphChi: Graph analytics over billions of edges using your laptop

(0) At the time of its release, MLbase will be open source.
(1) The initial set of distributed algorithms included in the August release, include ones for classification, regression, clustering, and collaborative filtering.
(2) Note the similarity of the syntax to MATLAB. The above example is part of the current draft version of MQL, the syntax is still likely to change slightly before its public release.
(3) As the MLBase team pointed out in a recent paper, while the popular system Weka comes with simple UI it “… requires expert knowledge to choose and configure the ML algorithm and is a single node system.”

MLbase: Scalable machine-learning made accessible

Describe and run bleeding edge algorithms on massive data sets

Get the O’Reilly Data Newsletter