Streamlining feature engineering

Researchers and startups are building tools that enable feature discovery.

Why do data scientists spend so much time on data wrangling and data preparation? In many cases it’s because they want access to the best variables with which to build their models. These variables are known as features in machine-learning parlance. For many0 data applications, feature engineering and feature selection are just as (if not more important) than choice of algorithm:

Good features allow a simple model to beat a complex model.
(to paraphrase Alon Halevy, Peter Norvig, and Fernando Pereira)

The terminology can be a bit confusing, but to put things in context one can simplify the data science pipeline to highlight the importance of features:

Feature engineering and discovery pipeline

Feature Engineering or the Creation of New Features
A simple example to keep in mind is text mining. One starts with raw text (documents) and extracted features could be individual words or phrases. In this setting, a feature could indicate the frequency of a specific word or phrase. Features1 are then used to classify and cluster documents, or extract topics associated with the raw text. The process usually involves the creation2 of new features (feature engineering) and identifying the most essential ones (feature selection).

Read more…


Untapped opportunities in AI

Some of AI's viable approaches lie outside the organizational boundaries of Google and other large Internet companies.

Editor’s note: this post is part of an ongoing series exploring developments in artificial intelligence.

Here’s a simple recipe for solving crazy-hard problems with machine intelligence. First, collect huge amounts of training data — probably more than anyone thought sensible or even possible a decade ago. Second, massage and preprocess that data so the key relationships it contains are easily accessible (the jargon here is “feature engineering”). Finally, feed the result into ludicrously high-performance, parallelized implementations of pretty standard machine-learning methods like logistic regression, deep neural networks, and k-means clustering (don’t worry if those names don’t mean anything to you — the point is that they’re widely available in high-quality open source packages).

Google pioneered this formula, applying it to ad placement, machine translation, spam filtering, YouTube recommendations, and even the self-driving car — creating billions of dollars of value in the process. The surprising thing is that Google isn’t made of magic. Instead, mirroring Bruce Scheneier’s surprised conclusion about the NSA in the wake of the Snowden revelations, “its tools are no different from what we have in our world; it’s just better funded.” Read more…


A growing number of applications are being built with Spark

Many more companies want to highlight how they're using Apache Spark in production.

One of the trends we’re following closely at Strata is the emergence of vertical applications. As components for creating large-scale data infrastructures enter their early stages of maturation, companies are focusing on solving data problems in specific industries rather than building tools from scratch. Virtually all of these components are open source and have contributors across many companies. Organizations are also sharing best practices for building big data applications, through blog posts, white papers, and presentations at conferences like Strata.

These trends are particularly apparent in a set of technologies that originated from UC Berkeley’s AMPLab: the number of companies that are using (or plan to use) Spark in production1 has exploded over the last year. The surge in popularity of the Apache Spark ecosystem stems from the maturation of its individual open source components and the growing community of users. The tight integration of high-performance tools that address different problems and workloads, coupled with a simple programming interface (in Python, Java, Scala), make Spark one of the most popular projects in big data. The charts below show the amount of active development in Spark:

Apache Spark contributions

For the second year in a row, I’ve had the privilege of serving on the program committee for the Spark Summit. I’d like to highlight a few areas where Apache Spark is making inroads. I’ll focus on proposals2 from companies building applications on top of Spark.

Read more…


How to be agile with your big data

Agile methodology brings flexibility to the EDW and offers ways to integrate open-source technologies with existing systems.

Data analysis, like other pursuits, is a balancing act. The rise of big data ratchets up the pressure on the traditional enterprise data warehouse (EDW) and associated software tools to handle rapidly evolving sets of new demands posed by the business. Companies want their EDW systems to be more flexible and more user friendly — without sacrificing processing speeds, data integrity, or overall reliability.

“The more data you give the business, the more questions they will ask,” says José Carlos Eiras, who has served as CIO at Kraft Foods, Philip Morris, General Motors, and DHL. “When you have big data, you have a lot of different questions, and suddenly you need an enterprise data warehouse that is very flexible.”

EDWs are remarkably powerful, but it takes considerable expertise and creativity to modify them on the fly. Adding new capabilities to the EDW generally requires significant investments of time and money. You can develop your own tools internally or purchase them from a vendor, but either way, it’s a hard slog. Read more…


5 reasons to learn D3

D3 doesn’t stand for data-design dictator

Designers and developers making data visualizations on the web are buzzing about d3.js. But why? Read more…

Comment: 1

Welcome to Intelligence Matters

Casting a critical eye on the exciting developments in the world of AI.

Editor’s note: this post was co-authored by Ben Lorica and Roger Magoulas


Siri screenshot.

Today we’re kicking off Intelligence Matters (IM), a new series exploring current issues in artificial intelligence, including the connection between artificial intelligence, human intelligence and the brain. IM offers a thoughtful take on recent developments, including a critical, and sometimes skeptical, view when necessary.

True AI has been “just around the corner” for 60 years, so why should O’Reilly start covering AI in a big way now? As computing power catches up to scientific and engineering ambitions, and as our ability to learn directly from sensory signals — i.e., big data — increases, intelligent systems are having a real and widespread impact. Every Internet user benefits from these systems today — they sort our email, plan our journeys, answer our questions, and protect us from fraudsters. And, with the Internet of Things, these system have already started to keep our houses and offices comfortable and well-lit, our data centers running more efficiently, our industrial processes humming, and even are driving our cars. Read more…

Comments: 2