"Big Data" entries

6 reasons why I like KeystoneML

The O'Reilly Data Show Podcast: Ben Recht on optimization, compressed sensing, and large-scale machine learning pipelines.

Thames_tunnel_shield

As we put the finishing touches on what promises to be another outstanding Hardcore Data Science Day at Strata + Hadoop World in New York, I sat down with my co-organizer Ben Recht for the the latest episode of the O’Reilly Data Show Podcast. Recht is a UC Berkeley faculty member and member of AMPLab, and his research spans many areas of interest to data scientists including optimization, compressed sensing, statistics, and machine learning.

At the 2014 Strata + Hadoop World in NYC, Recht gave an overview of a nascent AMPLab research initiative into machine learning pipelines. The research team behind the project recently released an alpha version of a new software framework called KeystoneML, which gives developers a chance to test out some of the ideas that Recht outlined in his talk last year. We devoted a portion of this Data Show episode to machine learning pipelines in general, and a discussion of KeystoneML in particular.

Since its release in May, I’ve had a chance to play around with KeystoneML and while it’s quite new, there are several things I already like about it:

KeystoneML opens up new data types

Most data scientists don’t normally play around with images or audio files. KeystoneML ships with easy to use sample pipelines for computer vision and speech. As more data loaders get created, KeystoneML will enable data scientists to leverage many more new data types and tackle new problems. Read more…

Comments: 2

The key to agile data science: experimentation

A real-world example of how a short delivery cycle fosters creativity.

tree_of_science_by_markpiet-d32fnfaI lead a research team of data scientists responsible for discovering insights that lead to market and competitive intelligence for our company, Computer Sciences Corporation (CSC). We are a busy group. We get questions from all different areas of the company and it’s important to be agile.

The nature of data science is experimental. You don’t know the answer to the question asked of you — or even if an answer exists. You don’t know how long it will take to produce a result or how much data you need. The easiest approach is to just come up with an idea and work on it until you have something. But for those of us with deadlines and expectations, that approach doesn’t fly. Companies that issue you regular paychecks usually want insight into your progress.

This is where being agile matters. An agile data scientist works in small iterations, pivots based on results, and learns along the way. Being agile doesn’t guarantee that an idea will succeed, but it does decrease the amount of time it takes to spot a dead end. Agile data science lets you deliver results on a regular basis and it keeps stakeholders engaged.

The key to agile data science is delivering data products in defined time boxes — say, two- to three-week sprints. Short delivery cycles force us to be creative and break our research into small chunks that can be tested using minimum viable experiments. We deliver something tangible after almost every sprint for our stakeholders to review and give us feedback. Our stakeholders get better visibility into our work, and we learn early on if we are on track.

This approach might sound obvious, but it isn’t always natural for the team. We have to get used to working on just enough to meet stakeholder’s needs and resist the urge to make solutions perfect before moving on. After we make something work in one sprint, we make it better in the next only if we can find a really good reason to do so. Read more…

Comment

The race to build ‘‘big data machines’’ in financial investing

Robot wealth managers and approaches will grow and offer alternative ways of investing.

NASDAQ_studio_Wikipedia

Get the O’Reilly Money Newsletter for news and insights about finance and technology.

Editor’s note: This post originally published in Big Data at Mary Ann Liebert, Inc., Publishers, in Volume 3, Issue 2, on June 18, 2015, under the title “Should You Trust Your Money to a Robot?” It is republished here with permission.

Financial markets emanate massive amounts of data from which machines can, in principle, learn to invest with minimal initial guidance from humans. I contrast human and machine strengths and weaknesses in making investment decisions. The analysis reveals areas in the investment landscape where machines are already very active and those where machines are likely to make significant inroads in the next few years.

Computers are making more and more decisions for us, and increasingly so in areas that require human judgment. Driverless cars, which seemed like science fiction until recently, are expected to become common in the next 10 years. There is a palpable increase in machine intelligence across the touchpoints of our lives, driven by the proliferation of data feeding into intelligent algorithms capable of learning useful patterns and acting on them. We are living through one of the greatest revolutions in our lifestyles, in which computers are increasingly engaged in our lives and decision-making, to a degree that it has become second nature. Recommendations on Amazon or auto-suggestions on Google are now so routine, we find it strange to encounter interfaces that don’t anticipate what we want. The intelligence revolution is well under way, with or without our conscious approval or consent. We are entering the era of intelligence as a service, with access to building blocks for building powerful new applications. Read more…

Comment: 1

The truth about MapReduce performance on SSDs

Cost-per-performance is approaching parity with HDDs.

geometric_stone_Brian_Reynolds_Flickr

Karthik Kambatla co-authored this post.

It is well-known that solid-state drives (SSDs) are fast and expensive. But exactly how much faster — and more expensive — are they than the hard disk drives (HDDs) they’re supposed to replace? And does anything change for big data?

I work on the performance engineering team at Cloudera, a data management vendor. It is my job to understand performance implications across customers and across evolving technology trends. The convergence of SSDs and big data does have the potential to broadly impact future data center architectures. When one of our hardware partners loaned us a number of SSDs with the mandate to “find something interesting,” we jumped on the opportunity. This post shares our findings.

As a starting point, we decided to focus on MapReduce. We chose MapReduce because it enjoys wide deployment across many industry verticals — even as other big data frameworks such as SQL-on-Hadoop, free text search, machine learning, and NoSQL gain prominence.

We considered two scenarios: first, when setting up a new cluster, we explored whether SSDs or HDDs, of equal aggregate bandwidth, are superior; second, we explored how cluster operators should configure SSDs, when upgrading an HDDs-only cluster. Read more…

Comment: 1

Designing with all of the data

More than just filling in where big data leaves off, thick data can provide a new perspective on how people experience designs.

Download a free copy of our new report “Data-Informed Product Design, by Pamela Pavliscak. Editor’s note: this post is an excerpt from the report.

There is a lot of hype about “data-driven” or “data-informed” design, but there is very little agreement about what it really means. Even deciding how to define data is difficult for teams with spotty access to data in the organization, uneven understanding, and little shared language. For some interactive products, it’s possible to have analytics, A/B tests, surveys, intercepts, benchmarks, scores of usability tests, ethnographic studies, and interviews. But what counts as data? And more important, what will inform design in a meaningful way?

When it comes to data, we tend to think in dichotomies: quantitative and qualitative, objective and subjective, abstract and sensory, messy and curated, business and user experience, science and story. Thinking about the key differences can help us to sort out how it fits together, but it can also set up unproductive oppositions. Using data for design does not have to be an either/or; instead, it should be yes, and

Big data and the user experience

At its simplest, big data is data generated by machines recording what people do and say. Some of this data is simply counts — counts of who has come to your website, how they got there, how long they stayed, and what they clicked or tapped. They also could be counts of how many clicked A and how many clicked B, or perhaps counts of purchases or transactions.

For a site such as Amazon, there are a million different articles for sale, several million customers, a hundred million sales transactions, and billions of clicks. The big challenge is how to cluster only the 250,000 best customers or how to reduce 1,000 data dimensions to only two or three relevant ones. Big data has to be cleaned, segmented, and visualized to start getting a sense of what it might mean. Read more…

Comment

Building intelligent machines

To understand deep learning, let’s start simple.

neurons-582054_1280

Use code DATA50 to get 50% off of the new early release of “Fundamentals of Deep Learning: Designing Next-Generation Artificial Intelligence Algorithms.” Editor’s note: This is an excerpt of “Fundamentals of Deep Learning,” by Nikhil Buduma.

The brain is the most incredible organ in the human body. It dictates the way we perceive every sight, sound, smell, taste, and touch. It enables us to store memories, experience emotions, and even dream. Without it, we would be primitive organisms, incapable of anything other than the simplest of reflexes. The brain is, inherently, what makes us intelligent.

The infant brain only weighs a single pound, but somehow, it solves problems that even our biggest, most powerful supercomputers find impossible. Within a matter of days after birth, infants can recognize the faces of their parents, discern discrete objects from their backgrounds, and even tell apart voices. Within a year, they’ve already developed an intuition for natural physics, can track objects even when they become partially or completely blocked, and can associate sounds with specific meanings. And by early childhood, they have a sophisticated understanding of grammar and thousands of words in their vocabularies.

For decades, we’ve dreamed of building intelligent machines with brains like ours — robotic assistants to clean our homes, cars that drive themselves, microscopes that automatically detect diseases. But building these artificially intelligent machines requires us to solve some of the most complex computational problems we have ever grappled with, problems that our brains can already solve in a manner of microseconds. To tackle these problems, we’ll have to develop a radically different way of programming a computer using techniques largely developed over the past decade. This is an extremely active field of artificial computer intelligence often referred to as deep learning. Read more…

Comments: 3

Data modeling with multi-model databases

A case study for mixing different data models within the same data store.

Editor’s note: Full disclosure — the author is a developer and software architect at ArangoDB GmbH, which leads the development of the open source multi-model database ArangoDB.

In recent years, the idea of “polyglot persistence” has emerged and become popular — for example, see Martin Fowler’s excellent blog post. Fowler’s basic idea can be interpreted that it is beneficial to use a variety of appropriate data models for different parts of the persistence layer of larger software architectures. According to this, one would, for example, use a relational database to persist structured, tabular data; a document store for unstructured, object-like data; a key/value store for a hash table; and a graph database for highly linked referential data. Traditionally, this means that one has to use multiple databases in the same project, which leads to some operational friction (more complicated deployment, more frequent upgrades) as well as data consistency and duplication issues.

MultiModel_Oreilly

Figure 1: tables, documents, graphs and key/value pairs: different data models. Image courtesy of Max Neunhöffer.

This is the calamity that a multi-model database addresses. You can solve this problem by using a multi-model database that consists of a document store (JSON documents), a key/value store, and a graph database, all in one database engine and with a unifying query language and API that cover all three data models and even allow for mixing them in a single query. Without getting into too much technical detail, these three data models are specially chosen because an architecture like this can successfully compete with more specialised solutions on their own turf, both with respect to query performance and memory usage. The column-oriented data model has, for example, been left out intentionally. Nevertheless, this combination allows you — to a certain extent — to follow the polyglot persistence approach without the need for multiple data stores. Read more…

Comments: 5

Real-time analytics within the transaction

Integrated data stream platforms are poised to supplant the lambda architecture.

Architecture_public_domain_image_Internet_Archive_Flickr

Data generation is growing exponentially, as is the demand for real-time analytics over fast input data. Traditional approaches to analyzing data in batch mode overcome the computational problems of data volume by scaling horizontally using a distributed system like Apache Hadoop. However, this solution is not feasible for analyzing large data streams in real time due to the scheduling I/O overhead it introduces.

Two main problems occur when batch processing is applied to stream or fast data. First, by the time the analysis is complete, it may already have been outdated by new incoming data. Second, the data may be arriving so fast that it is not feasible to store and batch-process them later, so the data must be processed or summarized when it is received. The Square Kilometer Array (SKA) radio telescope is a good public example of a system in which data must be preprocessed before storage. The SKA is a distributed radio observation project where each base station will receive 10-30 TB/sec and the Central Unit will process 4PB/sec. In this scenario, online summaries of the input data must be computed in real time and then processed — and significantly reduced in size — data is what’s stored.

In the business world, common examples of stream data are sensor networks, Twitter, Internet traffic, logs, financial tickers, click streams, and online bids. Algorithmic solutions enable the computation of summaries, frequency (heavy hitter) and event detection, and other statistical calculations on the stream as a whole or detection of outliers within it.

But what if you need to perform transaction-level analysis — scans across different dimensions of the data set, for example — as well as store the streamed data for fast lookup and retrospective analysis? Read more…

Comment
Four short links: 7 July 2015

Four short links: 7 July 2015

SCIP Berkeley Style, Regular Failures, Web Material Design, and Javascript Breakouts

  1. CS 61AS — Berkeley self-directed Structure and Interpretation of Computer Programs course.
  2. Harbingers of Failure (PDF) — We show that some customers, whom we call ‘Harbingers’ of failure, systematically purchase new products that flop. Their early adoption of a new product is a strong signal that a product will fail – the more they buy, the less likely the product will succeed. Firms can identify these customers either through past purchases of new products that failed, or through past purchases of existing products that few other customers purchase.
  3. Google Material Design LiteA library of Material Design components in CSS, JS, and HTML.
  4. Breakoutsvarious implementations of the classic game Breakout in numerous different [Javascript] engines.
Comment

Unpacking technical jargon in machine learning

A new report explores how to evaluate your machine learning models.

Mathematics_concept_collage_Fir0002_Wikimedia_Commons

Get notified when our free report “Evaluating Machine Learning Models: A beginner’s guide to key concepts and pitfalls” is available for download. Editor’s note: This is an excerpt of “Evaluating Machine Learning Models,” by Alice Zheng.


Alice Zheng will be part of the Data Science Summit and Dato Conference in July — a non-profit event jointly organized by Intel, Comcast, Pandora, Dato, Cloudera, and O’Reilly Media — in San Francisco. Visit the conference website for more information on the program. Use the discount code OREILLY20 and get 20% off either one or both days of the conference.

This report on evaluating machine learning models arose out of a sense of need. The content was first published as a series of six technical posts on the Dato Machine Learning Blog. I was the editor of the blog, and I needed something to publish for the next day. Dato builds machine learning tools that help users build intelligent data products. In our conversations with the community, we sometimes ran into a confusion in terminology. For example, people would ask for cross validation as a feature, when what they really meant was hyperparameter tuning, a feature we already had. So, I thought, “Aha! I’ll just quickly explain what these concepts mean and point folks to the relevant sections in the user guide.”

I sat down to write a blog post to explain cross validation, hold-out data sets, and hyperparameter tuning. After the first two paragraphs, however, I realized that it would take a lot more than a single blog post. The three terms sit at different depths in the concept hierarchy of machine learning model evaluation. Cross validation and hold-out validation are ways of chopping up a data set in order to measure the model’s performance on “unseen” data. Hyperparameter tuning, on the other hand, is a more “meta” process of model selection. But why does the model need “unseen” data, and what’s meta about hyperparameters? In order to explain all of that, I needed to start from the basics. First, I needed to explain the high-level concepts and how they fit together. Only then could I dive into each one in detail. Read more…

Comment