The O'Reilly Data Show Podcast: Ben Recht on optimization, compressed sensing, and large-scale machine learning pipelines.
As we put the finishing touches on what promises to be another outstanding Hardcore Data Science Day at Strata + Hadoop World in New York, I sat down with my co-organizer Ben Recht for the the latest episode of the O’Reilly Data Show Podcast. Recht is a UC Berkeley faculty member and member of AMPLab, and his research spans many areas of interest to data scientists including optimization, compressed sensing, statistics, and machine learning.
At the 2014 Strata + Hadoop World in NYC, Recht gave an overview of a nascent AMPLab research initiative into machine learning pipelines. The research team behind the project recently released an alpha version of a new software framework called KeystoneML, which gives developers a chance to test out some of the ideas that Recht outlined in his talk last year. We devoted a portion of this Data Show episode to machine learning pipelines in general, and a discussion of KeystoneML in particular.
Since its release in May, I’ve had a chance to play around with KeystoneML and while it’s quite new, there are several things I already like about it:
KeystoneML opens up new data types
Most data scientists don’t normally play around with images or audio files. KeystoneML ships with easy to use sample pipelines for computer vision and speech. As more data loaders get created, KeystoneML will enable data scientists to leverage many more new data types and tackle new problems. Read more…
A real-world example of how a short delivery cycle fosters creativity.
I lead a research team of data scientists responsible for discovering insights that lead to market and competitive intelligence for our company, Computer Sciences Corporation (CSC). We are a busy group. We get questions from all different areas of the company and it’s important to be agile.
The nature of data science is experimental. You don’t know the answer to the question asked of you — or even if an answer exists. You don’t know how long it will take to produce a result or how much data you need. The easiest approach is to just come up with an idea and work on it until you have something. But for those of us with deadlines and expectations, that approach doesn’t fly. Companies that issue you regular paychecks usually want insight into your progress.
This is where being agile matters. An agile data scientist works in small iterations, pivots based on results, and learns along the way. Being agile doesn’t guarantee that an idea will succeed, but it does decrease the amount of time it takes to spot a dead end. Agile data science lets you deliver results on a regular basis and it keeps stakeholders engaged.
The key to agile data science is delivering data products in defined time boxes — say, two- to three-week sprints. Short delivery cycles force us to be creative and break our research into small chunks that can be tested using minimum viable experiments. We deliver something tangible after almost every sprint for our stakeholders to review and give us feedback. Our stakeholders get better visibility into our work, and we learn early on if we are on track.
This approach might sound obvious, but it isn’t always natural for the team. We have to get used to working on just enough to meet stakeholder’s needs and resist the urge to make solutions perfect before moving on. After we make something work in one sprint, we make it better in the next only if we can find a really good reason to do so. Read more…
Robot wealth managers and approaches will grow and offer alternative ways of investing.
Editor’s note: This post originally published in Big Data at Mary Ann Liebert, Inc., Publishers, in Volume 3, Issue 2, on June 18, 2015, under the title “Should You Trust Your Money to a Robot?” It is republished here with permission.
Financial markets emanate massive amounts of data from which machines can, in principle, learn to invest with minimal initial guidance from humans. I contrast human and machine strengths and weaknesses in making investment decisions. The analysis reveals areas in the investment landscape where machines are already very active and those where machines are likely to make significant inroads in the next few years.
Computers are making more and more decisions for us, and increasingly so in areas that require human judgment. Driverless cars, which seemed like science fiction until recently, are expected to become common in the next 10 years. There is a palpable increase in machine intelligence across the touchpoints of our lives, driven by the proliferation of data feeding into intelligent algorithms capable of learning useful patterns and acting on them. We are living through one of the greatest revolutions in our lifestyles, in which computers are increasingly engaged in our lives and decision-making, to a degree that it has become second nature. Recommendations on Amazon or auto-suggestions on Google are now so routine, we find it strange to encounter interfaces that don’t anticipate what we want. The intelligence revolution is well under way, with or without our conscious approval or consent. We are entering the era of intelligence as a service, with access to building blocks for building powerful new applications. Read more…
As augmented reality technologies emerge, we must place the focus on serving human needs.
Register now for Solid Amsterdam, October 28, 2015 — space is limited.Augmented reality (AR), wearable technology, and the Internet of Things (IoT) are all really about human augmentation. They are coming together to create a new reality that will forever change the way we experience the world. As these technologies emerge, we must place the focus on serving human needs.
The Internet of Things and Humans
Tim O’Reilly suggested the word “Humans” be appended to the term IoT. “This is a powerful way to think about the Internet of Things because it focuses the mind on the human experience of it, not just the things themselves,” wrote O’Reilly. “My point is that when you think about the Internet of Things, you should be thinking about the complex system of interaction between humans and things, and asking yourself how sensors, cloud intelligence, and actuators (which may be other humans for now) make it possible to do things differently.”
I share O’Reilly’s vision for the IoTH and propose we extend this perspective and apply it to the new AR that is emerging: let’s take the focus away from the technology and instead emphasize the human experience.
The definition of AR we have come to understand is a digital layer of information (including images, text, video, and 3D animations) viewed on top of the physical world through a smartphone, tablet, or eyewear. This definition of AR is expanding to include things like wearable technology, sensors, and artificial intelligence (AI) to interpret your surroundings and deliver a contextual experience that is meaningful and unique to you. It’s about a new sensory awareness, deeper intelligence, and heightened interaction with our world and each other. Read more…
Cost-per-performance is approaching parity with HDDs.
Karthik Kambatla co-authored this post.
It is well-known that solid-state drives (SSDs) are fast and expensive. But exactly how much faster — and more expensive — are they than the hard disk drives (HDDs) they’re supposed to replace? And does anything change for big data?
I work on the performance engineering team at Cloudera, a data management vendor. It is my job to understand performance implications across customers and across evolving technology trends. The convergence of SSDs and big data does have the potential to broadly impact future data center architectures. When one of our hardware partners loaned us a number of SSDs with the mandate to “find something interesting,” we jumped on the opportunity. This post shares our findings.
As a starting point, we decided to focus on MapReduce. We chose MapReduce because it enjoys wide deployment across many industry verticals — even as other big data frameworks such as SQL-on-Hadoop, free text search, machine learning, and NoSQL gain prominence.
We considered two scenarios: first, when setting up a new cluster, we explored whether SSDs or HDDs, of equal aggregate bandwidth, are superior; second, we explored how cluster operators should configure SSDs, when upgrading an HDDs-only cluster. Read more…
More than just filling in where big data leaves off, thick data can provide a new perspective on how people experience designs.
Download a free copy of our new report “Data-Informed Product Design,“ by Pamela Pavliscak. Editor’s note: this post is an excerpt from the report.There is a lot of hype about “data-driven” or “data-informed” design, but there is very little agreement about what it really means. Even deciding how to define data is difficult for teams with spotty access to data in the organization, uneven understanding, and little shared language. For some interactive products, it’s possible to have analytics, A/B tests, surveys, intercepts, benchmarks, scores of usability tests, ethnographic studies, and interviews. But what counts as data? And more important, what will inform design in a meaningful way?
When it comes to data, we tend to think in dichotomies: quantitative and qualitative, objective and subjective, abstract and sensory, messy and curated, business and user experience, science and story. Thinking about the key differences can help us to sort out how it fits together, but it can also set up unproductive oppositions. Using data for design does not have to be an either/or; instead, it should be yes, and…
Big data and the user experience
At its simplest, big data is data generated by machines recording what people do and say. Some of this data is simply counts — counts of who has come to your website, how they got there, how long they stayed, and what they clicked or tapped. They also could be counts of how many clicked A and how many clicked B, or perhaps counts of purchases or transactions.
For a site such as Amazon, there are a million different articles for sale, several million customers, a hundred million sales transactions, and billions of clicks. The big challenge is how to cluster only the 250,000 best customers or how to reduce 1,000 data dimensions to only two or three relevant ones. Big data has to be cleaned, segmented, and visualized to start getting a sense of what it might mean. Read more…