- Three Best Practices for Building Successful Data Pipelines (Michael Li) — three key areas that are often overlooked in data pipelines, and those are making your analysis: reproducible, consistent, and productionizable.
- Amazon’s Culture Controversy Decoded (Rita J King) — very interesting culture map analysis of the reports of Amazon’s culture, and context for how companies make choices about what to be. (via Mike Loukides)
- How Will Real-Time Tracking Change the NFL? (New Yorker) — At the moment, the NFL is being tightfisted with the data. Commentators will have access during games, as will the betting and analytics firm Sportradar. Users of the league’s Xbox One app, which provides an interactive way of browsing video clips, fantasy-football statistics, and other metrics, will be able to explore a feature called Next Gen Replay, which allows them to track each player’s speed and trajectory, combining moving lines on a virtual field with live footage from the real one. But, for now, coaches are shut out; once a player exits the locker room on game day, the dynamic point cloud that is generated by his movement through space is a corporately owned data set, as outlined in the league’s 2011 collective-bargaining agreement. Which should tell you all you need to know about the NFL’s role in promoting sporting excellence.
- Giraffe: Using Deep Reinforcement Learning to Play Chess (Matthew Lai) — Giraffe, a chess engine that uses self-play to discover all its domain-specific knowledge, with minimal hand-crafted knowledge given by the programmer. See also the code. (via GitXiv)
"machine learning" entries
A beginner's guide to evaluating your machine learning models.
Everything today is being quantified, measured, and tracked — everything is generating data, and data is powerful. Businesses are using data in a variety of ways to improve customer satisfaction. For instance, data scientists are building machine learning models to generate intelligent recommendations to users so that they spend more time on a site. Analysts can use churn analysis to predict which customers are the best targets for the next promotional campaign. The possibilities are endless.
However, there are challenges in the machine learning pipeline. Typically, you build a machine learning model on top of your data. You collect more data. You build another model. But how do you know when to stop?
When is your smart model smart enough?
Evaluation is a key step when building intelligent business applications with machine learning. It is not a one-time task, but must be integrated with the whole pipeline of developing and productionizing machine learning-enabled applications.
In a new free O’Reilly report Evaluating Machine Learning Models: A Beginner’s Guide to Key Concepts and Pitfalls, we cut through the technical jargon of machine learning, and elucidate, in simple language, the processes of evaluating machine learning models. Read more…
The O'Reilly Data Show Podcast: Alice Zheng on feature representations, model evaluation, and machine learning models.
Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.
As tools for advanced analytics become more accessible, data scientist’s roles will evolve. Most media stories emphasize a need for expertise in algorithms and quantitative techniques (machine learning, statistics, probability), and yet the reality is that expertise in advanced algorithms is just one aspect of industrial data science.
During the latest episode of the O’Reilly Data Show podcast, I sat down with Alice Zheng, one of Strata + Hadoop World’s most popular speakers. She has a gift for explaining complex topics to a broad audience, through presentations and in writing. We talked about her background, techniques for evaluating machine learning models, how much math data scientists need to know, and the art of interacting with business users.
Making machine learning accessible
People who work at getting analytics adopted and deployed learn early on the importance of working with domain/business experts. As excited as I am about the growing number of tools that open up analytics to business users, the interplay between data experts (data scientists, data engineers) and domain experts remains important. In fact, human-in-the-loop systems are being used in many critical data pipelines. Zheng recounts her experience working with business analysts:
It’s not enough to tell someone, “This is done by boosted decision trees, and that’s the best classification algorithm, so just trust me, it works.” As a builder of these applications, you need to understand what the algorithm is doing in order to make it better. As a user who ultimately consumes the results, it can be really frustrating to not understand how they were produced. When we worked with analysts in Windows or in Bing, we were analyzing computer system logs. That’s very difficult for a human being to understand. We definitely had to work with the experts who understood the semantics of the logs in order to make progress. They had to understand what the machine learning algorithms were doing in order to provide useful feedback. Read more…
The O'Reilly Radar Podcast: Bradley Voytek on data's role in neuroscience, the brain scanner, and zombie brains in STEM.
Subscribe to the O’Reilly Radar Podcast to track the technologies and people that will shape our world in the years to come.
In this week’s Radar Podcast, O’Reilly’s Mac Slocum chats with Bradley Voytek, an assistant professor of cognitive science and neuroscience at UC San Diego. Voytek talks about using data-driven approaches in his neuroscience work, the brain scanner project, and applying cognitive neuroscience to the zombie brain.
Here are a few snippets from their chat:
In the neurosciences, we’ve got something like three million peer reviewed publications to go through. When I was working on my Ph.D., I was very interested, in particular, in two brain regions. I wanted to know how these two brain regions connect, what are the inputs to them and where do they output to. In my naivety as a Ph.D. student, I had assumed there would be some sort of nice 3D visualization, where I could click on a brain region and see all of its inputs and outputs. Such a thing did not exist — still doesn’t, really. So instead, I ended up spending three or four months of my Ph.D. combing through papers written in the 1970s … and I kept thinking to myself, this is ridiculous, and this just stewed in the back of my mind for a really long time.
Sitting at home [with my wife], I said, I think I’ve figured out how to address this problem I’m working on, which is basically very simple text mining. Lets just scrape the text of these three million papers, or at least the titles and abstracts, and see what words co-occur frequently together. It was very rudimentary text mining, with the idea that if words co-occur frequently … this might give us an index of how related things are, and she challenged me to a code-off.
How to almost necessarily succeed: An interview with Google research scientist Ilya Sutskever.
Get notified when our free report “Future of Machine Intelligence: Perspectives from Leading Practitioners” is available for download. The following interview is one of many that will be included in the report.
Ilya Sutskever is a research scientist at Google and the author of numerous publications on neural networks and related topics. Sutskever is a co-founder of DNNresearch and was named Canada’s first Google Fellow.
- Since humans can solve perception problems very quickly, despite our neurons being relatively slow, moderately deep and large neural networks have enabled machines to succeed in a similar fashion.
- Unsupervised learning is still a mystery, but a full understanding of that domain has the potential to fundamentally transform the field of machine learning.
- Attention models represent a promising direction for powerful learning algorithms that require ever less data to be successful on harder problems.
David Beyer: Let’s start with your background. What was the evolution of your interest in machine learning, and how did you zero-in on your Ph.D. work?
Ilya Sutskever: I started my Ph.D. just before deep learning became a thing. I was working on a number of different projects, mostly centered around neural networks. My understanding of the field crystallized when collaborating with James Martens on the Hessian-free optimizer. At the time, greedy layer-wise training (training one layer at a time) was extremely popular. Working on the Hessian-free optimizer helped me understand that if you just train a very large and deep neural network on a lot of data, you will almost necessarily succeed. Read more…
How machine learning plus expert sourcing can unify customer data at scale.
Watch the free webcast Integrating Customer Data at Scale to learn how Toyota Motor Europe was able to unify its customer data at scale.
Enterprises that are capable of gaining a unified view of their customer data can achieve added business enhancements and user opportunities. Capturing customer data, however, can be a difficult task, as most systems rely on traditional “top-down” approaches to standardizing data. In a recent O’Reilly webcast, Integrating Customer Data at Scale, Tamr field engineer Alan Wagner hosts a Q&A session with Matt Stevens, the general manager at Toyota Motor Europe, to demonstrate how a leading enterprise uses a third-generation system like Tamr to simplify the process of unifying customer data.
In the webcast, Stevens explains how Toyota Motor Europe has gained a 360-degree view of their customers through the Tamr Data Unification Platform, which takes a machine learning and expert-sourcing “human guided workflow” approach to data unification. Wagner provides a demo of the Tamr platform, applied within a Salesforce application, to demonstrate the ability to capture and unify customer data. Read more…