"data science" entries

Building a business that combines human experts and data science

The O’Reilly Data Show podcast: Eric Colson on algorithms, human computation, and building data science teams.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

350px-bombe

In this episode of the O’Reilly Data Show, I spoke with Eric Colson, chief algorithms officer at Stitch Fix, and former VP of data science and engineering at Netflix. We talked about building and deploying mission-critical, human-in-the-loop systems for consumer Internet companies. Knowing that many companies are grappling with incorporating data science, I also asked Colson to share his experiences building, managing, and nurturing, large data science teams at both Netflix and Stitch Fix.

Augmented systems: “Active learning,” “human-in-the-loop,” and “human computation”

We use the term ‘human computation’ at Stitch Fix. We have a team dedicated to human computation. It’s a little bit coarse to say it that way because we do have more than 2,000 stylists, and these are very much human beings that are very passionate about fashion styling. What we can do is, we can abstract their talent into—you can think of it like an API; there’s certain tasks that only a human can do or we’re going to fail if we try this with machines, so we almost have programmatic access to human talent. We are allowed to route certain tasks to them, things that we could never get done with machines. … We have some of our own proprietary software that blends together two resources: machine learning and expert human judgment. The way I talk about it is, we have an algorithm that’s distributed across the resources. It’s a single algorithm, but it does some of the work through machine resources, and other parts of the work get done through humans.

Read more…

Comment

Is 2016 the year you let robots manage your money?

The O’Reilly Data Show podcast: Vasant Dhar on the race to build “big data machines” in financial investing.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

350px-Merchants'_Exchange,_Wall_Street,_New_York_City

In this episode of the O’Reilly Data Show, I sat down with Vasant Dhar, a professor at the Stern School of Business and Center for Data Science at NYU, founder of SCT Capital Management, and editor-in-chief of the Big Data Journal (full disclosure: I’m a member of the editorial board). We talked about the early days of AI and data mining, and recent applications of data science to financial investing and other domains.

Dhar’s first steps in applying machine learning to finance

I joke with people, I say, ‘When I first started looking at finance, the only thing I knew was that prices go up and down.’ It was only when I actually went to Morgan Stanley and took time off from academia that I learned about finance and financial markets. … What I really did in that initial experiment is I took all the trades, I appended them with information about the state of the market at the time, and then I cranked it through a genetic algorithm and a tree induction algorithm. … When I took it to the meeting, it generated a lot of really interesting discussion. … Of course, it took several months before we actually finally found the reasons for why I was observing what I was observing.

Read more…

Comment: 1

Investing in big data technologies

The O’Reilly Data Show podcast: A fireside chat with Ben Horowitz, plus Reynold Xin on the rise of Apache Spark in China.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

350px-Diqing,_Yunnan,_China

In this special holiday episode of the O’Reilly Data Show, I look back at two conversations I had earlier this year at the Spark Summit in San Francisco. The first segment is an on-stage fireside chat with Ben Horowitz, co-founder of Andreessen Horowitz and author of The Hard Thing About Hard Things.

In the second segment, Reynold Xin, one of the architects of Apache Spark, explains the rise of Apache Spark in China.

Subscribe to the O’Reilly Data Show Podcast

Stitcher, TuneIn, iTunes, SoundCloud, RSS

Related resources:

Read more…

Comment

Patrick Wendell on Spark’s roadmap, Spark R API, and deep learning on the horizon

The O'Reilly Radar Podcast: A special holiday cross-over of the O'Reilly Data Show Podcast.

Subscribe to the O’Reilly Radar Podcast to track the technologies and people that will shape our world in the years to come.

350px-4746439330_f71a44a82f_o

In this special holiday episode of the Radar Podcast, we’re featuring a cross-over of the O’Reilly Data Show Podcast, which you can find on iTunes, Stitcher, TuneIn, or SoundCloud. O’Reilly’s Ben Lorica hosts that podcast, and in this episode, he chats with Apache Spark release manager and Databricks co-founder Patrick Wendell about the roadmap of Spark and where it’s headed, and interesting applications he’s seeing in the growing Spark ecosystem.

Here are some highlights from their chat:

We were really trying to solve research problems, so we were trying to work with the early users of Spark, getting feedback on what issues it had and what types of problems they were trying to solve with Spark, and then use that to influence the roadmap. It was definitely a more informal process, but from the very beginning, we were expressly user driven in the way we thought about building Spark, which is quite different than a lot of other open source projects. … From the beginning, we were focused on empowering other people and building platforms for other developers.

One of the early users was Conviva, a company that does analytics for real-time video distribution. They were a very early user of Spark, they continue to use it today, and a lot of their feedback was incorporated into our roadmap, especially around the types of APIs they wanted to have that would make data processing really simple for them, and of course, performance was a big issue for them very early on because in the business of optimizing real-time video streams, you want to be able to react really quickly when conditions change. … Early on, things like latency and performance were pretty important.

Read more…

Comment

Building a scalable platform for streaming updates and analytics

The O’Reilly Data Show podcast: Evan Chan on the early days of Spark+Cassandra, FiloDB, and cloud computing.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

350px-Sprungturm-seew

In this episode of the O’Reilly Data Show, I sit down with Evan Chan, distinguished engineer at Tuplejump. We talk about the early days of Spark (particularly his contributions to Spark/Cassandra integration), his interesting new open source project (FiloDB), and recent trends in cloud computing.

Bringing Apache Spark & Apache Cassandra together

Datastax credits me with inspiring them to bring Spark into Cassandra … I think they’re very generous about that. I think I was one of the first folks to talk about the possibility of bringing Cassandra and Spark together. The vision that I saw was that Cassandra was really good for real-time updates, but what if we’re able to do more analytical queries on it? Then you could combine, basically, a platform that is really good for real-time updates with analytics.

Read more…

Comment

Jai Ranganathan on architecting big data applications in the cloud

The O’Reilly Data Show podcast: The Hadoop ecosystem, the recent surge in interest in all things real time, and developments in hardware.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

350px_Handy_notes_and_queries_(1887)_(14760632446)

Given the quick pace of innovation in the data ecosystem, we like to take a step back from the details of individual components, architecture, and applications, in order to take a wider view of the landscape of big data. This allows us to evaluate the progress of technology and infrastructure along the way, shifting our attention from the details of individual components like Spark and Kafka, to larger trends.

Some of the larger trends we’ve been exploring include the capabilities of distributed machine learning and the tradeoffs and design decisions involved in cloud architecture and stream processing.

In this episode of the O’Reilly Data Show, I sat down with Jai Ranganathan, senior director of product management at Cloudera. We talked about the trends in the Hadoop ecosystem, cloud computing, the recent surge in interest in all things real time, and hardware trends:

Large-scale machine learning

This sounds a bit like this should already exist in really good form right now, but one of the things that I’m really interested in is expanding the set of capabilities for distributed machine learning. While there are systems out there today that do do this, I think relative to what you can experience from a singular environment learning scikit-learn or R, the set of things you can do in a distributed fashion is limited. …  It’s not easy to distribute various algorithms and model-building techniques. I think there is still a lot of work for us to do to improve that experience. … And I do want to have good open source options like MLlib. MLlib may be the right answer. I would be perfectly happy if that’s the final answer, but we do need systems just to provide the kind of depth that you typically are used to in the singular environment. That’s just a matter of time and investment because these are non-trivial problems, but they are things that people are working on.

Read more…

Comment: 1
Four short links: 16 November 2015

Four short links: 16 November 2015

Hospital Hacking, Security Data Science, Javascript Face-Substitution, and Multi-Agent Systems Textbook

  1. Hospital Hacking (Bloomberg) — interesting for both lax regulation (“The FDA seems to literally be waiting for someone to be killed before they can say, ‘OK, yeah, this is something we need to worry about,’ ” Rios says.) and the extent of the problem (Last fall, analysts with TrapX Security, a firm based in San Mateo, Calif., began installing software in more than 60 hospitals to trace medical device hacks. […] After six months, TrapX concluded that all of the hospitals contained medical devices that had been infected by malware.). It may take a Vice President’s defibrillator being hacked for things to change. Or would anybody notice?
  2. Cybersecurity and Data Science — pointers to papers in different aspects of using machine learning and statistics to identify misuse and anomalies.
  3. Real-time Face Substitution in Javascript — this is awesome. Moore’s Law is amazing.
  4. Multi-Agent Systems — undergraduate textbook covering distributed systems, game theory, auctions, and more. Electronic version as well as printed book.
Comment

Building systems for massive scale data applications

The O’Reilly Data Show podcast: Tyler Akidau on the evolution of systems for bounded and unbounded data processing.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

Spiral_Galaxy

Many of the open source systems and projects we’ve come to love — including Hadoop and HBase — were inspired by systems used internally within Google. These systems were described in papers and implemented by people who needed frameworks that could comfortably scale to massive data sets.

Google engineers and scientists continue to publish interesting papers, and these days some of the big data systems they describe in publications are available on their cloud platform.

In this episode of the O’Reilly Data Show, I sat down with Tyler Akidau one of the lead engineers in Google’s streaming and Dataflow technologies. He recently wrote an extremely popular article that provided a framework for how to think about bounded and unbounded data processing (a follow-up article is due out soon). We talked about the evolution of stream processing, the challenges of building systems that scale to massive data sets, and the recent surge in interest in all things real time:

On the need for MillWheel: A new stream processing engine

At the time [that MillWheel was built], there was, as far as I know, literally nothing externally that could handle the scale that we needed to handle. A lot of the existing streaming systems didn’t focus on out-of-order processing, which was a big deal for us internally. Also we really wanted to hit a strong focus on consistency — being able to get absolutely correct answers. … All three of these things were lacking in at least some area in [the systems we examined].

Read more…

Comment
Four short links: 26 October 2015

Four short links: 26 October 2015

Dataflow Computers, Data Set Explorer, Design Brief, and Coping with Uncertainty

  1. Dataflow Computers: Their History and Future (PDF) — entry from 2008 Wiley Encyclopedia of Computer Science and Engineering.
  2. Mirador — open source tool for visual exploration of complex data sets. It enables users to discover correlation patterns and derive new hypotheses from the data.
  3. How 23AndMe Got Regulatory Approval Back (Fast Company) — In order to meet FDA requirements, the design team had to prove that the reports provided on the website would be comprehensible to any American consumer, regardless of their background or education level. And you thought YOUR design brief was hard.
  4. Getting Comfortable with Uncertainty (The Atlantic) — We have this natural distaste for things that are unfamiliar to us, things that are ambiguous. It goes up from situational stressors, on an individual level and a group level. And we’re stuck with it simply because we have to be ambiguity-reducers.
Comment
Four short links: 23 October 2015

Four short links: 23 October 2015

Data Science, Temporal Graph, Biomedical Superstars, and VR Primer

  1. 50 Years of Data Science (PDF) — Because all of science itself will soon become data that can be mined, the imminent revolution in Data Science is not about mere “scaling up,” but instead the emergence of scientific studies of data analysis science-wide.
  2. badwolfa temporal graph store from Google.
  3. Why Biomedical Superstars are Signing on with Google (Nature) — “To go all the way from foundational first principles to execution of vision was the initial draw, and that’s what has continued to keep me here.” Research to retail, at Google scale.
  4. VR Basics — intro to terminology and hardware in the next gen of hardware, in case you’re late to the goldrush^w exciting field.
Comments: 2