"O’Reilly Data Show Podcast" entries

Using Apache Spark to predict attack vectors among billions of users and trillions of events

The O’Reilly Data Show podcast: Fang Yu on data science in security, unsupervised learning, and Apache Spark.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science: Stitcher, TuneIn, iTunes, SoundCloud, RSS.

350px-Zaunreparatur_beim_Museum_Arlerhof_in_Abtenau_26

In this episode of the O’Reilly Data Show, I spoke with Fang Yu, co-founder and CTO of DataVisor. We discussed her days as a researcher at Microsoft, the application of data science and distributed computing to security, and hiring and training data scientists and engineers for the security domain.

DataVisor is a startup that uses data science and big data to detect fraud and malicious users across many different application domains in the U.S. and China. Founded by security researchers from Microsoft, the startup has developed large-scale unsupervised algorithms on top of Apache Spark, to (as Yu notes in our chat) “predict attack vectors early among billions of users and trillions of events.”

Several years ago, I found myself immersed in the security space and at that time tools that employed machine learning and big data were still rare. More recently, with the rise of tools like Apache Spark and Apache Kafka, I’m starting to come across many more security professionals who incorporate large-scale machine learning and distributed systems into their software platforms and consulting practices.

Read more…

Building a business that combines human experts and data science

The O’Reilly Data Show podcast: Eric Colson on algorithms, human computation, and building data science teams.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

350px-bombe

In this episode of the O’Reilly Data Show, I spoke with Eric Colson, chief algorithms officer at Stitch Fix, and former VP of data science and engineering at Netflix. We talked about building and deploying mission-critical, human-in-the-loop systems for consumer Internet companies. Knowing that many companies are grappling with incorporating data science, I also asked Colson to share his experiences building, managing, and nurturing, large data science teams at both Netflix and Stitch Fix.

Augmented systems: “Active learning,” “human-in-the-loop,” and “human computation”

We use the term ‘human computation’ at Stitch Fix. We have a team dedicated to human computation. It’s a little bit coarse to say it that way because we do have more than 2,000 stylists, and these are very much human beings that are very passionate about fashion styling. What we can do is, we can abstract their talent into—you can think of it like an API; there’s certain tasks that only a human can do or we’re going to fail if we try this with machines, so we almost have programmatic access to human talent. We are allowed to route certain tasks to them, things that we could never get done with machines. … We have some of our own proprietary software that blends together two resources: machine learning and expert human judgment. The way I talk about it is, we have an algorithm that’s distributed across the resources. It’s a single algorithm, but it does some of the work through machine resources, and other parts of the work get done through humans.

Read more…

Investing in big data technologies

The O’Reilly Data Show podcast: A fireside chat with Ben Horowitz, plus Reynold Xin on the rise of Apache Spark in China.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

350px-Diqing,_Yunnan,_China

In this special holiday episode of the O’Reilly Data Show, I look back at two conversations I had earlier this year at the Spark Summit in San Francisco. The first segment is an on-stage fireside chat with Ben Horowitz, co-founder of Andreessen Horowitz and author of The Hard Thing About Hard Things.

In the second segment, Reynold Xin, one of the architects of Apache Spark, explains the rise of Apache Spark in China.

Subscribe to the O’Reilly Data Show Podcast

Stitcher, TuneIn, iTunes, SoundCloud, RSS

Related resources:

Read more…

Building a scalable platform for streaming updates and analytics

The O’Reilly Data Show podcast: Evan Chan on the early days of Spark+Cassandra, FiloDB, and cloud computing.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

350px-Sprungturm-seew

In this episode of the O’Reilly Data Show, I sit down with Evan Chan, distinguished engineer at Tuplejump. We talk about the early days of Spark (particularly his contributions to Spark/Cassandra integration), his interesting new open source project (FiloDB), and recent trends in cloud computing.

Bringing Apache Spark & Apache Cassandra together

Datastax credits me with inspiring them to bring Spark into Cassandra … I think they’re very generous about that. I think I was one of the first folks to talk about the possibility of bringing Cassandra and Spark together. The vision that I saw was that Cassandra was really good for real-time updates, but what if we’re able to do more analytical queries on it? Then you could combine, basically, a platform that is really good for real-time updates with analytics.

Read more…

Graph databases are powering mission-critical applications

The O’Reilly Data Show Podcast: Emil Eifrem on popular applications of graph technologies, cloud computing, and company culture.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

350px-VASR_7_map_svg

While most people associate graphs with social media analysis, there are a wide range of applications — including recommendations, fraud detection, I.T. operations, and security — that are routinely framed using graphs. This wide variety of use cases has led to rise to many interesting tools for storing, managing, visualizing, and analyzing massive graphs. The important thing to note is that graph databases are not limited to reporting and analytics, but are also being used to power mission critical applications.

In this episode of the O’Reilly Data Show, I sat down with Emil Eifrem, CEO and co-founder of Neo Technology. We talked about the early days of NoSQL, applications of graph databases, cloud computing, and company culture in the U.S. and Sweden.

Graph and NoSQL databases

The relational database had been an accelerator, and here it’s really slowing us down. What we ended up concluding was that the problem was this mismatch between the shape of the data and the abstractions that were exposed by our infrastructure. At that point, we said, okay, what if we had a database that just exposed these amazing network-oriented data structures or graph-oriented data structures, but other than that, had all the properties of a relational database. Wouldn’t that be great? …  Ultimately, we said the famous last words: ‘Hey, let’s just build it ourselves. How hard can it be?’ It turns out it’s 15 years later!

2007 is when both the Dynamo paper had been published and the BigTable paper had been published out of Amazon and Google, respectively. That’s when, in early adopter circuits, the discourse started to change … maybe the era of the one-size-fits-all database is over. Maybe our job isn’t to take all of our data and shove it through a relational database. Maybe there are some other tools and technologies and abstractions out there that make better sense for some data. That was in ’07.  I really think it was as if lightning struck in the community. … . [Dynamo and BigTable were announced] and the next day, 12 open source projects, implementing it, and then the next day, 24 new ones. It was just crazy back then.

Read more…

Turning big data into actionable insights

The O’Reilly Data Show podcast: Evangelos Simoudis on data mining, investing in data startups, and corporate innovation.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

350px-Fleming_valves

Can developments in data science and big data infrastructure drive corporate innovation? To be fair, many companies are still in the early stages of incorporating these ideas and tools into their organizations.

Evangelos Simoudis has spent many years interacting with entrepreneurs and executives at major global corporations. Most recently, he’s been advising companies interested in developing long-term strategies pertaining to big data, data science, cloud computing, and innovation. He began his career as a data mining researcher and practitioner, and is counted among the pioneers who helped data mining technologies get adopted in industry.

In this episode of the O’Reilly Data Show, I sat down with Simoudis and we talked about his thoughts on investing, data applications and products, and corporate innovation:

Open source software companies

I very much appreciate open source. I encourage my portfolio companies to use open source components as appropriate, but I’ve never seen the business model as being one that is particularly easy to really build the companies around them. Everybody points to Red Hat, and that may be the exception, but I have not seen companies that have, on the one hand, remained true to the open source principles and become big and successful companies that do not require constant investment. … The revenue streams never prove to be sufficient for building big companies. I think the companies that get started from open source in order to become big and successful … [are] ones that, at some point, decided to become far more proprietary in their model and in the services that they deliver. Or they become pure professional services companies as opposed to support services companies. Then they reach the necessary levels of success.

Read more…

Resolving transactional access and analytic performance trade-offs

The O’Reilly Data Show podcast: Todd Lipcon on hybrid and specialized tools in distributed systems.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

350px-Dolderbrug_Steenwijk_inclusief_lichtontwerpIn recent months, I’ve been hearing about hybrid systems designed to handle different data management needs. At Strata + Hadoop World NYC last week, Cloudera’s Todd Lipcon unveiled an open source storage layer — Kudu —  that’s good at both table scans (analytics) and random access (updates and inserts).

While specialized systems will continue to serve companies, there will be situations where the complexity of maintaining multiple systems — to eke out extra performance — will be harder to justify.

During the latest episode of the O’Reilly Data Show Podcast, I sat down with Lipcon to discuss his new project a few weeks before it was released. Here are a few snippets from our conversation:

HDFS and Hbase

[Hadoop is] more like a file store. It allows you to upload files onto an arbitrarily sized cluster with 20-plus petabytes, in single clusters. The thing is, you can upload the files but you can’t edit them in place. To make any change, you have to basically put in a new file. What HBase does in distinction is that it has more of a tabular data model, where you can update and insert individual row-by- row data, and then randomly access that data [in] milliseconds. The distinction here is that HDFS is pretty good for large scans where you’re putting in a large data set, maybe doing a full parse over the data set to train a machine learning model or compute an aggregate. If any of that data changes on a frequent basis or if you want to stream the data in or randomly access individual customer records, you’re kind of out of luck on HDFS. Read more…

Bridging the divide: Business users and machine learning experts

The O'Reilly Data Show Podcast: Alice Zheng on feature representations, model evaluation, and machine learning models.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

606px-IBM_Electronic_Data_Processing_Machine_-_GPN-2000-001881As tools for advanced analytics become more accessible, data scientist’s roles will evolve. Most media stories emphasize a need for expertise in algorithms and quantitative techniques (machine learning, statistics, probability), and yet the reality is that expertise in advanced algorithms is just one aspect of industrial data science.

During the latest episode of the O’Reilly Data Show podcast, I sat down with Alice Zheng, one of Strata + Hadoop World’s most popular speakers. She has a gift for explaining complex topics to a broad audience, through presentations and in writing. We talked about her background, techniques for evaluating machine learning models, how much math data scientists need to know, and the art of interacting with business users.

Making machine learning accessible

People who work at getting analytics adopted and deployed learn early on the importance of working with domain/business experts. As excited as I am about the growing number of tools that open up analytics to business users, the interplay between data experts (data scientists, data engineers) and domain experts remains important. In fact, human-in-the-loop systems are being used in many critical data pipelines. Zheng recounts her experience working with business analysts:

It’s not enough to tell someone, “This is done by boosted decision trees, and that’s the best classification algorithm, so just trust me, it works.” As a builder of these applications, you need to understand what the algorithm is doing in order to make it better. As a user who ultimately consumes the results, it can be really frustrating to not understand how they were produced. When we worked with analysts in Windows or in Bing, we were analyzing computer system logs. That’s very difficult for a human being to understand. We definitely had to work with the experts who understood the semantics of the logs in order to make progress. They had to understand what the machine learning algorithms were doing in order to provide useful feedback. Read more…

Pattern recognition and sports data

The O'Reilly Data Show Podcast: Award-winning journalist David Epstein on the (data) science of sports.

Sign-up now to receive a free download of the new O’Reilly report “Data Analytics in Sports: How Playing with Data Transforms the Game” when it publishes this fall.

Julien_Vervaecke_and_Maurice_Geldhof

Julien Vervaecke and Maurice Geldhof smoking a cigarette at the 1927 Tour de France. Public domain photo via Wikimedia Commons.

One of my favorite books from the last few years is David Epstein’s engaging tour through sports science using examples and stories from a wide variety of athletic endeavors. Epstein draws on examples from individual sports (including track and field, winter sports) and major U.S. team sports (baseball, basketball, and American football), and uses the latest research to explain how data and science are being used to improve athletic performance.

In a recent episode of the O’Reilly Data Show Podcast, I spoke with Epstein about his book, data science and sports, and his recent series of articles detailing suspicious practices at one of the world’s premier track and field training programs (the Oregon Project).

Nature/nurture and hardware/software

Epstein’s book contains examples of sports where athletes with certain physical attributes start off with an advantage. In relation to that, we discussed feature selection and feature engineering — the relative importance of factors like training methods, technique, genes, equipment, and diet — topics which Epstein has written about and studied extensively:

One of the most important findings in sports genetics is that your ability to improve with respect to a certain training program is mediated by your genes, so it’s really important to find the kind of training program that’s best tailored to your physiology. … The skills it takes for team sports, these perceptual skills, nobody is born with those. Those are completely software, to use the computer analogy. But it turns out that once the software is downloaded, it’s like a computer. While your hardware doesn’t do anything alone without software, once you’ve got the software, the hardware actually makes a lot of a difference in how good of an operating machine you have. It can be obscured when people don’t study it correctly, which is why I took on some of the 10,000 hours stuff. Read more…

Understanding neural function and virtual reality

The O'Reilly Data Show Podcast: Poppy Crum explains that what matters is efficiency in identifying and emphasizing relevant data.

Neuron_like_trees_gomessda_flickr

Like many data scientists, I’m excited about advances in large-scale machine learning, particularly recent success stories in computer vision and speech recognition. But I’m also cognizant of the fact that press coverage tends to inflate what current systems can do, and their similarities to how the brain works.

During the latest episode of the O’Reilly Data Show Podcast, I had a chance to speak with Poppy Crum, a neuroscientist who gave a well-received keynote at Strata + Hadoop World in San Jose. She leads a research group at Dolby Labs and teaches a popular course at Stanford on Neuroplasticity in Musical Gaming. I wanted to get her take on AI and virtual reality systems, and hear about her experience building a team of researchers from diverse disciplines.

Understanding neural function

While it can sometimes be nice to mimic nature, in the case of the brain, machine learning researchers recognize that understanding and identifying the essential neural processes is much more critical. A related example cited by machine learning researchers is flight: wing flapping and feathers aren’t critical, but an understanding of physics and aerodynamics is essential.

Crum and other neuroscience researchers express the same sentiment. She points out that a more meaningful goal should be to “extract and integrate relevant neural processing strategies when applicable, but also identify where there may be opportunities to be more efficient.”

The goal in technology shouldn’t be to build algorithms that mimic neural function. Rather, it’s to understand neural function. … The brain is basically, in many cases, a Rube Goldberg machine. We’ve got this limited set of evolutionary building blocks that we are able to use to get to a sort of very complex end state. We need to be able to extract when that’s relevant and integrate relevant neural processing strategies when it’s applicable. We also want to be able to identify that there are opportunities to be more efficient and more relevant. I think of it as table manners. You have to know all the rules before you can break them. That’s the big difference between being really cool or being a complete heathen. The same thing kind of exists in this area. How we get to the end state, we may be able to compromise, but we absolutely need to be thinking about what matters in neural function for perception. From my world, where we can’t compromise is on the output. I really feel like we need a lot more work in this area. Read more…