"data scientists" entries
The O'Reilly Radar Podcast: Cait O'Riordan on Shazam's predictive analytics, and Francine Bennett on using data for evil.
Subscribe to the O’Reilly Radar Podcast to track the technologies and people that will shape our world in the years to come.
In this week’s Radar Podcast, I chat with Cait O’Riordan, VP of product, music and platforms at Shazam. She talks about the current state of predictive analytics and how Shazam is able to predict the success of a song, often in the first few hours after its release. We also talk about the Internet of Things and how products like the Apple Watch affect Shazam’s product life cycles as well as the behaviors of their users.
Predicting the next pop hit
Shazam has more than 100 million monthly active users, and its users Shazam more than 20 million times per day. This, of course, generates a ton of data that Shazam uses in myriad ways, not the least of which is to predict the success of a song. O’Riordan explained how they approach their user data and how they’re able to accurately predict pop hits (and misses):
What’s interesting from a data perspective is when someone takes their phone out of their pocket, unlocks it, finds the Shazam app, and hits the big blue button, they’re not just saying, “I want to know the name of this song.” They’re saying, “I like this song sufficiently to do that.” There’s an amount of effort there that implies some level of liking. That’s really interesting, because you combine that really interesting intention on the part of the user plus the massive data set, you can cut that in lots and lots of different ways. We use it for lots of different things.
At the most basic level, we’re looking at what songs are going to be popular. We can predict, with a relative amount of accuracy, what will hit the Top 100 Billboard Chart 33 days out, roughly. We can look at that in lots of different territories as well. We can also look and see, in the first few hours of a track, whether a big track is going to go on to be successful. We can look at which particular part of the track is encouraging people to Shazam and what makes a popular hit. We know that, for example, for a big pop hit, you’ve got about 10 seconds to convince somebody to find the Shazam app and press that button. There are lots of different ways that we can look at that data, going right into the details of a particular song, zooming out worldwide, or looking in different territories just due to that big worldwide and very engaged audience.
The O'Reilly Data Show Podcast: Gary Kazantsev on how big data and data science are making a difference in finance.
Learn more about Next:Money, O’Reilly’s conference focused on the fundamental transformation taking place in the finance industry.
Having started my career in industry, working on problems in finance, I’ve always appreciated how challenging it is to build consistently profitable systems in this extremely competitive domain. When I served as quant at a hedge fund in the late 1990s and early 2000s, I worked primarily with price data (time-series). I quickly found that it was difficult to find and sustain profitable trading strategies that leveraged data sources that everyone else in the industry examined exhaustively. In the early-to-mid 2000s the hedge fund industry began incorporating many more data sources, and today you’re likely to find many finance industry professionals at big data and data science events like Strata + Hadoop World.
During the latest episode of the O’Reilly Data Show Podcast, I had a great conversation with one of the leading data scientists in finance: Gary Kazantsev runs the R&D Machine Learning group at Bloomberg LP. As a former quant, I wanted to know the types of problems Kazantsev and his group work on, and the tools and techniques they’ve found useful. We also talked about data science, data engineering, and recruiting data professionals for Wall Street. Read more…
A profile of Dr. Renetta Garrison Tull, from our latest report on women in the field of data.
Download our updated report, “Women in Data: Cutting-Edge Practitioners and Their Views on Critical Skills, Background, and Education,” by Cornelia Lévy-Bencheton and Shannon Cutt, featuring four new profiles of women across the European Union. Editor’s note: this is an excerpt from the free report.Dr. Renetta Garrison Tull is a recognized expert in women and minorities in education, and in the STEM gender gap — both within and outside the academic environment. Dr. Tull is also an electrical engineer by training and is passionate about bringing more women into the field.
From her vantage point at the University of Maryland Baltimore County (UMBC) as associate vice provost for graduate student development and postdoctoral affairs, Dr. Tull concentrates on opportunities for graduate and postdoctoral professional development. As director of PROMISE: Maryland’s Alliance for Graduate Education and the Professoriate (AGEP) program for the University System of Maryland (USM), Dr. Tull also has a unique perspective on the STEM subjects that students cover prior to attending the university, within academia and as preparation for the workforce beyond graduation.
Dr. Tull has been writing code since the seventh grade. Fascinated by the Internet, she “learned HTML before there were WYSIWYGs,” and remains heavily involved with the online world. “I’ve been politely chided in meetings for pulling out my phones (yes plural), sending texts, and updating our organization’s professional Twitter and Facebook status, while taking care of emails from multiple accounts. I manage several blogs, each for different audiences … friends, colleagues, and students.” Read more…
How data-driven tech toys are — and aren’t — changing the nature of play.
Sign up to be notified when the new free report Data, Technology & The Future of Play becomes available. This post is part of a series investigating the future of play that will culminate in a full report.
When I was in first grade, I cut the fur pom-poms off of my dad’s mukluks. (If you didn’t grow up in the Canadian North and you don’t know what mukluks are, here’s a picture.) My dad’s mukluks were specially made for him, so he was pretty sore. I cut the pom-poms off because I had just seen The Trouble With Tribbles at a friend’s house, and I desperately wanted some Tribbles. I kept them in a shoebox, named them, brought them to show-and-tell, and pretended they were real.
It’s exactly this kind of imaginative play that a lot of parents are afraid is being lost as toys become smarter. And in exchange for what? There isn’t any real evidence yet that smart toys genuinely make kids smarter.
I tell this story not to emphasize what a terrible vandal I was as a child, rather, I tell it to show how irrepressible childrens’ imaginations are, and to explain why technological toys are not going to kill that imagination. Today’s “smart” toys are no different than dolls and blocks, or in my case, a pair of mukluks. By nature, all toys have affordances that imply how they should be used. The more complex the toy, the more focused the affordances are. Consider a stick: it can be a weapon, a mode of transport, or a magic wand. But an app that is designed to do a thing guides users toward that use case, just as a door handle suggests that you should grasp and turn it. Design has opinions. Read more…
Learning the fundamentals of natural language processing.
Get “Data Science from Scratch” at 50% off with code DATA50. Editor’s note: This is an excerpt from our recent book Data Science from Scratch, by Joel Grus. It provides a survey of topics from statistics and probability to databases, from machine learning to MapReduce, giving the reader a foundation for understanding, and examples and ideas for learning more.
When we built our Data Scientists You Should Know recommender in Chapter 1, we simply looked for exact matches in people’s stated interests.
A more sophisticated approach to understanding our users’ interests might try to identify the topics that underlie those interests. A technique called Latent Dirichlet Analysis (LDA) is commonly used to identify common topics in a set of documents. We’ll apply it to documents that consist of each user’s interests.
LDA has some similarities to the Naive Bayes Classifier we built in Chapter 13, in that it assumes a probabilistic model for documents. We’ll gloss over the hairier mathematical details, but for our purposes the model assumes that:
- There is some fixed number K of topics.
- There is a random variable that assigns each topic an associated probability distribution over words. You should think of this distribution as the probability of seeing word w given topic k.
- There is another random variable that assigns each document a probability distribution over topics. You should think of this distribution as the mixture of topics in document d.
- Each word in a document was generated by first randomly picking a topic (from the document’s distribution of topics) and then randomly picking a word (from the topic’s distribution of words).
In particular, we have a collection of
documents, each of which is a
list of words. And we have a corresponding collection of
document_topics that assigns a topic (here a number between 0 and K – 1) to each word in each document. Read more…
A look at the winners from a showcase of some of the most innovative big data startups.
At Strata + Hadoop World in London last week, we hosted a showcase of some of the most innovative big data startups. Our judges narrowed the field to 10 finalists, from whom they — and attendees — picked three winners and an audience choice.
Underscoring many of these companies was the move from software to services. As industries mature, we see a move from custom consulting to software and, ultimately, to utilities — something Simon Wardley underscored in his Data Driven Business Day talk, and which was reinforced by the announcement of tools like Google’s Bigtable service offering.
This trend was front and center at the showcase:
- Winner Modgen, for example, generates recommendations and predictions, offering machine learning as a cloud-based service.
- While second-place Brytlyt offers their high-performance database as an on-premise product, their horizontally scaled-out architecture really shines when the infrastructure is elastic and cloud based.
- Finally, third-place OpenSensors’ real-time IoT message platform scales to millions of messages a second, letting anyone spin up a network of connected devices.
Ultimately, big data gives clouds something to do. Distributed sensors need a widely available, connected repository into which to report; databases need to grow and shrink with demand; and predictive models can be tuned better when they learn from many data sets. Read more…
The O'Reilly Data Show Podcast: Anima Anandkumar on tensor decomposition techniques for machine learning.
After sitting in on UC Irvine Professor Anima Anandkumar’s Strata + Hadoop World 2015 in San Jose presentation, I wrote a post urging the data community to build tensor decomposition libraries for data science. The feedback I’ve gotten from readers has been extremely positive. During the latest episode of the O’Reilly Data Show Podcast, I sat down with Anandkumar to talk about tensor decomposition, machine learning, and the data science program at UC Irvine.
Modeling higher-order relationships
The natural question is: why use tensors when (large) matrices can already be challenging to work with? Proponents are quick to point out that tensors can model more complex relationships. Anandkumar explains:
Tensors are higher order generalizations of matrices. While matrices are two-dimensional arrays consisting of rows and columns, tensors are now multi-dimensional arrays. … For instance, you can picture tensors as a three-dimensional cube. In fact, I have here on my desk a Rubik’s Cube, and sometimes I use it to get a better understanding when I think about tensors. … One of the biggest use of tensors is for representing higher order relationships. … If you want to only represent pair-wise relationships, say co-occurrence of every pair of words in a set of documents, then a matrix suffices. On the other hand, if you want to learn the probability of a range of triplets of words, then we need a tensor to record such relationships. These kinds of higher order relationships are not only important for text, but also, say, for social network analysis. You want to learn not only about who is immediate friends with whom, but, say, who is friends of friends of friends of someone, and so on. Tensors, as a whole, can represent much richer data structures than matrices.
Scale-out applications need scaled-in virtualization.
Data center operating systems are emerging as a first-class category of distributed system software. Hadoop, for example, is evolving from a MapReduce framework into YARN, a generic platform for scale-out applications.
To enable a rich ecosystem of diverse applications to coexist on these platforms, providing adequate isolation is crucial. The isolation mechanism must enforce resource limits, decouple software dependencies among applications and the host, provide security and privacy, confine failures, etc. Containers offer a simple and elegant solution to the problem. However, a question that comes up frequently is: Why not virtual machines (VMs)? After all, these systems face a number of the same challenges that have been solved by virtualization for traditional enterprise applications.
All problems in computer science can be solved by another level of indirection, except of course for the problem of too many indirections” — David Wheeler
From linear models to neural networks: an interview with Reza Zadeh.
Get notified when our free report, “Future of Machine Intelligence: Perspectives from Leading Practitioners,” is available for download. The following interview is one of many that will be included in the report.
As part of our ongoing series of interviews surveying the frontiers of machine intelligence, I recently interviewed Reza Zadeh. Reza is a Consulting Professor in the Institute for Computational and Mathematical Engineering at Stanford University and a Technical Advisor to Databricks. His work focuses on Machine Learning Theory and Applications, Distributed Computing, and Discrete Applied Mathematics.
- Neural networks have made a comeback and are playing a growing role in new approaches to machine learning.
- The greatest successes are being achieved via a supervised approach leveraging established algorithms.
- Spark is an especially well-suited environment for distributed machine learning.
David Beyer: Tell us a bit about your work at Stanford
Reza Zadeh: At Stanford, I designed and teach distributed algorithms and optimization (CME 323) as well as a course called discrete mathematics and algorithms (CME 305). In the discrete mathematics course, I teach algorithms from a completely theoretical perspective, meaning that it is not tied to any programming language or framework, and we fill up whiteboards with many theorems and their proofs. Read more…
A survey of the landscape shows the types of tools remain the same, but interfaces continue to improve.
As data projects become complex and as data teams grow in size, individuals and organizations need tools to efficiently manage data projects. A while back, I wrote a post on common options, and I closed that piece by asking:
Are there completely different ways of thinking about reproducibility, lineage, sharing, and collaboration in the data science and engineering context?
At the time, I listed categories that seemed to capture much of what I was seeing in practice: (proprietary) workbooks aimed at business analysts, sophisticated IDEs, notebooks (for mixing text, code, and graphics), and workflow tools. At a high level, these tools aspire to enable data teams to do the following:
- Reproduce their work — so they can rerun and/or audit when needed
- Facilitate storytelling — because in many cases, it’s important to explain to others how results were derived
- Operationalize successful and well-tested pipelines — particularly when deploying to production is a long-term objective
As I survey the landscape, the types of tools remain the same, but interfaces continue to improve, and domain specific languages (DSLs) are starting to appear in the context of data projects. One interesting trend is that popular user interface models are being adapted to different sets of data professionals (e.g. workflow tools for business users). Read more…