"data science" entries

Dimensionality reduction at the command line

Introducing Tapkee, an efficient command-line tool and C++ library for linear and nonlinear dimensionality reduction.

Get 50% off the “Data Science at the Command Line” ebook with code DATA50.

Editor’s Note: This post is a slightly adapted excerpt from Jeroen Janssens’ recent book, “Data Science at the Command Line.” To follow along with the code, and learn more about the various tools, you can install the Data Science Toolbox, a free virtual machine that runs on Microsoft Windows, Mac OS X, and Linux, and has all the command-line tools pre-installed.

The goal of dimensionality reduction is to map high-dimensional data points onto a lower dimensional space. The challenge is to keep similar data points close together on the lower-dimensional mapping. As we’ll see in the next section, our data set contains 13 features. We’ll stick with two dimensions because that’s straightforward to visualize.

Dimensionality reduction is often regarded as being part of the exploring step. It’s useful for when there are too many features for plotting. You could do a scatter plot matrix, but that only shows you two features at a time. It’s also useful as a preprocessing step for other machine-learning algorithms. Most dimensionality reduction algorithms are unsupervised, which means that they don’t employ the labels of the data points in order to construct the lower-dimensional mapping.

In this post, we’ll use Tapkee, a new command-line tool to perform dimensionality reduction. More specifically, we’ll demonstrate two techniques: PCA, which stands for Principal Components Analysis (Pearson, 1901) and t-SNE, which stands for t-distributed Stochastic Neighbor Embedding (van der Maaten & Hinton, 2008). Coincidentally, t-SNE was discussed in detail in a recent O’Reilly blog post. But first, let’s obtain, scrub, and explore the data set we’ll be using. Read more…

Comment

Create your graphs with R

A deep-dive into exploratory and presentation graphs.

Buy “Graphing Data with R: An Introduction” in early release. Editor’s note: this is an excerpt of “Graphing Data with R: An Introduction,” by John Jay Hilfiger.

Graphs are useful both for exploration and for presentation. Exploration is the process of analyzing the data and finding relationships and patterns. Presentation of your findings is making your case to others who have not studied the data as intensively as you have yourself. While one is exploring the data, graphs can be stark, lean, and somewhat unattractive. The data analyst, who knows the data and is getting to know it better with each graph made, does not need all the titles, labels, reference details, and colors that someone sitting through a presentation might expect, and might, indeed, find necessary. Furthermore, adding all this stuff just slows down the analyst. Also, some graphs will prove to be dead ends, or just not very interesting. Consequently, many graphs may be discarded during the discovery journey.

As the process of exploration continues, adding some details may make relationships a little clearer. As the analyst gets closer to presentation and/or publication, the graphs become more detailed and prettier. There probably will have been many plain graphs in the process of analysis and relatively few beautiful graphs that appear in the final report. Read more…

Comments: 4

Apache Spark: Powering applications on-premise and in the cloud

The O'Reilly Data Show Podcast: Patrick Wendell on the state of the Spark ecosystem.

Compass_Rose_Justin_Pickard_Flickr

As organizations shift their focus toward building analytic applications, many are relying on components from the Apache Spark ecosystem. I began pointing this out in advance of the first Spark Summit in 2013 and since then, Spark adoption has exploded.

With Spark Summit SF right around the corner, I recently sat down with Patrick Wendell, release manager of Apache Spark and co-founder of Databricks, for this episode of the O’Reilly Data Show Podcast. (Full disclosure: I’m an advisor to Databricks). We talked about how he came to join the UC Berkeley AMPLab, the current state of Spark ecosystem components, Spark’s future roadmap, and interesting applications built on top of Spark.

User-driven from inception

From the beginning, Spark struck me as different from other academic research projects (many of which “wither away” when grad students leave). The AMPLab team behind Spark spoke at local SF Bay Area meetups, they hosted 2-day events (AMP Camp), and worked hard to help early users. That mindset continues to this day. Wendell explained:

We were trying to work with the early users of Spark, getting feedback on what issues it had and what types of problems they were trying to solve with Spark, and then use that to influence the roadmap. It was definitely a more informal process, but from the very beginning, we were expressly user-driven in the way we thought about building Spark, which is quite different than a lot of other open source projects. We never really built it for our own use — it was not like we were at a company solving a problem and then we decided, “hey let’s let other people use this code for free”. … From the beginning, we were focused on empowering other people and building platforms for other developers, so I always thought that was quite unique about Spark.

Read more…

Comment

Validating data models with Kafka-based pipelines

A case for back-end A/B testing.

Start the O’Reilly “Introduction to Apache Kafka” training video for free. In this video, Gwen Shapira shows developers and administrators how to integrate Kafka into a data processing pipeline.

A/B testing is a popular method of using business intelligence data to assess possible changes to websites. In the past, when a business wanted to update its website in an attempt to drive more sales, decisions on the specific changes to make were driven by guesses; intuition; focus groups; and ultimately, which executive yelled louder. These days, the data-driven solution is to set up multiple copies of the website, direct users randomly to the different variations and measure which design improves sales the most. There are a lot of details to get right, but this is the gist of things.

When it comes to back-end systems, however, we are still living in the stone age. Suppose your business grew significantly and you notice that your existing MySQL database is becoming less responsive as the load increases. Suppose you consider moving to a NoSQL system, you need to decide which NoSQL solution to pick — there are a lot of options: Cassandra, MongoDB, Couchbase, or even Hadoop. There are also many possible data models: normalized, wide tables, narrow tables, nested data structures, etc.

A/B testing multiple data stores and data models in parallel

It is surprising how often a company will pick a solution based on intuition or even which architect yelled louder. Rather than making a decision based on facts and numbers regarding capacity, scale, throughput, and data-processing patterns, the back-end architecture decisions are made with fuzzy reasoning. In that scenario, what usually happens is that a data store and a data model are somehow chosen, and the entire development team will dive into a six-month project to move their entire back-end system to the new thing. This project will inevitably take 12 months, and about 9 months in, everyone will suspect that this was a bad idea, but it’s way too late to do anything about it. Read more…

Comment: 1

Announcing Cassandra certification

A new partnership between O’Reilly and DataStax offers certification and training in Cassandra.

apache-cassandra-certified-300x300I am pleased to announce a joint program between O’Reilly and DataStax to certify Cassandra developers. This program complements our developer certification for Apache Spark and — just as in the case of Databricks and Spark — we are excited to be working with the leading commercial company behind Cassandra. DataStax has done a tremendous job growing and nurturing the Cassandra community, user base, and technology.

Once the certification program is ready, developers can take the exam online, in designated test centers, and at select training courses. O’Reilly will also be developing books, training days, and videos targeted at developers and companies interested in the Cassandra distributed storage system.

Cassandra is a popular component used for building big data and real-time analytic platforms. Its ability to comfortably scale to clusters with thousands of nodes makes it a popular option for solutions that need to ingest and make sense of large amounts of time series and event data. As noted in an earlier post, real-time event data are at the heart of one of the trends we’re closely following: the convergence of cheap sensors, fast networks, and distributed computation. Read more…

Comments: 2

How Shazam predicts pop hits

The O'Reilly Radar Podcast: Cait O'Riordan on Shazam's predictive analytics, and Francine Bennett on using data for evil.

Subscribe to the O’Reilly Radar Podcast to track the technologies and people that will shape our world in the years to come.

record_player_from_1920s_Marcin_Wichary_FlickrIn this week’s Radar Podcast, I chat with Cait O’Riordan, VP of product, music and platforms at Shazam. She talks about the current state of predictive analytics and how Shazam is able to predict the success of a song, often in the first few hours after its release. We also talk about the Internet of Things and how products like the Apple Watch affect Shazam’s product life cycles as well as the behaviors of their users.

Predicting the next pop hit

Shazam has more than 100 million monthly active users, and its users Shazam more than 20 million times per day. This, of course, generates a ton of data that Shazam uses in myriad ways, not the least of which is to predict the success of a song. O’Riordan explained how they approach their user data and how they’re able to accurately predict pop hits (and misses):

What’s interesting from a data perspective is when someone takes their phone out of their pocket, unlocks it, finds the Shazam app, and hits the big blue button, they’re not just saying, “I want to know the name of this song.” They’re saying, “I like this song sufficiently to do that.” There’s an amount of effort there that implies some level of liking. That’s really interesting, because you combine that really interesting intention on the part of the user plus the massive data set, you can cut that in lots and lots of different ways. We use it for lots of different things.

At the most basic level, we’re looking at what songs are going to be popular. We can predict, with a relative amount of accuracy, what will hit the Top 100 Billboard Chart 33 days out, roughly. We can look at that in lots of different territories as well. We can also look and see, in the first few hours of a track, whether a big track is going to go on to be successful. We can look at which particular part of the track is encouraging people to Shazam and what makes a popular hit. We know that, for example, for a big pop hit, you’ve got about 10 seconds to convince somebody to find the Shazam app and press that button. There are lots of different ways that we can look at that data, going right into the details of a particular song, zooming out worldwide, or looking in different territories just due to that big worldwide and very engaged audience.

Read more…

Comment

Data science makes an impact on Wall Street

The O'Reilly Data Show Podcast: Gary Kazantsev on how big data and data science are making a difference in finance.

Learn more about Next:Money, O’Reilly’s conference focused on the fundamental transformation taking place in the finance industry.

Charging_Bull_Sam_valadi_FlickrHaving started my career in industry, working on problems in finance, I’ve always appreciated how challenging it is to build consistently profitable systems in this extremely competitive domain. When I served as quant at a hedge fund in the late 1990s and early 2000s, I worked primarily with price data (time-series). I quickly found that it was difficult to find and sustain profitable trading strategies that leveraged data sources that everyone else in the industry examined exhaustively. In the early-to-mid 2000s the hedge fund industry began incorporating many more data sources, and today you’re likely to find many finance industry professionals at big data and data science events like Strata + Hadoop World.

During the latest episode of the O’Reilly Data Show Podcast, I had a great conversation with one of the leading data scientists in finance: Gary Kazantsev runs the R&D Machine Learning group at Bloomberg LP. As a former quant, I wanted to know the types of problems Kazantsev and his group work on, and the tools and techniques they’ve found useful. We also talked about data science, data engineering, and recruiting data professionals for Wall Street. Read more…

Comments: 3

Cultivating a psychological sense of community

A profile of Dr. Renetta Garrison Tull, from our latest report on women in the field of data.

Download our updated report, “Women in Data: Cutting-Edge Practitioners and Their Views on Critical Skills, Background, and Education,” by Cornelia Lévy-Bencheton and Shannon Cutt, featuring four new profiles of women across the European Union. Editor’s note: this is an excerpt from the free report.

Dr. Renetta Garrison Tull is a recognized expert in women and minorities in education, and in the STEM gender gap — both within and outside the academic environment. Dr. Tull is also an electrical engineer by training and is passionate about bringing more women into the field.

From her vantage point at the University of Maryland Baltimore County (UMBC) as associate vice provost for graduate student development and postdoctoral affairs, Dr. Tull concentrates on opportunities for graduate and postdoctoral professional development. As director of PROMISE: Maryland’s Alliance for Graduate Education and the Professoriate (AGEP) program for the University System of Maryland (USM), Dr. Tull also has a unique perspective on the STEM subjects that students cover prior to attending the university, within academia and as preparation for the workforce beyond graduation.

Dr. Tull has been writing code since the seventh grade. Fascinated by the Internet, she “learned HTML before there were WYSIWYGs,” and remains heavily involved with the online world. “I’ve been politely chided in meetings for pulling out my phones (yes plural), sending texts, and updating our organization’s professional Twitter and Facebook status, while taking care of emails from multiple accounts. I manage several blogs, each for different audiences … friends, colleagues, and students.” Read more…

Comment

Connected play

How data-driven tech toys are — and aren’t — changing the nature of play.

Sign up to be notified when the new free report Data, Technology & The Future of Play becomes available. This post is part of a series investigating the future of play that will culminate in a full report.

When I was in first grade, I cut the fur pom-poms off of my dad’s mukluks. (If you didn’t grow up in the Canadian North and you don’t know what mukluks are, here’s a picture.) My dad’s mukluks were specially made for him, so he was pretty sore. I cut the pom-poms off because I had just seen The Trouble With Tribbles at a friend’s house, and I desperately wanted some Tribbles. I kept them in a shoebox, named them, brought them to show-and-tell, and pretended they were real.

It’s exactly this kind of imaginative play that a lot of parents are afraid is being lost as toys become smarter. And in exchange for what? There isn’t any real evidence yet that smart toys genuinely make kids smarter.

Boys_with_hoops_on_Chesnut_Street

Low-fi toys. Public domain image, “Boys playing with hoops on Chesnut Street, Toronto, Canada.” Source: Wikimedia Commons.

I tell this story not to emphasize what a terrible vandal I was as a child, rather, I tell it to show how irrepressible childrens’ imaginations are, and to explain why technological toys are not going to kill that imagination. Today’s “smart” toys are no different than dolls and blocks, or in my case, a pair of mukluks. By nature, all toys have affordances that imply how they should be used. The more complex the toy, the more focused the affordances are. Consider a stick: it can be a weapon, a mode of transport, or a magic wand. But an app that is designed to do a thing guides users toward that use case, just as a door handle suggests that you should grasp and turn it. Design has opinions. Read more…

Comment

Topic modeling for the newbie

Learning the fundamentals of natural language processing.

Get “Data Science from Scratch” at 50% off with code DATA50. Editor’s note: This is an excerpt from our recent book Data Science from Scratch, by Joel Grus. It provides a survey of topics from statistics and probability to databases, from machine learning to MapReduce, giving the reader a foundation for understanding, and examples and ideas for learning more.

When we built our Data Scientists You Should Know recommender in Chapter 1, we simply looked for exact matches in people’s stated interests.

A more sophisticated approach to understanding our users’ interests might try to identify the topics that underlie those interests. A technique called Latent Dirichlet Analysis (LDA) is commonly used to identify common topics in a set of documents. We’ll apply it to documents that consist of each user’s interests.

LDA has some similarities to the Naive Bayes Classifier we built in Chapter 13, in that it assumes a probabilistic model for documents. We’ll gloss over the hairier mathematical details, but for our purposes the model assumes that:

  • There is some fixed number K of topics.
  • There is a random variable that assigns each topic an associated probability distribution over words. You should think of this distribution as the probability of seeing word w given topic k.
  • There is another random variable that assigns each document a probability distribution over topics. You should think of this distribution as the mixture of topics in document d.
  • Each word in a document was generated by first randomly picking a topic (from the document’s distribution of topics) and then randomly picking a word (from the topic’s distribution of words).

In particular, we have a collection of documents, each of which is a list of words. And we have a corresponding collection of document_topics that assigns a topic (here a number between 0 and K – 1) to each word in each document. Read more…

Comment