ENTRIES TAGGED "data analysis"
Using data science to predict the Oscars
Sophisticated algorithms are not going to write the perfect script or crawl YouTube to find the next Justin Beiber (that last one I think we can all be thankful for!). But a model can predict the probability of a nominee winning the Oscar, and recently our model has Argo overtaking Lincoln as the likely winner of Best Picture. Every day on FarsiteForecast.com we’ve been describing applications of data science for the media and entertainment industry, illustrating how our models work, and updating the likely winners based on the outcomes of the Awards Season leading up to the Oscars. Just as predictive analytics provides valuable decision-making tools in sectors from retail to healthcare to advocacy, data science can also empower smarter decisions for entertainment executives, which led us to launch the Oscar forecasting project. While the potential for data science to impact any organization is as unique as each company itself, we thought we’d offer a few use cases that have wide application for media and entertainment organizations.
A deconstructed web analytics report shows what the dashboard missed.
We can all agree that in 2013 web analytics is still a nightmare, right?
The last few years have brought about an enormous expansion in the top of the web analytics information overload funnel, and today I can discover just about any aspect of my web traffic that piques my curiosity.
I know how much traffic I’m getting, who told them to come here, how they got here, how long they’re staying, what they’re looking at, what they’re using to look at it, where they’re from, and just about anything else I want to know about them. If I don’t like what I’m looking at, I can customize everything from my dashboard to reports to parameters within those reports.
What none of this tells me is how I can be more successful at turning the words I put on the Internet into dollars in my pocket.
Now, I know what you’re thinking: “It’s all there! More information than you could ever figure out what to do with.”
The problem with that is that it’s all there. It’s more information than I could ever figure out what to do with. Read more…
In-memory data storage, SQL, data preparation and asking the right questions all emerged as key trends at Strata + Hadoop World.
At our successful Strata + Hadoop World conference (including successfully avoiding Sandy), a few themes emerged that resonated with my interests and experience as a hands-on data analyst and as a researcher who tracks technology adoption trends. Keep in mind that these themes reflect my personal biases. Others will have a different take on their own key takeaways from the conference.
1. In-memory data storage for faster queries and visualization
Interactive or real-time query for large datasets is seen as a key to analyst productivity (real-time as in query times fast enough to keep the user in the flow of analysis, from sub-second to less than a few minutes). The existing large-scale data management schemes aren’t fast enough and reduce analytical effectiveness when users can’t explore the data by quickly iterating through various query schemes. We see companies with large data stores building out their own in-memory tools, e.g., Dremel at Google, Druid at Metamarkets, and Sting at Netflix, and new tools, like Cloudera’s Impala announcement at the conference, UC Berkeley’s AMPLab’s Spark, SAP Hana, and Platfora.
We saw this coming a few years ago when analysts we pay attention to started building their own in-memory data store sandboxes, often in key/value data management tools like Redis, when trying to make sense of new, large-scale data stores. I know from my own work that there’s no better way to explore a new or unstructured data set than to be able to quickly run off a series of iterative queries, each informed by the last. Read more…
O'Reilly's annual data anthology explores the maturation of big data and data science.
In the first edition of our free Big Data Now anthology, the O’Reilly team tracked the birth and early development of data tools and data science. Now, with the second edition, we’re seeing what happens when big data grows up: how it’s being applied, where it’s playing a role, and the consequences — good and bad alike — of data’s ascendance.
We’ve organized the 2012 edition of Big Data Now into five areas:
Getting Up to Speed With Big Data — Essential information on the structures and definitions of big data.
Big Data Tools, Techniques, and Strategies — Expert guidance for turning big data theories into big data products.
The Application of Big Data — Examples of big data in action, including a look at the downside of data.
What to Watch for in Big Data — Thoughts on how big data will evolve and the role it will play across industries and domains.
Big Data and Health Care — A special section exploring the possibilities that arise when data and health care come together.
Data analysis shows the structure of a network can separate true influencers from fake accounts.
There has been a lot of discussion recently about the effect fake Twitter accounts have on brands trying to keep track of social media engagement. A recent tweet spam attack offers an instructive example.
On the morning of October 1, the delegates attending the Strata Conference in London started to notice that a considerable number of spam tweets were being sent using the #strataconf hashtag. Using a tool developed by Bloom Agency, with data from DataSift, an analysis has been done that sheds light on the spam attack directed at the conference.
The following diagram shows a snapshot of the Twitter conversation after a few tweets had been received containing the #strataconf hashtag. Each red or blue line represents a connection between two Twitter accounts and shows how information flowed as a result of the tweet being sent. By 11 a.m., individual communities had started to emerge that were talking to each other about the conference, and these can clearly be seen in the diagram.
Inside core features of specialized data analysis languages.
Big data frameworks like Hadoop have received a lot of attention recently, and with good reason: when you have terabytes of data to work with — and these days, who doesn’t? — it’s amazing to have affordable, reliable and ubiquitous tools that allow you to spread a computation over tens or hundreds of CPUs on commodity hardware. The dirty truth is, though, that many analysts and scientists spend as much time or more working with mere megabytes or gigabytes of data: a small sample pulled from a larger set, or the aggregated results of a Hadoop job, or just a dataset that isn’t all that big (like, say, all of Wikipedia, which can be squeezed into a few gigs without too much trouble).
MATLAB is one of the oldest programming languages designed specifically for data analysis, and it is still extremely popular today. MATLAB was conceived in the late ’70s as a simple scripting language wrapped around the FORTRAN libraries LINPACK and EISPACK, which at the time were the best way to efficiently work with large matrices of data — as they arguably still are, through their successor LAPACK. These libraries, and thus MATLAB, were solely concerned with one data type: the matrix, a two-dimensional array of numbers.
This may seem very limiting, but in fact, a very wide range of scientific and data-analysis problems can be represented as matrix problems, and often very efficiently. Image processing, for example, is an obvious fit for the 2D data structure; less obvious, perhaps, is that a directed graph (like Twitter’s follow graph, or the graph of all links on the web) can be expressed as an adjacency matrix, and that graph algorithms like Google’s PageRank can be easily implemented as a series of additions and multiplications of these matrices. Similarly, the winning entry to the Netflix Prize recommendation challenge relied, in part, on a matrix representation of everyone’s movie ratings (you can imagine every row representing a Netflix user, every column a movie, and every entry in the matrix a rating), and in particular on an operation called Singular Value Decomposition, one of those original LINPACK matrix routines that MATLAB was designed to make easy to use.
Quickly perform and interpret the results of routine Small Data analysis
With so much focus on Big Data, the needs of many analysts who work with Small Data tend to get ignored. The default tool for many of these users remains spreadsheets1 and/or statistical packages which come with a lot of features and options. However many analysts need a very small subset of what these tools have to offer.
Enter Statwing, a software-as-a-service provider for routine statistical analysis. While the tool is still in the early stages, it can already do many basic “data analysis” tasks.
Consider the following example of a pivot table constructed in Excel: this required 8 mouse-clicks, if you do everything perfectly, and about 5 decisions (what variables to include, what metric to use, …)
The same task in Statwing required 4 mouse-clicks and 0 decisions! Plus it comes with visuals:
The lack of clutter and the addition of a simple “headline” (“Female tends to have much higher values for satisfaction than Male“), makes the result much easier to interpret. The advanced tab contains detailed statistical analysis (in this case the p-value, counts, values). Many users get confused by the output/results produced by traditional statistical software. Let’s face it, many analysts have had little training in statistics. I welcome a tool that produces readily interpretable results.
The company hopes to replicate the above example across a wide variety of routine data analysis tasks. Their initial focus is on tools for (consumer) survey analysis, a potentially huge market given that online companies have made surveys so much easier to conduct. Users of Statwing pay a small monthly subscription, making it cheaper than most2 statistical packages. For a small monthly fee, their intuitive UI lets analysts get their tasks done quickly. More importantly Statwing may nurture aspiring data scientists in your organization.
(1) As this recent Strata presentation points out: Spreadsheets are the glue that keeps many organizations together.
(2) Open source tools like OpenOffice, R and Octave are free. So is the use of Google spreadsheets.
Kris Hammond on replacing rows and columns with sentences and paragraphs.
Imagine a future where clear language supplants spreadsheets. In a recent interview, Narrative Science CTO Kris Hammond explained how we might get there.
An astonishing connection between web ops and medical care.
Machine learning and access to huge amounts of data allowed IBM to make an important discovery about premature infants. If web operations teams could capture everything — network data, environmental data, I/O subsystem data, etc. — what would they find out?
A look at lesser-known ways to extract insight from data.
Visualizations are one way to make sense of data, but they aren't the only way. Robbie Allen reveals six additional outputs that help users derive meaningful insights from data.