# "Big Data" entries

## Data science makes an impact on Wall Street

### The O'Reilly Data Show Podcast: Gary Kazantsev on how big data and data science are making a difference in finance.

Having started my career in industry, working on problems in finance, I’ve always appreciated how challenging it is to build consistently profitable systems in this extremely competitive domain. When I served as quant at a hedge fund in the late 1990s and early 2000s, I worked primarily with price data (time-series). I quickly found that it was difficult to find and sustain profitable trading strategies that leveraged data sources that everyone else in the industry examined exhaustively. In the early-to-mid 2000s the hedge fund industry began incorporating many more data sources, and today you’re likely to find many finance industry professionals at big data and data science events like Strata + Hadoop World.

During the latest episode of the O’Reilly Data Show Podcast, I had a great conversation with one of the leading data scientists in finance: Gary Kazantsev runs the R&D Machine Learning group at Bloomberg LP. As a former quant, I wanted to know the types of problems Kazantsev and his group work on, and the tools and techniques they’ve found useful. We also talked about data science, data engineering, and recruiting data professionals for Wall Street. Read more…

## Cultivating a psychological sense of community

### A profile of Dr. Renetta Garrison Tull, from our latest report on women in the field of data.

Download our updated report, “Women in Data: Cutting-Edge Practitioners and Their Views on Critical Skills, Background, and Education,” by Cornelia Lévy-Bencheton and Shannon Cutt, featuring four new profiles of women across the European Union. Editor’s note: this is an excerpt from the free report.

Dr. Renetta Garrison Tull is a recognized expert in women and minorities in education, and in the STEM gender gap — both within and outside the academic environment. Dr. Tull is also an electrical engineer by training and is passionate about bringing more women into the field.

From her vantage point at the University of Maryland Baltimore County (UMBC) as associate vice provost for graduate student development and postdoctoral affairs, Dr. Tull concentrates on opportunities for graduate and postdoctoral professional development. As director of PROMISE: Maryland’s Alliance for Graduate Education and the Professoriate (AGEP) program for the University System of Maryland (USM), Dr. Tull also has a unique perspective on the STEM subjects that students cover prior to attending the university, within academia and as preparation for the workforce beyond graduation.

Dr. Tull has been writing code since the seventh grade. Fascinated by the Internet, she “learned HTML before there were WYSIWYGs,” and remains heavily involved with the online world. “I’ve been politely chided in meetings for pulling out my phones (yes plural), sending texts, and updating our organization’s professional Twitter and Facebook status, while taking care of emails from multiple accounts. I manage several blogs, each for different audiences … friends, colleagues, and students.” Read more…

## Connected play

### How data-driven tech toys are — and aren’t — changing the nature of play.

Sign up to be notified when the new free report Data, Technology & The Future of Play becomes available. This post is part of a series investigating the future of play that will culminate in a full report.

When I was in first grade, I cut the fur pom-poms off of my dad’s mukluks. (If you didn’t grow up in the Canadian North and you don’t know what mukluks are, here’s a picture.) My dad’s mukluks were specially made for him, so he was pretty sore. I cut the pom-poms off because I had just seen The Trouble With Tribbles at a friend’s house, and I desperately wanted some Tribbles. I kept them in a shoebox, named them, brought them to show-and-tell, and pretended they were real.

It’s exactly this kind of imaginative play that a lot of parents are afraid is being lost as toys become smarter. And in exchange for what? There isn’t any real evidence yet that smart toys genuinely make kids smarter.

Low-fi toys. Public domain image, “Boys playing with hoops on Chesnut Street, Toronto, Canada.” Source: Wikimedia Commons.

I tell this story not to emphasize what a terrible vandal I was as a child, rather, I tell it to show how irrepressible childrens’ imaginations are, and to explain why technological toys are not going to kill that imagination. Today’s “smart” toys are no different than dolls and blocks, or in my case, a pair of mukluks. By nature, all toys have affordances that imply how they should be used. The more complex the toy, the more focused the affordances are. Consider a stick: it can be a weapon, a mode of transport, or a magic wand. But an app that is designed to do a thing guides users toward that use case, just as a door handle suggests that you should grasp and turn it. Design has opinions. Read more…

## Topic modeling for the newbie

### Learning the fundamentals of natural language processing.

Get “Data Science from Scratch” at 50% off with code DATA50. Editor’s note: This is an excerpt from our recent book Data Science from Scratch, by Joel Grus. It provides a survey of topics from statistics and probability to databases, from machine learning to MapReduce, giving the reader a foundation for understanding, and examples and ideas for learning more.

When we built our Data Scientists You Should Know recommender in Chapter 1, we simply looked for exact matches in people’s stated interests.

A more sophisticated approach to understanding our users’ interests might try to identify the topics that underlie those interests. A technique called Latent Dirichlet Analysis (LDA) is commonly used to identify common topics in a set of documents. We’ll apply it to documents that consist of each user’s interests.

LDA has some similarities to the Naive Bayes Classifier we built in Chapter 13, in that it assumes a probabilistic model for documents. We’ll gloss over the hairier mathematical details, but for our purposes the model assumes that:

• There is some fixed number K of topics.
• There is a random variable that assigns each topic an associated probability distribution over words. You should think of this distribution as the probability of seeing word w given topic k.
• There is another random variable that assigns each document a probability distribution over topics. You should think of this distribution as the mixture of topics in document d.
• Each word in a document was generated by first randomly picking a topic (from the document’s distribution of topics) and then randomly picking a word (from the topic’s distribution of words).

In particular, we have a collection of documents, each of which is a list of words. And we have a corresponding collection of document_topics that assigns a topic (here a number between 0 and K – 1) to each word in each document. Read more…

## Startups suggest big data is moving to the clouds

### A look at the winners from a showcase of some of the most innovative big data startups.

At Strata + Hadoop World in London last week, we hosted a showcase of some of the most innovative big data startups. Our judges narrowed the field to 10 finalists, from whom they — and attendees — picked three winners and an audience choice.

Underscoring many of these companies was the move from software to services. As industries mature, we see a move from custom consulting to software and, ultimately, to utilities — something Simon Wardley underscored in his Data Driven Business Day talk, and which was reinforced by the announcement of tools like Google’s Bigtable service offering.

This trend was front and center at the showcase:

• Winner Modgen, for example, generates recommendations and predictions, offering machine learning as a cloud-based service.
• While second-place Brytlyt offers their high-performance database as an on-premise product, their horizontally scaled-out architecture really shines when the infrastructure is elastic and cloud based.
• Finally, third-place OpenSensors’ real-time IoT message platform scales to millions of messages a second, letting anyone spin up a network of connected devices.

Ultimately, big data gives clouds something to do. Distributed sensors need a widely available, connected repository into which to report; databases need to grow and shrink with demand; and predictive models can be tuned better when they learn from many data sets. Read more…

## The tensor renaissance in data science

### The O'Reilly Data Show Podcast: Anima Anandkumar on tensor decomposition techniques for machine learning.

After sitting in on UC Irvine Professor Anima Anandkumar’s Strata + Hadoop World 2015 in San Jose presentationI wrote a post urging the data community to build tensor decomposition libraries for data science. The feedback I’ve gotten from readers has been extremely positive. During the latest episode of the O’Reilly Data Show Podcast, I sat down with Anandkumar to talk about tensor decomposition, machine learning, and the data science program at UC Irvine.

## Modeling higher-order relationships

The natural question is: why use tensors when (large) matrices can already be challenging to work with? Proponents are quick to point out that tensors can model more complex relationships. Anandkumar explains:

Tensors are higher order generalizations of matrices. While matrices are two-dimensional arrays consisting of rows and columns, tensors are now multi-dimensional arrays. … For instance, you can picture tensors as a three-dimensional cube. In fact, I have here on my desk a Rubik’s Cube, and sometimes I use it to get a better understanding when I think about tensors.  … One of the biggest use of tensors is for representing higher order relationships. … If you want to only represent pair-wise relationships, say co-occurrence of every pair of words in a set of documents, then a matrix suffices. On the other hand, if you want to learn the probability of a range of triplets of words, then we need a tensor to record such relationships. These kinds of higher order relationships are not only important for text, but also, say, for social network analysis. You want to learn not only about who is immediate friends with whom, but, say, who is friends of friends of friends of someone, and so on. Tensors, as a whole, can represent much richer data structures than matrices.

## Four short links: 7 May 2015

### Predicting Hits, Pricing Strategies, Quis Calculiet Shifty Custodes, Docker Security

1. Predicting a Billboard Music Hit (YouTube) — Shazam VP of Music and Platforms at Strata London. With relative accuracy, we can predict 33 days out what song will go to No. 1 on the Billboard charts in the U.S.
2. Psychological Pricing Strategies — a handy wrap-up of evil^wuseful pricing strategies to know.
3. What Two Programmers Have Revealed So Far About Seattle Police Officers Who Are Still in Uniformthrough their shrewd use of Washington’s Public Records Act, the two Seattle residents are now the closest thing the city has to a civilian police-oversight board. In the last year and a half, they have acquired hundreds of reports, videos, and 911 calls related to the Seattle Police Department’s internal investigations of officer misconduct between 2010 and 2013. And though they have only combed through a small portion of the data, they say they have found several instances of officers appearing to lie, use racist language, and use excessive force—with no consequences. In fact, they believe that the Office of Professional Accountability (OPA) has systematically “run interference” for cops. In the aforementioned cases of alleged officer misconduct, all of the involved officers were exonerated and still remain on the force.
4. Understanding Docker Security and Best Practices — explanation of container security and a benchmark for security practices, though email addresses will need to be surrendered in exchange for the good info.

## The unwelcome guest: Why VMs aren’t the solution for next-gen applications

### Scale-out applications need scaled-in virtualization.

Data center operating systems are emerging as a first-class category of distributed system software. Hadoop, for example, is evolving from a MapReduce framework into YARN, a generic platform for scale-out applications.

To enable a rich ecosystem of diverse applications to coexist on these platforms, providing adequate isolation is crucial. The isolation mechanism must enforce resource limits, decouple software dependencies among applications and the host, provide security and privacy, confine failures, etc. Containers offer a simple and elegant solution to the problem. However, a question that comes up frequently is: Why not virtual machines (VMs)? After all, these systems face a number of the same challenges that have been solved by virtualization for traditional enterprise applications.

All problems in computer science can be solved by another level of indirection, except of course for the problem of too many indirections” — David Wheeler

## On the evolution of machine learning

### From linear models to neural networks: an interview with Reza Zadeh.

Get notified when our free report, “Future of Machine Intelligence: Perspectives from Leading Practitioners,” is available for download. The following interview is one of many that will be included in the report.

As part of our ongoing series of interviews surveying the frontiers of machine intelligence, I recently interviewed Reza Zadeh. Reza is a Consulting Professor in the Institute for Computational and Mathematical Engineering at Stanford University and a Technical Advisor to Databricks. His work focuses on Machine Learning Theory and Applications, Distributed Computing, and Discrete Applied Mathematics.

## Key Takeaways

• Neural networks have made a comeback and are playing a growing role in new approaches to machine learning.
• The greatest successes are being achieved via a supervised approach leveraging established algorithms.
• Spark is an especially well-suited environment for distributed machine learning.

Reza Zadeh: At Stanford, I designed and teach distributed algorithms and optimization (CME 323) as well as a course called discrete mathematics and algorithms (CME 305). In the discrete mathematics course, I teach algorithms from a completely theoretical perspective, meaning that it is not tied to any programming language or framework, and we fill up whiteboards with many theorems and their proofs. Read more…

## Four short links: 30 April 2015

### Managing Complex Data Projects, Graphical Linear Algebra, Consistent Hashing, and NoTCP Manifesto

1. More Tools for Managing and Reproducing Complex Data Projects (Ben Lorica) — As I survey the landscape, the types of tools remain the same, but interfaces continue to improve, and domain specific languages (DSLs) are starting to appear in the context of data projects. One interesting trend is that popular user interface models are being adapted to different sets of data professionals (e.g. workflow tools for business users).
2. Graphical Linear Algebra — or “Graphical The-Subject-That-Kicked-Nat’s-Butt” as I read it.
3. Consistent Hashing: A Guide and Go Implementation — easy-to-follow article (and source).
4. NoTCP Manifesto — a nice summary of the reasons to build custom protocols over UDP, masquerading as church-nailed heresy. Today’s heresy is just the larval stage of tomorrow’s constricting orthodoxy.