"data scientists" entries

Integrate, catalog, and preserve metadata

Dr. Clare Bernard, former particle physicist at CERN, on solutions for discovering, organizing, and visualizing enterprise data.

by Shannon Cutt | @ShannonCutt | September 24, 2015

During a special edition of The O’Reilly Podcast, host and O’Reilly chief data scientist Ben Lorica interviewed Dr. Clare Bernard, a former particle physicist at CERN, who worked on the ATLAS experiment at the Large Hadron Collider. Bernard is now a field engineer at Tamr, where she’s involved in a new project that aims to integrate and catalog a variety of data across an enterprise, while preserving metadata.

Key takeaways from their chat:

A lot of companies have big top-down master data management projects, and they put in place a lot of data-governance tools, which typically don’t scale very well.
It’s really important to track where the data came from, what the fields mean, and what transformations have been applied to that data over time, so that you can then use it for your analytics and you really understand what it means.
Tracking metadata allows you to reproduce your data pipelines, and understand the lineage, and provenance of your data.

Ben Lorica: Let’s start with a little bit about your background. You are a scientist by training, right?

Clare Bernard: Yes. I was a particle physicist. I worked at CERN for a couple years and worked on the ATLAS experiment at the Large Hadron Collider. Then I got my Ph.D. and graduated in May. I’ve been working at Tamr since then, as a field engineer. Read more…

2015 Data Science Salary Survey

Revealing patterns in tools, tasks, and compensation through clustering and linear models.

by John King | @johntildenking | September 23, 2015

Download the free “2015 Data Science Salary Survey” report to learn about tools, trends, and what pays (and what doesn’t) for data professionals.

Download the free report

Data scientists are constantly looking outward, tapping into and extracting information from all manner of data in ways hardly imaginable not long ago. Much of the change is technological — data collection has multiplied as well as our means of processing it — but an important cultural shift has played a part, too, evidenced by the desire of organizations to become “data-driven” and the wide availability of public APIs.

But how much do we look inward, at ourselves? The variety of data roles, both in subject and method, means that even those of us who have a strong grasp of what it means to be a data scientist in a particular domain or sub-field may not have a complete view of the data space as a whole. Just as data we process and analyze for our organizations can be used to decide business actions, data about data scientists can help inform our career choices.

That’s where we come in. O’Reilly Media has been conducting an annual survey for data professionals, asking questions primarily about tools, tasks, and salary — and we are now releasing the third installment of the associated report, the 2015 Data Science Salary Survey. The 2015 edition features a completely new graphic design of the report and our findings. In addition to estimating salary differences based on demographics and tool usage, we have given a more detailed look at tasks — how data professionals spend their workdays — and titles. Read more…

The security infusion

Building access policies into data stores.

by Andy Oram | @praxagora | +Andy Oram | September 22, 2015

Hadoop jobs reflect the same security demands as other programming tasks. Corporate and regulatory requirements create complex rules concerning who has access to different fields in data sets, sensitive fields must be protected from internal users as well as external threats, and multiple applications run on the same data and must treat different users with different access rights. The modern world of virtualization and containers adds security at the software level, but tears away the hardware protection formerly offered by network segments, firewalls, and DMZs.

Furthermore, security involves more than saying yes or no to a user running a Hadoop job. There are rules for archiving or backing up data on the one hand, and expiring or deleting it on the other. Audit logs are a must, both to track down possible breaches and to conform to regulation.

Best practices for managing data in these complex, sensitive environments implement the well-known principle of security by design. According to this principle, you can’t design a database or application in a totally open manner and then layer security on top if you expect it to be robust. Instead, security must be infused throughout the system and built in from the start. Defense in depth is a related principle that urges the use of many layers of security, so that an intruder breaking through one layer may be frustrated by the next. Read more…

How advanced analytics are impacting sports

The expanding role of data analytics in a trillion-dollar industry.

by Janine Barlow | @janine_barlow | September 22, 2015

Download our new free report “Data Analytics in Sports: How Playing with Data Transforms the Game,” by Janine Barlow, to learn how advanced predictive analytics are impacting the world of sports.

Sports are the perfect playing field on which data scientists can play their game — there are finite structures and distinct goals. Many of the components in sports break down numerically — e.g., number of players; length of periods; and, taking a broader view, how much each player is paid.

This is why sports and data have gone hand-in-hand since the very beginning of the industry. What, after all, is baseball without baseball cards?

In a new O’Reilly report, Data Analytics in Sports: How Playing with Data Transforms the Game, we explore the role of data analytics and new technology in the sports industry. Through a series of interviews with experts at the intersection of data and sports, we break down some of the industry’s most prominent advances in the use of data analytics and explain what these advances mean for players, executives, and fans.

Read more…

Old-school DRM and new-school analytics

Piracy isn’t the threat; it’s centuries old. Music Science is the game changer.

by Alistair Croll | @acroll | +Alistair Croll | September 21, 2015

Download our new free report “Music Science: How Data and Digital Content are Changing Music,” by Alistair Croll, to learn more about music, data, and music science.

In researching how data is changing the music industry, I came across dozens of entertaining anecdotes. One of the recurring themes was music piracy. As I wrote in my previous post on music science, industry incumbents think of piracy as a relatively new phenomenon — as one executive told me, “vinyl was great DRM.”

But the fight between protecting and copying content has gone on for a long time, and every new medium for music distribution has left someone feeling robbed. One of the first known cases of copy protection — and illegal copying — involved Mozart himself.

As a composer, Mozart’s music spread far and wide. But he was also a performer and wanted to be able to command a premium for playing in front of audiences. One way he ensured continued demand was through “flourishes,” or small additions to songs, which weren’t recorded in written music. While Mozart’s flourishes are lost to history, researchers have attempted to understand how his music might once have been played. This video shows classical pianist Christina Kobb demonstrating a 19th century technique.

Read more…

Apache Drill: Tracking its history as an open source community

A strong, open user community needs to be fostered to reveal its potential.

by Ellen Friedman | @Ellen_Friedman | September 21, 2015

A strong user community is essential to releasing the full potential of an open source project, and this influence is particularly important now for the newly developed Apache Drill project. Drill is a highly scalable SQL query engine for interactive access to a wide range of big data sources and formats. Some of the ways users have an impact are an expected part of the development process: by trying the software and reporting their experiences and use cases, users in the Drill community provide valuable feedback to developers as well as raise awareness with a larger audience of what this big data tool has to offer.

This advantage was especially important with early versions of the software; users have helped development of Drill from early days by reporting bugs and praising features that they like. And now, as Drill is reaching maturity and refinement, users likely will also provide additional innovations: experimenting with Drill in their own projects, they may find new ways to use it that had not occurred to the developers.

Drill’s flexibility and extensibility lend themselves to innovation, but there’s also a natural tendency for this type of change because the big data and Hadoop landscape also are evolving quickly. In the case of Drill, we’re seeing the “unexpectedness benefit” of openness: the community gets out ahead of the leadership in use cases and technological change.

The first big Apache Drill design meeting in September 2012 in San Jose set the tone of openness and inclusion. This was an open meeting, organized by Drill co-founder Tomer Shiran and Drill mentor Ted Dunning, and sponsored by MapR Technologies through the Bay Area Apache Drill User Group. More than 60 people attended in person, and Webex connected a larger, international audience. I recall that in addition to speaker-led presentations and discussion, long strips of paper were mounted around the room for participants to write on during breaks in order to provide ideas or offer specific ways they might want to be involved. Practical steps like this surfaced good ideas immediately, and signaled openness for future ones. Read more…

Build better machine learning models

A beginner's guide to evaluating your machine learning models.

by Alice Zheng | @RainyData | September 18, 2015

Everything today is being quantified, measured, and tracked — everything is generating data, and data is powerful. Businesses are using data in a variety of ways to improve customer satisfaction. For instance, data scientists are building machine learning models to generate intelligent recommendations to users so that they spend more time on a site. Analysts can use churn analysis to predict which customers are the best targets for the next promotional campaign. The possibilities are endless.

However, there are challenges in the machine learning pipeline. Typically, you build a machine learning model on top of your data. You collect more data. You build another model. But how do you know when to stop?

When is your smart model smart enough?

Evaluation is a key step when building intelligent business applications with machine learning. It is not a one-time task, but must be integrated with the whole pipeline of developing and productionizing machine learning-enabled applications.

In a new free O’Reilly report Evaluating Machine Learning Models: A Beginner’s Guide to Key Concepts and Pitfalls, we cut through the technical jargon of machine learning, and elucidate, in simple language, the processes of evaluating machine learning models. Read more…

Big data is changing the face of fashion

How the fashion industry is embracing algorithms, natural language processing, and visual search.

by Liza Kindred | @LizaK | September 17, 2015

Download Fashioning Data: A 2015 Update, our updated free report exploring data innovations from the fashion industry.

Fashion is an industry that struggles for respect — despite its enormous size globally, it is often viewed as frivolous or unnecessary.

And it’s true — fashion can be spectacularly silly and wildly extraneous. But somewhere between the glitzy, million-dollar runway shows and the ever-shifting hemlines, a very big business can be found. One industry profile of the global textiles, apparel, and luxury goods market reported that fashion had total revenues of $3.05 trillion in 2011, and is projected to create $3.75 trillion in revenues in 2016.

Solutions for a unique business problem

The majority of clothing purchases are made not out of necessity, but out of a desire for self-expression and identity — two remarkably difficult things to quantify and define. Yet, established brands and startups throughout the industry are finding clever ways to use big data to turn fashion into “bits and bytes,” as much as threads and buttons.

In the newly updated O’Reilly report Fashioning Data: A 2015 Update, Data Innovations from the Fashion Industry, we explore applications of big data that carry lessons for industries of all types. Topics range from predictive algorithms to visual search — capturing structured data from photographs — to natural language processing, with specific examples from complex lifecycles and new startups; this report reveals how different companies are merging human input with machine learning. Read more…

Three best practices for building successful data pipelines

Reproducibility, consistency, and productionizability let data scientists focus on the science.

by Michael Li | @tianhuil | September 15, 2015

Building a good data pipeline can be technically tricky. As a data scientist who has worked at Foursquare and Google, I can honestly say that one of our biggest headaches was locking down our Extract, Transform, and Load (ETL) process.

At The Data Incubator, our team has trained more than 100 talented Ph.D. data science fellows who are now data scientists at a wide range of companies, including Capital One, the New York Times, AIG, and Palantir. We commonly hear from Data Incubator alumni and hiring managers that one of their biggest challenges is also implementing their own ETL pipelines.

Drawn from their experiences and my own, I’ve identified three key areas that are often overlooked in data pipelines, and those are making your analysis:

Reproducible
Consistent
Productionizable

While these areas alone cannot guarantee good data science, getting these three technical aspects of your data pipeline right helps ensure that your data and research results are both reliable and useful to an organization. Read more…

From search to distributed computing to large-scale information extraction

The O'Reilly Data Show Podcast: Mike Cafarella on the early days of Hadoop/HBase and progress in structured data extraction.

by Ben Lorica | @bigdata | +Ben Lorica | September 10, 2015

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

February 2016 marks the 10th anniversary of Hadoop — at a point in time when many IT organizations actively use Hadoop, and/or one of the open source, big data projects that originated after, and in some cases, depend on it.

During the latest episode of the O’Reilly Data Show Podcast, I had an extended conversation with Mike Cafarella, assistant professor of computer science at the University of Michigan. Along with Strata + Hadoop World program chair Doug Cutting, Cafarella is the co-founder of both Hadoop and Nutch. In addition, Cafarella was the first contributor to HBase.

We talked about the origins of Nutch, Hadoop (HDFS, MapReduce), HBase, and his decision to pursue an academic career and step away from these projects. Cafarella’s pioneering contributions to open source search and distributed systems fits neatly with his work in information extraction. We discussed a new startup he recently co-founded, ClearCutAnalytics, to commercialize a highly regarded academic project for structured data extraction (full disclosure: I’m an advisor to ClearCutAnalytics). As I noted in a previous post, information extraction (from a variety of data types and sources) is an exciting area that will lead to the discovery of new features (i.e., variables) that may end up improving many existing machine learning systems. Read more…