"data scientists" entries
A field guide to the Apache Hadoop projects, subprojects, and related technologies.
IT managers, developers, data analysts, and system architects are encountering the largest and most disruptive change in data analysis since the ascendency of the relational database in early 1980s — the challenge to process, organize, and take full advantage of big data. With 73% of organizations making big data investments in 2014 and 2015, this transition is occurring at a historic pace, requiring new ways of thinking to go along with new tools and techniques.
Hadoop is the cornerstone of this change to a landscape of systems and skills we’ve traditionally possessed. In the nine short years since the project revolutionized data science at Yahoo!, an entire ecosystem of technologies has sprung up around it. While the power of this ecosystem is plain to see, it can be a challenge to navigate your way through the complex and rapidly evolving collection of projects and products.
A couple years ago, my coworker Marshall Presser and I started our journey into the world of Hadoop. Like many folks, we found the company we worked for was making a major investment in the Hadoop ecosystem, and we had to find a way to adapt. We started in all of the typical places — blog posts, trade publications, Wikipedia articles, and project documentation. Quickly, we learned that many of these sources are often highly biased, either too shallow or too deep, and just plain inconsistent. Read more…
Analytics can make combining or comparing data faster and less painful.
Entity resolution refers to processes that businesses and other organizations have to do all the time in order to produce full reports on people, organizations, or events. Entity resolution can be used, for instance, to:
- Combine your customer data with a list purchased from a data broker. Identical data may be in columns of different names, such as “last” and “surname.” Connecting columns from different databases is a common extract, transform, and load (ETL) task.
- Extract values from one database and match them against one or more columns in another. For instance, if you get a party list, you might want to find your clients among the attendees. A police detective might want to extract the names of people involved in a crime report and see whether any suspects are among them.
- Find a match in dirty data, such as a person whose name is spelled differently in different rows.
Dirty, inconsistent, or unstructured data is the chief challenge in entity resolution. Jenn Reed, director of product management for Novetta Entity Analytics, points out that it’s easy for two numbers to get switched, such as a person’s driver’s license and social security numbers. Over time, sophisticated rules have been created to compare data, and it often requires the comparison of several fields to make sure a match is correct. (For instance, health information exchanges use up to 17 different types of data to make sure the Marcia Marquez who just got admitted to the ER is the same Marcia Marquez who visited her doctor last week.) Read more…
The O'Reilly Data Show Podcast: Angie Ma on building a finishing school for science and engineering doctorates.
Editor’s note: The ASI will offer a two-day intensive course, Practical Machine Learning, at Strata + Hadoop World in London in May.
Back when I was considering leaving academia, the popular exit route was financial engineering. Many science and engineering Ph.D.s ended up in big Wall Street banks; I chose to be the lead quant at a small hedge fund — it was a natural choice for many of us. Financial engineering was topically close to my academic interests, and working with traders meant access to resources and interesting problems.
Today, there are many more options for people with science and engineering doctorates. A few organizations take science and engineering Ph.D.s, and over the course of 8-12 weeks, prepare them to join the ranks of industrial data scientists and data engineers.
I recently sat down with Angie Ma, co-founder and president of ASI, a London startup that runs a carefully structured “finishing school” for science and engineering doctorates. We talked about how Angie and her co-founders (all ex-physicists) arrived at the concept of the ASI, the structure of their training programs, and the data and startup scene in the UK. [Full disclosure: I’m an advisor to the ASI.] Read more…
The O'Reilly Radar Podcast: John Carnahan on holistic data analysis, engagement channels, and data science as an art form.
In this Radar Podcast episode, I sit down with John Carnahan, executive vice president of data science at Ticketmaster. At our recent Strata + Hadoop World Conference in San Jose, CA, Carnahan presented a session on using data science and machine learning to improve ticket sales and marketing at Ticketmaster.
I took the opportunity to chat with Carnahan about Ticketmaster’s evolving approach to data analysis, the avenues of user engagement they’re investigating, and how his genetics background is informing his work in the big data space.
When Carnahan took the job at Ticketmaster about three years ago, his strategy focused on small, concrete tasks aimed at solving distinct nagging problems: how do you address large numbers of tickets not sold at an event, how do you engage and market those undersold events to fans, and how do you stem abuse of ticket sales. This strategy has evolved, Carnahan explained, to a more holistic approach aimed at bridging the data silos within the company:
“We still want those concrete things, but we want to build a bed of data science assets that’s built on top of a company that’s been around almost 40 years and has a lot of data assets. How do we build the platform that will leverage those things into the future, beyond just those small niche products that we really want to build. We’re trying to bridge the gap between a lot of those products, too. Rather than think of each of those things as a vertical or a silo that’s trying to accomplish something, it’s how do you use something that you’ve built over here, over there to make that better?”
In the next decade, Year Zero will be how big data reaches everyone and will fundamentally change how we live.
Editor’s note: this post originally appeared on the author’s blog, Solve for Interesting. This lightly edited version is reprinted here with permission.
In 10 years, every human connected to the Internet will have a timeline. It will contain everything we’ve done since we started recording, and it will be the primary tool with which we administer our lives. This will fundamentally change how we live, love, work, and play. And we’ll look back at the time before our feed started — before Year Zero — as a huge, unknowable black hole.
This timeline — beginning for newborns at Year Zero — will be so intrinsic to life that it will quickly be taken for granted. Those without a timeline will be at a huge disadvantage. Those with a good one will have the tricks of a modern mentalist: perfect recall, suggestions for how to curry favor, ease maintaining friendships and influencing strangers, unthinkably higher Dunbar numbers — now, every interaction has a history.
This isn’t just about lifelogging health data, like your Fitbit or Jawbone. It isn’t about financial data, like Mint. It isn’t just your social graph or photo feed. It isn’t about commuting data like Waze or Maps. It’s about all of these, together, along with the tools and user interfaces and agents to make sense of it.
Every decade or so, something from military or enterprise technology finds its way, bent and twisted, into the mass market. The client-server computer gave us the PC; wide-area networks gave us the consumer web; pagers and cell phones gave us mobile devices. In the next decade, Year Zero will be how big data reaches everyone. Read more…
The O'Reilly Data Show Podcast: David Blei, co-creator of one of the most popular tools in text mining and machine learning.
I don’t remember when I first came across topic models, but I do remember being an early proponent of them in industry. I came to appreciate how useful they were for exploring and navigating large amounts of unstructured text, and was able to use them, with some success, in consulting projects. When an MCMC algorithm came out, I even cooked up a Java program that I came to rely on (up until Mallet came along).
I recently sat down with David Blei, co-author of the seminal paper on topic models, and who remains one of the leading researchers in the field. We talked about the origins of topic models, their applications, improvements to the underlying algorithms, and his new role in training data scientists at Columbia University.
Generating features for other machine learning tasks
Blei frequently interacts with companies that use ideas from his group’s research projects. He noted that people in industry frequently use topic models for “feature generation.” The added bonus is that topic models produce features that are easy to explain and interpret:
“You might analyze a bunch of New York Times articles for example, and there’ll be an article about sports and business, and you get a representation of that article that says this is an article and it’s about sports and business. Of course, the ideas of sports and business were also discovered by the algorithm, but that representation, it turns out, is also useful for prediction. My understanding when I speak to people at different startup companies and other more established companies is that a lot of technology companies are using topic modeling to generate this representation of documents in terms of the discovered topics, and then using that representation in other algorithms for things like classification or other things.”
From data-driven government to our age of intelligence, here are key insights from Strata + Hadoop World in San Jose, CA, 2015.
Experts from across the big data world came together for Strata + Hadoop World in San Jose, CA, 2015. We’ve gathered insights from the event below.
U.S. chief data scientist
With a special recorded introduction from President Barack Obama, DJ Patil talks about his new role as the U.S. government’s first ever chief data scientist, the nature of the U.S.’s emerging data-driven government, and defines his mission in leading the data-driven initiative:
“Responsibly unleash the power of data for the benefit of the American public and maximize the nation’s return on its investment in data.”
Tips on how to build effective human-machine hybrids, from crowdsourcing expert Adam Marcus.
In a recent O’Reilly webcast, “Crowdsourcing at GoDaddy: How I Learned to Stop Worrying and Love the Crowd,” Adam Marcus explains how to mitigate common challenges of managing crowd workers, how to make the most of human-in-the-loop machine learning, and how to establish effective and mutually rewarding relationships with workers. Marcus is the director of data on the Locu team at GoDaddy, where the “Get Found” service provides businesses with a central platform for managing their online presence and content.
In the webcast, Marcus uses practical examples from his experience at GoDaddy to reveal helpful methods for how to:
- Offset the inevitability of wrong answers from the crowd
- Develop and train workers through a peer-review system
- Build a hierarchy of trusted workers
- Make crowd work inspiring and enable upward mobility
What to do when humans get it wrong
It turns out there is a simple way to offset human error: redundantly ask people the same questions. Marcus explains that when you ask five different people the same question, there are some creative ways to combine their responses, and use a majority vote. Read more…