Analytics can make combining or comparing data faster and less painful.
Entity resolution refers to processes that businesses and other organizations have to do all the time in order to produce full reports on people, organizations, or events. Entity resolution can be used, for instance, to:
- Combine your customer data with a list purchased from a data broker. Identical data may be in columns of different names, such as “last” and “surname.” Connecting columns from different databases is a common extract, transform, and load (ETL) task.
- Extract values from one database and match them against one or more columns in another. For instance, if you get a party list, you might want to find your clients among the attendees. A police detective might want to extract the names of people involved in a crime report and see whether any suspects are among them.
- Find a match in dirty data, such as a person whose name is spelled differently in different rows.
Dirty, inconsistent, or unstructured data is the chief challenge in entity resolution. Jenn Reed, director of product management for Novetta Entity Analytics, points out that it’s easy for two numbers to get switched, such as a person’s driver’s license and social security numbers. Over time, sophisticated rules have been created to compare data, and it often requires the comparison of several fields to make sure a match is correct. (For instance, health information exchanges use up to 17 different types of data to make sure the Marcia Marquez who just got admitted to the ER is the same Marcia Marquez who visited her doctor last week.) Read more…
The O'Reilly Data Show Podcast: Angie Ma on building a finishing school for science and engineering doctorates.
Editor’s note: The ASI will offer a two-day intensive course, Practical Machine Learning, at Strata + Hadoop World in London in May.
Back when I was considering leaving academia, the popular exit route was financial engineering. Many science and engineering Ph.D.s ended up in big Wall Street banks; I chose to be the lead quant at a small hedge fund — it was a natural choice for many of us. Financial engineering was topically close to my academic interests, and working with traders meant access to resources and interesting problems.
Today, there are many more options for people with science and engineering doctorates. A few organizations take science and engineering Ph.D.s, and over the course of 8-12 weeks, prepare them to join the ranks of industrial data scientists and data engineers.
I recently sat down with Angie Ma, co-founder and president of ASI, a London startup that runs a carefully structured “finishing school” for science and engineering doctorates. We talked about how Angie and her co-founders (all ex-physicists) arrived at the concept of the ASI, the structure of their training programs, and the data and startup scene in the UK. [Full disclosure: I’m an advisor to the ASI.] Read more…
Like the Internet in 1994, virtual reality is about to cross the chasm from core technologists to the wider world.
When you’re an entrepreneur or investor struggling to bring a technology to market just a little before its time, being too early can feel exactly the same as being flat wrong. But with a bit more perspective, it’s clear that many of the hottest companies and products in today’s tech landscape are actually capitalizing on ideas that have been tried before — have, in some cases, been tackled repeatedly, and by very smart teams — but whose day has only now just arrived.
Virtual reality (VR) is one of those areas that has seduced many smart technologists in its long history, and its repeated commercial flameouts have left a lot of scar tissue in their wake. Despite its considerable ups and downs, though, the dream of VR has never died — far from it. The ultimate promise of the technology has been apparent for decades now, and many visionaries have devoted their careers to making it happen. But for almost 50 years, these dreams have outpaced the realities of price and performance.
To be fair, VR has come a long way in that time, though largely in specialized, under-the-radar domains that can support very high system costs and large installations; think military training and resource exploration. But the basic requirements for mass-market devices have never been met: low-power computing muscle; large, fast displays; and tiny, accurate sensors. Thanks to the smartphone supply chain, though, all of these components have evolved very rapidly in recent years — to the point where low-cost, high-quality, compact VR systems are now becoming available. Consumer VR really is coming on fast now, and things are getting very interesting. Read more…
The O'Reilly Radar Podcast: John Carnahan on holistic data analysis, engagement channels, and data science as an art form.
In this Radar Podcast episode, I sit down with John Carnahan, executive vice president of data science at Ticketmaster. At our recent Strata + Hadoop World Conference in San Jose, CA, Carnahan presented a session on using data science and machine learning to improve ticket sales and marketing at Ticketmaster.
I took the opportunity to chat with Carnahan about Ticketmaster’s evolving approach to data analysis, the avenues of user engagement they’re investigating, and how his genetics background is informing his work in the big data space.
When Carnahan took the job at Ticketmaster about three years ago, his strategy focused on small, concrete tasks aimed at solving distinct nagging problems: how do you address large numbers of tickets not sold at an event, how do you engage and market those undersold events to fans, and how do you stem abuse of ticket sales. This strategy has evolved, Carnahan explained, to a more holistic approach aimed at bridging the data silos within the company:
“We still want those concrete things, but we want to build a bed of data science assets that’s built on top of a company that’s been around almost 40 years and has a lot of data assets. How do we build the platform that will leverage those things into the future, beyond just those small niche products that we really want to build. We’re trying to bridge the gap between a lot of those products, too. Rather than think of each of those things as a vertical or a silo that’s trying to accomplish something, it’s how do you use something that you’ve built over here, over there to make that better?”
The growing complexity of design and architecture will require a new definition of design foundations, practice, and theory.
Editor’s note: This is an excerpt by Matt Nish-Lapidus from our recent book Designing for Emerging Technologies, a collection of works by several authors and edited by Jon Follett. This excerpt is included in our curated collection of chapters from the O’Reilly Design library. Download a free copy of the Designing for the Internet of Things ebook here.Bruce Sterling wrote in Shaping Things that the world is becoming increasingly connected, and the devices by which we are connecting are becoming smarter and more self aware. When every object in our environment contains data collection, communication, and interactive technology, how do we as human beings learn how to navigate all of this new information? We need new tools as designers — and humans — to work with all of this information and the new devices that create, consume, and store it.
Today, there’s a good chance that your car can park itself. Your phone likely knows where you are. You can walk through the interiors of famous buildings on the web. Everything around us is constantly collecting data, running algorithms, calculating outcomes, and accumulating more raw data than we can handle.
We all carry minicomputers in our pockets, often more than one; public and private infrastructure collects terabytes of data every minute; and personal analytics has become so commonplace that it’s more conspicuous to not collect data about yourself than to record every waking moment. In many ways, we’ve moved beyond Malcolm McCullough’s ideas of ubiquitous computing put forth in Digital Ground and into a world in which computing isn’t only ubiquitous and invisible, but pervasive, constant, and deeply embedded in our everyday lives. Read more…
In the next decade, Year Zero will be how big data reaches everyone and will fundamentally change how we live.
Editor’s note: this post originally appeared on the author’s blog, Solve for Interesting. This lightly edited version is reprinted here with permission.
In 10 years, every human connected to the Internet will have a timeline. It will contain everything we’ve done since we started recording, and it will be the primary tool with which we administer our lives. This will fundamentally change how we live, love, work, and play. And we’ll look back at the time before our feed started — before Year Zero — as a huge, unknowable black hole.
This timeline — beginning for newborns at Year Zero — will be so intrinsic to life that it will quickly be taken for granted. Those without a timeline will be at a huge disadvantage. Those with a good one will have the tricks of a modern mentalist: perfect recall, suggestions for how to curry favor, ease maintaining friendships and influencing strangers, unthinkably higher Dunbar numbers — now, every interaction has a history.
This isn’t just about lifelogging health data, like your Fitbit or Jawbone. It isn’t about financial data, like Mint. It isn’t just your social graph or photo feed. It isn’t about commuting data like Waze or Maps. It’s about all of these, together, along with the tools and user interfaces and agents to make sense of it.
Every decade or so, something from military or enterprise technology finds its way, bent and twisted, into the mass market. The client-server computer gave us the PC; wide-area networks gave us the consumer web; pagers and cell phones gave us mobile devices. In the next decade, Year Zero will be how big data reaches everyone. Read more…