In the next decade, Year Zero will be how big data reaches everyone and will fundamentally change how we live.
Editor’s note: this post originally appeared on the author’s blog, Solve for Interesting. This lightly edited version is reprinted here with permission.
In 10 years, every human connected to the Internet will have a timeline. It will contain everything we’ve done since we started recording, and it will be the primary tool with which we administer our lives. This will fundamentally change how we live, love, work, and play. And we’ll look back at the time before our feed started — before Year Zero — as a huge, unknowable black hole.
This timeline — beginning for newborns at Year Zero — will be so intrinsic to life that it will quickly be taken for granted. Those without a timeline will be at a huge disadvantage. Those with a good one will have the tricks of a modern mentalist: perfect recall, suggestions for how to curry favor, ease maintaining friendships and influencing strangers, unthinkably higher Dunbar numbers — now, every interaction has a history.
This isn’t just about lifelogging health data, like your Fitbit or Jawbone. It isn’t about financial data, like Mint. It isn’t just your social graph or photo feed. It isn’t about commuting data like Waze or Maps. It’s about all of these, together, along with the tools and user interfaces and agents to make sense of it.
Every decade or so, something from military or enterprise technology finds its way, bent and twisted, into the mass market. The client-server computer gave us the PC; wide-area networks gave us the consumer web; pagers and cell phones gave us mobile devices. In the next decade, Year Zero will be how big data reaches everyone. Read more…
The Strata + Hadoop World 2015 Startup Showcase highlighted four important trends in the big data world.
At Strata + Hadoop World 2015 in San Jose last week, we ran an event for data-driven startups. This is the fourth year for the Startup Showcase, and it’s become a fixture of the conference. One of our early winners, MemSQL, has since raised $50 million in financing, and it’s a good way for companies to get visibility with investors, analysts, and attendees.
This year’s winners underscore several important trends in the big data space at the moment: the maturity of management tools; the deployment of machine learning in other verticals; an increased focus on privacy and permissions; and the convergence of enterprise languages like SQL with distributed, schema-less data stacks. Read more…
The O'Reilly Data Show Podcast: David Blei, co-creator of one of the most popular tools in text mining and machine learning.
I don’t remember when I first came across topic models, but I do remember being an early proponent of them in industry. I came to appreciate how useful they were for exploring and navigating large amounts of unstructured text, and was able to use them, with some success, in consulting projects. When an MCMC algorithm came out, I even cooked up a Java program that I came to rely on (up until Mallet came along).
I recently sat down with David Blei, co-author of the seminal paper on topic models, and who remains one of the leading researchers in the field. We talked about the origins of topic models, their applications, improvements to the underlying algorithms, and his new role in training data scientists at Columbia University.
Generating features for other machine learning tasks
Blei frequently interacts with companies that use ideas from his group’s research projects. He noted that people in industry frequently use topic models for “feature generation.” The added bonus is that topic models produce features that are easy to explain and interpret:
“You might analyze a bunch of New York Times articles for example, and there’ll be an article about sports and business, and you get a representation of that article that says this is an article and it’s about sports and business. Of course, the ideas of sports and business were also discovered by the algorithm, but that representation, it turns out, is also useful for prediction. My understanding when I speak to people at different startup companies and other more established companies is that a lot of technology companies are using topic modeling to generate this representation of documents in terms of the discovered topics, and then using that representation in other algorithms for things like classification or other things.”
From data-driven government to our age of intelligence, here are key insights from Strata + Hadoop World in San Jose, CA, 2015.
Experts from across the big data world came together for Strata + Hadoop World in San Jose, CA, 2015. We’ve gathered insights from the event below.
U.S. chief data scientist
With a special recorded introduction from President Barack Obama, DJ Patil talks about his new role as the U.S. government’s first ever chief data scientist, the nature of the U.S.’s emerging data-driven government, and defines his mission in leading the data-driven initiative:
“Responsibly unleash the power of data for the benefit of the American public and maximize the nation’s return on its investment in data.”
Tips on how to build effective human-machine hybrids, from crowdsourcing expert Adam Marcus.
In a recent O’Reilly webcast, “Crowdsourcing at GoDaddy: How I Learned to Stop Worrying and Love the Crowd,” Adam Marcus explains how to mitigate common challenges of managing crowd workers, how to make the most of human-in-the-loop machine learning, and how to establish effective and mutually rewarding relationships with workers. Marcus is the director of data on the Locu team at GoDaddy, where the “Get Found” service provides businesses with a central platform for managing their online presence and content.
In the webcast, Marcus uses practical examples from his experience at GoDaddy to reveal helpful methods for how to:
- Offset the inevitability of wrong answers from the crowd
- Develop and train workers through a peer-review system
- Build a hierarchy of trusted workers
- Make crowd work inspiring and enable upward mobility
What to do when humans get it wrong
It turns out there is a simple way to offset human error: redundantly ask people the same questions. Marcus explains that when you ask five different people the same question, there are some creative ways to combine their responses, and use a majority vote. Read more…