"data science" entries

Year Zero: Our life timelines begin

In the next decade, Year Zero will be how big data reaches everyone and will fundamentally change how we live.

sneaky_peek_Heart_of_Oak

Editor’s note: this post originally appeared on the author’s blog, Solve for Interesting. This lightly edited version is reprinted here with permission.

In 10 years, every human connected to the Internet will have a timeline. It will contain everything we’ve done since we started recording, and it will be the primary tool with which we administer our lives. This will fundamentally change how we live, love, work, and play. And we’ll look back at the time before our feed started — before Year Zero — as a huge, unknowable black hole.

This timeline — beginning for newborns at Year Zero — will be so intrinsic to life that it will quickly be taken for granted. Those without a timeline will be at a huge disadvantage. Those with a good one will have the tricks of a modern mentalist: perfect recall, suggestions for how to curry favor, ease maintaining friendships and influencing strangers, unthinkably higher Dunbar numbers — now, every interaction has a history.

This isn’t just about lifelogging health data, like your Fitbit or Jawbone. It isn’t about financial data, like Mint. It isn’t just your social graph or photo feed. It isn’t about commuting data like Waze or Maps. It’s about all of these, together, along with the tools and user interfaces and agents to make sense of it.

Every decade or so, something from military or enterprise technology finds its way, bent and twisted, into the mass market. The client-server computer gave us the PC; wide-area networks gave us the consumer web; pagers and cell phones gave us mobile devices. In the next decade, Year Zero will be how big data reaches everyone. Read more…

Comment: 1

Topic models: Past, present, and future

The O'Reilly Data Show Podcast: David Blei, co-creator of one of the most popular tools in text mining and machine learning.

card_catalog_2_bookfinch_Flickr

I don’t remember when I first came across topic models, but I do remember being an early proponent of them in industry. I came to appreciate how useful they were for exploring and navigating large amounts of unstructured text, and was able to use them, with some success, in consulting projects. When an MCMC algorithm came out, I even cooked up a Java program that I came to rely on (up until Mallet came along).

I recently sat down with David Blei, co-author of the seminal paper on topic models, and who remains one of the leading researchers in the field. We talked about the origins of topic models, their applications, improvements to the underlying algorithms, and his new role in training data scientists at Columbia University.

Generating features for other machine learning tasks

Blei frequently interacts with companies that use ideas from his group’s research projects. He noted that people in industry frequently use topic models for “feature generation.” The added bonus is that topic models produce features that are easy to explain and interpret:

“You might analyze a bunch of New York Times articles for example, and there’ll be an article about sports and business, and you get a representation of that article that says this is an article and it’s about sports and business. Of course, the ideas of sports and business were also discovered by the algorithm, but that representation, it turns out, is also useful for prediction. My understanding when I speak to people at different startup companies and other more established companies is that a lot of technology companies are using topic modeling to generate this representation of documents in terms of the discovered topics, and then using that representation in other algorithms for things like classification or other things.”

Read more…

Comment: 1

An alternate perspective on data-driven decision making

The O'Reilly Radar Podcast: Tricia Wang on "thick data," purpose-driven problem solving, and building the ideal team.

In this week’s Radar Podcast episode, O’Reilly’s Roger Magoulas chatted with Tricia Wang, a global tech ethnographer and co-founder of PL Data, about how qualitative and quantitative data need to work together, reframing “data-driven decision making,” and building the ideal team.

Subscribe to the O’Reilly Radar Podcast

TuneIn, iTunes, SoundCloud, RSS

Purpose-driven problem solving

Wang stressed that quantitative and qualitative need to work together. Rather than focusing on data-driven decision making, we need to focus on the best way to identify and solve the problem at hand: the data alone won’t provide the answers:

“It’s been kind of a detriment to our field that there’s this phrase ‘data-driven decision making.’ I think oftentimes people expect that the data’s going to give you answers. Data does not give you answers; it gives you inputs. You still have to figure out how to do the translation work and figure out what the data is trying to explain, right? I think data-driven decision making does not accurately describe what data can do. Really what we should be talking about is purpose-driven problem solving with data. Read more…

Comment: 1

Signals from Strata + Hadoop World in San Jose, CA, 2015

From data-driven government to our age of intelligence, here are key insights from Strata + Hadoop World in San Jose, CA, 2015.

Experts from across the big data world came together for Strata + Hadoop World in San Jose, CA, 2015. We’ve gathered insights from the event below.

U.S. chief data scientist

With a special recorded introduction from President Barack Obama, DJ Patil talks about his new role as the U.S. government’s first ever chief data scientist, the nature of the U.S.’s emerging data-driven government, and defines his mission in leading the data-driven initiative:

“Responsibly unleash the power of data for the benefit of the American public and maximize the nation’s return on its investment in data.”


Read more…

Comment: 1
Four short links: 19 February 2015

Four short links: 19 February 2015

Magical Interfaces, Automation Tax, Cyber Manhattan Project, and US Chief Data Scientist

  1. MAS S66: Indistinguishable From… Magic as Interface, Technology, and Tradition — MIT course taught by Greg Borenstein and Dan Novy. Further, magic is one of the central metaphors people use to understand the technology we build. From install wizards to voice commands and background daemons, the cultural tropes of magic permeate user interface design. Understanding the traditions and vocabularies behind these tropes can help us produce interfaces that use magic to empower users rather than merely obscuring their function. With a focus on the creation of functional prototypes and practicing real magical crafts, this class combines theatrical illusion, game design, sleight of hand, machine learning, camouflage, and neuroscience to explore how ideas from ancient magic and modern stage illusion can inform cutting edge technology.
  2. Maybe We Need an Automation Tax (RoboHub) — rather than saying “automation is bad,” move on to “how do we help those displaced by automation to retrain?”.
  3. America’s Cyber-Manhattan Project (Wired) — America already has a computer security Manhattan Project. We’ve had it since at least 2001. Like the original, it has been highly classified, spawned huge technological advances in secret, and drawn some of the best minds in the country. We didn’t recognize it before because the project is not aimed at defense, as advocates hoped. Instead, like the original, America’s cyber Manhattan Project is purely offensive. The difference between policemen and soldiers is that one serves justice and the other merely victory.
  4. White House Names DJ Patil First US Chief Data Scientist (Wired) — There is arguably no one better suited to help the country better embrace the relatively new discipline of data science than Patil.
Comment

Exploring methods in active learning

Tips on how to build effective human-machine hybrids, from crowdsourcing expert Adam Marcus.

15146_ORM_Webcast_ad(archived)In a recent O’Reilly webcast, “Crowdsourcing at GoDaddy: How I Learned to Stop Worrying and Love the Crowd,” Adam Marcus explains how to mitigate common challenges of managing crowd workers, how to make the most of human-in-the-loop machine learning, and how to establish effective and mutually rewarding relationships with workers. Marcus is the director of data on the Locu team at GoDaddy, where the “Get Found” service provides businesses with a central platform for managing their online presence and content.

In the webcast, Marcus uses practical examples from his experience at GoDaddy to reveal helpful methods for how to:

  • Offset the inevitability of wrong answers from the crowd
  • Develop and train workers through a peer-review system
  • Build a hierarchy of trusted workers
  • Make crowd work inspiring and enable upward mobility

What to do when humans get it wrong

It turns out there is a simple way to offset human error: redundantly ask people the same questions. Marcus explains that when you ask five different people the same question, there are some creative ways to combine their responses, and use a majority vote. Read more…

Comment

Recent performance improvements in Apache Spark

The goal is to offer a single platform where users can get the best distributed algorithms for any data processing task.

2014 has been the most active year of Spark development to date, with major improvements across the entire engine. One particular area where it made great strides was performance: Spark set a new world record in 100TB sorting, beating the previous record held by Hadoop MapReduce by three times, using only one-tenth of the resources; it received a new SQL query engine with a state-of-the-art optimizer; and many of its built-in algorithms became five times faster. In this post, I’ll cover some of the technology behind these improvements as well as new performance work the Apache Spark developer community has done to speed up Spark.

Back in 2010, we at the AMPLab at UC Berkeley designed Spark for interactive queries and iterative algorithms, as these were two major use cases not well served by batch frameworks like MapReduce. As a result, early users were drawn to Spark because of the significant performance improvements in these workloads. However, performance optimization is a never-ending process, and as Spark’s use cases have grown, so have the areas looked at for further improvement. User feedback and detailed measurements helped the Apache Spark developer community to prioritize areas to work in. Starting with the core engine, I’ll cover some of the recent optimizations that have been made. Read more…

Comment: 1

Processing frameworks for Hadoop

How to decide which framework is best for your particular use case.

Editor’s note: Mark Grover will be part of the team teaching the tutorial Architectural Considerations for Hadoop Applications at Strata + Hadoop World in San Jose. Visit the Strata + Hadoop World website for more information on the program.

Hadoop has become the de-facto platform for storing and processing large amounts of data and has found widespread applications. In the Hadoop ecosystem, you can store your data in one of the storage managers (for example, HDFS, HBase, Solr, etc.) and then use a processing framework to process the stored data. Hadoop first shipped with only one processing framework: MapReduce. Today, there are many other open source tools in the Hadoop ecosystem that can be used to process data in Hadoop; a few common tools include the following Apache projects: Hive, Pig, Spark, Cascading, Crunch, Tez, and Drill, along with Impala and Presto. Some of these frameworks are built on top of each other. For example, you can write queries in Hive that can run on MapReduce or Tez. Another example currently under development is the ability to run Hive queries on Spark.

Amidst all of these options, two key questions arise for Hadoop users:

  1. Which processing frameworks are most commonly used?
  2. How do I choose which framework(s) to use for my specific use case?

This post will you help answer both of these questions, giving you enough context to make an educated decision regarding the best processing framework for your specific use case. Read more…

Comments: 3

How data and connectivity are changing the nature of play

Now that technology has made its way into the playroom, there are a lot of important questions we should be asking.

Building_Blocks_Holger_Zscheyge_Flickr

Playing is how we learn. Through play, we develop large and fine motor skills, refine language and social interaction, and discover important facts about everything from the cycle of life and death to the laws of physics. When we play, we test the world around us, and share and grow.

But play is changing because it’s now filled with technology. In the coming months, I’m going to be looking at how data and connectivity are changing toys and the very nature of play. I’ll be talking to designers, inventors, technologists, and educators, and publishing the results in a report for O’Reilly Media.

Here’s my thinking so far:

Until very recent times, play was a purely tangible, real-world experience. Almost every adult alive has built a tower of blocks, climbed something, chased another person, used a skipping rope, and put together a puzzle. We all know the rules of tag and hide and seek. Some of our fondest memories include creative and imaginative play. We were pirates. We were princesses. We were explorers in a new, exciting land, limited only by our own imaginations and the loving cry of parents calling us home for dinner. Read more…

Comments: 2

Forecasting events, from disease outbreaks to sales to cancer research

The O'Reilly Data Show Podcast: Kira Radinsky on predicting events using machine learning, NLP, and semantic analysis.

Editor’s note: One of the more popular speakers at Strata + Hadoop World, Kira Radinsky was recently profiled in the new O’Reilly Radar report, Women in Data: Cutting-Edge Practitioners and Their Views on Critical Skills, Background, and Education.

When I first took over organizing Hardcore Data Science at Strata + Hadoop World, one of the first speakers I invited was Kira Radinsky. Radinsky had already garnered international recognition for her work forecasting real-world events (disease outbreak, riots, etc.). She’s currently the CTO and co-founder of SalesPredict, a start-up using predictive analytics to “understand who’s ready to buy, who may buy more, and who is likely to churn.”

I recently had a conversation with Radinsky, and she took me through the many techniques and subject domains from her past and present research projects. In grad school, she helped build a predictive system that combined newspaper articles, Wikipedia, and other open data sets. Through fine-tuned semantic analysis and NLP, Radinsky and her collaborators devised new metrics of similarity between events. The techniques she developed for that predictive software system are now the foundation of applications across many areas. Read more…

Comment