- MAS S66: Indistinguishable From… Magic as Interface, Technology, and Tradition — MIT course taught by Greg Borenstein and Dan Novy. Further, magic is one of the central metaphors people use to understand the technology we build. From install wizards to voice commands and background daemons, the cultural tropes of magic permeate user interface design. Understanding the traditions and vocabularies behind these tropes can help us produce interfaces that use magic to empower users rather than merely obscuring their function. With a focus on the creation of functional prototypes and practicing real magical crafts, this class combines theatrical illusion, game design, sleight of hand, machine learning, camouflage, and neuroscience to explore how ideas from ancient magic and modern stage illusion can inform cutting edge technology.
- Maybe We Need an Automation Tax (RoboHub) — rather than saying “automation is bad,” move on to “how do we help those displaced by automation to retrain?”.
- America’s Cyber-Manhattan Project (Wired) — America already has a computer security Manhattan Project. We’ve had it since at least 2001. Like the original, it has been highly classified, spawned huge technological advances in secret, and drawn some of the best minds in the country. We didn’t recognize it before because the project is not aimed at defense, as advocates hoped. Instead, like the original, America’s cyber Manhattan Project is purely offensive. The difference between policemen and soldiers is that one serves justice and the other merely victory.
- White House Names DJ Patil First US Chief Data Scientist (Wired) — There is arguably no one better suited to help the country better embrace the relatively new discipline of data science than Patil.
"data science" entries
From data-driven government to our age of intelligence, here are key insights from Strata + Hadoop World in San Jose, CA, 2015.
Experts from across the big data world came together for Strata + Hadoop World in San Jose, CA, 2015. We’ve gathered insights from the event below.
U.S. chief data scientist
With a special recorded introduction from President Barack Obama, DJ Patil talks about his new role as the U.S. government’s first ever chief data scientist, the nature of the U.S.’s emerging data-driven government, and defines his mission in leading the data-driven initiative:
“Responsibly unleash the power of data for the benefit of the American public and maximize the nation’s return on its investment in data.”
Tips on how to build effective human-machine hybrids, from crowdsourcing expert Adam Marcus.
In a recent O’Reilly webcast, “Crowdsourcing at GoDaddy: How I Learned to Stop Worrying and Love the Crowd,” Adam Marcus explains how to mitigate common challenges of managing crowd workers, how to make the most of human-in-the-loop machine learning, and how to establish effective and mutually rewarding relationships with workers. Marcus is the director of data on the Locu team at GoDaddy, where the “Get Found” service provides businesses with a central platform for managing their online presence and content.
In the webcast, Marcus uses practical examples from his experience at GoDaddy to reveal helpful methods for how to:
- Offset the inevitability of wrong answers from the crowd
- Develop and train workers through a peer-review system
- Build a hierarchy of trusted workers
- Make crowd work inspiring and enable upward mobility
What to do when humans get it wrong
It turns out there is a simple way to offset human error: redundantly ask people the same questions. Marcus explains that when you ask five different people the same question, there are some creative ways to combine their responses, and use a majority vote. Read more…
The goal is to offer a single platform where users can get the best distributed algorithms for any data processing task.
2014 has been the most active year of Spark development to date, with major improvements across the entire engine. One particular area where it made great strides was performance: Spark set a new world record in 100TB sorting, beating the previous record held by Hadoop MapReduce by three times, using only one-tenth of the resources; it received a new SQL query engine with a state-of-the-art optimizer; and many of its built-in algorithms became five times faster. In this post, I’ll cover some of the technology behind these improvements as well as new performance work the Apache Spark developer community has done to speed up Spark.
Back in 2010, we at the AMPLab at UC Berkeley designed Spark for interactive queries and iterative algorithms, as these were two major use cases not well served by batch frameworks like MapReduce. As a result, early users were drawn to Spark because of the significant performance improvements in these workloads. However, performance optimization is a never-ending process, and as Spark’s use cases have grown, so have the areas looked at for further improvement. User feedback and detailed measurements helped the Apache Spark developer community to prioritize areas to work in. Starting with the core engine, I’ll cover some of the recent optimizations that have been made. Read more…
How to decide which framework is best for your particular use case.
Editor’s note: Mark Grover will be part of the team teaching the tutorial Architectural Considerations for Hadoop Applications at Strata + Hadoop World in San Jose. Visit the Strata + Hadoop World website for more information on the program.
Hadoop has become the de-facto platform for storing and processing large amounts of data and has found widespread applications. In the Hadoop ecosystem, you can store your data in one of the storage managers (for example, HDFS, HBase, Solr, etc.) and then use a processing framework to process the stored data. Hadoop first shipped with only one processing framework: MapReduce. Today, there are many other open source tools in the Hadoop ecosystem that can be used to process data in Hadoop; a few common tools include the following Apache projects: Hive, Pig, Spark, Cascading, Crunch, Tez, and Drill, along with Impala and Presto. Some of these frameworks are built on top of each other. For example, you can write queries in Hive that can run on MapReduce or Tez. Another example currently under development is the ability to run Hive queries on Spark.
Amidst all of these options, two key questions arise for Hadoop users:
- Which processing frameworks are most commonly used?
- How do I choose which framework(s) to use for my specific use case?
This post will you help answer both of these questions, giving you enough context to make an educated decision regarding the best processing framework for your specific use case. Read more…
With Myriad, analytics can be performed on the same hardware that runs your production services.
This is a tale of two siloed clusters. The first cluster is an Apache Hadoop cluster. This is an island whose resources are completely isolated to Hadoop and its processes. The second cluster is the description I give to all resources that are not a part of the Hadoop cluster. I break them up this way because Hadoop manages its own resources with Apache YARN (Yet Another Resource Negotiator). Which is nice for Hadoop, but all too often those resources are underutilized when there are no big data workloads in the queue. And then when a big data job comes in, those resources are stretched to the limit, and they are likely in need of more resources. That can be tough when you are on an island.
Hadoop was meant to tear down walls — albeit, data silo walls — but walls, nonetheless. What has happened is that while tearing some walls down, other types of walls have gone up in their place.
Another technology, Apache Mesos, is also meant to tear down walls — but Mesos has often been positioned to manage the “second cluster,” which are all of those other, non-Hadoop workloads.
This is where the story really starts, with these two silos of Mesos and YARN. They are often pitted against each other, as if they were incompatible. It turns out they work together, and therein lies my tale. Read more…
Changing your frame of reference when starting with SQL on Hadoop.
Editor’s note: John Russell will be one of the teachers of the tutorial Getting Started with Interactive SQL-On-Hadoop at Strata + Hadoop World in San Jose. Visit the Strata + Hadoop World website for more information on the program.
If you’re just getting started doing analytic work with SQL on Hadoop, a table with a million rows might seem like a good starting point for experimentation. Isn’t that a lot of data? While you can exercise the features of a traditional database with a million rows, for Hadoop it’s not nearly enough. Think billions of rows instead.
Let’s look at the ways a million-row table falls short. Understanding the data volumes involved with big data can help you avoid going down unproductive pathways based on misleading assumptions.
With a million-row table, every byte in each row represents a megabyte of total data volume. Let’s say your table represents people and has fields for name, address, occupation, salary, height, weight, number of children, and favorite food. Here’s what a sample field might look like, with a scale underneath to illustrate length:
This particular record takes up 78 characters, including the comma separators. A back-of-the-envelope calculation suggests that, if this is an average row, we’ll end up with about 78 megabytes of data in the table. (And don’t recycle that envelope just yet — doing analytics with Hadoop, you’ll do a lot of rough estimates like this to sanity-check your expectations about performance and scalability.) Read more…