From data-driven government to our age of intelligence, here are key insights from Strata + Hadoop World in San Jose, CA, 2015.
Experts from across the big data world came together for Strata + Hadoop World in San Jose, CA, 2015. We’ve gathered insights from the event below.
U.S. chief data scientist
With a special recorded introduction from President Barack Obama, DJ Patil talks about his new role as the U.S. government’s first ever chief data scientist, the nature of the U.S.’s emerging data-driven government, and defines his mission in leading the data-driven initiative:
“Responsibly unleash the power of data for the benefit of the American public and maximize the nation’s return on its investment in data.”
Our things are getting wired together, and you're not secure if you can't control the destiny of your private information.
Editor’s note: The Electronic Frontier Foundation’s Cory Doctorow will be speaking at the Solid Conference in San Francisco June 23-25, 2015. Registration is now open — for more information on the program, visit the Solid website.
The digital world has been colonized by a dangerous idea: that we can and should solve problems by preventing computer owners from deciding how their computers should behave. I’m not talking about a computer that’s designed to say, “Are you sure?” when you do something unexpected — not even one that asks, “Are you really, really sure?” when you click “OK.” I’m talking about a computer designed to say, “I CAN’T LET YOU DO THAT DAVE” when you tell it to give you root, to let you modify the OS or the filesystem.
Case in point: the cell-phone “kill switch” laws in California and Minneapolis, which require manufacturers to design phones so that carriers or manufacturers can push an over-the-air update that bricks the phone without any user intervention, designed to deter cell-phone thieves. Early data suggests that the law is effective in preventing this kind of crime, but at a high and largely needless (and ill-considered) price.
To understand this price, we need to talk about what “security” is, from the perspective of a mobile device user: it’s a whole basket of risks, including the physical threat of violence from muggers; the financial cost of replacing a lost device; the opportunity cost of setting up a new device; and the threats to your privacy, finances, employment, and physical safety from having your data compromised. Read more…
The real challenge going forward: we can't trust anything.
A few weeks ago, I wrote about postmodern computing, and characterized it as the computing in a world of distrust.
This morning, I read Steve Bellovin’s blog post, What Must We Trust? — Bellovin explains that “modern” (my word) security is founded on the idea of a “Trusted Computing Base” (TCB), defined (in part) in the United States’ Defense Department’s Orange Book. There were parts of a system that you had to trust, and you had to guard their integrity vigilantly: the kernel, certainly, but also specific configuration files, executables, and so on.
The TCB has always been problematic, particularly since (at least initially) it did not consider the problem of network connections. But networking aside, Bellovin argues that recent events have blown the idea of a “trusted” system to bits. We’ve seen attacks against (Bellovin’s list) batteries, webcams, USB, and more. If Andromedans (Bellovin doesn’t want to say NSA) have managed to infiltrate our disk drives, what can trust mean? And it would be naive to think that this stops with devices that have disk drives. Our devices, from Fitbits to data centers, have been pwnd even before they’re built. Read more…
How to decide which framework is best for your particular use case.
Editor’s note: Mark Grover will be part of the team teaching the tutorial Architectural Considerations for Hadoop Applications at Strata + Hadoop World in San Jose. Visit the Strata + Hadoop World website for more information on the program.
Hadoop has become the de-facto platform for storing and processing large amounts of data and has found widespread applications. In the Hadoop ecosystem, you can store your data in one of the storage managers (for example, HDFS, HBase, Solr, etc.) and then use a processing framework to process the stored data. Hadoop first shipped with only one processing framework: MapReduce. Today, there are many other open source tools in the Hadoop ecosystem that can be used to process data in Hadoop; a few common tools include the following Apache projects: Hive, Pig, Spark, Cascading, Crunch, Tez, and Drill, along with Impala and Presto. Some of these frameworks are built on top of each other. For example, you can write queries in Hive that can run on MapReduce or Tez. Another example currently under development is the ability to run Hive queries on Spark.
Amidst all of these options, two key questions arise for Hadoop users:
- Which processing frameworks are most commonly used?
- How do I choose which framework(s) to use for my specific use case?
This post will you help answer both of these questions, giving you enough context to make an educated decision regarding the best processing framework for your specific use case. Read more…