"Big Data" entries

Signals from Strata + Hadoop World in Barcelona 2014

From the Internet of Things to data-driven fashion, here are key insights from Strata + Hadoop World in Barcelona 2014.

Experts from across the big data world came together for Strata + Hadoop World in Barcelona 2014. We’ve gathered insights from the event below.

#IoTH: The Internet of Things and Humans

“If we could start over with these capabilities we have now, how would we do it differently?” Tim O’Reilly continues to explore data and the Internet of Things through the lens of human empowerment and the ability to “use technology to give people superpowers.”

Read more…

Comment: 1

The science of moving dots: the O’Reilly Data Show Podcast

Rajiv Maheswaran talks about the tools and techniques required to analyze new kinds of sports data.

Many data scientists are comfortable working with structured operational data and unstructured text. Newer techniques like deep learning have opened up data types like images, video, and audio.

Other common data sources are garnering attention. With the rise of mobile phones equipped with GPS, I’m meeting many more data scientists at start-ups and large companies who specialize in spatio-temporal pattern recognition. Analyzing “moving dots” requires specialized tools and techniques.

Subscribe to the O’Reilly Data Show Podcast

iTunes, SoundCloud, RSS

A few months ago, I sat down with Rajiv Maheswaran founder and CEO of Second Spectrum, a company that applies analytics to sports tracking data. Maheswaran talked about this new kind of data and the challenge of finding patterns:

“It’s interesting because it’s a new type of data problem. Everybody knows that big data machine learning has done a lot of stuff in structured data, in photos, in translation for language, but moving dots is a very new kind of data where you haven’t figured out the right feature set to be able to find patterns from. There’s no language of moving dots, at least not that computers understand. People understand it very well, but there’s no computational language of moving dots that are interacting. We wanted to build that up, mostly because data about moving dots is very, very new. It’s only in the last five years, between phones and GPS and new tracking technologies, that moving data has actually emerged.”

Read more…

Comments: 2
Four short links: 18 November 2014

Four short links: 18 November 2014

A Worm Mind Forever LEGO Voyaging, Automatic Caption Generator, ELK Stack, and Amazonian Deployment

  1. A Worm’s Mind in a Lego Body — the c. elegans worm’s 302 neurons has been sequenced, modelled in open source code, and now hooked up to a Lego robot. It is claimed that the robot behaved in ways that are similar to observed C. elegans. Stimulation of the nose stopped forward motion. Touching the anterior and posterior touch sensors made the robot move forward and back accordingly. Stimulating the food sensor made the robot move forward. There is video.
  2. Show and Tell: A Neural Image Caption Generator — Google Research paper on generating captions like “Two pizzas sitting on top of a stove top oven” from a photo. Wow.
  3. Big Data with the ELK Stack — ElasticSearch, logstash, and Kibana. Interesting and powerful combination of tools!
  4. Apollo: Amazon’s Deployment EngineApollo will stripe the rolling update to simultaneously deploy to an equivalent number of hosts in each location. This keeps the fleet balanced and maximizes redundancy in the case of any unexpected events. When the fleet scales up to handle higher load, Apollo automatically installs the latest version of the software on the newly added hosts. Lust.
Comment

The big data sweet spot: policy that balances benefits and risks

Deciding what data to collect is hard when consequences are unpredictable.

Footprints_Jo_Jakeman

A big reason why discussions of “big data” get complicated — and policy-makers resort to vague hand-waving, as in the well-known White House executive office report — is that its ripple effects travel fast and far. Your purchase, when recorded by a data broker, affects not only the the ads and deals they offer you in the future, but the ones they offer innumerable people around the country that share some demographic with you.

Policy-making might be simple if data collectors or governments could say, “We’ll collect certain kinds of data for certain purposes and no others” — but the impacts of data collection are rarely predictable. And if one did restrict big data that way, its value would be seriously reduced.

Follow my steps: big data privacy vs collection

Data collection will explode as we learn how to connect you to different places you’ve been by the way you walk or to recognize certain kinds of diseases by your breath.

When such data exhaust is being collected, you can’t evade consequences by paying cash and otherwise living off the grid. In fact, trying to do so may disadvantage you even more: people who lack access to sophisticated technologies leave fewer tracks and therefore may not be served by corporations or governments. Read more…

Comment
Four short links: 12 November 2014

Four short links: 12 November 2014

Material Design, Inflatable Robots, Printable Awesome, and Graph Modelling

  1. CSS and React to Implement Material Design — as I said earlier, it will be interesting to see if Material Design becomes a common UI style for the web.
  2. Current State of Inflatable Robots — I’d missed the amazing steps forward in control that were made in pneumatic robots. Check out the OtherLab tentacle!
  3. Dinosaur Skull Showerhead — 3D-printable add-on to your shower. (via Archie McPhee)
  4. Data Modelling in Graph Databases — how to build the graph structure by working back from the questions you’ll ask of it.
Comment
Four short links: 11 November 2014

Four short links: 11 November 2014

High-Volume Logs, Regulated Broadband, Oculus Web, and Personal Data Vacuums

  1. Infrastructure for Data Streams — describing the high-volume log data use case for Apache Kafka, and how it plays out in storage and infrastructure.
  2. Obama: Treat Broadband and Mobile as Utility (Ars Technica) — In short, Obama is siding with consumer advocates who have lobbied for months in favor of reclassification while the telecommunications industry lobbied against it.
  3. MozVR — a website, and the tools that made it, designed to be seen through the Oculus Rift.
  4. All Cameras are Police Cameras (James Bridle) — how the slippery slope is ridden: When the Wall was initially constructed, the public were informed that this [automatic license plate recognition] data would only be held, and regularly purged, by Transport for London, who oversee traffic matters in the city. However, within less than five years, the Home Secretary gave the Metropolitan Police full access to this system, which allowed them to take a complete copy of the data produced by the system. This permission to access the data was granted to the Police on the sole condition that they only used it when National Security was under threat. But since the data was now in their possession, the Police reclassified it as “Crime” data and now use it for general policing matters, despite the wording of the original permission. As this data is not considered to be “personal data” within the definition of the law, the Police are under no obligation to destroy it, and may retain their ongoing record of all vehicle movements within the city for as long as they desire.
Comment
Four short links: 6 November 2014

Four short links: 6 November 2014

Javascript Testing, Dark Data, Webapp Design, and Design Trumps Data

  1. Karma — kick-ass open source Javascript test environment.
  2. The Dark Market for Personal Data (NYTimes) — can buy lists of victims of sexual assault, of impulse buyers, of people with sexually transmitted disease, etc. The cost of a false-positive when those lists are used for marketing is less than the cost of false-positive when banks use the lists to decide whether you’re a credit risk. The lists fall between the cracks in privacy legislation; essentially, the compilation and use of lists of people are unregulated territory.
  3. 7 Principles of Rich Web Applications — “rich web applications” sounds like 2007 wants its ideas back, but the content is modern and useful. Predict behaviour for negative latency.
  4. Collaborative Filtering at LinkedIn (PDF) — This paper presents LinkedIn’s horizontal collaborative filtering infrastructure, known as browsemaps. Great lessons learned, including context and presentation of browsemaps or any recommendation is paramount for a truly relevant user experience. That is, design and presentation represents the largest ROI, with data engineering being a second, and algorithms last. (via Greg Linden)
Comment

The problem of managing schemas

Schemas inevitably will change — Apache Avro offers an elegant solution.

filing_cabinets_foam_Flickr

When a team first starts to consider using Hadoop for data storage and processing, one of the first questions that comes up is: which file format should we use?

This is a reasonable question. HDFS, Hadoop’s data storage, is different from relational databases in that it does not impose any data format or schema. You can write any type of file to HDFS, and it’s up to you to process it later.

The usual first choice of file formats is either comma delimited text files, since these are easy to dump from many databases, or JSON format, often used for event data or data arriving from a REST API.

There are many benefits to this approach — text files are readable by humans and therefore easy to debug and troubleshoot. In addition, it is very easy to generate them from existing data sources and all applications in the Hadoop ecosystem will be able to process them. Read more…

Comments: 9
Four short links: 4 November 2014

Four short links: 4 November 2014

3D Shares, Autonomous Golf Carts, Competitive Solar, and Interesting Data Problems

  1. Cooper-Hewitt Shows How to Share 3D Scan Data Right (Makezine) — important as we move to a web of physical models, maps, and designs.
  2. Singapore Tests Autonomous Golfcarts (Robohub) — a reminder that the future may not necessarily look like someone used the clone tool to paint Silicon Valley over the world.
  3. Solar Hits Parity in 10 States, 47 by 2016 (Bloomberg) — The reason solar-power generation will increasingly dominate: it’s a technology, not a fuel. As such, efficiency increases and prices fall as time goes on. The price of Earth’s limited fossil fuels tends to go the other direction.
  4. Facebook’s Top Open Data Problems (Facebook Research) — even if you’re not interested in Facebook’s Very First World Problems, this is full of factoids like Facebook’s social graph store TAO, for example, provides access to tens of petabytes of data, but answers most queries by checking a single page in a single machine. (via Greg Linden)
Comment
Four short links: 3 November 2014

Four short links: 3 November 2014

LittleBits Cloud, Big Data Futures, Predictable Robots, and New OS

  1. LittleBits Adds Functionality (MakeZine) — That next big idea might come from one of the latest bits in the littleBits catalog, the cloudBit. The piece enables wi-fi control of your circuit in various configurations — from the Internet to the bit, from the bit to the internet, or from bit to bit.
  2. Big Data’s Big Ideas (Ben Lorica) — this is a lot of what’s on the O’Reilly radar at the moment. Excellent short summary, with links.
  3. Rodney Brooks and Robotics (Boston Magazine) — [The robot] Baxter’s LCD eyes will look at the spot where it’s about to reach, making its movements, from a human perspective, more predictable. “If you want a machine to be able to interact with people,” Brooks says, “it better not do things that are surprising to people.”
  4. FUZIX — new open source OS from Alan Cox. Runs on Z80s, mostly runs on 6502s, and in theory if it’s got 8 bits and banked RAM you can probably run Fuzix OS on it. (via Alan Cox)
Comment