ENTRIES TAGGED "Hadoop"

How to analyze 100 million images for $624

There's a lot of new ground to be explored in large-scale image processing.

Jetpac is building a modern version of Yelp, using big data rather than user reviews. People are taking more than a billion photos every single day, and many of these are shared publicly on social networks. We analyze these pictures to discover what they can tell us about bars, restaurants, hotels, and other venues around the world —…
Read Full Post | Comments: 3 |

Predicting the future: Strata 2014 hot topics

Eleven areas of focus for deeper investigation.

Conferences like Strata are planned a year in advance. The logistics and coordination required for an event of this magnitude takes a lot of planning, but it also takes a decent amount of prediction: Strata needs to skate to where the puck is going. While Strata New York + Hadoop World 2013 is still a few months away, we’re…
Read Full Post | Comments: 3 |
Four short links: 12 November 2012

Four short links: 12 November 2012

Motivated Learning, Better Hadoopery, Poignant Past Product, and Drone Imagery

  1. Teaching Programming to a Highly Motivated Beginner (CACM) — I don’t think there is any better way to internalize knowledge than first spending hours upon hours growing emotionally distraught over such struggles and only then being helped by a mentor. Me, too. Not struggle for struggle’s sake, but because you have built a strong mental map of the problem into which the solution can lock.
  2. Corona (GitHub) — Facebook opensources their improvements to Hadoop’s job tracking, in the name of scalability, latency, cluster utilization, and fairness. (via Chris Aniszczyk)
  3. One Man’s Trash (Bunnie Huang) — Bunnie finds a Chumby relic in a Shenzhen market stall.
  4. Dronestagram — posting pictures of drone strike locations to Instagram. (via The New Aesthetic)
Comment |
Four short links: 25 October 2012

Four short links: 25 October 2012

Big Data's Big Picture, Real-Time Queries, Real-Time Queries, Single-Process Real-Time Queries

  1. Big Data: the Big Picture (Vimeo) — Jim Stogdill’s excellent talk: although Big Data is presented as part of the Gartner Hype Cycle, it’s an epoch of the Information Age which will have significant effects on the structure of corporations and the economy.
  2. Impala (github) — Cloudera’s open source (Apache) implementation of Google’s F1 (PDF), for realtime queries across clusters. Impala is different from Hive and Pig because it uses its own daemons that are spread across the cluster for queries. Furthermore, Impala does not leverage MapReduce, allowing Impala to return result in real-time. (via Wired)
  3. druid (github) — open source (GPLv2) a distributed, column-oriented analytical datastore. It was originally created to resolve query latency issues seen with trying to use Hadoop to power an interactive service. See also the announcement of its open-sourcing.
  4. Supersonic (Google Code) — an ultra-fast, column oriented query engine library written in C++. It provides a set of data transformation primitives which make heavy use of cache-aware algorithms, SIMD instructions and vectorised execution, allowing it to exploit the capabilities and resources of modern, hyper pipelined CPUs. It is designed to work in a single process. Apache-licensed.
Comment |

Seven reasons why I like Spark

Spark is becoming a key part of a big data toolkit.

A large portion of this week’s Amp Camp at UC Berkeley, is devoted to an introduction to Spark – an open source, in-memory, cluster computing framework. After playing with Spark over the last month, I’ve come to consider it a key part of my big data toolkit. Here’s why: Hadoop integration: Spark can work with files stored in…
Read Full Post | Comments: 2 |

Heavy data and architectural convergence

Data is getting heavier relative to the networks that carry it around the data center.

Imagine a future where large clusters of like machines dynamically adapt between programming paradigms depending on a combination of the resident data and the required processing.

Read Full Post | Comment |
Four short links: 9 July 2012

Four short links: 9 July 2012

Personalized Medicine, Reporting on Execution, Software-Defined Radio, and Beyond Hadoop

  1. Personalized Leukemia Treatment (NY Times) — sequenced the tumor’s DNA, found the misbehaving gene, realized there was an existing experimental treatment to tackle that gene, and it worked. Reminds me of My Daughter’s DNA, which had its origin in the poignant story of Hugh Reinhoff sequencing his daughter’s DNA to diagnose her condition. It’s all about medical professionals now, but that’s no different from the Internet starting with geeks and moving out to the masses.
  2. Bullseye HD — web app which allows you to make the most of the time you spend with your team, by focusing your attention on the projects and actions that are off-track or not getting enough focus, rather than wasting precious time on status updates. (via Rowan Simpson)
  3. Per Vices — selling software-defined radio boards (for Linux only at the moment). (via Ars Technica)
  4. Post-Hadoop (GigaOm) — Google have moved beyond the basic software that Hadoop was copying. Lots of interesting points in this article, including one fundamental reality – MapReduce (and thereby Hadoop) is purpose-built for organized data processing (jobs). It is baked from the core for workflows, not ad hoc exploration.
Comment |
Strata Week: Data prospecting with Kaggle

Strata Week: Data prospecting with Kaggle

Kaggle now accepting data before a contest, HP's Autonomy purchase comes into focus, Cloudera's new Hadoop distribution.

In this week's data news, Kaggle launches Prospect, HP unveils its big data plans, and Cloudera releases CDH4 (the latest version of its Hadoop distribution).

Read Full Post | Comment |
Strata Week: Visualizing a better life

Strata Week: Visualizing a better life

A visualization tool from the OECD, concerns about open data and research, and updates to Hadoop.

In this week's data news, a visualization tool charts your "better life," researchers have concerns about access to data, and updates to Hadoop.

Read Full Post | Comment |
Top Stories: May 14-18, 2012

Top Stories: May 14-18, 2012

A coding judge, big data's enterprise conundrum, DIY education is on the move.

This week on O'Reilly: Coding is tied to cultural competence, not just a profession; Jim Stogdill wondered if solution vendors are waiting for broad Hadoop adoption before jumping in; and we learned how Schoolers, Edupunks and Makers are reshaping education.

Read Full Post | Comment |