- Teaching Programming to a Highly Motivated Beginner (CACM) — I don’t think there is any better way to internalize knowledge than first spending hours upon hours growing emotionally distraught over such struggles and only then being helped by a mentor. Me, too. Not struggle for struggle’s sake, but because you have built a strong mental map of the problem into which the solution can lock.
- Corona (GitHub) — Facebook opensources their improvements to Hadoop’s job tracking, in the name of scalability, latency, cluster utilization, and fairness. (via Chris Aniszczyk)
- One Man’s Trash (Bunnie Huang) — Bunnie finds a Chumby relic in a Shenzhen market stall.
- Dronestagram — posting pictures of drone strike locations to Instagram. (via The New Aesthetic)
ENTRIES TAGGED "Hadoop"
There's a lot of new ground to be explored in large-scale image processing.
Eleven areas of focus for deeper investigation.
Motivated Learning, Better Hadoopery, Poignant Past Product, and Drone Imagery
Big Data's Big Picture, Real-Time Queries, Real-Time Queries, Single-Process Real-Time Queries
- Big Data: the Big Picture (Vimeo) — Jim Stogdill’s excellent talk: although Big Data is presented as part of the Gartner Hype Cycle, it’s an epoch of the Information Age which will have significant effects on the structure of corporations and the economy.
- Impala (github) — Cloudera’s open source (Apache) implementation of Google’s F1 (PDF), for realtime queries across clusters. Impala is different from Hive and Pig because it uses its own daemons that are spread across the cluster for queries. Furthermore, Impala does not leverage MapReduce, allowing Impala to return result in real-time. (via Wired)
- druid (github) — open source (GPLv2) a distributed, column-oriented analytical datastore. It was originally created to resolve query latency issues seen with trying to use Hadoop to power an interactive service. See also the announcement of its open-sourcing.
- Supersonic (Google Code) — an ultra-fast, column oriented query engine library written in C++. It provides a set of data transformation primitives which make heavy use of cache-aware algorithms, SIMD instructions and vectorised execution, allowing it to exploit the capabilities and resources of modern, hyper pipelined CPUs. It is designed to work in a single process. Apache-licensed.
Spark is becoming a key part of a big data toolkit.
Data is getting heavier relative to the networks that carry it around the data center.
Imagine a future where large clusters of like machines dynamically adapt between programming paradigms depending on a combination of the resident data and the required processing.
Personalized Medicine, Reporting on Execution, Software-Defined Radio, and Beyond Hadoop
- Personalized Leukemia Treatment (NY Times) — sequenced the tumor’s DNA, found the misbehaving gene, realized there was an existing experimental treatment to tackle that gene, and it worked. Reminds me of My Daughter’s DNA, which had its origin in the poignant story of Hugh Reinhoff sequencing his daughter’s DNA to diagnose her condition. It’s all about medical professionals now, but that’s no different from the Internet starting with geeks and moving out to the masses.
- Bullseye HD — web app which allows you to make the most of the time you spend with your team, by focusing your attention on the projects and actions that are off-track or not getting enough focus, rather than wasting precious time on status updates. (via Rowan Simpson)
- Per Vices — selling software-defined radio boards (for Linux only at the moment). (via Ars Technica)
- Post-Hadoop (GigaOm) — Google have moved beyond the basic software that Hadoop was copying. Lots of interesting points in this article, including one fundamental reality – MapReduce (and thereby Hadoop) is purpose-built for organized data processing (jobs). It is baked from the core for workflows, not ad hoc exploration.
Kaggle now accepting data before a contest, HP's Autonomy purchase comes into focus, Cloudera's new Hadoop distribution.
In this week's data news, Kaggle launches Prospect, HP unveils its big data plans, and Cloudera releases CDH4 (the latest version of its Hadoop distribution).
A visualization tool from the OECD, concerns about open data and research, and updates to Hadoop.
In this week's data news, a visualization tool charts your "better life," researchers have concerns about access to data, and updates to Hadoop.
A coding judge, big data's enterprise conundrum, DIY education is on the move.
This week on O'Reilly: Coding is tied to cultural competence, not just a profession; Jim Stogdill wondered if solution vendors are waiting for broad Hadoop adoption before jumping in; and we learned how Schoolers, Edupunks and Makers are reshaping education.