Here are a few of the data stories that caught my attention this week.
Genomics data and the cloud
GigaOm’s Derrick Harris explores some of the big data obstacles and opportunities surrounding genome research. He notes that:
When the Human Genome Project successfully concluded in 2003, it had taken 13 years to complete its goal of fully sequencing the human genome. Earlier this month, two firms — Life Technologies and Illumina — announced instruments that can do the same thing in a day, one for only $1,000. That’s likely going to mean a lot of data.
But as Harris observes, the promise of quick and cheap genomics is leading to other problems, particularly as the data reaches a heady scale. A fully sequenced human genome is about 100GB of raw data. But citing DNAnexus founder Andreas Sundquist, Harris says that:
… volume increases to about 1TB by the time the genome has been analyzed. He [Sundquist] also says we’re on pace to have 1 million genomes sequenced within the next two years. If that holds true, there will be approximately 1 million terabytes (or 1,000 petabytes, or 1 exabyte) of genome data floating around by 2014.
That makes the promise of a $1,000 genome sequencing service challenging when it comes to storing and processing petabytes of data. Harris posits that it will be cloud computing to the rescue here, providing the necessary infrastructure to handle all that data.
Stanley Fish versus the digital humanities
Literary critic and New York Times opinionator Stanley Fish has been on a bit of a rampage in recent weeks, taking on the growing field of the “digital humanities.” Prior to the annual Modern Language Association meeting, Fish cautioned that alongside the traditional panels and papers on Ezra Pound and William Shakespeare and the like, there were going to be a flood of sessions devoted to:
…’the digital humanities,’ an umbrella term for new and fast-moving developments across a range of topics: the organization and administration of libraries, the rethinking of peer review, the study of social networks, the expansion of digital archives, the refining of search engines, the production of scholarly editions, the restructuring of undergraduate instruction, the transformation of scholarly publishing, the re-conception of the doctoral dissertation, the teaching of foreign languages, the proliferation of online journals, the redefinition of what it means to be a text, the changing face of tenure — in short, everything.
That “everything” was narrowed down substantially in Fish’s editorial this week, in which he blasted the digital humanities for what he sees as its fixation “with matters of statistical frequency and pattern.” In other words: data and computational analysis.
According to Fish, the problem with digital humanities is that this new scholarship relies heavily on the machine — and not the literary critic — for interpretation. Fish contends that digital humanities scholars are all teams of statisticians and positivists, busily digitizing texts so they can data-mine them and systematically and programmatically uncover something of interest — something worthy of interpretation.
University of Illinois, Urbana-Champaign English professor Ted Underwood argues that Fish not only mischaracterizes what digital humanities scholars do, but he misrepresents how his own interpretive tradition works:
… by pretending that the act of interpretation is wholly contained in a single encounter with evidence. On his account, we normally begin with a hypothesis (which seems to have sprung, like Sin, fully-formed from our head), and test it against a single sentence.
One of the most interesting responses to Fish’s recent rants about the humanities’ digital turn comes from University of North Carolina English professor Daniel Anderson, who demonstrates in the following video a far fuller picture of what “digital” “data” — creation and interpretation — looks like:
Hadoop World merges with O’Reilly’s Strata New York conference
Two of the big data events announced they’ll be merging this week: Hadoop World will now be part of the Strata Conference in New York this fall.
[Disclosure: The Strata events are run by O’Reilly Media.]
Cloudera first started Hadoop World back in 2009, and as Hadoop itself has seen increasing adoption, Hadoop World, too, has become more popular. Strata is a newer event — its first conference was held in Santa Clara, Calif., in February 2011, and it expanded to New York in September 2011.
With the merger, Hadoop World will be a featured program at Strata New York 2012 (Oct. 23-25).
In other Hadoop-related news this week, Strata chair Edd Dumbill took a close look at Microsoft’s Hadoop strategy. Although it might be surprising that Microsoft has opted to adopt an open source technology as the core of its big data plans, Dumbill argues that:
Hadoop, by its sheer popularity, has become the de facto standard for distributed data crunching. By embracing Hadoop, Microsoft allows its customers to access the rapidly-growing Hadoop ecosystem and take advantage of a growing talent pool of Hadoop-savvy developers.
Also, Cloudera data scientist Josh Willis takes a closer look at one aspect of that ecosystem: the work of scientists whose research falls outside of statistics and machine learning. His blog post specifically addresses one use case for Hadoop — seismology, for which there is now Seismic Hadoop — but the post also provides a broad look at what constitutes the practice of data science.
Got data news?
Feel free to email me.