<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>O&#039;Reilly Radar &#187; Bradley Voytek</title>
	<atom:link href="http://radar.oreilly.com/bradleyv/feed" rel="self" type="application/rss+xml" />
	<link>http://radar.oreilly.com</link>
	<description>Insight, analysis, and research about emerging technologies</description>
	<lastBuildDate>Tue, 21 May 2013 12:00:03 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<item>
		<title>In defense of frivolities and open-ended experiments</title>
		<link>http://radar.oreilly.com/2012/06/experiments-learning-frivolities.html</link>
		<comments>http://radar.oreilly.com/2012/06/experiments-learning-frivolities.html#comments</comments>
		<pubDate>Fri, 08 Jun 2012 14:00:00 +0000</pubDate>
		<dc:creator>Bradley Voytek</dc:creator>
				<category><![CDATA[Edu 2.0]]></category>
		<category><![CDATA[Web 2.0]]></category>
		<category><![CDATA[@editpick]]></category>
		<category><![CDATA[@home]]></category>
		<category><![CDATA[build]]></category>
		<category><![CDATA[DIY]]></category>
		<category><![CDATA[education]]></category>
		<category><![CDATA[experiments]]></category>
		<category><![CDATA[learning]]></category>
		<category><![CDATA[testing]]></category>

		<guid isPermaLink="false">http://blogs.oreilly.com/radar/2012/06/experiments-learning-frivolities.html</guid>
		<description><![CDATA[Before you scoff at the pointlessness of yet another social network, web app, or project, remember that we don&apos;t always do the research or build the company that is immediately useful or profitable. ]]></description>
				<content:encoded><![CDATA[<p>My first child was born just about nine months ago. From the hospital window on that memorable day, I could see that it was surprisingly sunny for a Berkeley autumn afternoon. At the time, I&#8217;d only slept about three of the last 38 hours. My mind was making up for the missing haze that usually fills the Berkeley sky. Despite my cloudy state, I can easily recall those moments following my first afternoon laying with my newborn son. In those minutes, he cleared my mind better than the sun had cleared the Berkeley skies.</p>
<p>While my wife slept and recovered, I talked to my boy, welcoming him into this strange world and his newfound existence. I told him how excited I was for him to learn about it all: the sky, planets, stars, galaxies, animals, happiness, sadness, laughter. As I talked, I came to realize how many concepts I understand that he lacked. For every new thing I mentioned, I realized there were 10 more that he would need to learn just to understand that one.</p>
<p>Of course, he need not know specific facts to appreciate the sun&#8217;s warmth, but to understand what the sun is, he must first learn the pyramid of knowledge that encapsulates our understanding of it: He must learn to distinguish self from other; he must learn about time, scale and distance and proportion, light and energy, motion, vision, sensation, and so on.</p>
<p class="image-box-580"><img src="http://cdn.oreilly.com/radar/images/posts/0612-anatomy-of-a-sunset.jpg" width="580" border="0" alt="Anatomy of a sunset"></p>
<p>I mentioned time. Ultimately, I regressed to talking about language, mathematics, history, ancient Egypt, and the Pyramids. It was the verbal equivalent of &#8220;<a href="http://tvtropes.org/pmwiki/pmwiki.php/Main/WikiWalk">wiki walking</a>,&#8221; wherein I go to Wikipedia to look up an innocuous fact, such as the density of gold, and find myself reading about <a href="http://en.wikipedia.org/wiki/Sumerian_religion">Mesopotamian religious practices</a> an hour later.</p>
<p>It struck me then how incredible human culture, science, and technology truly are. For billions of years, life was restricted to a nearly memoryless existence, at most relying upon brief <a href="http://www.nature.com/nrg/journal/v9/n12/full/nrg2480.html">changes in chemical gradients</a> to move closer to nutrient sources or farther from toxins.</p>
<p>With time, these basic chemo- and photo-sensory apparatuses evolved; creatures with longer memories &mdash; perhaps long enough to remember where food sources were richest &mdash; possessed an evolutionary advantage. Eventually, the time scales on which memory operates extended longer; short-term memory became long-term memory, and brains evolved the ability to maintain a memory across an entire biological lifetime. (In fact, how the brain coordinates such memories is a <a href="http://newscenter.berkeley.edu/2010/11/03/prefrontal_cortex_stroke/">core question of my neuroscientific research</a>.)</p>
<div align="center">
<p class="image-box-300"><img src="http://cdn.oreilly.com/radar/images/posts/0612-brain.png" width="300" border="0" alt="Brain"></p>
</div>
<p>However, memory did not stop there. Language permitted interpersonal communication, and primates finally overcame the memory limitations of a single lifespan. Writing and culture imbued an increased permanence to memory, impervious to the requirement for knowledge to pass verbally, thus improving the fidelity of memory and minimizing the costs of the &#8220;<a href="http://en.wikipedia.org/wiki/Chinese_whispers">telephone game effect</a>.&#8221;</p>
<p>We are now in the digital age, where we are freed from the confines of needing to remember a phone number or other arbitrary facts. While I&#8217;d like to think that we&#8217;re using this &#8220;extra storage&#8221; for useful purposes, sadly I can tell you more about minutiae of the Marvel Universe and &#8220;Star Wars&#8221; canon than will ever be useful (short of an alien invasion in which our survival as a species is predicated on my ability to tell you that Nightcrawler doesn&#8217;t, strictly speaking, teleport, but rather he travels through another dimension, and when he reappears in our dimension the &#8220;BAMF&#8221; sound results from some sulfuric gasses entering our dimension upon his return).</p>
<p>But I wiki-walk digress.</p>
<p>So what does all of this extra memory gain us?</p>
<p>Accelerated innovation.</p>
<p>As a scientist my (hopefully) novel research is built upon the unfathomable number of failures and successes dedicated by those who came before me. The common refrain is that we scientists stand on the shoulders of giants. It is for this reason that I&#8217;ve <a href="http://blog.ketyov.com/2012/02/basic-science-is-about-creating.html">previously argued</a> that research funding is so critical, even for apparently &#8220;frivolous&#8221; projects. I&#8217;ve got a <a href="https://docs.google.com/spreadsheet/ccc?key=0AsXhCu3oLBeWdHEwckdNMC0tNDhzajZmMVhNZGgzMnc">Google Doc</a> noting impressive breakthroughs that emerged from research that, on the surface, has no &#8220;practical&#8221; value:</p>
<ul>
<li> <a href="http://www.icrar.org/news/news_items/wireless_inventor_honoured">Research on black holes helped create Wi-Fi</a></li>
<li> <a href="http://www.apa.org/monitor/jan03/basic.aspx">Optometry saved lives on 9/11 via architecture</a></li>
<li> <a href="http://en.wikipedia.org/wiki/Penicillin">Growing bacteria in dirty petri dishes led to penicillin</a></li>
<li> <a href="http://www.radiolab.org/2011/nov/14/">Studying monkey social behaviors and eating habits led to insights into HIV</a></li>
</ul>
<p>Although you can&#8217;t legislate innovation or democratize a breakthrough, you can encourage a system that maximizes the probability that a breakthrough can occur. This is what science should be doing and this is, to a certain extent, what Silicon Valley is already doing.</p>
<p>The more data, information, software, tools, and knowledge available, the more we as a society can build upon previous work. (That said, even though I&#8217;m a <a href="http://radar.oreilly.com/2012/03/data-science-deep-data-information-paradox.html">huge proponent for more data</a>, the most transformational theory from biology came about from solid critical thinking, logical, and sparse data collection.)</p>
<p>Of course, I&#8217;m biased, but I&#8217;m going to talk about two projects in which I&#8217;m involved: one business and one scientific. The first is <a href="http://www.uber.com/">Uber</a>, an on-demand car service that allows users to request a private car via their smartphone or SMS. Uber is built using a variety of open software and tools such as <a href="http://www.python.org/">Python</a>, <a href="http://www.mysql.com/">MySQL</a>, <a href="http://nodejs.org/">node.js</a>, and others. These systems helped make Uber possible.</p>
<div align="center">
<p class="image-box-300"><img src="http://cdn.oreilly.com/radar/images/posts/0612-uber-screen.png" border="0" alt="Uber screenshot" width="300" /></p>
</div>
<p>As a non-engineer, it&#8217;s staggering to think of the complexity of the systems that make Uber work: GPS, accurate mapping tools, a reliable cellular/SMS system, automated dispatching system, and so on. But we as a culture become so quickly accustomed to certain advances that, should our system ever experience a service disruption, Louis C.K. would almost certainly be prophetic about the response:</p>
<div align="center">
<p><iframe width="480" height="360" src="http://www.youtube.com/embed/8r1CZTLk-Gk" frameborder="0" allowfullscreen></iframe></p>
</div>
<p>The other project in which I&#8217;m involved is <a href="http://www.brainscanr.com/">brainSCANr</a>. My <a href="http://www.voytekdesign.com/">wife</a> and I recently <a href="http://www.sciencedirect.com/science/article/pii/S0165027012001513">published a paper on this</a>, but the basic idea is that we mined the text of more than three million peer-reviewed neuroscience research articles to find associations between topics and search for potentially missing links (which we called &#8220;semi-automated hypothesis generation&#8221;).</p>
<p>We built the first version of the site in a week, using nothing but open data and tools. The National Library of Medicine, part of the National Institutes of Health, <a href="http://eutils.ncbi.nlm.nih.gov/entrez/query/static/esearch_help.html">provides an API</a> to search all of these manuscripts in their massive, 20-million-paper-plus database. We used Python to process the associations, the <a href="http://thejit.org/">JavaScript InfoVis Toolkit</a> to plot the data, and <a href="https://appengine.google.com/">Google App Engine</a> to host it all. I&#8217;m positive when the NIH funded the creation of PubMed and its API, they didn&#8217;t have this kind of project in mind.</p>
<p>That&#8217;s the great thing about making more tools available; it&#8217;s arrogant to think that we can anticipate the best ways to make use of our own creations. My hope is that brainSCANr is the weakest incarnation of this kind of scientific text mining, and that bigger and better things will come of it.</p>
<p>Twenty years ago, these projects would have been practically impossible, meaning that the amount of labor involved to make them would have been impractical. Now they can be built by a handful of people (or a guy and his pregnant wife) in a week.</p>
<p>Just as research into black holes can lead to a breakthrough in wireless communication, so too can seemingly benign software technologies open amazing and unpredictable frontiers. Who would have guessed that what began with a simple online bookstore would grow into <a href="http://aws.amazon.com/">Amazon Web Services</a>, a tool that is playing an ever-important role in innovation and scientific computing <a href="http://www.ncbi.nlm.nih.gov/pubmed/22492314">such as genetic sequencing</a>?</p>
<p>So, before you scoff at the &#8220;pointlessness&#8221; of social networks or the wastefulness of &#8220;another web service,&#8221; remember that we don&#8217;t always do the research that will lead to the best immediate applications or build the company that is immediately useful or profitable. Nor can we always anticipate how our products will be used. It&#8217;s easy to mock Twitter because you don&#8217;t care to hear about who ate what for lunch, but I guarantee that the <a href="http://www.thesun.co.uk/sol/homepage/news/2822899/Haiti-man-saved-by-Twitter.html">people whose lives were saved after the Haiti earthquake</a> or who <a href="http://en.wikipedia.org/wiki/Arab_Spring">coordinated the spark of the Arab Spring</a> are happy Twitter exists.</p>
<p>While we might have to justify ourselves to granting agencies, or venture capitalists, or our shareholders in order to do the work we want to do, sometimes the &#8220;real&#8221; reason we spend so much of our time working is the same reason people climb mountains: because it&#8217;s awesome that we can. That said, it&#8217;s nice to know that what we&#8217;re building now will be improved upon by our children in ways we can&#8217;t even conceive.</p>
<p>I can&#8217;t wait to have this conversation with my son when &mdash; after learning how to talk, of course &mdash; he&#8217;s had a chance to build on the frivolities of my generation.</p>
<p><strong>Related:</strong></p>
<ul>
<li> <a href="http://radar.oreilly.com/2012/03/data-science-deep-data-information-paradox.html">Automated science, deep data and the paradox of information</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://radar.oreilly.com/2012/06/experiments-learning-frivolities.html/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Automated science, deep data and the paradox of information</title>
		<link>http://radar.oreilly.com/2012/03/data-science-deep-data-information-paradox.html</link>
		<comments>http://radar.oreilly.com/2012/03/data-science-deep-data-information-paradox.html#comments</comments>
		<pubDate>Fri, 30 Mar 2012 18:30:00 +0000</pubDate>
		<dc:creator>Bradley Voytek</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[@home]]></category>
		<category><![CDATA[data analysis]]></category>
		<category><![CDATA[data conclusions]]></category>
		<category><![CDATA[data ethics]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[scientific method]]></category>
		<category><![CDATA[scientists]]></category>

		<guid isPermaLink="false">http://blogs.oreilly.com/radar/2012/03/data-science-deep-data-information-paradox.html</guid>
		<description><![CDATA[Bradley Voytek: &#34;Our goal as data scientists should be to distill the essence of the data into something that tells as true a story as possible while being as simple as possible to understand.&#34; ]]></description>
				<content:encoded><![CDATA[<p>A <a href="http://radar.oreilly.com/2012/01/what-is-big-data.html">lot of great pieces have been written</a> about the relatively recent surge in interest in big data and data science, but in this piece I want to address the importance of deep data analysis: what we can learn from the statistical outliers by drilling down and asking, &#8220;What&#8217;s different here? What&#8217;s special about these outliers and what do they tell us about our models and assumptions?&#8221;</p>
<p>The reason that big data proponents are so excited about the burgeoning data revolution isn&#8217;t just because of the math. Don&#8217;t get me wrong, the math is fun, but we&#8217;re <em>excited</em> because we can begin to distill patterns that were <em>previously invisible</em> to us due to a lack of information.</p>
<p><em>That&#8217;s</em> big data.</p>
<p>Of course, data are just a collection of facts; bits of information that are only given context &mdash; assigned meaning and importance &mdash; by human minds. It&#8217;s not until we <em>do something</em> with the data that any of it matters. You can have the best machine learning algorithms, the tightest statistics, and the smartest people working on them, but none of that means anything until someone makes a story out of the results.</p>
<p>And therein lies the rub.</p>
<p>Do all these data tell us a story about ourselves and the universe in which we live, or are we simply hallucinating patterns that we want to see?</p>
<h2>(Semi)Automated science</h2>
<p>In 2010, Cornell researchers  Michael Schmidt and Hod Lipson published a groundbreaking paper in &#8220;Science&#8221; titled, <a href="http://www.sciencemag.org/content/324/5923/81.short">&#8220;Distilling Free-Form Natural Laws from Experimental Data&#8221;</a>. The premise was simple, and it essentially boiled down to the question, &#8220;can we algorithmically extract models to fit our data?&#8221;</p>
<p>So they hooked up a double pendulum &mdash; a seemingly chaotic system whose movements are governed by classical mechanics &mdash; and trained a machine learning algorithm on the motion data.</p>
<p><iframe width="600" height="407" src="http://www.youtube.com/embed/U39RMUzCjiU" frameborder="0" allowfullscreen></iframe></iframe></p>
<p>Their results were astounding.</p>
<p>In a matter of minutes the algorithm converged on Newton&#8217;s second law of motion: <em>f = ma</em>. What took humanity tens of thousands of years to accomplish was completed on 32-cores in essentially no time at all.</p>
<p>In 2011, some neuroscience colleagues of mine, lead by Tal Yarkoni, published a paper in &#8220;Nature Methods&#8221; titled <a href="http://www.nature.com/nmeth/journal/v8/n8/abs/nmeth.1635.html">&#8220;Large-scale automated synthesis of human functional neuroimaging data&#8221;</a>. In this paper the authors sought to extract patterns from the overwhelming flood of brain imaging research.</p>
<p>To do this they algorithmically extracted the 3D coordinates of significant brain activations from thousands of neuroimaging studies, along with words that frequently appeared in each study. Using these two pieces of data along with some simple (but clever) mathematical tools, they were able to create probabilistic maps of brain activation for any given term.</p>
<p>In other words, you type in a word such as &#8220;learning&#8221; on their website search and visualization tool, <a href="http://neurosynth.org/">NeuroSynth</a>, and they give you back a pattern of brain activity that you should <em>expect</em> to see during a learning task.</p>
<p>But that&#8217;s not all. Given a pattern of brain activation, the system can perform a reverse inference, asking, &#8220;given the data that I&#8217;m observing, what is the most probable behavioral state that this brain is in?&#8221;</p>
<p>Similarly, in late 2010, my wife (<a href="http://www.voytekdesign.com/">Jessica Voytek</a>) and I undertook a project to algorithmically discover associations between concepts in the peer-reviewed neuroscience literature. As a neuroscientist, the goal of my research is to understand relationships between the human brain, behavior, physiology, and disease. Unfortunately, the facts that tie all that information together are locked away in more than 21 million static peer-reviewed scientific publications.</p>
<p>How many undergrads would I need to hire to read through that many papers? Any volunteers?</p>
<p>Even more mind-boggling, each year more than 30,000 neuroscientists attend the annual <a href="http://www.sfn.org/">Society for Neuroscience</a> conference. If we assume that only two-thirds of those people actually <em>do</em> research, and if we assume that they only work a meager (for the sciences) 40 hours a week, that&#8217;s around 40 <em>million</em> person-hours dedicated to but one branch of the sciences.</p>
<p>Annually.</p>
<p>This means that in the 10 years I&#8217;ve been attending that conference, more than 400 million person-hours have gone toward the pursuit of understanding the brain. Humanity built the pyramids in 30 years. The Apollo Project got us to the moon in about eight.</p>
<p>So my wife and I said to ourselves, &#8220;there has to be a better way&#8221;.</p>
<p>Which lead us to create <a href="http://www.brainscanr.com/Search?term_a=Alzheimer%27s+disease">brainSCANr</a>, a simple (simplistic?) tool (currently itself under peer review) that makes the assumption that the more often that two concepts appear together in the titles or abstracts of published papers, the more likely they are to be associated with one another.</p>
<p>For example, if 10,000 papers mention &#8220;Alzheimer&#8217;s disease&#8221; that <em>also</em> mention &#8220;dementia,&#8221; then Alzheimer&#8217;s disease is probably related to dementia. In fact, there are 17,087 papers that mention Alzheimer&#8217;s <em>and</em> dementia, whereas there are only 14 papers that mention Alzheimer&#8217;s and, for example, creativity.</p>
<p>From this, we built what we&#8217;re calling the &#8220;cognome&#8221;, a mapping between brain structure, function, and disease.</p>
<p>Big data, data mining, and machine learning are becoming critical tools in the modern scientific arsenal. Examples abound: <a href="http://www.nature.com/srep/2011/111215/srep00196/full/srep00196.html">text mining recipes to find cultural food taste preferences</a>, <a href="http://www.sciencemag.org/content/early/2010/12/15/science.1199644">analyzing cultural trends via word use in books (&#8220;culturomics&#8221;)</a>, <a href="http://www.sciencemag.org/content/333/6051/1878.abstract">identifying seasonality of mood from tweets</a>, and so on.</p>
<p>But so what?</p>
<h2>Deep data</h2>
<p>What those three studies show us is that it&#8217;s possible to automate, or at least semi-automate, critical aspects of the scientific method itself. Schmidt and Lipson show that it is possible to extract equations that perfectly model even seemingly chaotic systems. Yarkoni and colleagues show that it is possible to infer a complex behavioral state given input brian data.</p>
<p>My wife and I wanted to show that brainSCANr could be put to work for something more useful than just quantifying relationships between terms. So we created a simple algorithm to perform what we&#8217;re calling &#8220;semi-automated hypothesis generation,&#8221; which is predicated on a basic &#8220;the friend of a friend should be a friend&#8221; concept.</p>
<p>In the example below, the neurotransmitter &#8220;serotonin&#8221; has thousands of shared publications with &#8220;migraine,&#8221; as well as with the brain region &#8220;striatum.&#8221; However, migraine and striatum only share 16 publications.</p>
<p class="image-box-580">
<a href="http://darb.ketyov.com/professional/publications/VoytekVoytek-brainSCANr-hypotheses.jpg"><img border="0" src="http://darb.ketyov.com/professional/publications/VoytekVoytek-brainSCANr-hypotheses.jpg" width="580" /></a></p>
<p>That&#8217;s very odd. Because in medicine there is a serotonin hypothesis for the root cause of migraines. And we (neuroscientists) know that serotonin is released in the striatum to modulate brain activity in that region. Given that those two things are true, why is there so little research regarding the role of the striatum in migraines?</p>
<p>Perhaps there&#8217;s a missing connection?</p>
<p>Such missing links and other outliers in our models are the essence of deep data analytics. Sure, any data scientist worth their salt can take a mountain of data and reduce it down to a few simple plots. And such plots are important because they tell a story. But those aren&#8217;t the only stories that our data can tell us.</p>
<p>For example, in my geoanalytics work as the data evangelist for <a href="https://www.uber.com/">Uber</a>, I put some of my (definitely rudimentary) neuroscience network analytic skills to work to <a href="http://blog.uber.com/2012/01/09/uberdata-san-franciscomics/">figure out how people move from neighborhood to neighborhood in San Francisco</a>.</p>
<p class="image-box-580">
<a href="http://blog.uber.com/wp-content/uploads/2012/01/UberSanFranciscomics.jpg"><img border="0" src="http://blog.uber.com/wp-content/uploads/2012/01/UberSanFranciscomics.jpg" width="580" /></a></p>
<p>At one point, I checked to see if men and women moved around the city differently. A very simple regression model showed that the number of men who go to any given neighborhood significantly predicts the number of woman who go to that same neighborhood.</p>
<p>No big deal.</p>
<p>But what&#8217;s cool was seeing where the outliers were. When I looked at the models&#8217; residuals, <em>that&#8217;s</em> where I found the far more interesting story. While it&#8217;s good to have a model that fits your data, knowing where the model <em>breaks down</em> is not only important for internal metrics, but it also makes for a more interesting story:</p>
<div align="center">
<p class="image-box-400"><a href="http://blog.uber.com/wp-content/uploads/2012/01/UberWeekendGender.jpg"><img border="0" src="http://blog.uber.com/wp-content/uploads/2012/01/UberWeekendGender.jpg" width="400" /></a></p>
</div>
<p>What&#8217;s happening in the Marina district that so many more women want to go there? And why are there so many more men in SoMa?</p>
<h2>The paradox of information</h2>
<p>The interpretation of big data analytics can be a messy game. Maybe there are more men in SoMa because that&#8217;s where AT&amp;T Park is. But maybe there are just five guys who live in SoMa who happen to take Uber 100 times more often than average.</p>
<p>While data-driven posts make for fun reading (and writing), in the sciences we need to be more careful that we don&#8217;t fall prey to <em>ad hoc</em>, <a href="http://en.wikipedia.org/wiki/Just-so_story">just-so stories</a> that sound perfectly reasonable and plausible, but which we cannot conclusively prove.</p>
<p>In 2008, psychologists David McCabe and Alan Castel published a paper in the journal &#8220;Cognition,&#8221; titled, <a href="http://www.sciencedirect.com/science/article/pii/S0010027707002053">&#8220;Seeing is believing: The effect of brain images on judgments of scientific reasoning&#8221;</a>. In that paper, they showed that summaries of cognitive neuroscience findings that are accompanied by an image of a brain scan were rated as more credible by the readers.</p>
<p>This should cause any data scientist serious concern. In fact, I&#8217;ve formulated <a href="http://en.wikipedia.org/wiki/Clarke's_three_laws">three laws</a> of statistical analyses:</p>
<ol>
<li> The more advanced the statistical methods used, the fewer critics are available to be properly  skeptical.</li>
<li> The more advanced the statistical methods used, the more likely the data analyst will be to use math as a shield.</li>
<li> Any sufficiently advanced statistics can trick people into believing the results reflect truth.</li>
</ol>
<p>The first law is closely related to the &#8220;bike shed effect&#8221; (also known as <a href="http://en.wikipedia.org/wiki/Parkinson's_Law_of_Triviality">Parkinson&#8217;s Law of Triviality</a>) which states that, &#8220;the time spent on any item of the agenda will be in inverse proportion to the sum involved.&#8221;</p>
<p>In other words, if you try to build a simple thing such as a public bike shed, there will be endless town hall discussions wherein people argue over trivial details such as the color of the door. But if you want to build a nuclear power plant &mdash; a project so vast and complicated that most people can&#8217;t understand it &mdash; people will defer to expert opinion.</p>
<p>Such is the case with statistics.</p>
<p>If you make the mistake of going into the comments section of any news piece discussing a scientific finding, invariably someone will leave the comment, &#8220;correlation does not equal causation.&#8221;</p>
<p>We&#8217;ll go ahead and call that truism Voytek&#8217;s fourth law.</p>
<p>But people rarely have the capacity to argue against the methods and models used by, say, neuroscientists or cosmologists.</p>
<p>But sometimes we get perfect models without any understanding of the underlying processes. What do we learn from that?</p>
<p>The always fantastic Radiolab did a follow-up story on the Schmidt and Lipson &#8220;automated science&#8221; research in an episode titled <a href="http://www.radiolab.org/2010/apr/05/limits-of-science/">&#8220;Limits of Science&#8221;</a>. It turns out, a biologist contacted Schmidt and Lipson and gave them data to run their algorithm on. They wanted to figure out the principles governing the dynamics of a single-celled bacterium. Their result?</p>
<p>Well sometimes the stories we tell with data &#8230; they just don&#8217;t make sense to us.</p>
<p>They found, &#8220;two equations that describe the data.&#8221;</p>
<p>But they didn&#8217;t know what the equations <em>meant</em>. They had no context. Their variables had no meaning. Or, as Radiolab co-host <a href="https://twitter.com/jadabumrad">Jad Abumrad</a> put it, &#8220;the more we turn to computers with these big questions, the more they&#8217;ll give us answers that we just don&#8217;t understand.&#8221;</p>
<p>So while big data projects are creating ridiculously exciting new vistas for scientific exploration and collaboration, we have to take care to avoid the Paradox of Information wherein we can know too many <em>things</em> without knowing what those &#8220;things&#8221; are.</p>
<p>Because at some point, we&#8217;ll have so much data that we&#8217;ll stop being able to discern the <a href="http://en.wikipedia.org/wiki/Map%E2%80%93territory_relation">map from the territory</a>. Our goal as (data) scientists should be to distill the essence of the data into something that tells as true a story as possible while being as simple as possible to understand. Or, to operationalize that sentence better, we should aim to find balance between minimizing the residuals of our models and maximizing our ability to make sense of those models.</p>
<p>Recently, <a href="https://twitter.com/stephen_wolfram">Stephen Wolfram</a> released the results of a 20-year long experiment in personal data collection, including every keystroke he&#8217;s typed and every email he&#8217;s sent. <a href="http://www.npr.org/blogs/krulwich/2012/03/21/149095154/mirror-mirror-on-the-wall-do-the-data-tell-it-all">In response</a>, <a href="http://www.npr.org/people/5194672/robert-krulwich">Robert Krulwich</a>, the other co-host of Radiolab, concludes by saying &#8220;I&#8217;m looking at your data [Dr. Wolfram], and you know what&#8217;s amazing to me? How much of you is missing.&#8221;</p>
<p>Personally, I disagree; I believe that there&#8217;s a humanity in those numbers and that Mr. Krulwich is falling prey to the idea that science somehow ruins the magic of the universe. <a href="http://en.wikiquote.org/wiki/Carl_Sagan">Quoth</a> Dr. Sagan:</p>
<blockquote><p>&#8220;It is sometimes said that scientists are unromantic, that their passion to figure out robs the world of beauty and mystery. But is it not stirring to understand how the world actually works &mdash; that white light is made of colors, that color is the way we perceive the wavelengths of light, that transparent air reflects light, that in so doing it discriminates among the waves, and that the sky is blue for the same reason that the sunset is red? It does no harm to the romance of the sunset to know a little bit about it.&#8221;</p>
</blockquote>
<div align="center">
<p class="image-box-300">
<a href="http://upload.wikimedia.org/wikipedia/commons/thumb/7/71/PaleBlueDot.jpg/300px-PaleBlueDot.jpg"><img border="0" src="http://upload.wikimedia.org/wikipedia/commons/thumb/7/71/PaleBlueDot.jpg/300px-PaleBlueDot.jpg" width="300" /></a></p>
</div>
<p>So go forth and create beautiful stories, my statistical friends. See you after peer-review.</p>
<p><strong>Related:</strong></p>
<ul>
<li> <a href="http://radar.oreilly.com/2012/03/neuroscience-uber-bradley-voytek.html">Why Uber&#8217;s data fascinates a neuroscientist</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://radar.oreilly.com/2012/03/data-science-deep-data-information-paradox.html/feed</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
	</channel>
</rss>
