<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>O&#039;Reilly Radar &#187; Pete Warden</title>
	<atom:link href="http://radar.oreilly.com/petew/feed" rel="self" type="application/rss+xml" />
	<link>http://radar.oreilly.com</link>
	<description>Insight, analysis, and research about emerging technologies</description>
	<lastBuildDate>Tue, 18 Jun 2013 18:59:00 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<item>
		<title>How to create a visualization</title>
		<link>http://radar.oreilly.com/2012/02/how-to-create-visualization-facebook-vacation.html</link>
		<comments>http://radar.oreilly.com/2012/02/how-to-create-visualization-facebook-vacation.html#comments</comments>
		<pubDate>Mon, 13 Feb 2012 16:00:00 +0000</pubDate>
		<dc:creator>Pete Warden</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[@editpick]]></category>
		<category><![CDATA[@home]]></category>
		<category><![CDATA[@top]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[visualization]]></category>
		<category><![CDATA[visualization process]]></category>

		<guid isPermaLink="false">http://blogs.oreilly.com/radar/2012/02/how-to-create-visualization-facebook-vacation.html</guid>
		<description><![CDATA[Creating a visualization requires more than just data and imagery. Pete Warden outlines the process and actions that drove his new Facebook visualization project. ]]></description>
				<content:encoded><![CDATA[<p>Over the last few years I&#8217;ve created a <a href="http://andrewsullivan.thedailybeast.com/2011/12/a-globe-of-window-views.html">few</a> <a href="http://petewarden.typepad.com/searchbrowser/2010/02/how-to-split-up-the-us.html">popular</a> <a href="http://www.readwriteweb.com/archives/the_inner_circles_of_10_geek_heroes_on_twitter.php">visualizations</a>, <a href="http://petewarden.typepad.com/searchbrowser/2011/11/lessons-from-failed-visualizations.html">a lot of duds</a>, and I&#8217;ve learned a few lessons along the way. For <a href="https://www.jetpac.com/top">my latest analysis of where Facebook users go on vacation</a>, I decided to document the steps I follow to build my visualizations . It&#8217;s a very rough guide, these are just stages I&#8217;ve learned to follow by trial and error, but following these guidelines is a good way to start if you&#8217;re looking to create your first visualization.</p>
<h2>Play with your data</h2>
<p>I was lucky enough to spend a few hours with <a href="http://www.weigend.com/">Andreas Weigend</a> recently, head of the Stanford Social Data lab. He has nine rules of data, and the first is &#8220;Start with the problem, not the data.&#8221; What struck me about visualizations is that I actually take the opposite approach. I find the only way to begin is to explore what information is available and get a feeling for what stories it can tell.</p>
<p>In my case, we have a Cassandra cluster with information on more than 350 million photos shared on Facebook. I&#8217;ve been running Pig analytics jobs regularly to get a view of what we have in there. One of the reports we generate is a count of how many photos and users we have for particular places:</p>
<p class="image-box-580"><a href="http://radar.oreilly.com/assets_c/2012/02/howviz0.html"><img src="http://s.radar.oreilly.com/2012/02/12/1-0212-howviz0-580.png" width="580" border="0" alt="Data source example" style="margin-bottom: 15px" /></a><br /><em><a href="http://radar.oreilly.com/assets_c/2012/02/howviz0.html">Click to enlarge.</a></em></p>
<p>I was chatting with my colleague <a href="http://twitter.com/drtriumph">Chris Raynor</a> about this, and he asked me if we could tell where all the visitors to those places were coming from. This was something that had been at the back of my mind for a long time. Seeing how much information we had on each destination made me realize we had enough data to produce significant and meaningful answers.</p>
<p>When I was learning engineering, one of my favorite case studies was <a href="http://www.adammikeal.com/courses/ht/files/readings/p311-mackay.pdf">an investigation into an air-traffic control system</a>. Software engineers couldn&#8217;t understand why fully-computerized control rooms were actually less efficient and safe than more old-fashioned sites. What the researchers discovered was that the old process of passing around and arranging small cards that each represented a plane gave controllers a much stronger awareness of the situation than a screen that didn&#8217;t require their involvement for tasks, such as handing an aircraft to a colleague. I think the same is true of data. The more time you spend manipulating and examining the raw information, the more you understand it at a deep level. Knowing your data is the essential starting point for any visualization.</p>
<h2>Pick a question</h2>
<p>Now that I had a rough idea for what I wanted to visualize, I really needed to focus on what I would be doing. The best way to do that is to chose the exact title you want to give your visualization. I actually messed this up on one early map I created, giving the blog post the title &#8220;<a href="http://petewarden.typepad.com/searchbrowser/2010/02/how-to-split-up-the-us.html">How to split up the US</a>.&#8221; Everyone subsequently described it as &#8220;The Five Nations of Facebook.&#8221; Since then, I&#8217;ve tried very hard to pick the most natural title for what I&#8217;m going to be presenting, and then ensure I can deliver on the promise of the headline.</p>
<p>In this case I had a clear idea of the question at the start, it was going to be &#8220;Where do people go on vacation?&#8221;. However, as I thought about it, I realized it needed to be a lot more specific and concrete. There&#8217;s already a lot of &#8220;top travel destinations&#8221; lists out there, so what made mine different? It was the use of Facebook to gather much richer and more detailed information, so I refined it to &#8220;Where do Facebook users go on vacation?&#8221;.</p>
<h2>Sketch out your presentation</h2>
<p>I now had the data and a question I wanted to answer. The next step was figuring out how to show the information in a visual form. I&#8217;m in love with network diagrams showing connections between thousands of objects, but so often they are completely baffling to the rest of the world. I still remember <a href="https://twitter.com/davidcohen">David Cohen</a> threatening to strangle me if I showed him another one of &#8220;those damn spider webs&#8221; instead of a business plan. However, network diagrams are a good way of hinting at how much data is available for querying; they can really give an idea of the sheer scale of what&#8217;s there.</p>
<p>One of my favorite recent visualizations was <a href="https://www.facebook.com/notes/facebook-engineering/visualizing-friendships/469716398919">Paul Butler&#8217;s map of friendships on Facebook</a>, so I decided to use that as a visual reference:</p>
<p class="image-box-580"><a href="http://www.facebook.com/note.php?note_id=469716398919"><img src="http://s.radar.oreilly.com/2012/02/12/2-0212-paulbutlermap-580.png" width="580" border="0" alt="Paul Butler's Visualizaing Friendships visualization" style="margin-bottom: 15px" /></a><br /><em><a href="http://www.facebook.com/note.php?note_id=469716398919">See the full version of Paul Butler&#8217;s &#8220;Visualizing Friendships&#8221; visualization.</a></em></p>
<p>I borrowed a couple of key ideas from his work: the general color palette of the blue lines on a dark background and the use of great circles to create flowing arcs for all connections.</p>
<p>As I thought about the presentation, I realized that I had to simplify what it would be showing. With sources and destinations plotted all over the world, both the visual look and the querying interface would be overwhelming. Our user-base is primarily American thanks to our reliance on English-only natural language processing, so with that in mind I decided to make life simpler by only showing data from people who lived in the U.S. Accordingly, I changed the question in my title to &#8220;Where do American Facebook users go on vacation?&#8221;.</p>
<p>While I&#8217;m mostly presenting this as a linear, waterfall process, what I&#8217;ve just described is a good example of  how iterative cycles drive the real workflow. It&#8217;s hard to know how well a lot of things will work until you try them. As you&#8217;re still making some progress, don&#8217;t worry if you find yourself going in circles.</p>
<h2>Crunch the data</h2>
<p>If you know your data, and you have a good idea of the question you&#8217;re trying to answer, this should be the simplest stage. You&#8217;ll hopefully have a clear set of requirements and it&#8217;s just a matter of executing the right queries over your data.</p>
<p>In this case I already had some Pig scripts asking similar questions, so I was able to adapt one of those. The biggest surprise was when I ran into issues with some of the joins. The hard part was running the Hadoop job to gather the raw data from our Cassandra cluster, and that worked. I was able to output smaller files containing the gathered data, and then run a local Pig job to do the joins I needed.</p>
<p>The next stage was turning the raw information into a form that could be displayed. For example, I needed to take all of the user locations from the unstructured text strings that Facebook gave me, and convert them into latitude-longitude coordinates for plotting on a map. For this sort of work I usually turn to a general-purpose scripting language, and most of <a href="https://www.jetpac.com/">Jetpac</a> is already written in Ruby, so that was an easy choice. I wrote a script that walked through the data, using the <a href="http://www.datasciencetoolkit.org/">Data Science Toolkit</a> to match coordinates with names, and then output it into a file containing a JSON array of all the information.</p>
<h2>Build an interface</h2>
<p>A lot of the best visualizations have no interactivity. They just tell a story with a static image. That&#8217;s why it&#8217;s worth considering whether you need an interface at all. I actually had the interactive site that I used to create the &#8220;Five Nations of Facebook&#8221; visualization up for several weeks before that post, and nobody used it because it was too confusing. It was only when I boiled it down into a single picture with labels that it became a hit.</p>
<p>My problem is that I want other people to have as much fun exploring the data as I&#8217;ve had, so I couldn&#8217;t resist adding some interaction to the vacation visualization. I still wanted to retain the immediate visual appeal of a static image, so I decided to create a background showing the full data to introduce the visualization at a first glance, and then overlay an interactive foreground once the user started exploring it more deeply.</p>
<p>In most cases you&#8217;re better off using one of the excellent off-the-shelf visualization frameworks like <a href="http://mbostock.github.com/d3/">D3</a>. Since I needed something client-side for interaction, and was working with both geographic and network rendering, I couldn&#8217;t find anything that met my requirements. Instead I cannibalized one of my own projects, <a href="https://github.com/petewarden/openheatmap/blob/master/static/scripts/jquery.openheatmap.js">the jQuery component from OpenHeatMap</a>, and combined it with HTML5 canvas rendering to produce a custom JavaScript renderer. I used it to pre-render a background containing all the possible connections between home towns and travel destinations, and saved that off as a static image. That&#8217;s useful to save rendering time on page load, and lets me fall back to a static visualization on older browsers that don&#8217;t support Canvas.</p>
<p class="image-box-580"><a href="http://radar.oreilly.com/assets_c/2012/02/topbackground.html"><img src="http://s.radar.oreilly.com/2012/02/12/3-0212-topbackground-580.png" width="580" border="0" alt="Background image of Facebook vacation visualization" style="margin-bottom: 15px" /></a><br /><em><a href="http://radar.oreilly.com/assets_c/2012/02/topbackground.html">Click to enlarge.</a></em></p>
<p>I then tied in rendering the connections of any places that the user was hovering their cursor over, so that they could quickly get a feel for the relationships expressed in the data. I also wanted to display the details underlying the picture, so to drill down I added a dialog listing the raw statistics about a place. Users can bring this dialog up by clicking.</p>
<p class="image-box-580"><a href="http://radar.oreilly.com/assets_c/2012/02/topdialog.html"><img src="http://s.radar.oreilly.com/2012/02/12/4-0212-topdialog-580.png" width="580" border="0" alt="Facebook vacation visualization dialog box" style="margin-bottom: 15px" /></a><br /><em><a href="http://radar.oreilly.com/assets_c/2012/02/topdialog.html">Click to enlarge.</a></em></p>
<p>One problem with that interaction is that a lot of different cities are in a very small area, so it becomes extremely difficult to pick the one you want with the mouse cursor. To make that a little better, I prioritized the most popular U.S. cities so that in case of a conflict, they&#8217;re chosen over their smaller neighbors. I realized I also needed to add a search box. Thankfully we&#8217;re heavy users of <a href="http://twitter.github.com/bootstrap/">Twitter&#8217;s Bootstrap framework</a>, so it was a simple matter to add a search field and tie it in with Twitter&#8217;s excellent autocomplete component.</p>
<h2>Find the surprises!</h2>
<p>I build these visualizations so I can explore them myself, so my favorite part of the whole process is the chance to sit and play with the results. There&#8217;s always unexpected stories hidden in there, and I love uncovering them. For example, who knew that the city that had the most visitors to Paris was West Hollywood? When I lived in Los Angeles I used to love popping by the wonderful patisseries. Now I know why they&#8217;re so good! These little details are the stories that catch people&#8217;s imagination and cause them to spread the word, so think about writing a few of them up to help visitors understand what the page can tell them.</p>
<p>You&#8217;ll never know whether one of your visualizations will become popular ahead of time, but the real reward is enjoying your own work. I hope this short guide gives you some ideas for visualizations you want to build. I look forward to seeing what you come up with.</p>
<p class="image-box-580"><a href="https://www.jetpac.com/top"><img src="http://s.radar.oreilly.com/2012/02/12/5-0212-topshot1-580.png" width="580" border="0" alt="See the full Facebook vacation visualization" style="margin-bottom: 15px" /></a><br /><em><a href="https://www.jetpac.com/top">See the full visualization.</a></em></p>
<div style="float: left;border-top: thin gray solid;border-bottom: thin gray solid;padding: 20px;margin: 20px 2px;clear: both"><a href="https://en.oreilly.com/strata2012/public/regwith/radar20?cmp=il-radar-st12-pete-warden-visualization-how-to"><img style="float: left;border: none;padding-right: 10px" src="http://s.radar.oreilly.com/2011-strata-ca-promo.png" /></a><a href="https://en.oreilly.com/strata2012/public/regwith/radar20?cmp=il-radar-st12-pete-warden-visualization-how-to"><strong>Strata 2012</strong></a> &mdash;  The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.</p>
<p><a href="https://en.oreilly.com/strata2012/public/regwith/radar20?cmp=il-radar-st12-pete-warden-visualization-how-to"><strong>Save 20% on registration with the code RADAR20</strong></a></div>
<p><strong>Related:</strong></p>
<ul>
<li> <a href="http://radar.oreilly.com/2011/01/visualization-facebook-friendships.html">Visualization deconstructed: Mapping Facebook&#8217;s friendships</a></li>
<li> <a href="http://radar.oreilly.com/2011/01/visualization-mapping-america.html">Visualization deconstructed: New York Times &#8220;Mapping America&#8221;</a></li>
<li> <a href="http://radar.oreilly.com/2011/10/animated-geo-data.html">Visualization deconstructed: Why animated geospatial data works</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://radar.oreilly.com/2012/02/how-to-create-visualization-facebook-vacation.html/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>3 ideas you should steal from HubSpot</title>
		<link>http://radar.oreilly.com/2011/06/hubspot-data-products-marketing-customers.html</link>
		<comments>http://radar.oreilly.com/2011/06/hubspot-data-products-marketing-customers.html#comments</comments>
		<pubDate>Tue, 14 Jun 2011 13:00:00 +0000</pubDate>
		<dc:creator>Pete Warden</dc:creator>
				<category><![CDATA[Web 2.0]]></category>
		<category><![CDATA[@home]]></category>
		<category><![CDATA[customer service]]></category>
		<category><![CDATA[data product]]></category>
		<category><![CDATA[data service]]></category>

		<guid isPermaLink="false">http://blogs.oreilly.com/radar/2011/06/hubspot-data-products-marketing-customers.html</guid>
		<description><![CDATA[HubSpot&apos;s location (near Boston) and its target market (small businesses) may keep it under the radar of Silicon Valley, but the company&apos;s approach to data products and customer empowerment are worthy of attention. ]]></description>
				<content:encoded><![CDATA[<p><a href="http://www.hubspot.com/"><img src="http://s.radar.oreilly.com/2011/06/06/0611-hubspot.png" width="184" border="0" alt="HubSpot" style="float: right;margin: 3px 0 10px 10px" /></a>I&#8217;ve been following <a href="http://twitter.com/dharmesh">Dharmesh Shah&#8217;s</a> <a href="http://onstartups.com/">OnStartups blog</a> for years and I remember when he announced <a href="http://hubspot.com/">HubSpot</a>, the company he was starting. I&#8217;ve been fascinated to watch it grow and grow, so I was excited when I got to visit their offices a few months ago. Just after my visit they <a href="http://www.hubspot.com/blog/bid/10491/Sequoia-Google-Ventures-and-Salesforce-com-Invest-32-Million-in-HubSpot">closed a Series D funding round</a> for $32 million from Sequoia, Google Ventures and Salesforce.com, but despite its success almost nobody in the technology world has heard of HubSpot. I blame the combination of a location in Boston and a mainstream customer-base of small business owners for the lack of recognition. It&#8217;s a shame because there&#8217;s a lot to learn from their technology and process &mdash; they&#8217;ve solved some hard problems in thought-provoking ways.</p>
</p>
<h2>People are fascinated by mirrors</h2>
</p>
<p>There&#8217;s a good chance you&#8217;ve used their <a href="http://twittergrader.com/">Twitter Grader</a> tool, and its popularity shows one of the secrets to HubSpot&#8217;s success. The inspiration for the company came when Dharmesh realized that his own blog was driving a lot of traffic, and the startups he was helping out were all struggling to get anywhere near the same number of visitors. He built HubSpot by applying what he&#8217;d learned from blogging, and one of the key lessons was that people crave new information about their own lives and projects. If you can create a service that gives people interesting data about themselves and their organizations, they&#8217;ll spend time exploring it and they&#8217;ll share it with their friends.</p>
<p>With Twitter Grader, Dharmesh didn&#8217;t just create a source of free advertising for his company, it&#8217;s also implicitly targeted at people who want to improve their presence on the social network. Many of these people will be the small business owners that are in his target market. Even better, by offering the statistics as a gift to users, he created a small sense of reciprocal obligation that will make them more likely to purchase his services. The approach started with their original <a href="http://websitegrader.com/">Website Grader</a> service, but they found it so powerful, HubSpot now has a whole range of <a href="http://www.hubspot.com/marketing-tools/">similar free tools</a> for analyzing everything from your Facebook page to your blog.</p>
<p>The lesson for me is that giving people data and visualizations about things they truly care about can be a powerful tool for drawing them in to your service. Do some creative thinking about your customer&#8217;s problems, and see if there&#8217;s something you can offer them as a reward for their attention.</p>
<div style="height: 160px;border-top: thin gray solid;border-bottom: thin gray solid;padding: 20px;margin: 20px 2px"><a href="https://en.oreilly.com/oscon2011/public/regwith/os11rad?cmp=il-radar-os11-hubspot"><img style="float: left;border: none;padding-right: 10px" src="http://s.radar.oreilly.com/oscon-data-code-os11rad.png" /></a><a href="https://en.oreilly.com/oscon2011/public/regwith/os11rad?cmp=il-radar-os11-hubspot"><strong>OSCON Data 2011</strong></a>, being held July 25-27 in Portland, Ore., is a gathering for developers who are hands-on, doing the systems work and evolving architectures and tools to manage data. (This event is co-located with <a href="http://www.oscon.com/oscon2011?cmp=il-radar-os11-strataweek-060211">OSCON</a>.)</p>
<p><a href="https://en.oreilly.com/oscon2011/public/regwith/os11rad?cmp=il-radar-os11-hubspot"><strong>Save 20% on registration with the code OS11RAD</strong></a></div>
</p>
<h2>You should kill unicorns and rainbows with science</h2>
</p>
<p>One of the most enjoyable conversations I had at HubSpot was with <a href="http://danzarrella.com">Dan Zarrella</a>, who describes himself as a social media scientist. I can already hear some physics PhDs grinding their teeth, but Dan has earned that title by applying a lot of much-needed rigor to the fluffy world of social media measurements. He&#8217;s crusading against &#8220;<a href="http://danzarrella.com/need-to-justify-social-media-use-real-numbers-about-real-money.html">unicorns and rainbows</a>&#8221; metrics that have no connection to the goals you want to achieve. Many businesses have focused on building up easy-to-measure numbers like fan or follower counts, but to use Eric Ries&#8217; term, those are just <a href="http://www.startuplessonslearned.com/2009/12/why-vanity-metrics-are-dangerous.html">vanity metrics</a>. You can gain a million friends without it leading to a penny in revenue.</p>
<p>Dan&#8217;s antidote is the relentless application of logic and analysis, working backwards from the business goals to evaluate everything you&#8217;re doing as objectively as possible. A fantastic example of this is <a href="http://danzarrella.com/all-about-retweets">his study looking at how minor content details, like punctuation, make a retweet more or less likely</a>. It&#8217;s possible to argue with particular conclusions he draws, but he&#8217;s transparently laid out the methods by which he arrived at them. Anybody with some technical knowledge and access to a decent chunk of Twitter data can try to reproduce and refine his results. This makes the report so much more useful than the opinions or impressions that dominate most discussions of social media, since we can actually have an evidence-based argument about it.</p>
<p>I came away from talking with Dan with a new appreciation of how powerful the scientific method can be in even the most unlikely situations. I&#8217;ll be taking a fresh look at some of the painful problems my projects are hitting, and seeing if there&#8217;s some way I can gather the right data to gain insights, even if they seem hopelessly qualitative at first glance.</p>
</p>
<h2>User education is painful but powerful</h2>
</p>
<p>HubSpot focuses on the sort of people who used to buy ads in the Yellow Pages to promote their businesses. These people know they now need to use the Internet to reach customers, but they aren&#8217;t sure how. To succeed, HubSpot has to help those people build  useful websites and channels. Templates and other automated tools help, but a lot still comes down to people creating the right content for their own businesses and responding appropriately when customers get in touch through Twitter, Facebook or email. The only way to achieve that is to teach people how to do it, and so a lot of the company&#8217;s resources are put into education. </p>
<p>On a simple level, tools like HubSpot&#8217;s graders offer simple suggestions for improving websites and other content. Users of the service are sent regular emails that remind them of steps and actions they need to take, such as updating their blogs. HubSpot hosts a <a href="http://www.hubspot.tv/marketingupdate/">popular video cast</a> that covers all sorts of tips and horror stories from the last week in social media. All of these efforts really seem to help the company, judging from how enthusiastically users respond to all the material. On a deeper level, it also seems to help build a long-term relationship between the company and its customers, driving real loyalty.</p>
<p>One of the unwritten rules of the consumer technology world is that anything that requires educating users is a losing proposition. Anybody who has looked at their customer acquisition funnel knows how even minor usability problems can drive away vast swaths of people. What&#8217;s different about HubSpot is that their customers are a lot more motivated than your average consumer on the web. They&#8217;re using the service in the hope of actually making more money, so they&#8217;re willing to invest some time. It left me wondering if I should spend more time creating training material for my own projects, rather than always prioritizing interface work to make them easier to use. The people who use them to create content are already investing their own time, so is that perhaps another situation where education would pay off?</p>
<hr />
<p>Hubspot is a smart, practical company that&#8217;s very focused on using the data they&#8217;re gathering to understand what their customers really need. Maybe that&#8217;s precisely because the team isn&#8217;t in the Valley to be distracted by every shiny new idea? No matter what the cause, I&#8217;m grateful that they spent the time to show me what they&#8217;d learned, and I&#8217;m looking forward to applying these ideas to my own work.</p>
<p></p>
<p><strong>Related:</strong></p>
<ul>
<li> <a href="http://oreilly.com/catalog/9781449388485/">The Facebook Marketing Book (Book)</a></li>
<li> <a href="http://oreilly.com/catalog/9780596806606/">The Social Media Marketing Book</a> (Book)</li>
<li> <a href="http://www.youtube.com/watch?v=nNotnFZCjes">HubSpot&#8217;s Dharmesh Shah discusses inbound marketing at Web 2.0 Expo SF 2010</a></li>
<li> <a href="http://radar.oreilly.com/2011/01/facebook-marketing-tips.html">Pages before ads and other Facebook marketing tips</a></li>
<li> <a href="http://answers.oreilly.com/topic/315-some-of-the-best-tools-for-twitter-statistics/">Some of the best tools for Twitter statistics</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://radar.oreilly.com/2011/06/hubspot-data-products-marketing-customers.html/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Lessons of the Victorian data revolution</title>
		<link>http://radar.oreilly.com/2011/05/victorian-data-lessons.html</link>
		<comments>http://radar.oreilly.com/2011/05/victorian-data-lessons.html#comments</comments>
		<pubDate>Mon, 23 May 2011 13:00:00 +0000</pubDate>
		<dc:creator>Pete Warden</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[@home]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[data lessons]]></category>
		<category><![CDATA[history]]></category>

		<guid isPermaLink="false">http://blogs.oreilly.com/radar/2011/05/victorian-data-lessons.html</guid>
		<description><![CDATA[Examples from the Victorian era show that if we&apos;re going to improve the world with data, it&apos;s absolutely essential we stay grounded in reality. ]]></description>
				<content:encoded><![CDATA[<p><a href="http://www.flickr.com/photos/steampunkfrankenstein/3339974377/" title="SteamPunk Frankenstein - By D. Mattocks by SteamPunk Frankenstein, on Flickr"><img src="http://s.radar.oreilly.com/2011/05/16/0511-steampunk.png" border="0" alt="SteamPunk Frankenstein - By D. Mattocks by SteamPunk Frankenstein, on Flickr" style="float: right;margin: 3px 0 10px 10px" /></a>Ken Cukier recently wrote about <a href="http://cukier.wordpress.com/2011/03/06/data-boring-but/">how useful analogies from the past are in explaining the potential of the current data revolution</a>. Science as we know it was consciously created in the 19th century, and in many ways the current wave of data techniques feels like an echo of that first flood of innovations. It&#8217;s fascinating to read histories of the era like &#8220;<a href="http://books.google.com/books?id=JckCvpOQDOoC">The Philosophical Breakfast Club</a>&#8221; and spot the parallels.</p>
<p>Take tides for example. You&#8217;ve probably never worried about the timing or height of the sea, but for Victorian sailors figuring out the tides was a life or death problem. Getting it wrong would mean a slipped schedule at best, or a shipwreck at worst. The only people who could accurately predict the tides were harbor masters, since conditions varied widely across different areas and required patient observation by locals. The harbor masters guarded their knowledge so carefully that even British naval captains had to pay them to get access to the information they needed to dock their vessels!</p>
<p>The harbor masters were data producers with a business model that excluded many potential users because the transaction costs were too high to be worthwhile. Sound familiar? That&#8217;s the state of many of the datasets I wish were openly available, from <a href="http://en.wikipedia.org/wiki/Multiple_Listing_Service#Policies_on_sharing_MLS_data_in_the_USA">real-estate listings</a> to <a href="http://zipboundary.com/">full zip-code boundaries</a>.</p>
<p>The Victorian solution was another familiar face &mdash; crowdsourcing. William Whewell arranged for hundreds of volunteers around the world to measure their local sea levels and send the numbers back to him. He then plotted the times of the tidal maximums on a map to create a visualization called a co-tidal chart. Below is a modern version <a href="http://en.wikipedia.org/wiki/File:M2_tidal_constituent.jpg">from NASA</a>:</p>
<div align="center">
<p class="image-box-480">
<a href="http://en.wikipedia.org/wiki/File:M2_tidal_constituent.jpg"><img src="http://s.radar.oreilly.com/assets_c/2011/05/M2_tidal_constituent-thumb-486x312.jpg" width="480" alt="M2_tidal_constituent.jpg" style="margin-bottom: 15px" /></a><br />
<a href="http://en.wikipedia.org/wiki/File:M2_tidal_constituent.jpg">Click to enlarge</a></p>
</div>
<p>Maps like these, along with more detailed tables, allowed navigators to make their journeys without being ambushed by the tides. This story could be a poster child for our own revolution, with open data fixing a painful real-world problem.</p>
</p>
<h2>The limits of data</h2>
</p>
<p>What&#8217;s really useful about historical analogies is that you can see how they played out in the long term. The villains of the tidal story were the harbor masters who hoarded their information, but in fact that was only a small part of the value they offered. Despite incredibly detailed maps of every port, we still rely on their descendants to pilot commercial ships into harbor. There&#8217;s a world of knowledge about currents, shifting sand banks and traffic patterns that it hasn&#8217;t been possible to compress into numbers or rules.</p>
<p>The lesson I draw from this is that in many new areas there&#8217;s some problems that are easy to fix by gathering and applying data, but we need to keep a bit of humility. James Scott&#8217;s &#8220;<a href="http://books.google.com/books?id=W0seMALXWcQC">Seeing Like a State</a>&#8221; looks at the legacy of the Victorian scientific revolution, and shows how the very success of its ideas had a dark side. Creating datasets may help technical people like us to understand problems and propose solutions, but it also means that harbor masters and other people with deep, lived experience of the domains will be overruled. In the 20th century the prestige of the scientific toolkit was used to justify disasters like the collectivization of agriculture, as technocrats around the world wielded numbers to take power away from &#8220;inefficient&#8221; smallholders. Those figures were mostly proven bogus by reality, as plans with no knowledge of conditions on the ground failed when confronted with the wildly variable conditions of soil, weather and pests that farmers had spent a lifetime learning to cope with.</p>
<div style="height: 175px;border-top: thin gray solid;border-bottom: thin gray solid;padding: 20px;margin: 20px 2px"><a href="https://en.oreilly.com/oscon2011/public/regwith/os11rad?cmp=il-radar-os11-victorian-data"><img style="float: left;border: none;padding-right: 10px" src="http://s.radar.oreilly.com/oscon-data-code-os11rad.png" /></a><a href="https://en.oreilly.com/oscon2011/public/regwith/os11rad?cmp=il-radar-os11-victorian-data"><strong>OSCON Data 2011</strong></a>, being held July 25-27 in Portland, Ore., is a gathering for developers who are hands-on, doing the systems work and evolving architectures and tools to manage data. (This event is co-located with <a href="http://www.oscon.com/oscon2011?cmp=il-radar-os11-victorian-data">OSCON</a>.)</p>
<p><a href="https://en.oreilly.com/oscon2011/public/regwith/os11rad?cmp=il-radar-os11-victorian-data"><strong>Save 20% on registration with the code OS11RAD</strong></a></div>
</p>
<h2>The way forward</h2>
</p>
<p>Specialists like us who can understand and interpret data are in a privileged position. Most people have an exaggerated respect for arguments expressed as numbers or visualizations, because they don&#8217;t understand how many assumptions and simplifications go into these creations. It&#8217;s our job to remember that and balance our enthusiasm about the power of our techniques with some humility about their limits. It also makes education and popularization even more important, since we need a <a href="http://radar.oreilly.com/2011/05/data-science-terminology.html">common language</a> to talk with domain specialists, so they can keep our work honest with their own deep knowledge. The Victorian example shows that if we&#8217;re going to improve the world with data, it&#8217;s absolutely essential we stay grounded in reality.</p>
<p><em>Photo: <a href="http://www.flickr.com/photos/steampunkfrankenstein/3339974377/" title="SteamPunk Frankenstein - By D. Mattocks by SteamPunk Frankenstein, on Flickr">SteamPunk Frankenstein &#8211; By D. Mattocks by SteamPunk Frankenstein, on Flickr</a></em></p>
<p></p>
<p><strong>Related:</strong></p>
<ul>
<li> <a href="http://radar.oreilly.com/2011/05/data-science-terminology.html">Why the term &#8220;data science&#8221; is flawed but useful</a></li>
<li> <a href="http://radar.oreilly.com/2011/05/strataweek-avos-data-weapon-science-papers.html#data-weapon">Data and mathematical intimidation</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://radar.oreilly.com/2011/05/victorian-data-lessons.html/feed</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Why you can&apos;t really anonymize your data</title>
		<link>http://radar.oreilly.com/2011/05/anonymize-data-limits.html</link>
		<comments>http://radar.oreilly.com/2011/05/anonymize-data-limits.html#comments</comments>
		<pubDate>Tue, 17 May 2011 13:00:00 +0000</pubDate>
		<dc:creator>Pete Warden</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[@home]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[privacy]]></category>
		<category><![CDATA[security]]></category>

		<guid isPermaLink="false">http://blogs.oreilly.com/radar/2011/05/anonymize-data-limits.html</guid>
		<description><![CDATA[Because we now have so much data at our disposal, any dataset with a decent amount of information can be matched against identifiable public records. To keep datasets available, we must acknowledge that foolproof anonymization is an illusion. ]]></description>
				<content:encoded><![CDATA[<p>One of the joys of the last few years has been the flood of real-world datasets being released by all sorts of organizations. These usually involve some record of individuals&#8217; activities, so to assuage privacy fears, the distributors will claim that any personally-identifying information (PII) has been stripped. The idea is that this makes it impossible to match any record with the person it&#8217;s recording.</p>
<p>Something that my friend <a href="http://33bits.org/">Arvind Narayanan</a> has taught me, both with theoretical papers and repeated practical demonstrations, is that this anonymization process is an illusion. Precisely because there are now so many different public datasets to cross-reference, any set of records with  a non-trivial amount of information on someone&#8217;s actions has a good chance of matching identifiable public records. Arvind first demonstrated this when he and his fellow researcher took the &#8220;anonymous&#8221; dataset released as part of the first Netflix prize, and <a href="http://33bits.org/about/netflix-paper-home-page/">demonstrated how he could correlate the movie rentals listed with public IMDB reviews</a>. That let them identify some named individuals, and then gave access to their complete rental histories. More recently, he and his collaborators used the same approach to win a <a href="http://www.kaggle.com/">Kaggle</a> contest by <a href="http://33bits.org/2011/03/09/link-prediction-by-de-anonymization-how-we-won-the-kaggle-social-network-challenge/">matching the topography of the anonymized and a publicly crawled version of the social connections on Flickr</a>. They were able to take two partial social graphs, and like piecing together a jigsaw puzzle, figure out fragments that matched and represented the same users in both.</p>
<p>All the known examples of this type of identification are from the research world &mdash; no commercial or malicious uses have yet come to light &mdash; but they prove that anonymization is not an absolute protection. In fact, it creates a false sense of security. Any dataset that has enough information on people to be interesting to researchers also has enough information to be de-anonymized. This is important because I want to see our tools applied to problems that really matter in areas like health and crime. This means releasing detailed datasets on those areas to researchers, and those are bound to contain data more sensitive than movie rentals or photo logs. If just one of those sets is de-anonymized and causes a user backlash, we&#8217;ll lose access to all of them.</p>
<p>So, what should we do? Accepting that anonymization is not a complete solution doesn&#8217;t mean giving up, it just means we have to be smarter about our data releases. Below I outline four suggestions.</p>
<div style="border-top: thin gray solid;border-bottom: thin gray solid;padding: 20px;margin: 20px 2px"><a href="https://en.oreilly.com/oscon2011/public/regwith/os11rad?cmp=il-radar-os11-anonymize-data"><img style="float: left;border: none;padding-right: 10px" src="http://s.radar.oreilly.com/oscon-data-code-os11rad.png" /></a><a href="https://en.oreilly.com/oscon2011/public/regwith/os11rad?cmp=il-radar-os11-anonymize-data"><strong>OSCON Data 2011</strong></a>, being held July 25-27 in Portland, Ore., is a gathering for developers who are hands-on, doing the systems work and evolving architectures and tools to manage data. (This event is co-located with <a href="http://www.oscon.com/oscon2011?cmp=il-radar-os11-anonymize-data">OSCON</a>.)</p>
<p><a href="https://en.oreilly.com/oscon2011/public/regwith/os11rad?cmp=il-radar-os11-anonymize-data"><strong>Save 20% on registration with the code OS11RAD</strong></a></div>
</p>
<h2>Keep the anonymization</h2>
</p>
<p>Just because it&#8217;s not totally reliable, don&#8217;t stop stripping out PII. It&#8217;s a good first step, and makes the reconstruction process much harder for any attacker.</p>
</p>
<h2>Acknowledge there&#8217;s a risk of de-anonymization</h2>
</p>
<p>Don&#8217;t make false promises to users about how anonymous their data is. Make the case to them that you&#8217;re minimizing the risk and possible harm of any data leaks, sell them on the benefits (either for themselves or the wider world) and get their permission to go ahead. This is a painful slog, but the more organizations that take this approach, the easier it will be. A great model is Reddit, which asked their users to opt-in to sharing their data. They <a href="http://www.reddit.com/r/redditdev/comments/bubhl/csv_dump_of_reddit_voting_data/">got a great response</a>.</p>
</p>
<h2>Limit the detail</h2>
</p>
<p>Look at the records you&#8217;re getting ready to open up to the world, and imagine that they can be linked back to named people. Are there parts of it that are more sensitive than others, and maybe less important to the sort of applications you have in mind? Can you aggregate multiple people together into cohorts that represent the average behavior of small groups?</p>
</p>
<h2>Learn from the experts</h2>
</p>
<p>There&#8217;s many decades of experience of dealing with highly sensitive and personal data in sociology and economics departments across the globe. They&#8217;ve developed<a href="http://www.ihsn.org/home/index.php?q=tools/anonymization/techniques">techniques</a> that could prove useful to the emerging community of data scientists, such as subtle distortions of the information to prevent identification of individuals, or even the sort of locked-down clean-room conditions that are required to access detailed IRS data.</p>
<p>There&#8217;s so much good that can be accomplished using open datasets, it would be a tragedy if we let this slip through our fingers with preventable errors. With a bit of care up front, and an acknowledgement of the challenges we face, I really believe we can deliver concrete benefits without destroying people&#8217;s privacy.</p>
<p></p>
<p><strong>Related:</strong></p>
<ul>
<li> <a href="http://radar.oreilly.com/2011/05/data-science-terminology.html">Why the term &#8220;data science&#8221; is flawed but useful</a></li>
<li> <a href="http://radar.oreilly.com/2011/04/iphone-tracking-apple-response.html">The iPhone tracking story, one week later</a></li>
<li> <a href="http://radar.oreilly.com/2010/11/open-question-how-much-locatio.html">Open question: How much location information are you willing to share?</a></li>
<li> <a href="http://radar.oreilly.com/2010/08/online-privacy-debates-heat-up.html">Online privacy debates heat up in Washington</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://radar.oreilly.com/2011/05/anonymize-data-limits.html/feed</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Why the term &quot;data science&quot; is flawed but useful</title>
		<link>http://radar.oreilly.com/2011/05/data-science-terminology.html</link>
		<comments>http://radar.oreilly.com/2011/05/data-science-terminology.html#comments</comments>
		<pubDate>Mon, 09 May 2011 10:30:00 +0000</pubDate>
		<dc:creator>Pete Warden</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[@home]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[data scientist]]></category>

		<guid isPermaLink="false">http://blogs.oreilly.com/radar/2011/05/data-science-terminology.html</guid>
		<description><![CDATA[While formal boundaries and professional criteria for &#34;data science&#34; remain undefined, here&apos;s why we should keep using the term. ]]></description>
				<content:encoded><![CDATA[<p>Mention &#8220;data science&#8221; to a lot of the high-profile people you might think practice it and you&#8217;re likely to see rolling eyes and shaking heads. It has taken me a while, but I&#8217;ve learned to love the term, despite my doubts. The key reason is that the rest of the world understands roughly what I mean when I use it. After years of stumbling through long-winded explanations about what I do, I can now say &#8220;I&#8217;m a data scientist&#8221; and move on. It is still an incredibly hazy definition, but my former descriptions left people confused as well, so this approach is no worse and at least saves time.</p>
<p>With that in mind, here are the arguments I&#8217;ve heard against the term, and why I don&#8217;t think they should stop its adoption.</p>
</p>
<h2>It&#8217;s not a real science</h2>
</p>
<p>I just finished reading &#8220;<a href="http://books.google.com/books?id=JckCvpOQDOoC">The Philosophical Breakfast Club</a>,&#8221; the story of four Victorian friends who created the modern structure of science, as well as inventing the word &#8220;scientist.&#8221; I grew up with the idea that physics, chemistry and biology were the only real sciences and every other subject using the term was just stealing their clothes (&#8220;Anything that needs science in the name is not a real science&#8221;). The book shows that from the beginning the label was never restricted to just the hard experimental sciences. It was chosen to promote a disciplined approach to reasoning that relied on data rather than the poorly-supported logical deductions many contemporaries favored. Data science fits comfortably in this more open tradition.</p>
<div style="border-top: thin gray solid;border-bottom: thin gray solid;padding: 20px;margin: 20px 2px"><a href="https://en.oreilly.com/oscon2011/public/regwith/os11rad?cmp=il-radar-os11-data-science-term"><img style="float: left;border: none;padding-right: 10px" src="http://s.radar.oreilly.com/oscon-data-code-os11rad.png" /></a><a href="https://en.oreilly.com/oscon2011/public/regwith/os11rad?cmp=il-radar-os11-data-science-term"><strong>OSCON Data 2011</strong></a>, being held July 25-27 in Portland, Ore., is a gathering for developers who are hands-on, doing the systems work and evolving architectures and tools to manage data. (This event is co-located with <a href="http://www.oscon.com/oscon2011">OSCON</a>.)</p>
<p><a href="https://en.oreilly.com/oscon2011/public/regwith/os11rad?cmp=il-radar-os11-data-science-term"><strong>Save 20% on registration with the code OS11RAD</strong></a></div>
</p>
<h2>It&#8217;s an unnecessary label</h2>
</p>
<p>To me, it&#8217;s obvious that there has been a massive change in the landscape over the last few years. Data and the tools to process it are suddenly abundant and cheap. Thousands of people are exploiting this change, making things that would have been impossible or impractical before now, using a whole new set of techniques. We need a term to describe this movement, so we can create job ads, conferences, training and books that reach the right people. Those goals might sound very mundane, but without an agreed-upon term we just can&#8217;t communicate.</p>
</p>
<h2>The name doesn&#8217;t even make sense</h2>
</p>
<p>As a friend said, &#8220;show me a science that doesn&#8217;t involve data.&#8221; I hate the name myself, but I also know it could be a lot worse. Just look at other fields that suffer under terms like &#8220;<a href="http://en.wikipedia.org/wiki/Processual_archaeology">new archaeology</a>&#8221; (now more than 50 years old) or &#8220;<a href="http://en.wikipedia.org/wiki/Modern_art">modernist art</a>&#8221; (pushing a century). I learned from teenage bands that the naming process is the most divisive part of any new venture, so my philosophy has always been to take the name you&#8217;re given, and rely on time and hard work to give it the right associations. Apple and Microsoft (n&eacute;e <a href="http://www.wired.com/science/discoveries/news/2008/04/dayintech_0404">Micro-soft</a>) are terrible startup names by any objective measure, but they&#8217;ve earned their mindshare. People are calling what we&#8217;re doing &#8220;data science,&#8221; so lets accept that and focus on moving the subject forward.</p>
</p>
<h2>There&#8217;s no definition</h2>
</p>
<p>This is probably the deepest objection, and the one with the most teeth. There is no widely accepted boundary for what&#8217;s inside and outside of data science&#8217;s scope. Is it just a faddish rebranding of statistics? I don&#8217;t think so, but I also don&#8217;t have a full definition. I believe that the recent abundance of data has sparked something new in the world, and when I look around I see people with shared characteristics who don&#8217;t fit into traditional categories. These people tend to work beyond the narrow specialties that dominate the corporate and institutional world, handling everything from finding the data, processing it at scale, visualizing it and writing it up as a story. They also seem to start by looking at what the data can tell them, and then picking interesting threads to follow, rather than the traditional scientist&#8217;s approach of choosing the problem first and then finding data to shed light on it. I don&#8217;t know what the eventual consensus will be on the limits of data science, but we&#8217;re starting to see some outlines emerge.</p>
</p>
<h2>Time for the community to rally</h2>
</p>
<p>I&#8217;m betting a lot on the persistence of the term. If I&#8217;m wrong the <a href="http://www.datasciencetoolkit.org/">Data Science Toolkit</a> will end up sounding as dated as &#8220;surfing the information super-highway.&#8221; I think data science, as a phrase, is here to stay though, whether we like it or not. That means we as a community can either step up and steer its future, or let others exploit its current name recognition and dilute it beyond usefulness. If we don&#8217;t rally around a workable definition to replace the current vagueness, we&#8217;ll have lost a powerful tool for explaining our work.</p>
<p></p>
<p><strong>Related:</strong></p>
<ul>
<li> <a href="http://radar.oreilly.com/2010/06/what-is-data-science.html">What is data science?</a></li>
<li> <a href="http://radar.oreilly.com/2011/01/3-skills-of-data-scientists.html">3 skills a data scientist needs</a></li>
<li> <a href="http://radar.oreilly.com/2010/09/data-week-becoming-a-data-scie.html">Becoming a data scientist</a></li>
<li> <a href="http://radar.oreilly.com/2010/07/data-science-democratized.html">Data science democratized</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://radar.oreilly.com/2011/05/data-science-terminology.html/feed</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>The iPhone tracking story, one week later</title>
		<link>http://radar.oreilly.com/2011/04/iphone-tracking-apple-response.html</link>
		<comments>http://radar.oreilly.com/2011/04/iphone-tracking-apple-response.html#comments</comments>
		<pubDate>Wed, 27 Apr 2011 16:00:00 +0000</pubDate>
		<dc:creator>Pete Warden</dc:creator>
				<category><![CDATA[Mobile]]></category>
		<category><![CDATA[@home]]></category>
		<category><![CDATA[@top]]></category>
		<category><![CDATA[ios]]></category>
		<category><![CDATA[iPhone]]></category>
		<category><![CDATA[location]]></category>
		<category><![CDATA[tracking]]></category>

		<guid isPermaLink="false">http://blogs.oreilly.com/radar/2011/04/iphone-tracking-apple-response.html</guid>
		<description><![CDATA[Apple announces fixes and sheds more light on location data. Plus, a look at some of the reporting and potential applications that have popped up. ]]></description>
				<content:encoded><![CDATA[<p><em>By <a href="http://about.me/alasdairallan">Alasdair Allan</a> and <a href="http://twitter.com/petewarden">Pete Warden</a></em></p>
<p>It&#8217;s now been a week since we published the <a href="http://radar.oreilly.com/2011/04/apple-location-tracking.html">iPhone tracking story</a>, so it seemed a good time to cover what we&#8217;ve learned. </p>
</p>
<h2>The fix</h2>
</p>
<p><a href="http://radar.oreilly.com/2011/04/apple-location-tracking.html"><img src="http://s.radar.oreilly.com/2011/04/21/042111-iphone-track.png" border="0" alt="iPhone track" style="float: right;margin: 3px 0 10px 10px" /></a>Apple has just released <a href="http://www.apple.com/pr/library/2011/04/27location_qa.html">a Q&amp;A covering this problem</a> and they will be fixing the issues we spotted with a software update. &#8220;The reason the iPhone stores so much data is a bug we uncovered,&#8221; Apple notes in the statement.</p>
<p>Apple explains that nearby locations are pulled down from an Apple database and stored on the phone. These locations are from a &#8220;crowd-sourced database of Wi-Fi hotspot and cell tower data.&#8221; This matches the picture that was emerging from research. It explains why there&#8217;s lots of locations that don&#8217;t match towers, and also why the accuracy is within a few-hundred meters, since we&#8217;ve learned that &#8220;micro-cells&#8221; in urban areas are clustered closely together.</p>
<p>The Q&amp;A explains the technical workings behind the log and reassures us that only anonymous data is sent back. Our <a href="http://petewarden.github.com/iPhoneTracker/#faq">conclusions</a> still apply.</p>
<p>Apple doesn&#8217;t address our claim that this reveals sensitive information about your travels. At this point we&#8217;re just relieved to get an explanation and a fix, but people <a href="http://petewarden.github.com/iPhoneTracker/">can examine their own data</a> and decide for themselves how happy they would be sharing it with strangers.</p>
</p>
<h2>Forensics</h2>
</p>
<p><a href="http://www.theatlantic.com/technology/archive/2011/04/what-does-your-phone-know-about-you-more-than-you-think/237786/">What Does Your iPhone Know About You? More Than You Think</a> &mdash; Alexis Madrigal has written a fascinating follow-up piece covering the data that professionals can read from your phone. Using forensics tools like <a href="http://katanaforensics.com/">the Lantern program</a> that <a href="http://alexlevinson.wordpress.com/">Alex Levinson</a> helped build, anyone with physical access to the device can construct a picture of the user&#8217;s life. It&#8217;s eye-opening what the &#8220;law enforcement, government, and corporate examiners&#8221; who purchase the system can uncover about your behavior.</p>
<p>The <a href="http://www.zeit.de/datenschutz/malte-spitz-data-retention">Tell-all telephone visualization</a> also makes for thoughtful viewing. It&#8217;s built from details that a German politician forced his cell phone provider to share after it was caught storing six months of location data on its subscribers. I think one of the reasons that the iPhone Tracker application has had so much use is that it shows people their own data in an understandable way. Unfortunately, that means that similar information that&#8217;s harder to access behind a company&#8217;s firewall may not get the same scrutiny, just because it&#8217;s harder to show in a way that connects with people.</p>
</p>
<h2>Uses for good</h2>
</p>
<p>I&#8217;ve long been a fan of <a href="http://geoloqi.com">Geoloqi&#8217;s</a> opt-in service for recording and sharing your travels, but several other projects in the same area have appeared in my inbox over the last few days. Maria Scileppi has created the <a href="http://livingbrushstroke.tumblr.com/">Living Brushstroke</a> project (see <a href="http://vimeo.com/20246863">video</a> below) to capture people&#8217;s movements at events, and turn the data into art. Intriguing and beautiful patterns emerge as people cross paths. It&#8217;s a very fresh way to look at our lives.</p>
<p align="center">
<p></p>
<p><strong>Related:</strong></p>
<ul>
<li> <a href="http://www.apple.com/pr/library/2011/04/27location_qa.html">Apple Q&amp;A on Location Data</a></li>
<li> <a href="http://radar.oreilly.com/2011/04/apple-location-tracking.html">Got an iPhone or 3G iPad? Apple is recording your moves</a></li>
<li> <a href="http://radar.oreilly.com/2011/04/iphone-tracking-followup.html">iPhone tracking: The day after</a></li>
<li> <a href="http://radar.oreilly.com/2011/04/more-iphone-tracking-research.html">Additional iPhone tracking research</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://radar.oreilly.com/2011/04/iphone-tracking-apple-response.html/feed</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>Additional iPhone tracking research</title>
		<link>http://radar.oreilly.com/2011/04/more-iphone-tracking-research.html</link>
		<comments>http://radar.oreilly.com/2011/04/more-iphone-tracking-research.html#comments</comments>
		<pubDate>Sun, 24 Apr 2011 17:00:00 +0000</pubDate>
		<dc:creator>Pete Warden</dc:creator>
				<category><![CDATA[Mobile]]></category>
		<category><![CDATA[@home]]></category>
		<category><![CDATA[iPhone]]></category>
		<category><![CDATA[location]]></category>
		<category><![CDATA[privacy]]></category>
		<category><![CDATA[tracking]]></category>

		<guid isPermaLink="false">http://blogs.oreilly.com/radar/2011/04/more-iphone-tracking-research.html</guid>
		<description><![CDATA[The iPhone tracking story led to a host of related investigations. Here&apos;s a look at some of the latest developments. ]]></description>
				<content:encoded><![CDATA[<p><strong>Update, 4/27/11</strong> &mdash; Apple has posted <a href="http://www.apple.com/pr/library/2011/04/27location_qa.html">a response</a> to questions raised in <a href="http://radar.oreilly.com/2011/04/apple-location-tracking.html">this report</a> and others.</p>
<p><em>By <a href="http://about.me/alasdairallan">Alasdair Allan</a> and <a href="http://twitter.com/petewarden">Pete Warden</a></em></p>
<p>Here&#8217;s the latest developments on <a href="http://radar.oreilly.com/2011/04/apple-location-tracking.html">iPhone tracking</a>.</p>
</p>
<h2>Android records a short log</h2>
</p>
<p>The Guardian has <a href="http://www.guardian.co.uk/technology/2011/apr/21/android-phones-record-user-locations">a good overview</a> of Android&#8217;s equivalent to consolidated.db. It records the last 50 cell locations, and the last 200 Wi-Fi networks, but older entries are overwritten. As we mentioned in <a href="http://twitter.com/petewarden">our original video</a>, this was what we expected on the iPhone when we found the file, and it was the sheer scale and duration of the recording that floored us, along with how easy it was to access on your computer. Android doesn&#8217;t appear to copy the file over when you sync, so you&#8217;d need physical access to the phone to read it.</p>
</p>
<h2>Phoning home your location</h2>
</p>
<p>In the Wall Street Journal there&#8217;s <a href="http://online.wsj.com/article/SB10001424052748703983704576277101723453610.html?loc=interstitialskip">a good story</a> covering how phones often send your location back to servers at both Apple and Google. We&#8217;ve known that cell companies are gathering this kind of data, because they need it for their basic operations, but the most interesting question for me is how it&#8217;s actually stored by these software companies. If it&#8217;s truly just for improving their location services, it could be anonymized so that it would be hard to figure out an individual&#8217;s movements if you had the data. Even if it&#8217;s not, the data is somewhat protected when it&#8217;s on a company&#8217;s internal network, since that keeps it further out of reach than a file that&#8217;s held on your machine.</p>
</p>
<h2>Better for tracking travel than home or office locations</h2>
</p>
<p><a href="http://blog.geoiq.com/2011/04/22/liberating-my-data-from-the-iphone/">Sean Gorman</a> and my friend <a href="http://geothought.blogspot.com/2011/04/more-on-apple-recording-your-iphone.html">Peter Batty</a> have done some impressive work digging into the details of the location data. Their conclusion is that it&#8217;s hard to spot locations where you spend a lot of time in the same place, like your house or place of work. It&#8217;s almost as if re-visiting the same spot overwrites a lot of the older data for that place, which would fit with a lot of what we&#8217;ve seen. They also try to quantify the accuracy of the location, pointing out how many outliers appear.</p>
<p>Even just showing where you&#8217;ve been traveling to is pretty concerning, but it&#8217;s good to rule out some malicious uses. The work they&#8217;ve done gives us a lot more about the characteristics of the data, I&#8217;m looking forward to seeing more of this kind of analysis.</p>
<p>Intriguingly, their work also has some support for <a href="http://www.willclarke.net/?p=247">Will Clarke&#8217;s idea</a> that the locations are associated with cell towers. Peter&#8217;s data shows a cluster around Mile High Stadium, which he hasn&#8217;t visited recently but which does have a lot of cell infrastructure. Sean has another map that overlays actual tower locations with his points, and it&#8217;s clear they don&#8217;t coincide, but could well be triangulated from multiple towers. Sean&#8217;s observation fits with our initial hypothesis that the locations are the result of sometimes-inaccurate triangulation from towers, but Peter&#8217;s is evidence that there&#8217;s a bias in the data to clustering around tower positions.</p>
<p>Peter is investigating the WiFiLocation table. This typically contains a lot more points than the cell version, with 219,000 entries in Alasdair&#8217;s data versus only 29,000 cell points. We didn&#8217;t visualize this in the application because the derived lat/long points are a lot noisier, but that may be an issue with the quality of the location-lookup tables Apple are using since they switched away from SkyHook. It appears to record the ID of many of the WiFi networks you&#8217;ve come into range of, so I&#8217;ll be interested to see what Peter and others discover about this data.</p>
<p></p>
<p><strong>Related:</strong></p>
<ul>
<li> <a href="http://radar.oreilly.com/2011/04/apple-location-tracking.html">Got an iPhone or 3G iPad? Apple is recording your moves</a></li>
<li> <a href="http://radar.oreilly.com/2011/04/iphone-tracking-followup.html">iPhone tracking: The day after</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://radar.oreilly.com/2011/04/more-iphone-tracking-research.html/feed</wfw:commentRss>
		<slash:comments>21</slash:comments>
		</item>
		<item>
		<title>iPhone tracking: The day after</title>
		<link>http://radar.oreilly.com/2011/04/iphone-tracking-followup.html</link>
		<comments>http://radar.oreilly.com/2011/04/iphone-tracking-followup.html#comments</comments>
		<pubDate>Fri, 22 Apr 2011 02:00:00 +0000</pubDate>
		<dc:creator>Pete Warden</dc:creator>
				<category><![CDATA[Mobile]]></category>
		<category><![CDATA[@home]]></category>
		<category><![CDATA[iPhone]]></category>
		<category><![CDATA[location]]></category>
		<category><![CDATA[privacy]]></category>
		<category><![CDATA[tracking]]></category>

		<guid isPermaLink="false">http://blogs.oreilly.com/radar/2011/04/iphone-tracking-followup.html</guid>
		<description><![CDATA[The iPhone tracking story published here a few days ago struck an unexpected nerve. Here&apos;s a selection of the most interesting immediate reactions. ]]></description>
				<content:encoded><![CDATA[<p><strong>Update, 4/27/11</strong> &mdash; Apple has posted <a href="http://www.apple.com/pr/library/2011/04/27location_qa.html">a response</a> to questions raised in <a href="http://radar.oreilly.com/2011/04/apple-location-tracking.html">this report</a> and others.</p>
<p><em>By <a href="http://about.me/alasdairallan">Alasdair Allan</a> and <a href="http://twitter.com/petewarden">Pete Warden</a></em></p>
<p><a href="http://radar.oreilly.com/2011/04/apple-location-tracking.html"><img src="http://s.radar.oreilly.com/2011/04/21/042111-iphone-track.png" border="0" alt="iPhone track" style="float: right;margin: 3px 0 10px 10px" /></a>I don&#8217;t think either of us were expecting to see <a href="http://radar.oreilly.com/2011/04/apple-location-tracking.html">this story</a> strike such a nerve. There&#8217;s been some amazing detective work from researchers across the web, and so here&#8217;s a selection of the most interesting immediate reactions.</p>
<p><a href="https://alexlevinson.wordpress.com/2011/04/21/3-major-issues-with-the-latest-iphone-tracking-discovery/"><strong>Alex Levinson</strong></a> &mdash; Right from launch, we had <a href="http://petewarden.github.com/iPhoneTracker/#8">an FAQ</a> pointing to articles by people like <a href="http://ryanneal.wordpress.com/2011/03/18/war-against-the-iphones-consolidated-db/">Ryan Neal</a> and <a href="http://www.courbis.fr/spip.php?page=article&amp;id_article=255">Paul Courbis</a> who had found this file (consolidated.db) before, but hadn&#8217;t understood or been able to communicate its significance. The main reason we went public with this was exactly because it already seemed to be an open secret among people who make their living doing forensic phone analysis, but not among the general public &mdash; even pretty geeky people like Alasdair and me. We were freaked out by the implications of this data and how unprotected it was, but most of the forensics community seemed to miss quite how creepy ordinary people would find it.</p>
<p>I do appreciate how frustrating this must be for Alex though, and would like to apologize personally to him that we didn&#8217;t include his article among the prior research we cited. Unlike the others, it didn&#8217;t show up in web searches or the books we referenced. It also didn&#8217;t help that most of the follow-up articles by other people left out the details that we&#8217;d tried to make clear about who found it first. We obviously didn&#8217;t communicate it as well as we thought we had, which is completely our fault.</p>
<p><a href="http://www.theatlantic.com/technology/archive/2011/04/my-life-according-to-the-iphones-secret-tracking-log/237636/"><strong>My Life According to the iPhone&#8217;s Secret Tracking Log</strong></a> &mdash; Alexis Madrigal has a far more interesting life than me, judging by his map. I especially like the points from a flight with Jim Fallows somewhere over West Virginia. As he says, this data can be incredibly interesting, and as data geeks we were just as fascinated as he is. I actually look forward to a future where we can use this sort of information, but with the user&#8217;s permission.</p>
<p><a href="http://www.willclarke.net/?p=247"><strong>Apple is not &#8220;recording your moves&#8221;</strong></a> &mdash; Both of us have been following Will Clarke&#8217;s blog for a while and we liked this article. It&#8217;s good to look skeptically at the accuracy of the data both in space and time. We do disagree about one of the conclusions though: that the points are just the locations of cell towers. That was one of our first thoughts when we saw the data. But the fact that there&#8217;s thousands of different points scattered across small areas, all in slightly different places, seems like pretty strong evidence that they&#8217;re not just the locations of cell towers. Another way of putting that is that there&#8217;s a lot more points than there are towers. There&#8217;s also lots of points with the same tower ID code that are in different locations. That all led to our conclusion that it was trying to figure out the device&#8217;s position, even if it wasn&#8217;t very good at it.</p>
<p>Until we get a deeper analysis, that&#8217;s just a provisional conclusion of course. But getting smart folks like Will to dig into this and correct anything we&#8217;ve got wrong is exactly why we open-sourced it. He also picks up on the <a href="http://www.willclarke.net/?p=264">Las Vegas Anomaly</a>. Multiple people have reported seeing a phantom trip to the city show up, and one theory (other than a lot of lost weekends) is that Apple has an unpacking or testing facility there. Alasdair&#8217;s phone that was shipped with iOS4 shows this, whereas my older device that originally had iOS3 doesn&#8217;t, which was suggestive. I wonder if Will&#8217;s device is a newer one, too?</p>
<p><a href="http://www.openstreetmap.org/"><strong>OpenStreetMap</strong></a> &mdash; The <a href="http://petewarden.github.com/iPhoneTracker/">application</a> we released relies on this volunteer-run site to render the background map tiles. We ended up tripling their usual load, according to a team member. They actually fired up extra servers to cope, so I made sure to add a link to <a href="http://donate.openstreetmap.org/">their donation page</a> from our main site. If you got something out of the application, please do consider giving something to them, or even getting involved. It&#8217;s a fantastic team and community. How many other organizations would have responded to heavy usage by a free client by paying for more servers themselves? I even messed up their credit text on the initial version of the application, but they were very understanding about that too.</p>
<p></p>
<p><strong>Related:</strong></p>
<ul>
<li> <a href="http://radar.oreilly.com/2011/04/apple-location-tracking.html">Got an iPhone or 3G iPad? Apple is recording your moves</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://radar.oreilly.com/2011/04/iphone-tracking-followup.html/feed</wfw:commentRss>
		<slash:comments>26</slash:comments>
		</item>
		<item>
		<title>Will data be too cheap to meter?</title>
		<link>http://radar.oreilly.com/2011/02/crunchbase-cheap-data.html</link>
		<comments>http://radar.oreilly.com/2011/02/crunchbase-cheap-data.html#comments</comments>
		<pubDate>Tue, 08 Feb 2011 14:00:00 +0000</pubDate>
		<dc:creator>Pete Warden</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[@home]]></category>
		<category><![CDATA[@top]]></category>
		<category><![CDATA[analysis]]></category>
		<category><![CDATA[databases]]></category>
		<category><![CDATA[strataconf]]></category>

		<guid isPermaLink="false">http://blogs.oreilly.com/radar/2011/02/crunchbase-cheap-data.html</guid>
		<description><![CDATA[The data acquisition process should be increasingly automatic, and so increasingly cheap. I&apos;m hoping for a world where information producers are paid for extracting value from that data. ]]></description>
				<content:encoded><![CDATA[<p><a href="http://www.crunchbase.com"><img src="http://s.radar.oreilly.com/2011/02/07/0207-crunchbase.png" border="0" alt="CrunchBase" style="float: right;margin: 3px 0 10px 10px" /></a>Last week at <a href="http://strataconf.com/strata2011">Strata</a> I got into an argument with a journalist over the future of <a href="http://www.crunchbase.com/">CrunchBase</a>. His position was that we were just in a &#8220;pre-commercial&#8221; world, that creating the database required a reporter&#8217;s time, and so after the current aberration had passed we&#8217;d return to the old status quo where this kind of information was only available through paid services. I wasn&#8217;t so sure.</p>
<p>When I explain to people why the Big Data movement is important &mdash; why it&#8217;s a real change instead of a fad &mdash; I point to price as the fundamental difference between the old and new worlds. Until a few years ago, the state of the art for doing meaningful analysis of multi-gigabyte data sets was the data warehouse. These custom systems were very capable, but could easily cost millions of dollars. Today I can hire a hundred machine Hadoop cluster from Amazon for just $10 an hour, and process thousands of gigabytes a day.</p>
<p>This represents a massive discontinuity in price, and it&#8217;s why Big Data is so disruptive. Suddenly we can imagine a group of kids in their garage building Google-scale systems practically on pocket money. While the drop in the cost of data storage and transmission has been less dramatic, it has followed a steady downward trend over the decades. Now that processing has become cheap too, a whole universe of poverty-stricken hackers, academics, makers, reporters, and startups can do interesting things with massive data sets.</p>
<p>Why does this have to do with CrunchBase? The reporter had some implicit assumptions about the cost of the data collection process. He argued that it required extra effort from the journalists to create the additional value captured in the database. To paraphrase him: &#8220;It&#8217;s time they&#8217;d rather spend at home playing with their kids, and so we&#8217;ll end up compensating them for their work if we want them to continue producing it.&#8221; What I felt was missing from this is that CrunchBase might actually be just a side-effect of work they&#8217;d be doing even if it wasn&#8217;t released for public consumption.</p>
<p>Many news organizations are taking advantage of the dropping cost of data handling by heavily automating their news-gathering and publishing workflows. This can be as simple as Google Alerts or large collections of RSS feeds to scan, <a href="http://marshallk.com/how-to-use-twitter-plus-needlebase-to-discover-fabulous-things">using scraping tools to gather public web data</a>, and there&#8217;s <a href="http://radar.oreilly.com/2011/01/journalist-data-tools.html">a myriad of other information-processing techniques</a> out there. Internally there&#8217;s a need to keep track of the results of manual or automated research, and so the most advanced organizations are using some kind of structured database to capture the information for future use.</p>
<p>That means that that the only extra effort required to release something like CrunchBase is publishing it to the web. Assuming that there&#8217;s some benefits to doing so (that <a href="http://techcrunch.com/">TechCrunch&#8217;s</a> reputation as the site-of-record for technology company news is enhanced, for example) and that there&#8217;s multiple companies with the data available, then the low cost of the release will mean it makes sense to give it away.</p>
<p>I actually don&#8217;t know if all these assumptions are true, CrunchBase&#8217;s approach may not be sustainable, but I hope it illustrates how a truly radical change in price can upset the traditional rules. Even on a competitive, commercial, free-market playing field it sometimes makes sense to behave in ways that appear hopelessly altruistic. We&#8217;ve seen this play out with open-source software. I expect to see pricing forces do something similar to open up more and more sources of data.</p>
<p>I&#8217;m usually the contrarian guy in the room arguing that <a href="http://petewarden.typepad.com/searchbrowser/2010/10/information-wants-to-be-paid.html">information wants to be paid</a>, so I don&#8217;t actually believe (as Lewis Strauss <a href="http://en.wikipedia.org/wiki/Too_cheap_to_meter">famously said about electricity</a>) all data will be too cheap to meter. Instead I&#8217;m hoping we&#8217;ll head toward a world where producers of information are paid for adding real value. Too many &#8220;premium&#8221; data sets are collated and merged from other computerized sources, and that process should be increasingly automatic, and so increasingly cheap. Give me a raw CrunchBase culled from press releases and filings for free, then charge me for your informed opinion on how likely the companies are to pay their bills if I extend them credit. Just as free, open-source software has served as the foundation for <a href="http://store.apple.com/">some very lucrative businesses</a>, <a href="http://oreilly.com/catalog/0636920018254">the new world of free public data</a> will trigger a flood of innovations that will end up generating value in ways we can&#8217;t foresee, and that we&#8217;ll be happy to pay for.</p>
]]></content:encoded>
			<wfw:commentRss>http://radar.oreilly.com/2011/02/crunchbase-cheap-data.html/feed</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>4 free data tools for journalists (and snoops)</title>
		<link>http://radar.oreilly.com/2011/01/journalist-data-tools.html</link>
		<comments>http://radar.oreilly.com/2011/01/journalist-data-tools.html#comments</comments>
		<pubDate>Thu, 06 Jan 2011 16:00:00 +0000</pubDate>
		<dc:creator>Pete Warden</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[@home]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[data tools]]></category>
		<category><![CDATA[journalism]]></category>

		<guid isPermaLink="false">http://blogs.oreilly.com/radar/2011/01/journalist-data-tools.html</guid>
		<description><![CDATA[You no longer have to be a technical specialist to find exciting and surprising data. In this excerpt from Pete Warden&apos;s ebook, &#34;Where are the bodies buried on the web? Big data for journalists,&#34; Pete looks at four services that reveal underlying information about web pages and domains. ]]></description>
				<content:encoded><![CDATA[<p><em>Note: The following is an excerpt from Pete Warden&#8217;s <a href="http://web.mailana.com/labs/bigdataforjournalists.pdf">free ebook</a> &#8220;Where are the bodies buried on the web? Big data for journalists.&#8221;</em></p>
<hr />
<p>There&#8217;s been a revolution in data over the last few years, driven by an astonishing drop in the price of gathering and analyzing massive amounts of information. It only cost me $120 to <a href="http://petewarden.typepad.com/searchbrowser/2010/02/how-to-split-up-the-us.html">gather, analyze and visualize 220 million public Facebook profiles</a>, and you can use <a href="http://80legs.com/">80legs</a> to download a million web pages <a href="http://80legs.com/plans.html">for just $2.20</a>. Those are just two examples.</p>
<p>The technology is also getting easier to use. Companies like <a href="http://extractiv.com">Extractiv</a> and <a href="http://needlebase.com/">Needlebase</a> are creating point-and-click tools for gathering data from almost any site on the web, and every other stage of the analysis process is getting radically simpler too.</p>
<p>What does this mean for journalists? You no longer have to be a technical specialist to find exciting, convincing and surprising data for your stories. For example, the following four services all easily reveal underlying data about web pages and domains.</p>
</p>
<h2>WHOIS</h2>
</p>
<p>Many of you will already be familiar with WHOIS, but it&#8217;s so useful for research it&#8217;s still worth pointing out. If you go to <a href="http://whois.domaintools.com">this site</a> (or just type &#8220;whois <a href="http://www.example.com">www.example.com</a>&#8221; in Terminal.app on a Mac) you can get the basic registration information for any website. In recent years, some owners have chosen &#8220;private&#8221; registration, which hides their details from view, but in many cases you&#8217;ll see a name, address, email and phone number for the person who registered the site.</p>
<p>You can also enter numerical IP addresses here and get data on the organization or individual that owns that server. This is especially handy when you&#8217;re trying to track down more information on an abusive or malicious user of a service, since most websites record an IP address for everyone who accesses them</p>
<div style="border-top: thin gray solid;border-bottom: thin gray solid;padding: 20px;margin: 20px 2px"><a href="https://en.oreilly.com/strata2011/public/register?cmp=il-radar-st11-journalist-data-tools"><img style="float: left;border: none;padding-right: 10px" src="http://s.radar.oreilly.com/strata11-promo-radar.png" /></a><a href="http://strataconf.com/?cmp=il-radar-st11-journalist-data-tools"><strong>Strata: Making Data Work</strong></a>, being held Feb. 1-3, 2011 in Santa Clara, Calif.,  will focus on the business and practice of data. The conference will provide three days of training, breakout sessions, and plenary discussions &#8212; along with an Executive Summit, a Sponsor Pavilion, and other events showcasing the new data ecosystem.</p>
<p><a href="https://en.oreilly.com/strata2011/public/register?cmp=il-radar-st11-journalist-data-tools"><strong>Save 30% off registration with the code STR11RAD</strong></a></div>
</p>
<h2>Blekko</h2>
</p>
<p>The newest search engine in town, one of <a href="http://blekko.com">Blekko&#8217;s</a> selling points is the richness of the data it offers. If you type in a domain name followed by /seo, you&#8217;ll receive a page of statistics on that URL</p>
<div align="center">
<p class="image-box-450">
<img src="http://s.radar.oreilly.com/assets_c/2011/01/pic0-thumb-486x99.png" width="450" alt="Blekko" /></p>
</div>
<div align="center">
<p class="image-box-450">
<img alt="Blekko statistics page" src="http://s.radar.oreilly.com/assets_c/2011/01/pic1-thumb-486x252.png" width="450" /></p>
</div>
<p>The first tab shows other sites that are linking to the current domain, in popularity order. This can be extremely useful when you&#8217;re trying to understand what coverage a site is receiving, and if you want to understand why it&#8217;s ranking highly in Google&#8217;s search results, since they&#8217;re based on those inbound links. Inclusion of this information would have been an interesting addition to <a href="http://www.nytimes.com/2010/11/28/business/28borker.html">the recent DecorMyEyes story</a>, for example.</p>
<p>The other handy tab is &#8220;Crawl stats,&#8221; especially the &#8220;Cohosted with&#8221; section:</p>
<div align="center">
<p class="image-box-450"><img alt="Cohosted with section on Blekko" src="http://s.radar.oreilly.com/assets_c/2011/01/pic2-thumb-486x246.png" width="450" /></p>
</div>
<p>This tells you which other websites are running from the same machine. It&#8217;s common for scammers and spammers to astroturf their way toward legitimacy by building multiple sites that review and link to each other. They look like independent domains, and may even have different registration details, but often they&#8217;ll actually live on the same server because that&#8217;s a lot cheaper. These statistics give you an insight into the hidden business structure of shady operators.</p>
</p>
<h2>bit.ly</h2>
</p>
<p>I always turn to <a href="http://bit.ly">bit.ly</a> when I want to know how people are sharing a particular link. To use it, enter the URL you&#8217;re interested in:</p>
<div align="center">
<p class="image-box-450"><img alt="Bitly link shortening box" src="http://s.radar.oreilly.com/assets_c/2011/01/pic3-thumb-486x66.png" width="450" /></p>
</div>
<p>Then click on the &#8216;Info Page+&#8217; link:</p>
<div align="center">
<p class="image-box-450"><img alt="pic4.png" src="http://s.radar.oreilly.com/assets_c/2011/01/pic4-thumb-486x30.png" width="450" /></p>
</div>
<p>That takes you to the full statistics page (though you may need to choose &#8220;aggregate bit.ly link&#8221; first if you&#8217;re signed in to the service).</p>
<div align="center">
<p class="image-box-450">
<img alt="pic5.png" src="http://s.radar.oreilly.com/assets_c/2011/01/pic5-thumb-486x104.png" width="450" /></p>
</div>
<p>This will give you an idea of how popular the page is, including activity on Facebook and Twitter. Below that you&#8217;ll see public conversations about the link provided by <a href="http://backtype.com/">backtype.com</a>.</p>
<div align="center">
<p class="image-box-450">
<img alt="Facebook and Twitter activity on Bitly" src="http://s.radar.oreilly.com/assets_c/2011/01/pic6-thumb-486x156.png" width="450" height="156" class="mt-image-none" /></p>
</div>
<p>I find this combination of traffic data and conversations very helpful when I&#8217;m trying to understand why a site or page is popular, and who exactly its fans are. For example, it provided me with strong evidence that the prevailing narrative about grassroots sharing and Sarah Palin was <a href="http://www.thenation.com/blog/37462/new-data-shows-sarah-palin-paper-grizzly">wrong</a>.</p>
<p><em>[Disclosure: O'Reilly AlphaTech Ventures is an <a href="http://oatv.com/investments/">investor in bit.ly</a>.]</em></p>
</p>
<h2>Compete</h2>
</p>
<p>By surveying a cross-section of American consumers, <a href="http://www.compete.com">Compete</a> builds up detailed usage statistics for most websites, and they make some basic details freely available.</p>
<p>Choose the &#8220;Site Profile&#8221; tab and enter a domain:</p>
<div align="center">
<p class="image-box-450">
<img alt="Compete site profile box" src="http://s.radar.oreilly.com/assets_c/2011/01/pic7-thumb-486x114.png" width="450" /></p>
</div>
<p>You&#8217;ll then see a graph of the site&#8217;s traffic over the last year, together with figures for how many people visited, and how often.</p>
<div align="center">
<p class="image-box-450">
<img alt="Compete Traffic" src="http://s.radar.oreilly.com/assets_c/2011/01/pic8-thumb-486x204.png" width="450" /></p>
</div>
<p>Since they&#8217;re based on surveys, Compete&#8217;s numbers are only approximate. Nonetheless, I&#8217;ve found them reasonably accurate when I&#8217;ve been able to compare them against internal analytics.</p>
<p>Compete&#8217;s stats are a good source when comparing two sites. While the absolute numbers may be off for both sites, Compete still offers a decent representation of the sites&#8217; relative difference in popularity. </p>
<p>One caveat: Compete only surveys U.S. consumers, so the data will be poor for predominantly international sites.</p>
<hr />
<p><em>Additional data resources and tools are discussed in <a href="http://web.mailana.com/labs/bigdataforjournalists.pdf">Pete&#8217;s free ebook</a>.</em></p>
<p></p>
<p><strong>Related:</strong></p>
<ul>
<li> <a href="http://radar.oreilly.com/2010/12/data-journalism.html">The growing importance of data journalism</a></li>
<li> <a href="http://radar.oreilly.com/2010/07/data-science-democratized.html">Data science democratized</a></li>
<li> <a href="http://radar.oreilly.com/2010/06/what-is-data-science.html">What is data science?</a></li>
</ul>
<p></p>
]]></content:encoded>
			<wfw:commentRss>http://radar.oreilly.com/2011/01/journalist-data-tools.html/feed</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>