<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>O&#039;Reilly Radar &#187; Ben Lorica</title>
	<atom:link href="http://radar.oreilly.com/ben/feed" rel="self" type="application/rss+xml" />
	<link>http://radar.oreilly.com</link>
	<description>Insight, analysis, and research about emerging technologies</description>
	<lastBuildDate>Mon, 20 May 2013 11:00:26 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<item>
		<title>The re-emergence of time-series</title>
		<link>http://radar.oreilly.com/2013/04/the-re-emergence-of-time-series.html</link>
		<comments>http://radar.oreilly.com/2013/04/the-re-emergence-of-time-series.html#comments</comments>
		<pubDate>Wed, 10 Apr 2013 13:00:03 +0000</pubDate>
		<dc:creator>Ben Lorica</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[@home]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[finance]]></category>
		<category><![CDATA[time series]]></category>

		<guid isPermaLink="false">http://radar.oreilly.com/?p=56817</guid>
		<description><![CDATA[My first job after leaving academia was as a quant 1 for a hedge fund, where I performed (what are now referred to as) data science tasks on financial time-series. I primarily used techniques from probability &#38; statistics, econometrics, and &#8230; ]]></description>
				<content:encoded><![CDATA[<p>My first job after leaving academia was as a quant <sup><a href="#1">1</a></sup> for a hedge fund, where I performed (what are now referred to as) data science tasks on financial time-series. I primarily used techniques from probability &amp; statistics, econometrics, and optimization, with occasional forays into machine-learning (clustering, classification, anomalies). More recently, I&#8217;ve been closely following the emergence of tools that target large time series and decided to highlight a few interesting bits.</p>
<h2>Time-series and big data</h2>
<p>Over the last six months I&#8217;ve been encountering more data scientists (outside of finance) who work with massive amounts of time-series data. The rise of unstructured data has been widely reported, the growing importance of time-series much less so. Sources include data from consumer devices (gesture recognition &amp; user interface design), sensors (apps for &#8220;self-tracking&#8221;), machines (systems in data centers), and health care. In fact some research hospitals have troves of EEG and ECG readings that translate to time-series data collections with billions (even trillions) of points. <span id="more-56817"></span></p>
<h2>Search and machine-learning at scale</h2>
<p>Before doing anything else, one has to be able to run queries at scale. Last year <a href="http://practicalquant.blogspot.com/2012/10/mining-time-series-with-trillions-of.html">I wrote about a team of researchers at UC Riverside</a> who took an existing search algorithm (<a href="http://web.science.mq.edu.au/~cassidy/comp449/html/ch11s02.html">dynamic time-warping</a> <sup><a href="#2">2</a></sup>) and got it to scale to time-series with trillions of points. There are many potential applications of their research, one I highlighted is from health care:</p>
<blockquote><p>&#8230; a doctor who needs to search through EEG data (with hundreds of billions of points), for a &#8220;prototypical epileptic spike&#8221;, where the input query is a time-series snippet with thousands of points.</p></blockquote>
<p>As the size of data grows, the UCR dynamic time-warping algorithm takes time to finish (it takes a few hours for time-series with trillions of points). In general (academic) researchers who&#8217;ve spent weeks or months collecting data are fine waiting a few hours for a pattern recognition algorithm to finish. But users who come from different backgrounds (e.g. web companies) may not be as patient. Fortunately &#8220;search&#8221; is an active research area and faster (distributed) pattern recognition systems will likely emerge soon.</p>
<p>Once you scale up search, other interesting problems can be tackled. The UCR team is using their dynamic time-warping algorithm in tasks like classification, clustering, and motif <sup><a href="#3">3</a></sup> discovery. Other teams are investigating techniques from <a href="http://www.giss.nasa.gov/staff/mway/book/">signal-processing</a>, <a href="http://www.fast-lab.org/structuredcomplex.html">pattern recognition</a>, and <a href="http://www.fast-lab.org/">trajectory tracking</a>.</p>
<h2>Some data management tools that target time-series</h2>
<p>One of the more popular sessions at <a href="http://practicalquant.blogspot.com/2012/05/much-to-like-about-hbasecon.html">last year&#8217;s HBase Conference</a> was on <a href="http://opentsdb.net/index.html">OpenTSDB</a>, a distributed, time series database built on top of HBase. It&#8217;s used to store and serve time series metrics, and comes with tools (based on <a href="http://www.gnuplot.info/">GNUPlot</a>) for charting. <a href="https://groups.google.com/forum/?fromgroups=#!topic/opentsdb/3HrW9pTl1cc">Originally named</a> OpenTSDB2, <a href="https://code.google.com/p/kairosdb/">KairosDB</a> was written primarily for Cassandra (but also works with HBase). OpenTSDB emphasizes tools for <a href="https://code.google.com/p/kairosdb/wiki/FAQ">readying data for charts</a> (interpolating to fill in missing values), KairosDB distinguishes between data and the presentation of data.</p>
<p>Startup <a href="https://tempo-db.com/features/">TempoDB</a> offers a <a href="https://tempo-db.com/pricing/">reasonably priced</a>, cloud-based service for storing, retrieving, and visualizing time-series data. Still a work in progress <a href="http://www.scidb.org/Documents/SciDB-Summary.pdf">SciDB</a> is an open source database project, designed specifically for data intensive science problems. The designers of the system plan to make time-series analysis easy to express within SciDB.</p>
<hr />
<p><small></p>
<p id="1">
<p>(1) I worked on trading strategies for derivatives, portfolio &amp; risk management, and option pricing.</p>
<p id="2">
<p>(2) From my <a href="http://practicalquant.blogspot.com/2012/10/mining-time-series-with-trillions-of.html">earlier post</a>: In a recent paper, the UCR team noted that <em>&#8220;&#8230; after an exhaustive literature search of more than 800 papers, we are not aware of any distance measure that has been shown to outperform DTW by a statistically significant amount on reproducible experiments&#8221;</em>.</p>
<p id="3">
<p>(3) <em>Motifs</em> are similar subsequences of a long time series; <em>shapelets</em> are time series primitives that can be used to speed up automatic classification (by reducing the number of &#8220;features&#8221;).<br />
</small></p>
<p><em>This post was originally published on <a href="http://strata.oreilly.com/2013/04/the-re-emergence-of-time-series.html">strata.oreilly.com</a>.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://radar.oreilly.com/2013/04/the-re-emergence-of-time-series.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>An update on in-memory data management</title>
		<link>http://radar.oreilly.com/2013/02/an-update-on-in-memory-data-management.html</link>
		<comments>http://radar.oreilly.com/2013/02/an-update-on-in-memory-data-management.html#comments</comments>
		<pubDate>Fri, 22 Feb 2013 14:00:25 +0000</pubDate>
		<dc:creator>Ben Lorica</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[@home]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[data tools]]></category>
		<category><![CDATA[in-memory data]]></category>
		<category><![CDATA[memory]]></category>
		<category><![CDATA[storage]]></category>

		<guid isPermaLink="false">http://radar.oreilly.com/?p=55954</guid>
		<description><![CDATA[By Ben Lorica and Roger Magoulas We wanted to give you a brief update on what we&#8217;ve learned so far from our series of interviews with players and practitioners in the in-memory data management space. A few preliminary themes have &#8230; ]]></description>
				<content:encoded><![CDATA[<p><strong>By <a href="http://radar.oreilly.com/ben">Ben Lorica</a> and <a href="http://radar.oreilly.com/rogerm">Roger Magoulas</a></strong></p>
<p>We wanted to give you a brief <a href="http://radar.oreilly.com/2013/01/in-memory-data-management.html">update</a> on what we&#8217;ve learned so far from our series of interviews with players and practitioners in the in-memory data management space. A few preliminary themes have emerged, some expected, others surprising. </p>
<p>Performance improves as you put data as close to the computation as possible. We talked to people in systems, data management, web applications, and scientific computing who have embraced this concept. Some solutions go to the the lowest level of hardware (L1, L2 cache), The next generation SSDs will have latency performance closer to main memory, potentially <a href="http://www.snia.org/about/news/newsroom/pr/snia-announces-non-volatile-memory-nvm-programming-technical-work-group">blurring the distinction between storage and memory</a>. For performance and power consumption considerations we can imagine a future where the primary way systems are sized will be based on the amount of non-volatile memory<sup>*</sup> deployed. </p>
<p>Putting data in-memory does not negate the importance of distributed computing environments. Data size and the ability to leverage parallel environments are frequently cited reasons. The same characteristics that make the distributed environments compelling also apply to in-memory systems: fault-tolerance and parallelism for performance. An additional consideration is the ability to gracefully spillover to disk when main is memory full.<span id="more-55954"></span></p>
<p>There is no general purpose solution that can deliver optimal performance for all workloads. The drive for low latency requires different strategies depending on write or read intensity, fault-tolerance, and consistency. Database vendors we talked with have different approaches for transactional and analytic workloads, in some cases integrating in-memory into existing or new products. People who specialize in write-intensive systems identify <em>hot data</em> (i.e., frequently accessed) and put those in-memory. </p>
<p>Hadoop has emerged as an ingestion layer and the place to store data you <em>might</em> use. The next layer identifies and extracts high-value data that can be stored in-memory for low-latency interactive queries. Due to resource constraints of main memory, using columnar stores to compress data becomes important to speed I/O and store more in a limited space.</p>
<p>While it may be difficult to make in-memory systems completely transparent, the people we talked with emphasized programming interfaces that are as simple as possible.</p>
<p>Our conversations to date have revealed a wide range of solutions and strategies. We remain excited about the topic, and we&#8217;re continuing our investigation. If you haven&#8217;t yet, feel free to reach out to us on Twitter (Ben is <a href="http://twitter.com/bigdata">@bigdata</a> and Roger is <a href="http://twitter.com/rogerm">@rogerm</a>) or leave a comment on this post.</p>
<p><em>* By non-volatile memory we mean the <a href="http://www.snia.org/about/news/newsroom/pr/snia-announces-non-volatile-memory-nvm-programming-technical-work-group">next-generation SSDs</a>. In the rest of the post &#8220;memory&#8221; refers to traditional volatile main memory.</em></p>
<p><strong>Related:</strong></p>
<ul>
<li> <a href="http://radar.oreilly.com/2013/01/in-memory-data-management.html">Need speed for big data? Think in-memory data management</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://radar.oreilly.com/2013/02/an-update-on-in-memory-data-management.html/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Need speed for big data? Think in-memory data management</title>
		<link>http://radar.oreilly.com/2013/01/in-memory-data-management.html</link>
		<comments>http://radar.oreilly.com/2013/01/in-memory-data-management.html#comments</comments>
		<pubDate>Fri, 18 Jan 2013 14:00:31 +0000</pubDate>
		<dc:creator>Ben Lorica</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[@home]]></category>
		<category><![CDATA[data management]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[data scientist]]></category>
		<category><![CDATA[data technology]]></category>
		<category><![CDATA[data tool]]></category>
		<category><![CDATA[in-memory data]]></category>

		<guid isPermaLink="false">http://radar.oreilly.com/?p=55287</guid>
		<description><![CDATA[By Ben Lorica and Roger Magoulas In a forthcoming report we will highlight technologies and solutions that take advantage of the decline in prices of RAM, the popularity of distributed and cloud computing systems, and the need for faster queries &#8230; ]]></description>
				<content:encoded><![CDATA[<p><strong>By <a href="http://radar.oreilly.com/ben">Ben Lorica</a> and <a href="http://radar.oreilly.com/rogerm">Roger Magoulas</a></strong></p>
<p>In a forthcoming report we will highlight technologies and solutions that take advantage of the decline in prices of RAM, the popularity of distributed and cloud computing systems, and the need for faster queries on large, distributed data stores. Established technology companies have had interesting offerings, but what initially caught our attention were open source projects that started gaining traction last year.</p>
<p>An example we frequently hear about is the demand for tools that support <em>interactive</em> query performance. Faster query response times translate to more engaged and productive analysts, and real-time reports. Over the past two years several in-memory solutions emerged to deliver 5X-100X faster response times. A <a href="http://research.microsoft.com/pubs/163083/hotcbp12%20final.pdf">recent paper from Microsoft Research</a> noted that even in this era of big data and Hadoop, many MapReduce jobs fit in the memory of a single server. To scale to extremely large datasets several new systems use a combination of distributed computing (in-memory grids), compression, and (columnar) storage technologies. </p>
<p>Another interesting aspect of in-memory technologies is that they seem to be everywhere these days. We’re looking at tools aimed at analysts (Tableau, Qlikview, Tibco Spotfire, Platfora), databases that target specific workloads or data types (VoltDB, SAP HANA, Hekaton, Redis, Druid, Kognitio, and Yarcdata), frameworks for analytics (Spark/Shark, GraphLab, GridGain, Asterix/Hyracks), and the data center (RAMCloud, <a href="https://amplab.cs.berkeley.edu/wp-content/uploads/2011/06/disk-irrelevant_hotos2011.pdf">memory <em>Iocality</em></a>). </p>
<p>We’ll be talking to companies and hackers to get a sense of how in-memory solutions fit into their planning.  Along these lines, we would love to hear what you think about the rise of these technologies, as well as applications, companies and projects we should look at. Feel free to reach out to us on Twitter (Ben is <a href="http://twitter.com/bigdata">@bigdata</a> and Roger is <a href="http://twitter.com/rogerm">@rogerm</a>) or leave a comment on this post.</p>
]]></content:encoded>
			<wfw:commentRss>http://radar.oreilly.com/2013/01/in-memory-data-management.html/feed</wfw:commentRss>
		<slash:comments>14</slash:comments>
		</item>
		<item>
		<title>Seven reasons why I like Spark</title>
		<link>http://radar.oreilly.com/2012/08/seven-reasons-why-i-like-spark.html</link>
		<comments>http://radar.oreilly.com/2012/08/seven-reasons-why-i-like-spark.html#comments</comments>
		<pubDate>Tue, 21 Aug 2012 18:45:31 +0000</pubDate>
		<dc:creator>Ben Lorica</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[@home]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[distributed computing]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[machine learning]]></category>

		<guid isPermaLink="false">http://radar.oreilly.com/?p=51037</guid>
		<description><![CDATA[A large portion of this week&#8217;s Amp Camp at UC Berkeley, is devoted to an introduction to Spark &#8211; an open source, in-memory, cluster computing framework. After playing with Spark over the last month, I&#8217;ve come to consider it a &#8230; ]]></description>
				<content:encoded><![CDATA[<p>A large portion of this week&#8217;s <a href="http://ampcamp.berkeley.edu/">Amp Camp</a> at UC Berkeley, is devoted to an introduction to <a href="http://spark-project.org/">Spark</a> &#8211; an open source, in-memory, cluster computing framework. After playing with Spark over the last month, I&#8217;ve come to consider it a key part of my big data toolkit. Here&#8217;s why:</p>
<p><strong>Hadoop integration</strong>: Spark can work with files stored in HDFS, an important feature given the amount of investment in the Hadoop Ecosystem. Getting Spark to work <a href="https://groups.google.com/forum/?fromgroups=#!searchin/spark-users/mapr/spark-users/LqG7kf3tkdI/5M9ThGWUMLEJ">with MapR</a> is straightforward.<a href="https://groups.google.com/forum/?fromgroups=#!searchin/spark-users/mapr/spark-users/LqG7kf3tkdI/5M9ThGWUMLEJ"><br />
</a></p>
<p><strong>The Spark interactive Shell</strong>: Spark is written in Scala, and has it&#8217;s own version of the Scala interpreter. I find this extremely convenient for testing short snippets of code.</p>
<p><strong>The Spark Analytic Suite</strong>:</p>
<p><a href="http://radar.oreilly.com/2012/08/seven-reasons-why-i-like-spark.html/spark-stack" rel="attachment wp-att-51038"><img class="size-medium wp-image-51038 aligncenter" src="http://s.radar.oreilly.com/wp-files/2/2012/08/spark-stack-300x153.jpg" alt="" width="300" height="153" /></a><br />
(Figure courtesy of <a href="http://www.cs.berkeley.edu/~matei/">Matei Zaharia</a>)</p>
<p>Spark comes with tools for interactive query analysis (Shark), large-scale <a href="http://googleresearch.blogspot.com/2009/06/large-scale-graph-computing-at-google.html">graph processing and analysis</a> (Bagel), and real-time analysis (Spark Streaming). Rather than having to mix and match a set of tools (e.g., Hive, Hadoop, Mahout, S4/Storm), you only have to learn one programming paradigm. For SQL enthusiasts, the added bonus is that Shark tends to run faster than Hive. If you want to run Spark in the cloud, there are a set of <a href="https://github.com/mesos/spark/wiki/EC2-Scripts">EC2 scripts</a> available.</p>
<p><span id="more-51037"></span><strong>Resilient Distributed Data sets (RDD&#8217;s):<br />
</strong><a href="http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf">RDD&#8217;s</a> are <em>distributed</em> objects that can be cached in-memory, across a cluster of compute nodes. They are the fundamental data objects used in Spark. The crucial thing is that fault-tolerance is built-in: RDD&#8217;s are <em>automatically</em> rebuilt if something goes wrong. If you need to test something out, RDD&#8217;s can even be used interactively from the Spark interactive shell.</p>
<p><strong>Distributed Operators:<br />
</strong>Aside from Map and Reduce, there are <a href="https://github.com/mesos/spark/wiki/Spark-Programming-Guide">many other operators one can use on RDD&#8217;s</a>. Once I familiarized myself with how they work, I began converting a few basic machine-learning and data processing algorithms into this framework.</p>
<p><strong>Once you get past the learning curve &#8230; iterative programs<br />
</strong>It takes some effort to become productive in anything, Spark is no exception. I was a complete <em>Scala</em> newbie so I first had to get comfortable with a new language (apparently, they like underscores &#8211; see <a href="http://www.slideshare.net/normation/scala-dreaded">here</a>, <a href="http://www.codecommit.com/blog/scala/quick-explanation-of-scalas-syntax">here</a>, and <a href="http://ananthakumaran.in/2010/03/29/scala-underscore-magic.html">here</a>). Beyond Scala one can use Shark (&#8220;SQL&#8221; on Spark), and relatively new Java and Python API&#8217;s.</p>
<p>You can use <a href="http://www.spark-project.org/examples.html">the examples</a> that come with Spark to get started, but I found the essential thing is to get comfortable with the built-in distributed operators. Once I learned RDD&#8217;s and the operators, I started writing <em>iterative</em> programs to implement a few machine-learning and data processing algorithms. (Since Spark distributes &amp; caches data in-memory, you can write pretty <em>fast</em> machine-learning programs on <em>massive</em> data sets.)</p>
<p><strong>It&#8217;s already used in production<br />
</strong>Is anyone <em>really</em> using Spark? While the list of companies is still small, judging from the size of the <a href="http://www.meetup.com/spark-users/">SF Spark Meetup</a> and <a href="http://ampcamp.berkeley.edu/">Amp Camp</a>, I expect many more companies to start deploying Spark. (If you&#8217;re in the SF Bay Area, we are starting a new <a href="http://www.meetup.com/Distributed-data-processing-with-Mesos/">Distributed Data Processing</a> Meetup with <a href="http://www.airbnb.com">Airbnb</a>, and Spark is one one of the topics we&#8217;ll cover.)</p>
<hr />
<p>
<strong>Update (8/23/2012)</strong>: Here&#8217;s another important reason to like Spark &#8211; at 14,000 lines of code it&#8217;s much simpler than other software used for Big Data. </p>
<p><strong>The Spark codebase is <u><em>small</em></u>, extensible, and hackable.</strong><br />
Matei&#8217;s <a href="http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf">last presentation</a> at <a href="http://ampcamp.berkeley.edu/">Amp Camp</a> included the diagram below (<em>LOC</em> = lines of code): </p>
<p><a href="http://radar.oreilly.com/2012/08/seven-reasons-why-i-like-spark.html/spark-codebase" rel="attachment wp-att-51200"><img class="size-medium wp-image-51200 aligncenter" src="http://s.radar.oreilly.com/wp-files/2/2012/08/spark-codebase-300x154.jpg" alt="" width="300" height="154" /></a></p>
<p>(Figure courtesy of <a href="http://www.cs.berkeley.edu/~matei/">Matei Zaharia</a>)</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://radar.oreilly.com/2012/08/seven-reasons-why-i-like-spark.html/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Active Facebook users by region: November, 2010</title>
		<link>http://radar.oreilly.com/2010/11/active-facebook-users-by-region-nov-2010.html</link>
		<comments>http://radar.oreilly.com/2010/11/active-facebook-users-by-region-nov-2010.html#comments</comments>
		<pubDate>Tue, 16 Nov 2010 23:00:00 +0000</pubDate>
		<dc:creator>Ben Lorica</dc:creator>
				<category><![CDATA[Web 2.0]]></category>
		<category><![CDATA[@home]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[research]]></category>
		<category><![CDATA[social networking]]></category>

		<guid isPermaLink="false">http://blogs.oreilly.com/radar/2010/11/active-facebook-users-by-region-nov-2010.html</guid>
		<description><![CDATA[With Facebook unveiling an integrated messaging system for its more than 500 million users, I decided to update a few charts that breakdown its users by region. ]]></description>
				<content:encoded><![CDATA[<p>With Facebook unveiling an <a href="http://blog.facebook.com/blog.php?post=452288242130">integrated messaging system</a> for its <a href="http://www.facebook.com/press/info.php?statistics">more than 500 million users</a>, I decided to <a href="http://radar.oreilly.com/2010/07/facebook-reaches-half-a-billion.html">update a few charts</a> that breakdown its users by region.</p>
<p><strong>I.</strong> Percentage share of active users (weekly): note the steady rise in the share of users from Asia.</p>
<p align="center">
<img src="http://s.radar.oreilly.com/fbook_20101114_1.jpg" width="501" height="406" border="1" align="center" hspace="4" vspace="4" alt="pathint" /></p>
<p>
<strong>II.</strong> Market Penetration: Less than 3% penetration in Facebook&#8217;s high-growth regions in Asia &amp; Africa.</p>
<p align="center">
<img src="http://s.radar.oreilly.com/fbook_20101114_2.jpg" width="600" height="257" border="1" align="center" hspace="4" vspace="4" alt="pathint" /></p>
<p><strong>III.</strong> Percent Share of each Age Group (within each Region): Relative to the U.S. (33%), the share of users ages 18-25 <a href="http://radar.oreilly.com/2010/07/facebook-reaches-half-a-billion.html">continues to be higher</a> in Asia (44%), Africa (41%), and the Middle East / N. Africa (39%).</p>
<p align="center">
<img src="http://s.radar.oreilly.com/fbook_20101114_region_and_age_share.jpg" width="600" height="429" border="1" align="center" hspace="4" vspace="4" alt="pathint" />
</p>
<p></p>
]]></content:encoded>
			<wfw:commentRss>http://radar.oreilly.com/2010/11/active-facebook-users-by-region-nov-2010.html/feed</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Hiring trends among the major platform players</title>
		<link>http://radar.oreilly.com/2010/11/hiring-trends-among-the-major-platform-players.html</link>
		<comments>http://radar.oreilly.com/2010/11/hiring-trends-among-the-major-platform-players.html#comments</comments>
		<pubDate>Mon, 15 Nov 2010 20:35:05 +0000</pubDate>
		<dc:creator>Ben Lorica</dc:creator>
				<category><![CDATA[Web 2.0]]></category>
		<category><![CDATA[@home]]></category>
		<category><![CDATA[internet operating system]]></category>
		<category><![CDATA[jobs]]></category>
		<category><![CDATA[research]]></category>
		<category><![CDATA[Web 2 Summit]]></category>

		<guid isPermaLink="false">http://blogs.oreilly.com/radar/2010/11/hiring-trends-among-the-major-platform-players.html</guid>
		<description><![CDATA[Consistent with the recent flurry of articles about hiring wars, many platform companies have increased their number of job postings. Winning the battle for the Internet&apos;s points of control requires amassing talent. ]]></description>
				<content:encoded><![CDATA[<p>After re-reading <a href="http://radar.oreilly.com/2010/04/handicapping-internet-platform-wars.html">Tim&#8217;s post on the major internet platform players</a>, I looked at recent hiring trends* among the companies he highlighted. First I examined year-over-year changes in number of job postings (from Aug to Oct 2009 vs. Aug to Oct 2010). Consistent with the recent flurry of articles about hiring wars, all the companies (except for Yahoo) increased** their number of job postings. Winning the battle for the <a href="http://map.web2summit.com/">Internet&#8217;s points of control</a> requires amassing talent:</p>
<p align="center">
<img src="http://s.radar.oreilly.com/w2s2010_trend.jpg" width="537" height="279" border="1" align="center" hspace="4" vspace="4" alt="pathint" />
</p>
<p>Below is the breakdown by most popular <a href="http://www.onetcenter.org/overview.html">occupations</a> over the last three months:</p>
<p><p align="center">
<img src="http://s.radar.oreilly.com/w2s2010_occupations.jpg" width="600" height="556" border="1" align="center" hspace="4" vspace="4" alt="pathint" />
</p>
<p>[For a similar breakdown by location (most popular metro areas), click <a href="http://s.radar.oreilly.com/w2s2010_metroarea.jpg">HERE</a>.]</p>
<hr />
<p>(*) Using data from a partnership with <a href="http://www.simplyhired.com">SimplyHired</a>, we maintain a data warehouse that includes most U.S. online job postings dating back to late 2005. Since there are no standard data formats for job postings across employment sites, algorithms are used to detect duplicate job postings, companies, occupations, and metro areas. The algorithms are far from perfect, so the above results are at best <strong>extremely rough estimates</strong>.</p>
<p>
(**) As a benchmark, the total number of job postings in our entire data warehouse of jobs grew 68% from Aug/Oct 2009 to Aug/Oct 2010.</p>
<p></p>
]]></content:encoded>
			<wfw:commentRss>http://radar.oreilly.com/2010/11/hiring-trends-among-the-major-platform-players.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Windows Mobile apps are more expensive than iPhone apps</title>
		<link>http://radar.oreilly.com/2010/11/windows-marketplace-for-mobile-apps.html</link>
		<comments>http://radar.oreilly.com/2010/11/windows-marketplace-for-mobile-apps.html#comments</comments>
		<pubDate>Fri, 05 Nov 2010 11:00:00 +0000</pubDate>
		<dc:creator>Ben Lorica</dc:creator>
				<category><![CDATA[Mobile]]></category>
		<category><![CDATA[@home]]></category>
		<category><![CDATA[app store]]></category>
		<category><![CDATA[appstore]]></category>
		<category><![CDATA[game]]></category>
		<category><![CDATA[iphone app]]></category>
		<category><![CDATA[itunes]]></category>
		<category><![CDATA[windows phone]]></category>

		<guid isPermaLink="false">http://blogs.oreilly.com/radar/2010/11/windows-marketplace-for-mobile-apps.html</guid>
		<description><![CDATA[The Windows Marketplace for Mobile now has about 1,400 apps spread across 16 categories. In this short post I&apos;ll provide some basic statistics and compare it with the grandaddy of app stores: the U.S. iTunes store. ]]></description>
				<content:encoded><![CDATA[<p>
[<strong>Update</strong>: Several readers have correctly pointed out that the <a href="http://marketplace.windowsphone.com">source of the data I used for the post</a>,  was for Windows <em>Mobile</em> apps, so I decided to tweak the title to reflect that. The goal of this post is to examine the marketplace for Windows smartphone apps prior to the much-anticipated launch of Windows <em>Phone</em> 7. I hope to do an update a few months post-launch.]
</p>
<p>The <a href="http://marketplace.windowsphone.com">Windows Marketplace for Mobile</a> now has about 1,400 apps spread across 16 categories. In this short post I&#8217;ll provide some basic statistics* and compare it with the grandaddy of app stores &#8211; the U.S. iTunes store.</p>
<p align="center">
<a href="http://marketplace.windowsphone.com"><img src="http://s.radar.oreilly.com/winphone_market_logo.jpg" width="350" height="62" border="1" align="center" hspace="4" vspace="4" alt="pathint" /></a>
</p>
<p>First let&#8217;s look at the distribution of apps across categories. <a href="http://radar.oreilly.com/2009/11/games-top-the-charts-iphone-android-markets.html">Like the iPhone and Android platforms</a>, Windows Phone 6.x  are rich in game apps. Given that there are far fewer Windows Phone apps, it may take some time before we see the variety of <em>categories</em> found in iTunes. There are large iPhone categories (medical**, education, sports &#8230; ) that aren&#8217;t part of the taxonomy for Windows Marketplace for Mobile.</p>
<p align="center">
<img src="http://s.radar.oreilly.com/winphone1.jpg" width="600" height="730" border="0" align="center" alt="pathint" />
</p>
<p>
More than 90% of the 280,000+ iTunes apps aren&#8217;t free, compared to 78% of apps available on Windows Marketplace for Mobile. Below are the share of free/paid apps across the different categories.</p>
<p align="center">
<img src="http://s.radar.oreilly.com/winphone2.jpg" width="570" height="399" border="0" alt="pathint" />
</p>
<p>
At least for now, Windows Phone 6.x apps are pricier than iPhone apps. The <em>mean</em> price of a paid iPhone app is $3.43, compared to $6.16 for paid apps available on Windows Marketplace for Mobile. Welcome news for the <a href="http://arstechnica.com/gadgets/news/2010/11/windows-phone-7-to-rival-ipad-for-developer-attention-in-2011.ars">many developers gearing up to produce apps for Windows Phone 7</a>!</p>
<p align="center">
<img src="http://s.radar.oreilly.com/winphone3.jpg" width="600" height="702" border="0" align="center" alt="pathint" />
</p>
<p></p>
<hr />
<p>(*) Data for this post: <strong>U.S.</strong> iTunes store through 10/31/2010, limited to <em>iPhone</em> apps; Windows Marketplace for Mobile through 11/3/2010.</p>
<p>(**) The <a href="http://radar.oreilly.com/2008/12/iphone-app-store-first-five-mo.html">Medical category was added several months after the launch</a> of the iTunes app store.</p>
<p></p>
]]></content:encoded>
			<wfw:commentRss>http://radar.oreilly.com/2010/11/windows-marketplace-for-mobile-apps.html/feed</wfw:commentRss>
		<slash:comments>30</slash:comments>
		</item>
		<item>
		<title>Crowdsourcing specific microtasks</title>
		<link>http://radar.oreilly.com/2010/10/crowdsourcing-specific-microtasks.html</link>
		<comments>http://radar.oreilly.com/2010/10/crowdsourcing-specific-microtasks.html#comments</comments>
		<pubDate>Mon, 25 Oct 2010 22:00:00 +0000</pubDate>
		<dc:creator>Ben Lorica</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[crowdsourcing]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[mechanical turk]]></category>

		<guid isPermaLink="false">http://blogs.oreilly.com/radar/2010/10/crowdsourcing-specific-microtasks.html</guid>
		<description><![CDATA[Since the first-ever Mechanical Turk meetup a year ago, there has been an explosion in crowdsourcing services and a well-attended conference in San Francisco. I remain enthusiastic about crowdsourcing, but the number of companies has me worried about quality of work. Fortunately specialization is already occurring, so for particular tasks there are companies out there ready to provide high-quality service.... ]]></description>
				<content:encoded><![CDATA[<p>Since <a href="http://radar.oreilly.com/2009/06/mechanical-turk-best-practices.html">the first-ever Mechanical Turk meetup</a> a year ago, there has been <a href="http://behind-the-enemy-lines.blogspot.com/2010/10/explosion-of-micro-crowdsourcing.html">an explosion in crowdsourcing services</a> and a well-attended <a href="http://crowdconf.com/vids.html">conference</a> in San Francisco. I remain <a href="http://radar.oreilly.com/2009/06/mechanical-turk-best-practices.html">enthusiastic</a> about crowdsourcing, but the number of companies has me worried about quality of work. Fortunately specialization is already occurring, so for particular tasks there are companies out there ready to provide high-quality service.
</p>
<p>
One company that recently caught my eye is Helsinki (and SF) based <a href="http://www.microtask.com/">Microtask</a>. Founded by Computer Graphics (CG) and Computer Vision (CV) veterans, Microtask has chosen to focus in a few areas where CG and CV are relevant. Aside from speech transcription, they currently provide form-processing (digitizing hand-written forms) and archive digitization services, and have plans to expand into image categorization and video indexing in the near future.  By initially focusing on a few specific tasks, Microtask is able to refine its platform while simultaneously leveraging prior skills in areas such as <a href="http://en.wikipedia.org/wiki/Optical_character_recognition">optical character recognition</a>.  </p>
<p align="center">
<img src="http://s.radar.oreilly.com/microtask.jpg" width="600" height="351" border="1" align="center" hspace="4" vspace="4" alt="pathint" />
</p>
<p>
A few things about the Microtask platform are worth highlighting. In order to protect the intellectual property of its customers, Microtask never sends complex tasks to the same service provider. Rather tasks get broken up into pieces and scattered across multiple providers. This is fairly easy to do for the types of digitization services they offer. Customers who are wary of sending data to outside servers, can run Microtask&#8217;s software on their own servers. (Ordinarily customers use Microtask through a set of API&#8217;s.) Finally, Microtask can guarantee quality and delivery time because it has longterm relationships with (labor) service providers. Microtask contracts out with call-centers throughout the world, and tasks* are performed by workers in-between service calls.</p>
<p>
Using call-center workers is novel but crowdsourcing seems increasingly tied to social gaming and virtual currencies. Having come from Computer Graphics and Computer Vision, the founders of Microtask have experience and connections in the gaming industry. <a href="http://www.microtask.com/company/management/">CEO Ville Miettinen</a> admitted that social gaming integration is a high-priority for them over the next few years. The key is that they want Microtask to fit seamlessly into the gaming experience,  they want gamers to be able to stay &#8220;in the flow of the game&#8221; while performing crowdsourcing tasks. I&#8217;m looking forward to what they and game designers come up with &#8212; a modern equivalent of <a href="http://www.youtube.com/watch?v=sNfQ_B6_xy8">Typing of the Dead</a>?</p>
<p></p>
<hr />
<p>(*) It&#8217;s useful to remember that these are simple tasks (e.g., <a href="http://en.wikipedia.org/wiki/Optical_character_recognition">OCR</a>) involving the validation of outputs generated using machine-learning. Microtask uses confidence scores generated by their algorithms to rank and prioritize validation tasks.</p>
]]></content:encoded>
			<wfw:commentRss>http://radar.oreilly.com/2010/10/crowdsourcing-specific-microtasks.html/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Amazon&apos;s cloud platform still the largest, but others are closing the gap</title>
		<link>http://radar.oreilly.com/2010/08/amazon-cloud-platform-still-the-largest-but-others-are-closing-the-gap.html</link>
		<comments>http://radar.oreilly.com/2010/08/amazon-cloud-platform-still-the-largest-but-others-are-closing-the-gap.html#comments</comments>
		<pubDate>Tue, 31 Aug 2010 22:30:18 +0000</pubDate>
		<dc:creator>Ben Lorica</dc:creator>
				<category><![CDATA[Web 2.0]]></category>
		<category><![CDATA[@home]]></category>
		<category><![CDATA[cloud]]></category>
		<category><![CDATA[jobs]]></category>

		<guid isPermaLink="false">http://blogs.oreilly.com/radar/2010/08/amazon-cloud-platform-still-the-largest-but-others-are-closing-the-gap.html</guid>
		<description><![CDATA[Measured in terms of (U.S.) job postings, Amazon&apos;s Cloud Computing platform is still larger than Google&apos;s App Engine. What&apos;s interesting is that the gap has closed over the past year. ]]></description>
				<content:encoded><![CDATA[<p>Tim&#8217;s recent <a href="https://twitter.com/timoreilly/status/22635050946">tweet</a> on the growing demand for Google App Engine skills inspired me to measure the popularity of the major cloud computing platforms. Elance is one of many job boards in our data warehouse of U.S. job postings<sup>1</sup> , and I wanted to measure <a href="http://www.elance.com/p/online-employment-report-it.html">demand</a> across many more job sites.</p>
<p>
Measured in terms of (U.S.) job postings, Amazon&#8217;s Cloud Computing platform is still larger than Google&#8217;s App Engine. What&#8217;s interesting is that the gap has closed over the past year<sup>2</sup>:</p>
<p align="center">
<img src="http://s.radar.oreilly.com/cloud_platforms201008.jpg" width="600" height="315" border="1" align="center" hspace="4" vspace="4" alt="pathint" />
</p>
<p>
Over the past two months, the other cloud platforms were roughly one-third (Google), one-fourth (Microsoft), and one-sixth (Rackspace) the size of Amazon. During the same period last year, these platforms were much smaller: Google was one-fifth, Microsoft was one-seventh, and Rackspace one-tenth the size of Amazon.</p>
<p></p>
<hr />
<p>(1) Data for this post is for U.S. online job postings through 8/21/2010 and is maintained in partnership with <a href="http://www.SimplyHired.com">SimplyHired.com</a>. We use algorithms to dedup job posts: a single job posting can contain multiple jobs and appear on multiple job sites.</p>
<p>(2) I counted the number of unique job posts that mention each of the cloud computing platforms.</p>
]]></content:encoded>
			<wfw:commentRss>http://radar.oreilly.com/2010/08/amazon-cloud-platform-still-the-largest-but-others-are-closing-the-gap.html/feed</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>The number of Hadoop jobs continue to rise</title>
		<link>http://radar.oreilly.com/2010/08/number-of-hadoop-jobs-continue-to-rise.html</link>
		<comments>http://radar.oreilly.com/2010/08/number-of-hadoop-jobs-continue-to-rise.html#comments</comments>
		<pubDate>Sun, 08 Aug 2010 21:16:37 +0000</pubDate>
		<dc:creator>Ben Lorica</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[jobs]]></category>

		<guid isPermaLink="false">http://blogs.oreilly.com/radar/2010/08/number-of-hadoop-jobs-continue-to-rise.html</guid>
		<description><![CDATA[While still a small fraction of data management job postings, the number of job posts that mention &#34;hadoop&#34; continue to grow steadily. Year-over-year, there were 300% more such job posts in the first seven months of 2010 compared to the same period in 2009. The fraction of &#34;hadoop&#34; jobs posted by California companies remain high, but is definitely lower than what it was last year. ]]></description>
				<content:encoded><![CDATA[<p>While still a small fraction<sup>1</sup> of data management job postings, the number of job posts that mention &#8220;hadoop&#8221; continue to grow steadily. Year-over-year, there were 300% more such job posts<sup>2</sup> in the first seven months of 2010 compared to the same period in 2009:</p>
<p align="center">
<img src="http://s.radar.oreilly.com/hadoop_jobs1.jpg" width="450" height="385" border="1" align="center" hspace="4" vspace="4" alt="pathint" />
</p>
<p>
The fraction of &#8220;hadoop&#8221; jobs posted by California companies remain high, but is definitely <a href="http://radar.oreilly.com/2009/06/most-hadoop-jobs-are-in-california.html">lower than what it was last year</a>:</p>
<p align="center">
<img src="http://s.radar.oreilly.com/hadoop_jobs2.jpg" width="565" height="534" border="1" align="center" hspace="4" vspace="4" alt="pathint" />
</p>
<p></p>
<hr />
<p>(1) Over the last three months, job posts that mention &#8220;hadoop&#8221; were inching towards 8-10% of the number of job posts that mention &#8220;mysql&#8221;.</p>
<p>(2) Data for this post is for U.S. online job postings through 7/31/2010 and is maintained in partnership with <a href="http://www.SimplyHired.com">SimplyHired.com</a>. We use algorithms to dedup job posts: a single job posting can contain multiple jobs and appear on multiple job sites.</p>
]]></content:encoded>
			<wfw:commentRss>http://radar.oreilly.com/2010/08/number-of-hadoop-jobs-continue-to-rise.html/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
