Ben Lorica
Ben Lorica is a Senior Analyst in the Research Group at O'Reilly Media, Inc.. He has applied Business Intelligence, Data Mining and Statistical Analysis in a variety of settings including Direct Marketing, Consumer and Market Research, Targeted Advertising, Text Mining, and Financial Engineering. His background includes stints with an investment management company, internet startups, and financial services. At O'Reilly, Ben works on custom research and consulting projects, open source data warehousing and analytics.
Sun
Jan 17
2010
Manifold Learning, Calculus & Friendship, and Other Math Links
by Ben Lorica | @dliman | comments: 3One of the largest gatherings of mathematicians, the joint meetings of the AMS/MAA/SIAM, took place last week in San Francisco. Knowing that there were going to be over 6,000 pure and applied mathematicians at Moscone West, I took some time off from work and attended several sessions. Below are a few (somewhat technical) highlights. (It's the only conference I've attended where the person managing the press room, was also working on some equations in-between helping the media.)
The Machine-Learning Bubble in Computational Medicine (Challenges in Computational Medicine and Biology)
Donald Geman gave a nice survey of the problems and mathematical techniques frequently used in computational biology. He also raised something that struck a chord with me. While computational biology has things in common with other fields ("small n, large d problem": small samples, relative to the number of dimensions), techniques that work in fields like computer vision don't automatically translate to biology. First, the size of samples in biology and medicine are orders of magnitude smaller compared to other fields. Secondly, while black boxes (think SVM's or neural nets) are acceptable in other fields, biologists want accurate predictions and explanations for why/how algorithms work. Finally, it isn't clear if there are underlying low-dimensional structures in biological data. Taken together, Geman wonders if machine-learning's possible role in biology and medicine has been overhyped.
Using Unlabeled Data To Identify Optimal Classifiers (A Geometric Perspective on Learning Theory and Algorithms)
Revisiting, the "small n, large d" problem, Partha Niyogi gave an overview of recent geometric approaches to machine-learning. In order to mitigate the curse of dimensionality, Niyogi and his fellow researchers exploit the tendency of (natural) data be be non-uniformly distributed. In particular, they use the shape of the data to determine optimal machine-learning classifiers. In their version of manifold learning, they assume that the space of target functions (e.g. all possible classifiers), consists of functions supported on a submanifold of the original high-dimensional euclidean/feature space. One of the most interesting features of their geometric approach, is their use of both labeled and unlabeled data to identify optimal classifiers. The traditional approaches to training classifiers require labeled data. So while one can use mechanical turks to increase the amount of labeled data for learning purposes, the geometric techniques outlined by Dr. Niyogi actually take advantage of any unlabeled data you may already have. Lest you think that these are purely academic/theoretical techniques, Dr. Niyogi cites a company that uses these algorithms to analyze and classify child speech patterns. With so much Data Exhaust available, I can't help but think that techniques that can leverage unlabeled data will prove useful in many domains. (Niyogi and his collaborators have many papers on Manifold Learning, including one that describes the algorithms, and another that provides the theoretical foundations.)
tags: algorithms, big data, geometry, machine-learning, math, mathematics
| comments: 3
submit:
Thu
Jan 14
2010
Collecting, Aggregating, and Analyzing Data Exhaust
by Ben Lorica | @dliman | comments: 1Next week, O'Reilly's Research Director Roger Magoulas, will lead an exciting panel discussion on Big Data. The focus will be on the piles of data that companies have been collecting, and are just beginning to analyze:
The internet and social media create a mountain of random, unstructured, and at times ephemeral data by-products, which may appear to be trash. Yet, one person’s trash is another’s treasure. From FaceBook to Netflix, people are spending more time sharing their thoughts, opinions, plans and perspectives as they socialize and conduct business online. With each of these Internet exchanges traces of information,or Data Exhaust, are left behind. When correlated or combined, these snippets can provide insight into political views, professional achievements, purchasing behaviors, and demographic information—pinpointing trend setters and leading indicators. Brilliant innovators now re-purpose this data stream, aggregating and analyzing the data to provide new products or services.Next Tuesday's panel discussion and networking event will be held at the Stanford Business School. Further details are available on the VLAB web site.
() Recent Radar posts on Big Data: (1) Counting Unique Users in Real-time with Streaming Databases, (2) Pipelining and Real-time Analytics with MapReduce Online
tags: analytics, big data, sensors, streams
| comments: 1
submit:
Mon
Dec 14
2009
Apps Per Seller Across the US iTunes Categories
by Ben Lorica | @dliman | comments: 3Measured in terms of number of unique apps, the Top 5 categories in the U.S. app store have been Games, Books, Entertainment, Travel and Utilities. But comparing categories in terms of number of apps doesn't capture the challenge of developing applications in different categories. As I noted in an earlier post, it's much easier to develop a Book app than an interactive game.
One crude measure for the relative complexity of developing apps across categories is to compare the number of apps per seller. The Top 5 categories in Nov/2009, were Books (17 apps per seller), Travel (6 apps per seller), Education (4 per seller), Reference and Sports (3 per seller). There were also 3 apps per seller in the Games and Entertainment categories in Nov/2009:
() Data for this post was for through 12/10/2009, and covers the U.S. iTunes App store.
Fri
Nov 20
2009
Asia Continues to be Facebook's Strongest Growth Region
by Ben Lorica | @dliman | comments: 1With Facebook topping 330 million active users over the past week, the company's strongest growth region continues to be Asia. Over the last 12 weeks, Facebook added close to 17M active users in Asia alone. Since my previous post, the share of active users from Asia grew by 2% (to 13.5% of all users), and roughly 1 in 7 users now come from the region. With a market penetration under 2%, Facebook is poised to add many more users in Asia (and Africa).
Compared to the U.S., the proportion of Facebook users in their teens (13-17) or in the 18-25 age group are much higher in Asia:
As was the case in other parts of the world, expect the share of users 45 and older to climb as Facebook becomes more mainstream in Asia. Growth was strong across all age groups in Asia over the last 12 weeks, particularly among teens (+90%) and the 18-25 age group (+60%).
In closing I want to highlight countries (within several regions) where Facebook has been growing rapidly:
tags: facebook, hard numbers, platforms, research, social networking
| comments: 1
submit:
Wed
Nov 11
2009
Counting Unique Users in Real-time with Streaming Databases
by Ben Lorica | @dliman | comments: 6As the web increasingly becomes real-time, marketers and publishers need analytic tools that can produce real-time reports. As an example, the basic task of calculating the number of unique users is typically done in batch mode (e.g. daily) and in many cases using a random sample from relevant log files. If unique user counts can be accurately computed in real-time, publishers and marketers can mount A/B tests or referral analysis to dynamically adjust their campaigns.
In a previous post I described SQL databases designed to handle data streams. In their latest release, Truviso announced technology that allows companies to track unique users in real-time. Truviso uses the same basic idea I described in my earlier post:
Recognizing that "data is moving until it gets stored", the idea behind many real-time analytic engines is to start applying the same analytic techniques to moving (streams) and static (stored) data.Truviso uses (compressed) bitmaps and set theory to compute the number of unique customers in real-time. In the process they are able to handle the standard SQL queries associated with these types of problems: counting the number of distinct users, for any given set of demographic filters. Bitmaps are built as data streams into the system and use the same underlying technology that allows Truviso to handle massive data sets from high-traffic web sites.
Once companies can do simple counts and averages in real-time, the next step is to use real-time information for more sophisticated analyses. Truviso has customers using their system for "on-the-fly predictive modeling".
The other main enhancement in this release is Truviso's move towards parallel processing. Their new execution engine processes runs or blocks of data in parallel in multi-core systems or multi-node environments. Using Truviso's parallel execution engine is straightforward on a single multi-core server, but on a multi-node cluster it may require considerable attention to configuration.
[For my previous posts on real-time analytic tools see here and here.]
UPDATE (11/23/2009): Google acquires Teracent, a San Mateo startup that specializes in real-time analytics for optimizing online banner/display ads.
tags: a/b testing, analytics, big data, real-time, sensors, sql, streams
| comments: 6
submit:
Tue
Nov 3
2009
Games Top the Charts in the iPhone and Android App Markets
by Ben Lorica | @dliman | comments: 2While it might be true that the number of Book apps is growing at a faster rate, Games continue to dominate the list of popular U.S. iTunes Apps. Games accounted for about a fifth of all iTunes apps over the past week, but the category continued to have a disproportionate share of the Top 100 charts, accounting for 52% of the Top Grossing, 56% of the Top Paid, and 50% of the Top Free apps:
Since most Book apps are actually individual e-books, the Gaming category would have a hard time keeping up with the ever increasing number of Books. Once publishers figured out how to turn their titles into iPhone apps, the number of Book apps started growing faster than Games. Nevertheless Games continue to rule the Top 100 charts.
A similar story is playing out on the Android platform: the most popular Android apps are primarily Games. (In the Android taxonomy, most Books are in the Reference category.)
Returning to the top iPhone apps, the price of the Top Grossing apps stabilized somewhat last week. Except for the top decile (rank 1 through 10) for which the median price was about $7, the median price across the other deciles was around $5.
Over the last week, the Top Paid Games were slightly more expensive than apps that made the overall Top 100 Paid list. iPhone Game developers will tell you that (visually) compelling and engaging iPhone Games are far from trivial to design and market. So it's no surprise that the creators of the most popular Games are starting to charge a little more for their software.
() Data for this post was for the week ending 11/1/2009.
() First, designing for such a small screen poses a major challenge. Secondly, the sheer number of Game apps (close to 20K last week) makes it hard to create something that turns into a long-running top-seller.
tags: android, iphone, mobile, platform, smartphone
| comments: 2
submit:
Wed
Oct 28
2009
Twitter Users Most Followed by the Web 2.0 Summit Crowd
by Ben Lorica | @dliman | comments: 7I took the set of users who posted tweets containing the hashtag #w2s and determined who those users followed. Unlike the list of the most followed users in all of Twitter, the list isn't dominated by celebrities. (A few coders landed in the top 50.) Regular Radar readers will be familiar with many of the users listed below: over 20 of the top 50 are based in the SF Bay Area. Of the over 700 users I identified, a third follow Tim:
() Data for this post was pulled on 10/27/2009. Using the Twitter search API, I was able to identify 1,500 relevant tweets and over 700 unique users responsible for those tweets. Given that I likely omitted earlier tweets, the results are at best an approximation of the true top 50 list.
tags: twitter, web 2.0 summit, web squared, web2summit
| comments: 7
submit:
Tue
Oct 20
2009
Pipelining and Real-time Analytics with MapReduce Online
by Ben Lorica | @dliman | comments: 2Most of the news related to the real-time web these days centers around the adoption of decentralized, push-oriented protocols (pubsubhubbub, rsscloud) designed to reduce latency in web publishing. Less discussed are the analytic tools that can are capable of crunching through data in real-time. As more of the web moves towards these types of publishing tools, data-driven organizations will demand low latency analytic tools.
Some organizations create their own real-time analysis tools, while others turn to specialized solutions. The Huffington Post developed in-house tools that let editors optimize headlines in near real-time. In some domains, the need for real-time analytics isn't new and companies have moved in with targeted products: SF-based Splunk is a popular real-time analytic tool for IT organizations.
In a previous post, I highlighted SQL-based real-time analytic tools that can handle large amounts of data. Tools like Truviso (based on the Postgres database) and streambase are attractive in that they require little adjustment for developers already familiar with SQL. In the same post, I noted that other big data management systems such as MPP databases and MapReduce/Hadoop were too batch-oriented (load all the data, then analyze) to deliver analysis in near real-time.
At least for MapReduce/Hadoop systems things may have changed slightly since my last post. A group of researchers from UC Berkeley and Yahoo recently modified MapReduce to allow for pipelining between operators. Rather than waiting for a Map or Reduce operator to complete (or "materialize to stable storage") before kicking off a subsequent operation, their solution is to modify MapReduce to allow intermediate data to be pipelined between operators. As they noted in their paper, pipelining holds several advantages:
A downstream dataflow element can begin consuming data before a producer element has finished execution, which can increase opportunities for parallelism, improve utilization, and reduce response time.Much like the stream databases I described previously, their approach to pipelining allows MapReduce jobs to "run continuously" and analyze new data as it arrives, enabling MapReduce/Hadoop to handle real-time monitoring and analysis tasks. The kicker is that their method of pipelining preserves the fault-tolerance and programming interfaces developers have come to associate with MapReduce frameworks. As an example, users of their Hadoop Online Prototype (or HOP) can continue continue using Hive or Pig.Since reducers begin processing data as soon as it is produced by mappers, they can generate and refine an approximation of their final answer during the course of execution. This technique, known as online aggregation, can reduce the turnaround time for data analysis by several orders of magnitude.
Pipelining widens the domain of problems to which MapReduce can be applied. This allows MapReduce to be applied to domains such as system monitoring and stream processing.
In a recent conversation with lead authors Tyson Condie and Neil Conway, they highlighted a few other features of HOP that would make it attractive to current Hadoop users. First, HOP not only preserves Hadoop's public interfaces, it also allows for jobs to be co-scheduled and pipelined, thus reducing the need to write results to HDFS. Second, pipelining leads to preliminary results and early feedback, resulting in faster debugging cycles. Upon seeing early results, a developer can either kill a task, or toggle between pipeline and block mode. Third, HOP does a better job of handling stragglers (slow running tasks) by using previous results to kick-off smart re-starts. Finally, they are currently incorporating a continuous and adaptive optimizer that for a given task, will let HOP converge to the optimal degree of parallelism. The optimizer will allow HOP to scale up/down, dynamically adding/dropping mappers & reducers, based on data being pipelined. In preliminary experiments, they found that superior cluster utilization via pipelining can mean substantial reductions in job completion times.
For those interested in performing real-time analytics within Hadoop, Tyson and Neil informed us that they will make the HOP code publicly available within a month. When asked if HOP can handle large data sets, they confirmed that researchers inside Yahoo have ongoing (successful) experiments using HOP on "Hadoop scale" data. Over the long-term, they predict some form of pipelining will become standard within Hadoop.
So how does HOP compare with the real-time SQL databases I described in an earlier post? For domains where the latency required is in the order of (sub) milliseconds (e.g. algorithmic trading), HOP probably won't help. OTOH, solutions like Truviso and streambase have shown they can handle those types of problems. But for a broader class of problems where a delay of a few seconds is acceptable, HOP will be a suitable analytic engine. In terms of usability, tools like Truviso and streambase look and work like standard SQL, making them fairly accessible to a broad class of users. To make HOP more accessible, Tyson and Neil noted that one interesting side project is to modify equivalent MapReduce tools (Hive and Pig) to incorporate "continuous and real-time queries".
UPDATE (11/12/2009): Neil Conway just announced that the source code for HOP (Hadoop Online Prototype) is now available.
() Traditional pull-oriented sytems require subscribers to nag publishers regularly ("Do you have something new?"). Push models deliver content to clients automatically as soon as new content is published ("Don't call us, we'll call you.").
() For real-time structured data analysis, enterprises favor the term complex event-processing (CEP). An example is TIBCO's CEP software.
Tue
Oct 13
2009
Mechanical Turk app on the iPhone Provides Work for Refugees
by Ben Lorica | @dliman | comments: 7Mechanical Turk service provider CrowdFlower and microwork non-profit Samasource have teamed up to make their services available to iPhone users. Users of CrowdFlower's mechanical turk platform can now opt to send their tasks to iPhone users. Previously, CrowdFlower users could choose between Amazon mechanical turks or CrowdFlower's stable of turks.
The Give Work iPhone app takes tasks (created by real companies) and sends it to iPhone users who volunteer to complete them. Meanwhile, workers in a Kenyan refugee camp perform the same tasks using CrowdFlower's regular web interface. In essence, Kenyan refugees work to increase the accuracy of the results provided by the army of volunteer iPhone mechanical turks. In a previous post on Mechanical Turk Best Practices, I highlighted recent research that suggested that for a large set of tasks, the aggregate work of 4-6 turks compare favorably with a single (domain) expert.
The payment for tasks sent to CrowdFlower's iPhone app goes entirely to the workers in the Kenyan refugee camp. In addition, Samasource has negotiated with money transfer services, so the payment goes through with zero transaction costs.
The turks in the refugee camps are recent graduates of Samasource's computer training program. Rather than sitting idly while they wait to be employed, they earn money performing simple computer tasks for real companies. On the other hand, Give Work app users volunteer to perform simple tasks on their iPhone knowing that refugees in Africa are benefiting. CrowdFlower founder Lukas Biewald notes that their work with Samasource opens up their platform to companies who want to tap into and help micro-workers in developing countries.
There are other mechanical turk services that employ workers in developing countries (see for example txteagle). What distinguishes CrowdFlower is an innovative web interface that lets companies easily upload/define their projects and choose the set of turks they want to use: Amazon, CrowdFlower, and now iPhone users + Kenyan refugees. CrowdFlower has many other features worth noting including analytics and reporting, tools to increase accuracy, and a services team that works with companies interested in custom solutions.
When I talk to companies about using mechanical turks, many are still unaware of what they even are, and most don't quite know how to use them. In our work, we routinely use turks to build machine-learning training sets, and for tasks that require the levels of accuracy that algorithms are unable to deliver. Thanks to companies like CrowdFlower, it's now really easy for companies to dip their toes, and experiment with integrating mechanical turks. And with the launch of their Give Work iPhone app, companies can simultaneously opt to provide income to workers in developing countries.
() We are users of CrowdFlower's mechanical turk platform.
() Actually nervous laughter is a common response!
tags: africa, developing world, iphone_app, mechanical turk
| comments: 7
submit:
Thu
Oct 8
2009
The iPhone as a Gaming Platform: Share of Top Apps By Category
by Ben Lorica | @dliman | comments: 4As a follow-up to my recent post on the Top Grossing Apps list on iTunes, I examined three lists highlighted in the app store: the Top Paid, Top Free, and Top Grossing Apps. Believing that many users scan these lists, developers covet a spot on any of these Top 100 charts.
In my previous posts, I've highlighted that Games is the largest category, accounting for about 20% of unique apps. The graphs below show that the gaming category has a much larger share in each of the three Top 100 lists:
68% of the Top Paid, 67% of the Top Free, and 50% of the Top Grossing apps were Games. Other categories that had disproportionate share of apps in the Top 100 rankings include Social Networking, Photography, (and to a lesser extent) Sports, and Utilities.
In contrast, three of the five largest categories (Books, Travel, Education) were severely underrepresented in each of the U.S. iTunes Top 100 Charts.
() Size of a category is measured in terms of unique apps.
() Data for this post was from the two weeks ending 10/4/2009. I consider an app as being in the Top 100, if it was listed among the most popular (free, paid or grossing) apps, sometime during those two weeks.
Recent Posts
- The Price of The Top Grossing iTunes Apps on October 6, 2009
- There are Over a Million People Actively Using Facebook Right Now on September 24, 2009
- Mobile Banks in the Developing World Prove Simpler is Better on September 17, 2009
- Resetting Expectations: Some Augmented Reality Links on September 9, 2009
- The Most Popular iTunes Apps Aren't Always The Cheapest on August 27, 2009
- Compared to the US, Facebook is Younger in Asia and the Middle East on August 18, 2009
- Big Data and Real-time Structured Data Analytics on August 13, 2009
- The iTunes App Store Rolls with the Travel Season on August 10, 2009
- Infographic of the Day: Who Came to the US in 2008 on August 6, 2009
- The US Online Job Market Improved Slightly in July on August 5, 2009















