Ben Lorica

Ben Lorica

Ben Lorica is a Senior Analyst in the Research Group at O'Reilly Media, Inc.. He has applied Business Intelligence, Data Mining and Statistical Analysis in a variety of settings including Direct Marketing, Consumer and Market Research, Targeted Advertising, Text Mining, and Financial Engineering. His background includes stints with an investment management company, internet startups, and financial services. At O'Reilly, Ben works on custom research and consulting projects, open source data warehousing and analytics.

 

Fri

Nov 20
2009

Asia Continues to be Facebook's Strongest Growth Region

by Ben Lorica@dlimancomments: 0

With Facebook topping 330 million active users over the past week, the company's strongest growth region continues to be Asia. Over the last 12 weeks, Facebook added close to 17M active users in Asia alone. Since my previous post, the share of active users from Asia grew by 2% (to 13.5% of all users), and roughly 1 in 7 users now come from the region. With a market penetration under 2%, Facebook is poised to add many more users in Asia (and Africa).

pathint

Compared to the U.S., the proportion of Facebook users in their teens (13-17) or in the 18-25 age group are much higher in Asia:

pathint

As was the case in other parts of the world, expect the share of users 45 and older to climb as Facebook becomes more mainstream in Asia. Growth was strong across all age groups in Asia over the last 12 weeks, particularly among teens (+90%) and the 18-25 age group (+60%).

pathint
In other regions, notably North America, Europe, the Middle East, and South America, growth in the 18-25 age bracket, lagged behind users 45 and older.

In closing I want to highlight countries (within several regions) where Facebook has been growing rapidly:

(continue reading)

tags: facebook, hard numbers, platforms, research, social networkingcomments: 0
submit: Reddit Digg stumbleupon   

 

Wed

Nov 11
2009

Counting Unique Users in Real-time with Streaming Databases

by Ben Lorica@dlimancomments: 6

As the web increasingly becomes real-time, marketers and publishers need analytic tools that can produce real-time reports. As an example, the basic task of calculating the number of unique users is typically done in batch mode (e.g. daily) and in many cases using a random sample from relevant log files. If unique user counts can be accurately computed in real-time, publishers and marketers can mount A/B tests or referral analysis to dynamically adjust their campaigns.

In a previous post I described SQL databases designed to handle data streams. In their latest release, Truviso announced technology that allows companies to track unique users in real-time. Truviso uses the same basic idea I described in my earlier post:

Recognizing that "data is moving until it gets stored", the idea behind many real-time analytic engines is to start applying the same analytic techniques to moving (streams) and static (stored) data.
Truviso uses (compressed) bitmaps and set theory to compute the number of unique customers in real-time. In the process they are able to handle the standard SQL queries associated with these types of problems: counting the number of distinct users, for any given set of demographic filters. Bitmaps are built as data streams into the system and use the same underlying technology that allows Truviso to handle massive data sets from high-traffic web sites.

pathint

Once companies can do simple counts and averages in real-time, the next step is to use real-time information for more sophisticated analyses. Truviso has customers using their system for "on-the-fly predictive modeling".

The other main enhancement in this release is Truviso's move towards parallel processing. Their new execution engine processes runs or blocks of data in parallel in multi-core systems or multi-node environments. Using Truviso's parallel execution engine is straightforward on a single multi-core server, but on a multi-node cluster it may require considerable attention to configuration.

[For my previous posts on real-time analytic tools see here and here.]

tags: a/b testing, analytics, big data, real-time, sensors, sql, streamscomments: 6
submit: Reddit Digg stumbleupon   

 

Tue

Nov 3
2009

Games Top the Charts in the iPhone and Android App Markets

by Ben Lorica@dlimancomments: 2

While it might be true that the number of Book apps is growing at a faster rate, Games continue to dominate the list of popular U.S. iTunes Apps. Games accounted for about a fifth of all iTunes apps over the past week, but the category continued to have a disproportionate share of the Top 100 charts, accounting for 52% of the Top Grossing, 56% of the Top Paid, and 50% of the Top Free apps:

pathint

Since most Book apps are actually individual e-books, the Gaming category would have a hard time keeping up with the ever increasing number of Books. Once publishers figured out how to turn their titles into iPhone apps, the number of Book apps started growing faster than Games. Nevertheless Games continue to rule the Top 100 charts.

A similar story is playing out on the Android platform: the most popular Android apps are primarily Games. (In the Android taxonomy, most Books are in the Reference category.)

pathint

Returning to the top iPhone apps, the price of the Top Grossing apps stabilized somewhat last week. Except for the top decile (rank 1 through 10) for which the median price was about $7, the median price across the other deciles was around $5.

pathint

Over the last week, the Top Paid Games were slightly more expensive than apps that made the overall Top 100 Paid list. iPhone Game developers will tell you that (visually) compelling and engaging iPhone Games are far from trivial to design and market††. So it's no surprise that the creators of the most popular Games are starting to charge a little more for their software.

(†) Data for this post was for the week ending 11/1/2009.
(††) First, designing for such a small screen poses a major challenge. Secondly, the sheer number of Game apps (close to 20K last week) makes it hard to create something that turns into a long-running top-seller.

tags: android, iphone, mobile, platform, smartphonecomments: 2
submit: Reddit Digg stumbleupon   

 

Wed

Oct 28
2009

Twitter Users Most Followed by the Web 2.0 Summit Crowd

by Ben Lorica@dlimancomments: 7

I took the set of users who posted tweets containing the hashtag #w2s and determined who those users followed. Unlike the list of the most followed users in all of Twitter, the list isn't dominated by celebrities. (A few coders landed in the top 50.) Regular Radar readers will be familiar with many of the users listed below: over 20 of the top 50 are based in the SF Bay Area. Of the over 700 users I identified, a third follow Tim:

pathint
UPDATE: Pete Warden has been doing similar analysis to help conference organizers and attendees. He goes a step further and monitors conversations (one twitter user mentioning another user, and vice-versa). Here is Pete's network graph of the recent Web 2.0 Summit.

(†) Data for this post was pulled on 10/27/2009. Using the Twitter search API, I was able to identify 1,500 relevant tweets and over 700 unique users responsible for those tweets. Given that I likely omitted earlier tweets, the results are at best an approximation of the true top 50 list.

tags: twitter, web 2.0 summit, web squared, web2summitcomments: 7
submit: Reddit Digg stumbleupon   

 

Tue

Oct 20
2009

Pipelining and Real-time Analytics with MapReduce Online

by Ben Lorica@dlimancomments: 2

Most of the news related to the real-time web these days centers around the adoption of decentralized, push-oriented protocols (pubsubhubbub, rsscloud) designed to reduce latency in web publishing. Less discussed are the analytic tools that can are capable of crunching through data in real-time. As more of the web moves towards these types of publishing tools, data-driven organizations will demand low latency analytic tools.

Some organizations create their own real-time analysis tools, while others turn to specialized solutions††. The Huffington Post developed in-house tools that let editors optimize headlines in near real-time. In some domains, the need for real-time analytics isn't new and companies have moved in with targeted products: SF-based Splunk is a popular real-time analytic tool for IT organizations.

In a previous post, I highlighted SQL-based real-time analytic tools that can handle large amounts of data. Tools like Truviso (based on the Postgres database) and streambase are attractive in that they require little adjustment for developers already familiar with SQL. In the same post, I noted that other big data management systems such as MPP databases and MapReduce/Hadoop were too batch-oriented (load all the data, then analyze) to deliver analysis in near real-time.

At least for MapReduce/Hadoop systems things may have changed slightly since my last post. A group of researchers from UC Berkeley and Yahoo recently modified MapReduce to allow for pipelining between operators. Rather than waiting for a Map or Reduce operator to complete (or "materialize to stable storage") before kicking off a subsequent operation, their solution is to modify MapReduce to allow intermediate data to be pipelined between operators. As they noted in their paper, pipelining holds several advantages:

A downstream dataflow element can begin consuming data before a producer element has finished execution, which can increase opportunities for parallelism, improve utilization, and reduce response time.

Since reducers begin processing data as soon as it is produced by mappers, they can generate and refine an approximation of their final answer during the course of execution. This technique, known as online aggregation, can reduce the turnaround time for data analysis by several orders of magnitude.

Pipelining widens the domain of problems to which MapReduce can be applied. This allows MapReduce to be applied to domains such as system monitoring and stream processing.

Much like the stream databases I described previously, their approach to pipelining allows MapReduce jobs to "run continuously" and analyze new data as it arrives, enabling MapReduce/Hadoop to handle real-time monitoring and analysis tasks. The kicker is that their method of pipelining preserves the fault-tolerance and programming interfaces developers have come to associate with MapReduce frameworks. As an example, users of their Hadoop Online Prototype (or HOP) can continue continue using Hive or Pig.

In a recent conversation with lead authors Tyson Condie and Neil Conway, they highlighted a few other features of HOP that would make it attractive to current Hadoop users. First, HOP not only preserves Hadoop's public interfaces, it also allows for jobs to be co-scheduled and pipelined, thus reducing the need to write results to HDFS. Second, pipelining leads to preliminary results and early feedback, resulting in faster debugging cycles. Upon seeing early results, a developer can either kill a task, or toggle between pipeline and block mode. Third, HOP does a better job of handling stragglers (slow running tasks) by using previous results to kick-off smart re-starts. Finally, they are currently incorporating a continuous and adaptive optimizer that for a given task, will let HOP converge to the optimal degree of parallelism. The optimizer will allow HOP to scale up/down, dynamically adding/dropping mappers & reducers, based on data being pipelined. In preliminary experiments, they found that superior cluster utilization via pipelining can mean substantial reductions in job completion times.

For those interested in performing real-time analytics within Hadoop, Tyson and Neil informed us that they will make the HOP code publicly available within a month. When asked if HOP can handle large data sets, they confirmed that researchers inside Yahoo have ongoing (successful) experiments using HOP on "Hadoop scale" data. Over the long-term, they predict some form of pipelining will become standard within Hadoop.

So how does HOP compare with the real-time SQL databases I described in an earlier post? For domains where the latency required is in the order of (sub) milliseconds (e.g. algorithmic trading), HOP probably won't help. OTOH, solutions like Truviso and streambase have shown they can handle those types of problems. But for a broader class of problems where a delay of a few seconds is acceptable, HOP will be a suitable analytic engine. In terms of usability, tools like Truviso and streambase look and work like standard SQL, making them fairly accessible to a broad class of users. To make HOP more accessible, Tyson and Neil noted that one interesting side project is to modify equivalent MapReduce tools (Hive and Pig) to incorporate "continuous and real-time queries".

UPDATE (11/12/2009): Neil Conway just announced that the source code for HOP (Hadoop Online Prototype) is now available.

(†) Traditional pull-oriented sytems require subscribers to nag publishers regularly ("Do you have something new?"). Push models deliver content to clients automatically as soon as new content is published ("Don't call us, we'll call you.").
(††) For real-time structured data analysis, enterprises favor the term complex event-processing (CEP). An example is TIBCO's CEP software.

tags: analytics, big data, cep, hadoop, hive, mapreduce, mpp, real-time, streamscomments: 2
submit: Reddit Digg stumbleupon   

 

Tue

Oct 13
2009

Mechanical Turk app on the iPhone Provides Work for Refugees

by Ben Lorica@dlimancomments: 7

Mechanical Turk service provider CrowdFlower and microwork non-profit Samasource have teamed up to make their services available to iPhone users. Users of CrowdFlower's mechanical turk platform can now opt to send their tasks to iPhone users. Previously, CrowdFlower users could choose between Amazon mechanical turks or CrowdFlower's stable of turks.

The Give Work iPhone app takes tasks (created by real companies) and sends it to iPhone users who volunteer to complete them. Meanwhile, workers in a Kenyan refugee camp perform the same tasks using CrowdFlower's regular web interface. In essence, Kenyan refugees work to increase the accuracy of the results provided by the army of volunteer iPhone mechanical turks. In a previous post on Mechanical Turk Best Practices, I highlighted recent research that suggested that for a large set of tasks, the aggregate work of 4-6 turks compare favorably with a single (domain) expert.

pathint

The payment for tasks sent to CrowdFlower's iPhone app goes entirely to the workers in the Kenyan refugee camp. In addition, Samasource has negotiated with money transfer services, so the payment goes through with zero transaction costs.

The turks in the refugee camps are recent graduates of Samasource's computer training program. Rather than sitting idly while they wait to be employed, they earn money performing simple computer tasks for real companies. On the other hand, Give Work app users volunteer to perform simple tasks on their iPhone knowing that refugees in Africa are benefiting. CrowdFlower founder Lukas Biewald notes that their work with Samasource opens up their platform to companies who want to tap into and help micro-workers in developing countries.

There are other mechanical turk services that employ workers in developing countries (see for example txteagle). What distinguishes CrowdFlower is an innovative web interface that lets companies easily upload/define their projects and choose the set of turks they want to use: Amazon, CrowdFlower, and now iPhone users + Kenyan refugees. CrowdFlower has many other features worth noting including analytics and reporting, tools to increase accuracy, and a services team that works with companies interested in custom solutions.

When I talk to companies about using mechanical turks, many are still unaware†† of what they even are, and most don't quite know how to use them. In our work, we routinely use turks to build machine-learning training sets, and for tasks that require the levels of accuracy that algorithms are unable to deliver. Thanks to companies like CrowdFlower, it's now really easy for companies to dip their toes, and experiment with integrating mechanical turks. And with the launch of their Give Work iPhone app, companies can simultaneously opt to provide income to workers in developing countries.

(†) We are users of CrowdFlower's mechanical turk platform.
(††) Actually nervous laughter is a common response!

tags: africa, developing world, iphone_app, mechanical turkcomments: 7
submit: Reddit Digg stumbleupon   

 

Thu

Oct 8
2009

The iPhone as a Gaming Platform: Share of Top Apps By Category

by Ben Lorica@dlimancomments: 4

As a follow-up to my recent post on the Top Grossing Apps list on iTunes, I examined three lists highlighted in the app store: the Top Paid, Top Free, and Top Grossing Apps. Believing that many users scan these lists, developers covet a spot on any of these Top 100 charts.

In my previous posts, I've highlighted that Games is the largest category, accounting for about 20% of unique apps. The graphs below show that the gaming category has a much larger share†† in each of the three Top 100 lists:

pathint

68% of the Top Paid, 67% of the Top Free, and 50% of the Top Grossing apps were Games. Other categories that had disproportionate share of apps in the Top 100 rankings include Social Networking, Photography, (and to a lesser extent) Sports, and Utilities.

In contrast, three of the five largest categories (Books, Travel, Education) were severely underrepresented in each of the U.S. iTunes Top 100 Charts.

(†) Size of a category is measured in terms of unique apps.
(††) Data for this post was from the two weeks ending 10/4/2009. I consider an app as being in the Top 100, if it was listed among the most popular (free, paid or grossing) apps, sometime during those two weeks.

tags: games, gaming, iphone, mobile, platformcomments: 4
submit: Reddit Digg stumbleupon   

 

Tue

Oct 6
2009

The Price of The Top Grossing iTunes Apps

by Ben Lorica@dlimancomments: 5

In response to developer complaints that more expensive apps were getting buried at the bottom of popularity rankings, Apple recently introduced a separate ranking based on revenue. (The Top 100 Paid apps ranks apps are based on number of downloads.) In this post, I'll validate that compared to downloads, the Top 100 ranking based on revenues does contain pricier apps.

For each decile, I calculated the MEAN price of the Top 100 Apps over the 2 most recent weeks. Notice that for the most recent week, the MEAN price for each decile of the Top 100 Grossing apps is more than $5. In contrast, none of the deciles for the Top 100 Paid apps had a mean of $4 or more. There isn't much of a relationship between rank and price although there was a slight downward trend in the price of the Top Grossing apps over the most recent week: except for the blip in the 5th decile of apps ranked 41-50, the top deciles tended to have higher MEAN prices.

pathint

The same situation holds when one looks at MEDIAN price during the most recent week: each decile of the Top Grossing apps had a MEDIAN price of $3, while no decile in the Top 100 Paid apps had a MEDIAN price of $2.

pathint

Unique Apps by Category: About two weeks ago, the U.S. iTunes store crossed 90,000 apps††. Last week, the Travel and Education categories displaced Utilities, to claim spots in the Top 4 largest categories:

pathint

(†) I refer to an app as being in the Top N, if it was listed among the N most popular (paid or grossing) apps, sometime during the given week.
(††) Since inception, 90K different apps have appeared at some point in time. Over the most recent week, more than 85,000 apps appeared in the U.S. iTunes store.

tags: iphone, mobile, platformcomments: 5
submit: Reddit Digg stumbleupon   

 

Thu

Sep 24
2009

There are Over a Million People Actively Using Facebook Right Now

by Ben Lorica@dlimancomments: 7

A little over a week ago Facebook reached a major milestone: 300 million active users. The fastest-growth region continues to be Asia, but growth in other overseas regions such as the Americas and Africa have also been strong. Currently reaching only 1% of potential users in Asia and Africa, Facebook has barely scratched the surface in both regions:

pathint

Growth in the U.S. remains fastest among those age 45 and older, and the share of those users is higher in the U.S. than overseas. In other regions recent growth tended to be more evenly divided among age groups. One notable exception has been the teen group in Asia, which grew over 80% in the last 12 weeks.

pathint

Of the 300 million users, how many are actively using Facebook right now? (For the rest of this post active means not just logged in, but actually engaged.) By treating the previous question as a Fermi problem, I can probably derive a decent estimate. First, I assume that the average fraction of people actively using Facebook at any moment, equals the fraction of time an average Facebook user is active on the site††. Without access to any usage stats, I'll throw out the following guesstimate: a typical Facebook user spends 4 hours per month (or 48 per year) actively using the site.

pathint

Depending on how accurate you want to be, there are 1.6 to 6 million people actively using Facebook right now. If the average Facebook user spends considerably more than 4 hours per month (actively) using the site, the estimate would be much higher than a 1.6 million. I do have an escape clause: in classic Fermi problems, being within a factor of 10 is considered acceptable.

(†) Increasingly popular in the business world, Fermi problems have long been staples in Physics (and Math) departments.
(††) In other words, if the average Facebook user spends 1% of her time actively using the site, on average 1% of all Facebook users are actively using the site at any given moment.

tags: facebook, fermi problem, hard numbers, platforms, research, social networkingcomments: 7
submit: Reddit Digg stumbleupon   

 

Thu

Sep 17
2009

Mobile Banks in the Developing World Prove Simpler is Better

by Ben Lorica@dlimancomments: 4

Recent initiatives designed to make U.S. consumer financial products simpler and intelligible to customers, reminds me of a study we did on Mobile Banks in the developing world. Designed to work on the simplest mobile devices and originally targeting the unbanked, mobile banks evolved from simple services (transfer of mobile air time) to become widely used money-transfer and mobile payment systems. In the Philippines, over $100M flows through the GCASH system daily. GCASH and rival SmartMoney are accepted in establishments that take credit cards, giving the unbanked the ability to conduct cashless transactions, a benefit previously limited to credit card customers. In Kenya, the number of transactions that flow through M-PESA is comparable to the number of all ATM transactions in the country.

A key observation we gleaned when we studied Mobile Banks in the developing world is that the most successful services not only have easy-to-use products with low transaction fees, the terms and fees involved are spelled out clearly. The financial products they offer are by design easy for consumers to understand. A recent CGAP survey found that 1 in 6 mobile banking users in the Philippines previously had traditional bank accounts, and 7 in 10 viewed mobile banking services as easy to use.

Among other things, the proposed Consumer Financial Protection Agency will work to ensure "... consumers get information that is clear and concise, and to prevent the worst kinds of abuses." It's unfortunate that large financial services companies have to be strong-armed into simpler offerings, when there is a large market for such products. Fortunately smaller companies aren't waiting for regulatory changes and are beginning to offer simpler products.

There's more to the successful mobile banks than meets the eye, some of the large players have become world-class financial services providers. While it's technically easy to roll out a rudimentary mobile payment system, the most successful mobile banks in the developing world use complex software systems that handle more (near) real-time transactions than traditional banking systems. Unecumbered by legacy software systems, business rules and practices, mobile banks are innovating at a much faster pace than traditional financial services companies. At the height of the banking crisis, Clayton Christensen offered the following advice to JP Morgan CEO Jamie Dimon: "Go to the developing world and buy a phone company!" Not surprisingly, traditional banks in the developing world are eagerly forging alliances with fast-growing mobile banks. GCASH has agreements with several Philippine banks allowing fund transfers (and other forms of inter-operability) between their customers.

Over the long-term, mobile banks have, in many countries, become the first step†† towards financial inclusion. Once unbanked consumers get comfortable using mobile banks, they become more likely to adopt other products such as micro-insurance and (micro) loans.

In a recent survey article, I discuss in detail the profound impact mobile banks have had in the developing world, as well as some of the main challenges they face. But let me highlight the following statistic from a recent CGAP survey of M-PESA users in Kenya: the income of rural recipients increased 30% since they started using M-PESA.

(†) Insiders like to distinguish between mobile banking (mobile phone access to existing bank customers) and mobile banks (financial institutions that arose with mobile phones).
(††) Moving into the realm of science-fiction, some technocrats in Japan recently "speculated" that mobile payment services could open the door towards eliminating cash altogether.

tags: africa, developing world, disruptive innovation, financial reform, gcash, kenya, m-pesa, mobile, mobile banks, philippines, smartmoneycomments: 4
submit: Reddit Digg stumbleupon   

 

BENS'S TWITTER UPDATES

RELEASE 2.0

CURRENT CONFERENCES

O'Reilly Tools of Change

Web 2.0 Expo showcases the latest Web 2.0 business models, development paradigms and design strategies for the builders of the next-generation Web.