Building recommendation platforms with Hadoop

Preview of upcoming session at the Strata Conference

Recommendations are making their way into more and more products. Using larger datasets are significantly improving the recommendations. Hadoop is being increasingly used for building out the recommendation platforms. Some of the examples of Recommendations include product recommendations, merchant recommendations, content recommendations, social recommendations, query recommendation, display and search ads.

With the number of options available to the users ever increasing, the attention span of customers is getting lower and lower at the very fast pace. At any given moment, the customers are getting used to seeing their best choices right in front of them. In such a scenario, we see recommendations powering more and more features of the products and driving user interaction. Hence companies are looking for more ways to minutely target customers at the right time. This brings in big data into the picture. Succeeding with data and building new markets, or changing the existing markets is the game being played in many high stake scenarios. Some companies have found the way to build their big data recommendation/machine learning platform giving them the edge in bringing better and better products ever faster to the market. Hence, there is a strong case for looking at recommendations/machine learning on big data as a platform in a company, rather than something of a black box that magically produces the right results. The platform allows us to build various other features like fraud detection, spam detection, content enrichment and serving etc. making it viable in the long run. It is not just about recommendations.

The more data we give to our algorithms, the better-targeted results we get. If that is the case, it again means that it is not practical to have separate big data platforms for different things. Leave along the requirement for a DR cluster.

A recommendation platform using Hadoop would have the following components: ETL, feature generation, feature selection, recommendation algorithms, A/B testing, serving, tracking and reporting. In the presentation at Strata Santa Clara, we will dive into the details of the design of each of the components. We will also go over real use cases and details of solving them in the Hadoop ecosystem. We will also specifically cover a set of machine learning algorithms for solving the various recommendation use cases. While Mahout fits well with Hadoop Map-Reduce framework, there are also elegant ways of plugging in other non-distributed systems/algorithms into Hadoop.

Below is a very high level view of the architecture diagram.


Mahout has quite an extensive set of algorithms that can be run on Hadoop. These include clustering, collaborative filtering and classification. Hadoop provides the ideal platform for the training and testing of the models. Automating this process with Hadoop brings in huge savings in development and operational cost. Tracking & Reporting of the performance of the various models significantly helps in knowing how well the system is operating. This also means that we have now built EDW on Hadoop as part of the process and along with it we now have the analytical ability to understand in depth the users, markets, advertisers etc.
A scalable system in terms of data processing, flexibility of development, integration of various sources and analytics goes a long way in making an organization agile, innovative and able to deliver quality systems in production.

tags: , , , , ,