Eleven areas of focus for deeper investigation.
Conferences like Strata are planned a year in advance. The logistics and coordination required for an event of this magnitude takes a lot of planning, but it also takes a decent amount of prediction: Strata needs to skate to where the puck is going.
While Strata New York + Hadoop World 2013 is still a few months away, we’re already guessing at what next year’s Santa Clara event will hold. Recently, the team got together to identify some of the hot topics in big data, ubiquitous computing, and new interfaces. We selected eleven big topics for deeper investigation.
- Deep learning
- Time-series data
- The big data “app stack”
- Cultural barriers to change
- Design patterns
- Laggards and Luddites
- The convergence of two databases
- The other stacks
- Mobile data
- The analytic life-cycle
- Data anthropology
Here’s a bit more detail on each of them. Read more…
Collaborative filtering with Neo4j
By this time, chances are very likely that you’ve heard of NoSQL, and of graph databases like Neo4j.
NoSQL databases address important challenges that we face today, in terms of data size and data complexity. They offer a valuable solution by providing particular data models to address these dimensions.
On one side of the spectrum, these databases resolve issues for scaling out and high data values using compounded aggregate values, on the other side is a relationship based data model that allows us to model real world information containing high fidelity and complexity.
Neo4j, like many other graph databases, builds upon the property graph model; labeled nodes (for informational entities) are connected via directed, typed relationships. Both nodes and relationships hold arbitrary properties (key-value pairs). There is no rigid schema, but with node-labels and relationship-types we can have as much meta-information as we like. When importing data into a graph database, the relationships are treated with as much value as the database records themselves. This allows the engine to navigate your connections between nodes in constant time. That compares favorably to the exponential slowdown of many-JOIN SQL-queries in a relational database.
How can you use a graph database?
Graph databases are well suited to model rich domains. Both object models and ER-diagrams are already graphs and provide a hint at the whiteboard-friendliness of the data model and the low-friction mapping of objects into graphs.
Instead of de-normalizing for performance, you would normalize interesting attributes into their own nodes, making it much easier to move, filter and aggregate along these lines. Content and asset management, job-finding, recommendations based on weighted relationships to relevant attribute-nodes are some use cases that fit this model very well.
Many people use graph databases because of their high performance online query capabilities. They process large amounts or high volumes of raw data with Map/Reduce in Hadoop or Event-Processing (like Storm, Esper, etc.) and project the computation results into a graph. We’ve seen examples of this from many domains from financial (fraud detection in money flow graphs), biotech (protein analysis on genome sequencing data) to telco (mobile network optimizations on signal-strength-measurements).
Graph databases shine when you can express your queries as a local search using a few starting points (e.g., people, products, places, orders). From there, you can follow relevant relationships to accumulate interesting information, or project visited nodes and relationships into a suitable result.
A new set of analytic engines make the case for convenience over performance
The choice of tools for data science includes1 factors like scalability, performance, and convenience. A while back I noted that data scientists tended to fall into two camps: those who used an integrated stack, and others who tended to stitch together frameworks. Being able to stick with the same programming language and environment is a definite productivity boost since it requires less setup time and context-switching.
More recently I highlighted the emergence of composable analytic engines, that leverage data stored in HDFS (or HBase and Accumulo). These engines may not be the fastest available, but they scale to data sizes that cover most workloads, and most importantly they can operate on data stored in popular distributed data stores. The fastest and most complete set of algorithms will still come in handy, but I suspect that users will opt for slightly slower2, but more convenient tools, for many routine analytic tasks.
OSCON 2013 Speaker Series
If you have delved into Apache Hadoop and related projects, you know that installing and configuring Hadoop is hard. Often, a minor mistake during installation or configuration with messy tarballs will lurk for a long time until some otherwise innocuous change to the system or workload causes difficulties. Moreover, there is little to no integration testing among different projects (e.g. Hadoop, Hive, HBase, Zookeeper, etc.) in the ecosystem. Apache Bigtop is an open source project aimed at bridging exactly those gaps by:
1. Making it easier for users to deploy and configure Hadoop and related projects on their bare metal or virtualized clusters.
2. Performing integration testing among various components in the Hadoop ecosystem.
More about Apache Bigtop
The primary goal of Apache Bigtop is to build a community around the packaging and interoperability testing of Hadoop related projects. This includes testing at various levels (packaging, platform, runtime, upgrade, etc.) developed by a community with a focus on the system as a whole, rather than individual projects.
The latest released version of Apache Bigtop is Bigtop 0.5 which integrates the latest versions of various projects including Hadoop, Hive, HBase, Flume, Sqoop, Oozie and many more! The supported platforms include CentOS/RHEL 5 and 6, Fedora 16 and 17, SuSE Linux Enterprise 11, OpenSuSE 12.2, Ubuntu LTS Lucid and Precise, and Ubuntu Quantal.
Who uses Bigtop?
Folks who use Bigtop can be divided into two major categories. The first category of users are those who leverage Bigtop to power their own Hadoop Distributions. The second category of users are those who use Bigtop for deployment purposes.
In alphabetical order, they are:
OSCON 2013 Speaker Series
Paco Nathan (@pacoid) is Director of Data Science at Concurrent, O’Reilly Author, and OSCON 2013 Speaker. In this interview we talk about creating enterprise data workflow with Cascading. Be sure to check out Paco’s book on the subject here
NOTE: If you are interested in attending OSCON to check out Paco’s talk or the many other cool sessions, click over to the OSCON website where you can use the discount code OS13PROG to get 20% your registration fee.
Key highlights include:
- Cascading is an abstraction layer on top of Hadoop [Discussed at 0:23]
- Define your business logic at a high level [Discussed at 1:21]
- Is Cascading good for enterprise? [Discussed at 2:31]
- Test-driven development at scale [Discussed at 3:35]
- Cascalog and the City of Palo Alto Open Data portal [Discussed at 7:39]
You can view the full interview here:
OSCON 2013 Speaker Series
NOTE: If you are interested in attending OSCON to check out Dave’s talk or the many other cool sessions, click over to the OSCON website where you can use the discount code OS13PROG to get 20% off your registration fee.
Since 2009, I’ve been leading the optimization team at AppNexus, a real-time advertising exchange. On this exchange, advertisers participate in real-time auctions to bid on individual ad impressions. The highest bid wins the auction, and that advertiser gets to show an ad. This allows advertisers to carefully target where they advertise—maximizing the effectiveness of their advertising budget—and lets websites maximize their ad revenue.
We do these auctions often (~50 billion a day) and fast (<100 milliseconds). Not surprisingly, this creates a lot of technical challenges. One of those challenges is how to automatically maximize the value advertisers get for their marketing budgets—systematically driving consumer engagement through ad placements on particular websites, times of day, etc.—and we call this process “optimization.” The volume of data is large, and the algorithms and strategies aren’t trivial.
In order to win clients and build our business to the scale we have today, it was crucial that we build a world-class optimization system. But when I started, we didn’t have a scalable tech stack to process the terabytes of data flowing through our systems every day, and we didn't have the team to do any of the required data modeling.
So, we needed to hire great people fast. However, there aren’t many veterans in the advertising optimization space, and because of that, we couldn’t afford to narrow our search to only experts in Java or R or Matlab. In order to give us the largest talent pool possible to recruit from, we had to choose a tech stack that is both powerful and accessible to people with diverse experience and backgrounds. So we chose Python.
Python is easy to learn. We found that people coding in R, Matlab, Java, PHP, and even those who have never programmed before could quickly learn and get up to speed with Python. This opened us up to hiring a tremendous pool of talent who we could train in Python once they joined AppNexus. To top it off, there’s a great community for hiring engineers and the PyData community is full of programmers who specialize in modeling and automation.
Additionally, Python has great libraries for data modeling. It offers great analytical tools for analysts and quants and when combined, Pandas, IPython, and Matplotlib give you a lot of the functionality of Matlab or R. This made it easy to hire and onboard our quants and analysts who were familiar with those technologies. Even better, analysts and quants can share their analysis through the browser with IPython.
Now that we had all of these wonderful employees, we needed a way to cut down the time to get them ramped up and pushing code to production.
First, we wanted to get our analysts and quants looking at and modeling data as soon as possible. We didn’t want them worrying about writing database connector code, or figuring out how to turn a cursor into a data frame. To tackle this, we built a project called Link.
Imagine you have a MySQL database. You don’t want to hardcode all of your connection information because you want to have a different config for different users, or for different environments. Link allows you to define your “environment” in a JSON config file, and then reference it in code as if it is a Python object.
Now, with only three lines of code you have a database connection and a data frame straight from your mysql database. This same methodology works for Vertica, Netezza, Postgres, Sqlite, etc. New “wrappers” can be added to accommodate new technologies, allowing team members to focus on modeling the data, not how to connect to all these weird data sources.
In : from link import lnk
In : my_db = lnk.dbs.my_db
In : df = my_db.select('select * from my_table').as_dataframe()
Int64Index: 325 entries, 0 to 324
id 325 non-null values
user_id 323 non-null values
app_id 325 non-null values
name 325 non-null values
body 325 non-null values
created 324 non-null values
By having the flexibility to easily connect to new data sources and APIs, our quants were able to adapt to the evolving architectures around us, and stay focused on modeling data and creating algorithms.
Second, we wanted to minimize the amount of work it took to take an algorithm from research/prototype phase to full production scale. Luckily, with everyone working in Python, our quants, analysts, and engineers are using the same language and data processing libraries. There was no need to re-implement an R script in Java to get it out across the platform.
Spark, Storm, HBase, and YARN power large-scale, real-time models.
My favorite session at the recent Hadoop Summit was a keynote by Bruno Fernandez-Ruiz, Senior Fellow & VP Platforms at Yahoo! He gave a nice overview of their analytic and data processing stack, and shared some interesting factoids about the scale of their big data systems. Notably many of their production systems now run on MapReduce 2.0 (MRv2) or YARN – a resource manager that lets multiple frameworks share the same cluster.
Yahoo! was the first company to embrace Hadoop in a big way, and it remains a trendsetter within the Hadoop ecosystem. In the early days the company used Hadoop for large-scale batch processing (the key example being, computing their web index for search). More recently, many of its big data models require low latency alternatives to Hadoop MapReduce. In particular, Yahoo! leverages user and event data to power its targeting, personalization, and other “real-time” analytic systems. Continuous Computing is a term Yahoo! uses to refer to systems that perform computations over small batches of data (over short time windows), in between traditional batch computations that still use Hadoop MapReduce. The goal is to be able to quickly move from raw data, to information, to knowledge:
On a side note: many organizations are beginning to use cluster managers that let multiple frameworks share the same cluster. In particular I’m seeing many companies – notably Twitter – use Mesos1 (instead of YARN) to run similar services (Storm, Spark, Hadoop MapReduce, HBase) on the same cluster.
Going back to Bruno’s presentation, here are some interesting bits – current big data systems at Yahoo! by the numbers:
Data stores are rolling out easy-to-use analysis tools
Originated by the NSA, Apache Accumulo is a BigTable inspired data store known for being highly scalable and for its interesting security model. Federal agencies and Defense contractors have deployed Accumulo on clusters of a thousand or more servers. It also uses “cell-level” security to control access to values stored in individual cells1.
What Accumulo was lacking were easy-to-use, standard analytic engines that allow users to interact with data. The release of Sqrrl Enterprise this past week fills that gap. Sqrrl Enterprise provides an initial set of analytic engines for the Accumulo ecosystem2. It includes support for interactive SQL, fulltext search, and queries over graph data. Each of these engines takes into account security labels placed on data: since every data object ingested into Sqrrl has a security label, (query & analytic) results incorporate those access levels. Analysts interact with data as they normally would. For example Sqrrl’s indexing technology accounts for security labels, and search queries are written in standard Lucene syntax. Reminiscent of the Phoenix project for HBase3, SQL queries4 in Sqrrl are converted into optimized Accumulo iterators.
Analytic engines on top of Hadoop simplify the creation of interesting, low-cost, scalable applications
Hadoop’s low-cost, scale-out architecture has made it a new platform for data storage. With a storage system in place, the Hadoop community is slowly building a collection of open source, analytic engines. Beginning with batch processing (MapReduce, Pig, Hive), Cloudera has added interactive SQL (Impala), analytics (Cloudera ML + a partnership with SAS), and as of early this week, real-time search. The economics that led to Hadoop dominating batch processing is permeating other types of analytics.
Another collection of open source, Hadoop-compatible analytic engines, the Berkeley Data Analytics Stack (BDAS), is being built just across the San Francisco Bay. Starting with a batch-processing framework that’s faster than MapReduce (Spark), it now includes interactive SQL (Shark), and real-time analytics (Spark Streaming). Sometime this summer, frameworks for machine-learning (MLbase) and graph analytics (GraphX) will be released. A cluster manager (Mesos) and an in-memory file system (Tachyon) allow users of other analytic frameworks to leverage the BDAS platform. (The Python data community is looking at Tachyon closely.)
A new, open source benchmark can be used to track performance improvements over time
As organizations continue to accumulate data, there has been renewed interest in interactive query engines that scale to terabytes (even petabytes) of data. Traditional MPP databases remain in the mix, but other options are attracting interest. For example, companies willing to upload data into the cloud are beginning to explore Amazon Redshift1, Google BigQuery, and Qubole.
A variety of analytic engines2 built for Hadoop are allowing companies to bring its low-cost, scale-out architecture to a wider audience. In particular, companies are rediscovering that SQL makes data accessible to lots of users, and many prefer3 not having to move data to a separate (MPP) cluster. There are many new tools that seek to provide an interactive SQL interface to Hadoop, including Cloudera’s Impala, Shark, Hadapt, CitusDB, Pivotal-HD, PolyBase4, and SQL-H.
An open source benchmark from UC Berkeley’s Amplab
A benchmark for tracking the progress5 of scalable query engines has just been released. It’s a worthy first effort, and its creators hope to grow the list of tools to include other open source (Drill, Stinger) and commercial6 systems. As these query engines mature and features get added, data from this benchmark can provide a quick synopsis of performance improvements over time.
The initial release includes Redshift, Hive, Impala, and Shark (Hive, Impala, Shark were configured to run on AWS). Hive 0.10 and the most recent versions7 of Impala and Shark were used (Hive 0.11 was released in mid-May and has not yet been included). Data came from Intel’s Hadoop Benchmark Suite and CommonCrawl. In the case of Hive/Impala/Shark, data was stored in compressed SequenceFile format using CDH 4.2.0.
At least for the queries included in the benchmark, Redshift is about 2-3x faster than Shark/on-disk, and 0.3-2x faster than Shark/in-memory. Given that it’s built on top of a general purpose engine (Spark), it’s encouraging that Shark’s performance is within range of MPP8 databases (such as Redshift) that are highly optimized for interactive SQL queries. With new frameworks like Shark and Impala providing speedups comparable to those observed in MPP databases, organizations now have the option of using a single system (Hadoop/Spark) instead of two (Hadoop/Spark + MPP database).
Let’s look at some of the results in detail: