GraphChi: Graph analytics over billions of edges using your laptop

GraphChi is a spinoff project of GraphLab, an open source, distributed, in-memory software system for analytics and machine-learning.

Designed specifically to run on a single computer with limited memory¹ (DRAM), since its release a few months ago GraphChi has been used to analyze graphs with billions of edges. Running on a single machine means deployment and debugging are simpler. In addition it is no longer necessary to find (optimal) graph partitions that minimize communication between compute nodes – the starting point for many distributed graph computations.

The stated goal of GraphChi is to “Compute on graphs with billions of edges, in a reasonable time, on a single PC.” One way to define “reasonable amount of computation time” is to compare against the results produced by other graph processing systems. That’s exactly what GraphChi’s creators did in a recent paper. They found that GraphChi compared favorably to graph analytics packages such as Pegasus and Stanford GPS. While GraphChi was 2-3X slower² in some cases, it is easier to deploy, easier to debug, and way more energy efficient.

In particular see line 7 of the following table: GraphChi (on a single Mac Mini) was 7X faster than Hadoop (which ran on a 1636-node cluster).

Disk-based graph computation using Parallel Sliding Windows
GraphChi is written in 8,000 lines of C++ — there is a Java implementation, with a Scala API, that’s 2-3X slower. It uses a system called Parallel Sliding Windows (PSW), which shards a graph and processes it one subgraph³ at a time. PSW can process edges efficiently from disk, requiring only a small number of non-sequential disk access, while allowing for asynchronous⁴ iterative computations⁵.

But can PSW be used to implement algorithms to perform common tasks? To demonstrate the flexibility of PSW, the GraphChi team has implemented many popular algorithms⁶:

Currently most GraphChi users deploy the collaborative filtering toolkit written by Danny Bickson.

Evolving Graphs
In real-world applications graphs change continuously (as an example users add/delete friends on social networks). PSW supports both the addition and removal of edges. Tests on twitter data (with 1.5B edges) indicate that PSW can comfortably ingest 200,000 edges/second (using a Mac Mini computer).

In summary: Big Graphs on a Single Machine

As GraphChi founder Aapo Kyrölä notes, while Big Graphs need not translate to extremely large data sets, they usually lead to very challenging computations. GraphChi is a relatively new system that runs on a single machine, and scales many popular algorithms to billions of edges. Currently GraphChi and GraphLab are separate codebases and systems. A future version of GraphLab will combine them into a single system.

Related posts:

Seven reasons why I like Spark

Shark: Real-time queries and analytics for big data

(1) GraphChi assumes that there “… is enough memory to contain the edges and their associated values of any single vertex in the graph.”
(2) In some cases GraphChi was faster than a given distributed system.
(3) In PSW a subgraph is an interval of vertices (graph vertices are labeled 1, 2, …, |V|=total vertices). Each interval is a shard on disk.
(4) There are many cases where asynchronous procedures outperform synchronous ones. From the Distributed GraphLab paper: “Many important … algorithms iteratively update a large set of parameters. Because of the underlying graph structure, parameter updates (on vertices or edges) depend … on the values of other parameters. In contrast to synchronous systems, which update all parameters simultaneously (in parallel) using parameter values from the previous time step as input, asynchronous systems update parameters using the most recent parameter values as input.”
(5) For iterative computations in PSW, the whole graph is processed before a new iteration begins.
(6) The GraphLab/GraphChi team placed 5th in last year’s ACM KDD CUP track 1, and 4th (out of 192 groups) in this year’s ACM KDD CUP track 2.

GraphChi: Graph analytics over billions of edges using your laptop

A disk-based, single-node, graph analytics system that scales to massive graphs