GraphChi: Graph analytics over billions of edges using your laptop

A disk-based, single-node, graph analytics system that scales to massive graphs

GraphChi is a spinoff project of GraphLab, an open source, distributed, in-memory software system for analytics and machine-learning.

Designed specifically to run on a single computer with limited memory1 (DRAM), since its release a few months ago GraphChi has been used to analyze graphs with billions of edges. Running on a single machine means deployment and debugging are simpler. In addition it is no longer necessary to find (optimal) graph partitions that minimize communication between compute nodes – the starting point for many distributed graph computations.

The stated goal of GraphChi is to “Compute on graphs with billions of edges, in a reasonable time, on a single PC.” One way to define “reasonable amount of computation time” is to compare against the results produced by other graph processing systems. That’s exactly what GraphChi’s creators did in a recent paper. They found that GraphChi compared favorably to graph analytics packages such as Pegasus and Stanford GPS. While GraphChi was 2-3X slower2 in some cases, it is easier to deploy, easier to debug, and way more energy efficient.

In particular see line 7 of the following table: GraphChi (on a single Mac Mini) was 7X faster than Hadoop (which ran on a 1636-node cluster).

Disk-based graph computation using Parallel Sliding Windows
GraphChi is written in 8,000 lines of C++ — there is a Java implementation, with a Scala API, that’s 2-3X slower. It uses a system called Parallel Sliding Windows (PSW), which shards a graph and processes it one subgraph3 at a time. PSW can process edges efficiently from disk, requiring only a small number of non-sequential disk access, while allowing for asynchronous4 iterative computations5.

But can PSW be used to implement algorithms to perform common tasks? To demonstrate the flexibility of PSW, the GraphChi team has implemented many popular algorithms6:

Currently most GraphChi users deploy the collaborative filtering toolkit written by Danny Bickson.

Evolving Graphs
In real-world applications graphs change continuously (as an example users add/delete friends on social networks). PSW supports both the addition and removal of edges. Tests on twitter data (with 1.5B edges) indicate that PSW can comfortably ingest 200,000 edges/second (using a Mac Mini computer).

In summary: Big Graphs on a Single Machine

As GraphChi founder Aapo Kyrölä notes, while Big Graphs need not translate to extremely large data sets, they usually lead to very challenging computations. GraphChi is a relatively new system that runs on a single machine, and scales many popular algorithms to billions of edges. Currently GraphChi and GraphLab are separate codebases and systems. A future version of GraphLab will combine them into a single system.

Related posts:

  • Seven reasons why I like Spark
  • Shark: Real-time queries and analytics for big data

  • (1) GraphChi assumes that there “… is enough memory to contain the edges and their associated values of any single vertex in the graph.”
    (2) In some cases GraphChi was faster than a given distributed system.
    (3) In PSW a subgraph is an interval of vertices (graph vertices are labeled 1, 2, …, |V|=total vertices). Each interval is a shard on disk.
    (4) There are many cases where asynchronous procedures outperform synchronous ones. From the Distributed GraphLab paper: “Many important … algorithms iteratively update a large set of parameters. Because of the underlying graph structure, parameter updates (on vertices or edges) depend … on the values of other parameters. In contrast to synchronous systems, which update all parameters simultaneously (in parallel) using parameter values from the previous time step as input, asynchronous systems update parameters using the most recent parameter values as input.”
    (5) For iterative computations in PSW, the whole graph is processed before a new iteration begins.
    (6) The GraphLab/GraphChi team placed 5th in last year’s ACM KDD CUP track 1, and 4th (out of 192 groups) in this year’s ACM KDD CUP track 2.

    tags: , , , , , ,

    Get the O’Reilly Data Newsletter

    Stay informed. Receive weekly insight from industry insiders.

    Get the O’Reilly Web Ops and Performance Newsletter

    Weekly insight from industry insiders. Plus exclusive content and offers.

    Get the O’Reilly Programming Newsletter

    Weekly insight from industry insiders. Plus exclusive content and offers.

    Get the O’Reilly Hardware Newsletter

    Get weekly insight and knowledge on how to design, prototype, manufacture, and market great connected devices.

    Get Four Short Links in Your Inbox

    Sign up to receive Nat’s eclectic collection of curated links every weekday.

    Get the O’Reilly Design Newsletter

    Stay informed. Receive weekly insight from industry insiders.

    Get the O’Reilly Web Platform Newsletter

    Stay informed. Receive weekly insight from industry insiders—plus exclusive content and offers.