Why Choose a Graph Database

Collaborative filtering with Neo4j

By this time, chances are very likely that you’ve heard of NoSQL, and of graph databases like Neo4j.

NoSQL databases address important challenges that we face today, in terms of data size and data complexity. They offer a valuable solution by providing particular data models to address these dimensions.

On one side of the spectrum, these databases resolve issues for scaling out and high data values using compounded aggregate values, on the other side is a relationship based data model that allows us to model real world information containing high fidelity and complexity.

Neo4j, like many other graph databases, builds upon the property graph model; labeled nodes (for informational entities) are connected via directed, typed relationships. Both nodes and relationships hold arbitrary properties (key-value pairs). There is no rigid schema, but with node-labels and relationship-types we can have as much meta-information as we like. When importing data into a graph database, the relationships are treated with as much value as the database records themselves. This allows the engine to navigate your connections between nodes in constant time. That compares favorably to the exponential slowdown of many-JOIN SQL-queries in a relational database.

property-graph

How can you use a graph database?

Graph databases are well suited to model rich domains. Both object models and ER-diagrams are already graphs and provide a hint at the whiteboard-friendliness of the data model and the low-friction mapping of objects into graphs.

Instead of de-normalizing for performance, you would normalize interesting attributes into their own nodes, making it much easier to move, filter and aggregate along these lines. Content and asset management, job-finding, recommendations based on weighted relationships to relevant attribute-nodes are some use cases that fit this model very well.

Many people use graph databases because of their high performance online query capabilities. They process large amounts or high volumes of raw data with Map/Reduce in Hadoop or Event-Processing (like Storm, Esper, etc.) and project the computation results into a graph. We’ve seen examples of this from many domains from financial (fraud detection in money flow graphs), biotech (protein analysis on genome sequencing data) to telco (mobile network optimizations on signal-strength-measurements).

Graph databases shine when you can express your queries as a local search using a few starting points (e.g., people, products, places, orders). From there, you can follow relevant relationships to accumulate interesting information, or project visited nodes and relationships into a suitable result.

How does graph querying in Neo4j work?

Neo4j’s query language Cypher aims to be a user-friendly language that is designed to be read and understood easily. It allows you to declare patterns (MATCH) that you want to find in the graph and then apply filters (WHERE), projection (RETURN) and paging (LIMIT,SKIP,ORDER BY) to your result data. To make it possible to declare the visual graph-patterns in a textual query language, we went back to our roots and felt that ASCII-art would be the obvious choice (it was inspired by the graphviz dot language). In the example we’ll use the new Neo4j 2.0 query syntax without a START clause.

So a simple query for people to take to the conference would look like this:

MATCH (me:Person)-[f:FRIEND]->(friend)-[:WORKS_AT]->(job)

WHERE me.name = “Michael” AND job.name = “Programmer”

RETURN friend.name

ORDER BY f.since

LIMIT 10

Let’s do a quick example on how to model a specific domain. Let’s take OSCON as an example. Here is a quick whiteboard drawing with the core domain:

whiteboard-graph

From this whiteboard graph, we can now ask some interesting questions:

  • How well are the rooms filled?
  • I am interested in NoSQL, what other sessions can you recommend?
  • How much do I have to move if I want to attend all the JavaScript sessions?
  • Who is the most prolific speaker?

Let’s just answer one of them, and you can have fun figuring out the others.

I am interested in NoSQL, what other sessions can you recommend?

MATCH (tag:Tag)<-[:TAGGED]-(session)<-[:FAVORITED]-(someone),

(someone)-[:FAVORITED]->(other_session)

WHERE tag.name=”NoSQL”

RETURN other_session, count(*) as cnt

ORDER BY cnt desc

LIMIT 5

This is also called collaborative filtering, I first find others that are similar to me (that also favorited NoSQL sessions) and then I look which other subjects they were also interested in and show the ones that were hit most often.

If you like the ideas I discussed here, you can also check out our OSCON tutorial, or feel free grab us at the conference.

OSCON 2013 attendees: look for one of the Neo4j Graphistas to get a complimentary copy of Graph Databases.

tags: , , , ,