Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.
While most people associate graphs with social media analysis, there are a wide range of applications — including recommendations, fraud detection, I.T. operations, and security — that are routinely framed using graphs. This wide variety of use cases has led to rise to many interesting tools for storing, managing, visualizing, and analyzing massive graphs. The important thing to note is that graph databases are not limited to reporting and analytics, but are also being used to power mission critical applications.
In this episode of the O’Reilly Data Show, I sat down with Emil Eifrem, CEO and co-founder of Neo Technology. We talked about the early days of NoSQL, applications of graph databases, cloud computing, and company culture in the U.S. and Sweden.
Graph and NoSQL databases
The relational database had been an accelerator, and here it’s really slowing us down. What we ended up concluding was that the problem was this mismatch between the shape of the data and the abstractions that were exposed by our infrastructure. At that point, we said, okay, what if we had a database that just exposed these amazing network-oriented data structures or graph-oriented data structures, but other than that, had all the properties of a relational database. Wouldn’t that be great? … Ultimately, we said the famous last words: ‘Hey, let’s just build it ourselves. How hard can it be?’ It turns out it’s 15 years later!
2007 is when both the Dynamo paper had been published and the BigTable paper had been published out of Amazon and Google, respectively. That’s when, in early adopter circuits, the discourse started to change … maybe the era of the one-size-fits-all database is over. Maybe our job isn’t to take all of our data and shove it through a relational database. Maybe there are some other tools and technologies and abstractions out there that make better sense for some data. That was in ’07. I really think it was as if lightning struck in the community. … . [Dynamo and BigTable were announced] and the next day, 12 open source projects, implementing it, and then the next day, 24 new ones. It was just crazy back then.
Operational and analytic systems
The key trigger point for that is what’s called graph global operations, when you have operations that touch the entire graph. Let’s say that … you have a billion-plus node graph and you say ‘give me all the nodes in my graph sorted by the number of relationships.’ Give me the top 50 nodes with the highest degree, the most connections to them. That operation has to touch the entire graph database.
At that point, you want to move to a graph compute system, and these are systems like Giraph for example, or GraphX part of Spark. [These systems are] amazing at scaling [graph computations] out across many machines and crunching it really in a more batch oriented offline analytics-type pipeline.
Neo4j’s home inside of the enterprise today is on the operational side, which means that … we live where the applications live. The applications that serve customers in real time … we have a lot of applications where if we go down for a minute or two, it’s going to show up in the next call for that CEO, the next quarter. It’s really revenue impacting. We’re really on the operational side of the family.
I actually think there’s exactly one point in time when you can take a new type of database to the market. I’m focusing on the operational side. If you look back in history, we’ve really only been able to build database companies when there’s been a platform shift. Oracle was built on the platform shift of going from mainframe to client-server. Then the next big platform shift of the client-server to Web, and that’s when we got MySQL. Now, it’s very clear what the big platform shift is right now. It’s, of course, from whatever Web, LAMP stack, on-premise systems, to the cloud. When you have big platform shifts for whatever reason, good or bad, people re-evaluate their stack. … That’s what’s happening right now as the world moves to more cloud platforms as a method of delivering their applications. I think that’s one of the key enablers, actually, for why we have this explosion of databases on the operational side.
It’s never going to be 100-0, like 100% in the public cloud versus 0% on-prem. That’s probably never, ever going to happen, but it’s very clear it won’t be 0-100, either. The question is whether it’s going to be 90% cloud and 10% on-prem. … I think that the absolute vast majority of data and applications are going to be running off of a public cloud.
- Graph Databases book
- Network structure and dynamics in online social systems
- Data modeling with multi-model databases
- Building web apps with Flask and Neo4j
Image by Xenon54 on Wikimedia Commons.