To paraphrase Seth Godin, businesses need to stop only collecting data, and start finding the connections between them. In other words, the relationships between data points matter almost more than the individual points themselves. Following some advice of a professional mentor, I realized learning about graph databases would be an important skill for an aspiring data scientist like myself. One of the leading graph databases appears to be Neo4J which uses a programming language called Cypher.
The good news is that there appears to be a great variety of resources out there for people wanting to learn Neo4j quickly. There’s a great white paper that details five possible use cases for using Neo4j and presents some advantages. For example, a graph database can store complex, densely connected data points. Its data model supports both hierarchical and nonhierarchical structures and can capture rich metadata. It also has a query engine that look through millions of relationships between data points per second, reducing processing time from minutes or hours to milliseconds.
I want to explore each of these advantages in a little more detail.
- A graph database can store complex, densely connected data
- A graph database supports hierarchical and non-hierarchical data structures.
- A graph database performs quick searches.
One of the great advantages of Neo4j and graph databases is that they can store complex and densely connected data. Unlike traditional relational database management systems, Neo4j can have millions of nodes, edges and labels. For more information on the relavent definitions, visit https://www.slideshare.net/neo4j/data-modeling-with-neo4j-25767444. Because of the way the data is stored, it can be very complicated.
In addition to having the ability to store complex and connected data, the Neo4j graph database supports hierarchical and non-hierarchical data structures. In hierarchical data structures, there is a parent node and then children nodes that looks like a tree with branches. For example, in a logistics structure as described in Ben Stopford’s Thinking in Graphs, the package network node shown in pink has a parcel center node shown in light green that’s underneath it. The delivery base node belongs to the parcel center, the delivery area (purple) belongs to the delivery base (blue) and so on down the structure. In a non-hierarchical structure, these dependencies between the nodes do not exist and the nodes are more interconnected.
As Robinson and his colleagues explain in Graph Databases: New Opportunities for Connected Data, in contrast to relational databases, where you have to perform join queries to connect data parts together and performance slows down, a graph database performance tends to remain relatively constant, even as the dataset grows. The third advantage of graph databases is that the processing speed doesn’t slow down because queries are localized to a portion of the graph. As a result, the execution time for each query is proportional only to the size of the part of the graph traversed to satisfy that query, rather than the size of the overall graph.
As I begin my penultimate semester at Indiana University’s School of Informatics today, I will take a break from Neo4J and Cypher. But I promise we will return to have at least one more discussion before I graduate in May 2018.