Graph Data Modeling 101

10th October, 2016

Graph Data Modeling 101

Data modeling – the translation of your data’s conceptual view to a logical model – is the first step towards great graph visualization.

During the data-modeling process you determine which entities in your dataset should be nodes, which should be links and which should be discarded. The result is a blueprint of your data’s entities, relationships and properties. You can use that blueprint to create a visualization model.

The process is repetitive and often relies on trial and error, but it’s worth doing correctly. There are many different ways to model a single dataset, but some are more useful than others. Getting it right will make the lives of the developer and end-user easier.

What is a node, link and property?

First let’s look at the graph model. If you use a graph database, you’ll already be familiar with nodes and edges (known as nodes and links in a visualization environment):

The basic graph model

Nodes are the fundamental units of our data. They’re the entities around which we design our entire model.

Links are the relationships between nodes. They can be single, directed or multiple:

  • Single indicates an on/off type relationship, where only the existence of a link is important.
  • Directed implies the directional flow of information, communication or commodities between two nodes.
  • Multiple represents more than one relationship that needs to be visualized separately, not amalgamated in a single link.

Properties are descriptive characteristics of nodes and links, but aren’t important enough to become nodes themselves.

There is no formula for deriving a graph model from your data, but we can give you some guidance. Let’s walk through two examples, starting with data in a relational format and a Key-Value format.

Modeling Relational Data as a Graph

Relational databases have been around for decades. They are a familiar and reliable option for digital data storage, and virtually every organization either has one, or relies on cloud services that use one.

As the name suggests, they store related data in two-dimensional tables of columns and rows. Translating this relational data into a graph format takes some work.

For example, here’s a set of relational tables containing vehicle data:


And here’s a list of people with car insurance policies:

Tracy Freeman216-555-0192904 Riverside Street, Ashtabula, OH 44004
Ramiro Rowe201-555-0107980 Linden St., Bayonne, NJ 07002
Justin Park309-555-01579727 Cedar Dr., Rolling Meadows, IL 60008
Sadie Medina239-555-01579033 Yukon Street, Ponte Vedra Beach, FL 32082
Frankie Chavez212-555-01987549 Eagle Dr., Buffalo, NY 14215
Josephine Wu404-555-016387 Maiden Street, Riverdale, GA 30274

And a separate table of insurance policies that connect vehicles to owners:

G3Q7T35Tracy Freeman
JLL8R2YRamiro Rowe
MV3GJ7SJustin Park
SL28KTBSadie Medina
TDNSN4CFrankie Chavez
X7SHNVWJosephine Wu

The tables describe our data model: Vehicles, Policies, and Owners and it’s easy to decide what the links should be:

Our three tables become three nodes

To validate this model, think about your users. What questions do they need to answer? If Vehicle_Year is an important part of the investigation, it should be a node. If not, it should be a property of the Vehicle node.

The next step is to select unique identifiers for each node. The Policy_ID uniquely identifies each policy already.

For the vehicle node we have three options: VIN, Registration and Year. Millions of cars are built each year. Registrations can be assigned to different automobiles. The only entity we know is unique to each vehicle is the VIN, sometimes called the chassis number. So that’s the property we should choose to represent our vehicle.
The Policy_ID uniquely identifies each policy, so that is a straightforward decision.

Finding a unique attribute for the Person node is not as simple. Addresses and names are not unique identifiers.

This is a challenge known as Identity Resolution. One way around it is to assign a new attribute to each person – e.g. a customer ID number:

Our three nodes with their unique identifiers

It’s important to resolve the Identity Resolution problem, especially if you plan to visualize your graph. Our products merge nodes with identical IDs. If those nodes turn out to be unique, you may misrepresent important patterns.

All remaining columns can be added as properties – provided they are going to be useful. Don’t add properties to your model just because it is in your database.

Our final data model

Modeling Key-Value as a Graph

Data modeling is more complicated if you’re working with Key-Value data stores such as Couchbase, DynamoDB or FoundationDB. Relationships in Key-Value datasets aren’t stored in interconnected tables, so there’s no obvious way to translate from a physical model to a logical model.

Instead, they’re stored in rows as associative arrays:


As you can see, the data isn’t as structured. New columns can be added at any time, introducing new data points and relationships.

Despite this, the same two rules apply:

  1. Nodes should be the core objects your users need to understand
  2. Nodes should be unique

You can then infer relationships and add any remaining – and useful – columns as properties to the nodes or links.

Create a visual model

Once you’ve chosen a winning data model that’s both simple and practical, you can start translating it into your visualization model.

Cambridge Intelligence products have plenty of features to make this part of the process easier. Take a look at our demos to see what techniques are available, and view the source code to see how they are created.

Finally, before you start, read the 10 Rules of Great Graph Design and Pitfalls of Network Visualization before you start. It would be a shame if all your hard work was overshadowed by an overly enthusiastic approach to glyphs.

Learn more about visualization design with our blog post

The diagrams for this post were created using Alistair Jones’ excellent Arrow tool.

More from our blog

Visit our blog