Data modeling – the translation of your data’s conceptual view to a logical model – is the first step towards great graph visualization.
During the data-modeling process you determine which entities in your dataset should be nodes, which should be links and which should be discarded. The result is a blueprint of your data’s entities, relationships and properties. You can use that blueprint to create a visualization model.
The process is repetitive and often relies on trial and error, but it’s worth doing correctly. There are many different ways to model a single dataset, but some are more useful than others. Getting it right will make the lives of the developer and end-user easier.
First let’s look at the graph model. If you use a graph database, you’ll already be familiar with nodes and edges (known as nodes and links in a visualization environment):
Nodes are the fundamental units of our data. They’re the entities around which we design our entire model.
Links are the relationships between nodes. They can be single, directed or multiple:
Properties are descriptive characteristics of nodes and links, but aren’t important enough to become nodes themselves.
There is no formula for deriving a graph model from your data, but we can give you some guidance. Let’s walk through two examples, starting with data in a relational format and a Key-Value format.
Relational databases have been around for decades. They are a familiar and reliable option for digital data storage, and virtually every organization either has one, or relies on cloud services that use one.
As the name suggests, they store related data in two-dimensional tables of columns and rows. Translating this relational data into a graph format takes some work.
For example, here’s a set of relational tables containing vehicle data:
And here’s a list of people with car insurance policies:
|Tracy Freeman||216-555-0192||904 Riverside Street, Ashtabula, OH 44004|
|Ramiro Rowe||201-555-0107||980 Linden St., Bayonne, NJ 07002|
|Justin Park||309-555-0157||9727 Cedar Dr., Rolling Meadows, IL 60008|
|Sadie Medina||239-555-0157||9033 Yukon Street, Ponte Vedra Beach, FL 32082|
|Frankie Chavez||212-555-0198||7549 Eagle Dr., Buffalo, NY 14215|
|Josephine Wu||404-555-0163||87 Maiden Street, Riverdale, GA 30274|
And a separate table of insurance policies that connects vehicles to owners:
The tables describe our data model: Vehicles, Policies and Owners and it’s easy to decide what the links should be:
To validate this model, think about your users. What questions do they need to answer? If Vehicle_Year is an important part of the investigation, it should be a node. If not, it should be a property of the Vehicle node.
The next step is to select unique identifiers for each node. The Policy_ID uniquely identifies each policy already.
For the vehicle node we have three options: VIN, Registration and Year. Millions of cars are built each year. Registrations can be assigned to different automobiles. The only entity we know is unique to each vehicle is the VIN, sometimes called the chassis number. So that’s the property we should choose to represent our vehicle.
The Policy_ID uniquely identifies each policy, so that is a straightforward decision.
Finding a unique attribute for the Person node is not as simple. Addresses and names are not unique identifiers.
This is a challenge known as Identity Resolution. One way around it is to assign a new attribute to each person – e.g. a customer ID number:
It’s important to resolve the Identity Resolution problem, especially if you plan to visualize your graph. Our products merge nodes with identical IDs. If those nodes turn out to be unique, you may misrepresent important patterns.
All remaining columns can be added as properties – provided they are going to be useful. Don’t add properties to your model just because it is in your database.
Data modeling is more complicated if you’re working with Key-Value data stores such as Couchbase, DynamoDB or FoundationDB. Relationships in Key-Value datasets aren’t stored in interconnected tables, so there’s no obvious way to translate from a physical model to a logical model.
Instead, they’re stored in rows as associative arrays:
|19UYA1234L0002133||4PFE120||2014||G3Q7T35||Tracy Freeman||216-555-0192||904 Riverside Street, Ashtabula, OH 44004|
|1G4AB37X0DW483037||3338 NB||2015||JLL8R2Y||Ramiro Rowe||201-555-0107||980 Linden St., Bayonne, NJ 07002|
As you can see, the data isn’t as structured. New columns can be added at any time, introducing new data points and relationships.
Despite this, the same two rules apply:
You can then infer relationships and add any remaining – and useful – columns as properties to the nodes or links.
Once you’ve chosen a winning data model that’s both simple and practical, you can start translating it into your visualization model.
Cambridge Intelligence products have plenty of features to make this part of the process easier. Take a look at our demos to see what techniques are available, and view the source code to see how they are created.
Finally, before you start, read the 10 Rules of Great Graph Design and Pitfalls of Network Visualization before you start. It would be a shame if all your hard work was overshadowed by an overly enthusiastic approach to glyphs.
The diagrams for this post were created using Alistair Jones’ excellent Arrow tool.
Read more blog posts about Best Practice.