Five steps to tackle big graph data visualization

21st September, 2018

Rather watch a video? This blog is also available as a webinar recording.

Understanding big graph data requires two things: a robust database and a powerful graph visualization engine. That’s why hundreds of developers have combined a graph database with our technology to create effective, interactive tools to make sense of their graph data.

But humans are not big data creatures. Most adults can store between 4 and 7 items in their short term memory. That presents a challenge when building graph analysis tools. How do you decide which subset of data to present to your users? How do they find the most important patterns and connections?

That’s the question we’ll try to answer in this blog post. You’ll discover that, with some thoughtful planning, big data doesn’t have to be a big problem.

20,000 nodes visualized in KeyLines. Pretty, but pretty useless if you want to understand specific node behavior. Data from The Cosmic Web Project.
20,000 nodes visualized in KeyLines. Pretty, but pretty useless if you want to understand specific node behavior. Data from The Cosmic Web Project.

The challenge of massive graph visualization

For many organizations, ‘big data’ means collecting every bit of information available, then figuring out how to use it. That makes it incredibly challenging to go beyond aggregated analysis to understand individual elements.

To provide your users with something more useful, think about the data funnel. Using back-end data management and front-end interactions, the funnel reduces billions of data points into something a user can comprehend.

The data funnel to bring big data down to a human scale
The data funnel presents steps to bring big data down to a human scale

Let’s focus on the key techniques you’ll apply at each stage.

1. Filtering on the back-end: ~1,000,000+ nodes

There’s no point visualizing your entire database instance. You want to remove as much noise as possible, as early as possible. Filtering with database queries is an incredibly effective way to do this.

Our toolkits’ flexibility means you can give users some nice visual ways to create custom filtering queries, like sliders, tick-boxes or selecting from a list of cases. In this example, we’re using queries to power a ‘search and expand’ interaction:

This search and expand model lets users explore huge volumes of graph data in a manageable way

There’s no guarantee that filtering through search will be enough to keep data points at a manageable level. Multiple searches might return excessive amounts of information that’s hard to analyze. Filtering is effective, but it shouldn’t be the only technique you use.

2. Aggregating in the back-end: ~100,000 nodes

Once filtering techniques are in place, you should consider aggregation. There are two ways to approach this.

Firstly, there’s data cleansing to remove duplicates and errors. This can be time-consuming but, again, queries are your friend. Functions like Cypher’s ‘count’ make it really easy to aggregate nodes in the back end.

Secondly, there’s a data modeling step to remove unnecessary clutter from entering your chart in the first place. Can multiple nodes be merged? Can multiple links be collapsed into one?

It’s worth taking some time to get this right. With a few simple aggregation decisions, it’s possible to reduce tens of thousands of nodes into a few thousand.

Using link aggregation, we’ve reduced 22,000 nodes and links into a much more manageable chart
Using link aggregation, we’ve reduced 22,000 rows of data into a digestible chart

3. Create a clever visual model: ~10,000 – 1,000 nodes

Already by now, you should have reduced 1,000,000+ nodes to a few thousand. This is where the power of visualization really shines. Your user’s visualization relies on a small proportion of what’s in the database, but we can use visual modeling to simplify it further.

This chart shows graph data relating to car insurance claims. Our schema includes car and policyholders, phone numbers, insurance claims, claimants, third parties, garages and accidents:

A typical example of a insurance fraud detection dataset

Loading the full data model can be useful, but with some carefully considered re-modelling, the user can select an alternative approach suited to the insight they need. Perhaps they want to see direct connections between policyholders and garages:

The same insurance fraud detection dataset, using a different graph data model to highlight different connections in the data

Or a view to remove unnecessary intermediate nodes and show connections between the people involved:

A third possible data model, omitting unnecessary nodes and links for an even clearer view

The ideal visual model will depend on the questions your users are trying to answer.

4. Filters, combining and pruning: ~1,000 nodes

Now your users have the relevant nodes and links in their chart, you should give them the tools to declutter and focus on their insight.

Filters are great for this. For a great user experience (and better performance) consider presenting users with an already-filtered view, with the option to bring in more data.

Another option is to use our Combos functionality to group nodes and links, giving a clearer view of a large dataset without actually removing anything from the chart. It’s an effective way to simplify complexity, but also to offer a ‘detail on-demand’ user experience that makes graph insight easier to find.

Combos are a great way to tidy an otherwise messy or overwhelming graph visualization

A third option is to remove unnecessary distractions from a chart. This might mean giving users a way to ‘prune’ leaf nodes or making it easy to hide ‘supernodes’ that clutter the chart and obscure insight.

Leaf, orphan and super nodes rarely add anything to your graph data understanding, so give users an easy way to remove them
Leaf, orphan and super nodes rarely add anything to your graph data understanding, so give users an easy way to remove them

We have plenty of tools to help with this critical part of your graph data analysis. This video on managing chart clutter explains a few more.

5. Run a layout: ~100 nodes

By this point, your users should have a tiny subset of your original graph data in their chart. The final step is to help them uncover insight. Automated graph layouts are great for this.

A good force-directed layout goes beyond simply detangling links. It should also help you see the patterns, anomalies, and clusters that direct the user towards the answers they’re looking for.

The organic graph layout
Our organic layout reveals the underlying structure in your graph data

With an effective, consistent and powerful graph layout, your users will find that answers start to jump out of the chart.

Bonus tip: talk to your users

This blog post is really just a starting point. There are plenty of other tips and techniques to help you solve the big graph data challenge (we’ve not even started on temporal analysis or geospatial visualization).

But probably the most important tip of all is this: take time to talk to your users.

Find out what they need to see and the questions they’re trying to answer. Use the data funnel to make their processes as simple and fast as possible, and use the power of graph visualization to turn their biggest datasets into something genuinely insightful.

If you’re looking for a graph visualization technology to help, request a trial of our toolkits. We’d be happy to help.

Request a trial

More from our blog

Visit our blog