Five steps to tackle big graph data visualization

21st September, 2018

Understanding big graph data requires two things: a robust database and a powerful graph visualization engine. That’s why hundreds of developers have combined a graph database with KeyLines to create effective, interactive tools to explore and make sense of their graph data.

But humans are not big data creatures. Given most adults can store between 4 and 7 items only in their short term memory, loading an overwhelming quantity of densely-connected items into a chart won’t generate insight.

That presents a challenge for those of us building graph analysis tools. How do you decide which subset of data to present to users? How do they find the most important patterns and connections? That’s what we’ll explore in this blog post. You’ll discover that, with some thoughtful planning, big data doesn’t have to be a big problem.

The challenge of massive graph visualization

For many organizations, ‘big data’ means collecting every bit of information available, then figuring out how to use it later. One of the many problems with this approach is that it’s incredibly challenging to go beyond aggregated analysis to understand individual elements.

20,000 nodes visualized in KeyLines. Pretty, but pretty useless if you want to understand specific node behavior. Data from The Cosmic Web Project.
20,000 nodes visualized in KeyLines. Pretty, but pretty useless if you want to understand specific node behavior. Data from The Cosmic Web Project.

To provide your users with something more useful, you need to think about the data funnel. Through back-end data management and front-end interactions, the funnel reduces billions of data points into something a user can comprehend.

The data funnel to bring big data down to a human scale
The data funnel to bring big data down to a human scale

Let’s focus on the key techniques you’ll apply at each stage.

1. Filtering on the back-end: ~1,000,000+ nodes

There’s no point visualizing your entire database instance. You want to remove as much noise as possible, as early as possible. Filtering with database queries is an incredibly effective way to do this.

KeyLines’ flexibility means you can give users some nice visual ways to create custom filtering queries, like sliders, tick-boxes or selecting from a list of cases. In this example, we’re using queries to power a ‘search and expand’ interaction in KeyLines:


There’s no guarantee that filtering through search will be enough to keep data points at a manageable level. Multiple searches might return excessive amounts of information that’s hard to analyze. Filtering is effective, but it shouldn’t be the only technique you use.

2. Aggregating in the back-end: ~100,000 nodes

Once filtering techniques are in place, you should consider aggregation. There are two ways to approach this.

Firstly, there’s data cleansing to remove duplicates and errors. This can be time-consuming but, again, queries are your friend. Functions like Cypher’s ‘count’ make it really easy to aggregate nodes in the back end.

Secondly, there’s a data modelling step to remove unnecessary clutter from entering the KeyLines chart in the first place. Can multiple nodes be merged? Can multiple links be collapsed into one?

It’s worth taking some time to get this right. With a few simple aggregation decisions, it’s possible to reduce tens of thousands of nodes into a few hundred.

Using link aggregation, we’ve reduced 22,000 nodes and links into a much more manageable chart
Using link aggregation, we’ve reduced 22,000 nodes and links into a much more manageable chart

3. Create a clever visual model: ~10,000 – 1,000 nodes

Already by now, you should have reduced 1,000,000+ nodes to a few hundred. This is where the power of visualization really shines. Your user’s visualization relies on a small proportion of what’s in the database, but we can use visual modelling to simplify it further.

This chart shows graph data relating to car insurance claims. Our schema includes car and policyholders, phone numbers, insurance claims, claimants, third parties, garages and accidents:


Loading the full data model can be useful, but with some carefully considered re-modelling, the user can select an alternative approach suited to the insight they need. Perhaps they want to see direct connections between policyholders and garages:


Or a view to remove unnecessary intermediate nodes and show connections between the people involved:


The ideal visual model will depend on the questions your users are trying to answer.

4. Filters, combining and pruning: ~1,000 nodes

Now your users have the relevant nodes and links in their chart, you should give them the tools to declutter and focus on their insight.

A great way to do this is filtering, to add or remove subsets of the data on-demand. For better performance, present them with a filtered view first, but give them control options to bring in data. There are plenty of ways to do this – tick boxes, sliders, the time bar, or ‘expand and load’.

Another option is KeyLines’ combos functionality. Combos allow the users to group certain nodes, giving a clearer view of a large dataset without actually removing anything from the chart. It’s an effective way to simplify complexity, but also to offer a ‘detail on demand’ user experience that makes graph insight easier to find.


A third example of best practice is to remove unnecessary distractions from a chart. This might mean giving users a way to ‘prune’ leaf nodes, or making it easy to hide ‘super nodes’ that clutter the chart and obscure insight.

Leaf, orphan and super nodes rarely add anything to your graph data understanding, so give users an easy way to remove them
Leaf, orphan and super nodes rarely add anything to your graph data understanding, so give users an easy way to remove them

KeyLines offers plenty of tools to help with this critical part of your graph data analysis. This video on managing chart clutter explains a few more.

5. Run a layout: ~100 nodes

By this point, your users should have a tiny subset of your original graph data in their chart. The final step is to help them uncover insight. Automated graph layouts are great for this.

A good force-directed layout goes beyond simply detangling links. It should also help you see the patterns, anomalies and clusters that direct the user towards the answers they’re looking for.

KeyLines’ organic layout
KeyLines’ latest layout – the organic layout. By spreading the nodes and links apart in a distinctive fan-like pattern, the underlying structure becomes much clearer.

With an effective, consistent and powerful graph layout, your users will find that answers start to jump out of the chart.

Bonus tip: talk to your users

This blog post is really just a starting point. There are plenty of other tips and techniques to help you solve the big graph data challenge (we’ve not even started on temporal analysis or geospatial visualization).

But probably the most important tip of all is this: take time to talk to your users.

Find out what data they need to see and the questions they’re trying to answer. Use the data funnel to make that process as simple and fast as possible, and use the power of KeyLines to turn the biggest graph datasets into something genuinely insightful.

Request a trial here to get started with the KeyLines toolkit.

Subscribe to our newsletter

Get occasional data visualization updates, stories and best practice tips by email