Graph visualization: fixing data hairballs

This is the first of three blog posts on common graph visualization challenges.

Part 2 looks at ‘the snowstorm‘, and part 3 explores ‘the starburst‘, but first we’ll focus on ‘the hairball’ – a problem that affects many large datasets.

A familiar problem with graph visualization

A typical venture into the world of graph visualization goes like this…

You’ve invested in the latest and greatest big data technology stack. You’ve curated a fantastic data source. You’re absolutely convinced that within this data lie insights that’ll give your end users ultimate power. All they need to do is load it into a visualization platform to reveal beautiful interconnected structures that blur the boundaries between data science and art.

This Protein Homology Graph, by Edward Marcotte and Alex Adai, from New York’s Museum of Modern Art (MOMA) 2008 exhibit
This Protein Homology Graph, by Edward Marcotte and Alex Adai, from New York’s Museum of Modern Art (MOMA) 2008 exhibit

So you evaluate a technology like our powerful graph visualization SDKs, load your data, and wait with bated breath for the results.

And when the chart loads, there’s a strong chance you’ll see one of three usual suspects:

  1. The hairball – showing connections that are so dense, they can’t be usefully visualized.
  2. The snowstorm – packed with many small, separate components where nothing stands out.
  3. The starburst – where almost every connection is between a single central node and every other node.
Usual suspect #1: the hairball, the snowstorm and the starburst
Usual suspect #1: the hairball, the snowstorm and the starburst

None of these will help your users uncover threats or find insight. The difference between success and failure hinges on what you do next.

Do you accept that this is the nature of your data, advise your users to boost their hardware and leave them to it? Or do you spend hours trying to understand the root causes of the problems in your data and designing a visual investigation tool that really works? Why do these beasts appear? What aspects of the underlying data give rise to them? And what can you do about it when the underlying data is not yours to control?

We’ll look at snowstorms and starbursts in future blog posts. Now, we’ll focus on the infamous hairball.

If you see a tangled mass of nodes and links that are impossible to analyze, you've got a hairball
If you see a tangled mass of nodes and links that are impossible to analyze, you’ve got a hairball

Where do hairballs come from?

It’s ironic that the desire to connect information together – the heart of all things graph – is of course ultimately responsible for these issues. Let’s see how the hairball problem builds up.

We’ll take an example from the world of vehicle insurance claims fraud, and imagine building a knowledge graph from scratch. We start with a database of people – our insurance policyholders, named drivers, witnesses, etc.

As you can see, visualizing the node/link structure is a little uninspiring.

This visualization shows plenty of people but no connections
This visualization shows plenty of people but no connections

Now some of these people own insurance policies, so the next step is to add those policies to our knowledge graph.

Adding policies gives basic structure to the data
Adding policies gives basic structure to the data

This view tells us something. We’ve done a little bit of custom styling of our graph by color-coding the different node types. It’s easy to spot which people have policies – but it’s hardly justification for a knowledge graph project, so let’s push on.

People have phone numbers and addresses; policies cover vehicles and have claims logged against them. Let’s see what happens when we add those details.

Connected components emerge from the chart once we add further details
Connected components emerge from the chart once we add further details

Notice how our automatic organic graph layout makes it easier to recognize structures. But our quest for knowledge continues.

Next, we add records for the types of damage claimed by policyholders, and the details of the mechanics who fixed those damages. The visualization starts to take shape.

We can see definite subnetworks and well-connected nodes
We can see definite subnetworks and well-connected nodes

And finally we add nodes representing the country where claimants live, or where their vehicles are registered.

And this is where it all goes wrong.

Hairball alert: connections are so dense they can’t be usefully visualized
Hairball alert: connections are so dense they can’t be usefully visualized

You can see the problem. This dataset is exactly what you want from an underlying knowledge graph. It’s rich and well connected, and it answers questions like “Are there patterns of insurance fraud that vary by country?” and “How far on average do people travel to have their cars fixed?”. But as a graph visualization, the result is near to useless. Quick and easy analysis is impossible.

We have a hairball.

The solution: focus on the workflow, not the data model

So, of all the insurance claim charts above, which one do you think is the best graph visualization?

It’s a trick question of course, although most people will pick the one with the most detail just before the hairball appears.

The smart response is: “None of these visualizations is useful, because you haven’t told me what the end user is actually trying to do.”

There are many tricks to removing hairballs, including graph filtering and node and link aggregation, but I’d recommend that you don’t try to literally visualize everything in your underlying knowledge graph. Instead, start working backwards from the job your end users need to get done.

In this use case, our goal is to identify suspicious individuals with unusual levels of connectivity. All we really care about is the people nodes – everything else is metadata that helps to single out some people over others.

Let’s see what happens if we create a new visual representation that is derived from our raw graph as follows:

  • Represent people as nodes.
  • Put links between two people who are connected via a path that includes an insurance claim (graph databases are really good at this kind of query).
  • Size the people nodes based on their betweenness centrality – one of our powerful social network analysis centrality algorithms.
A customized user experience powered by advanced graph algorithms reveals the insight users need
A customized user experience powered by advanced graph algorithms reveals the insight users need

The result looks much nicer, and we can immediately spot a couple of interesting individuals who would have been impossible to see inside that original hairball. The user can now do their job by focusing on those people and making a decision about whether to investigate further.

By remodeling the data in this way, we’ve made full use of our efforts to build a tightly connected graph database. We wouldn’t have known all of those links existed without first seeing the original hairball, but by paring back the data, we’ve given end users something they can actually use.

Or, in other words, hairballs in your knowledge graph are a good thing; but just don’t let them anywhere near your UI.

Try it for yourself

If you have the data, we have the visualization toolkits that can bring it to life.

For more information, take a look at our detailed white papers, case studies and webinars.

Or if you’re ready to get started, request a free evaluation of our powerful KeyLines and ReGraph toolkits.

How can we help you?

Request trial

Ready to start?

Request a free trial

Learn more

Want to learn more?

Read our white papers

“case

Looking for success stories?

Browse our case studies

Registered in England and Wales with Company Number 07625370 | VAT Number 113 1740 61
6-8 Hills Road, Cambridge, CB2 1JP. All material © Cambridge Intelligence 2024.
Read our Privacy Policy.