This is the first of three blog posts from Head of Product Management, Dan Williams, exploring the typical challenges you face when visualizing graph data. Here he focuses on how to avoid ‘the hairball’. It’s a problem that affects many large datasets, particularly knowledge graph visualizations.
A familiar problem with graph visualization
A typical venture into the world of graph visualization goes like this…
You’ve invested in the latest and greatest big data technology stack. You’ve curated a fantastic data source. You’re absolutely convinced that within this data lie insights that’ll give your end users ultimate power. All they need to do is load it into a visualization platform to reveal beautiful interconnected structures that blur the boundaries between data science and art.
And when the chart loads, there’s a strong chance you’ll see one of three usual suspects:
- The hairball – showing connections that are so dense, they can’t be usefully visualized.
- The snowstorm – packed with many small, separate components where nothing stands out.
- The starburst – where almost every connection is between a single central node and every other node.
None of these will help your users uncover threats or find insight. The difference between success and failure hinges on what you do next.
Do you accept that this is the nature of your data, advise your users to boost their hardware and leave them to it? Or do you spend hours trying to understand the root causes of the problems in your data and designing a visual investigation tool that really works? Why do these beasts appear? What aspects of the underlying data give rise to them? And what can you do about it when the underlying data is not yours to control?
We’ll look at snowstorms and starbursts in future blog posts. Now, we’ll focus on the infamous hairball.
Where do hairballs come from?
It’s ironic that the desire to connect information together – the heart of all things graph – is of course ultimately responsible for these issues. Let’s see how the hairball problem builds up.
We’ll take an example from the world of vehicle insurance claims fraud, and imagine building a knowledge graph from scratch. We start with a database of people – our insurance policyholders, named drivers, witnesses, etc.
As you can see, visualizing the node/link structure is a little uninspiring.
Now some of these people own insurance policies, so the next step is to add those policies to our knowledge graph.
This view tells us something. We’ve done a little bit of custom styling by color-coding the different node types. It’s easy to spot which people have policies – but it’s hardly justification for a knowledge graph project, so let’s push on.
People have phone numbers and addresses; policies cover vehicles and have claims logged against them. Let’s see what happens when we add those details.
Notice how our automatic organic layout makes it easier to recognize structures. But our quest for knowledge continues.
Next, we add records for the types of damage claimed by policyholders, and the details of the mechanics who fixed those damages. The visualization starts to take shape.
And finally we add nodes representing the country where claimants live, or where their vehicles are registered.
And this is where it all goes wrong.
You can see the problem. This dataset is exactly what you want from an underlying knowledge graph. It’s rich and well connected, and it answers questions like “Are there patterns of insurance fraud that vary by country?” and “How far on average do people travel to have their cars fixed?”. But as a graph visualization, the result is near to useless. Quick and easy analysis is impossible.
We have a hairball.
The solution: focus on the workflow, not the data model
So, of all the insurance claim charts above, which one do you think is the best graph visualization?
It’s a trick question of course, although most people will pick the one with the most detail just before the hairball appears.
The smart response is: “None of these visualizations is useful, because you haven’t told me what the end user is actually trying to do.”
There are many tricks to removing hairballs, including filtering and aggregation, but I’d recommend that you don’t try to literally visualize everything in your underlying knowledge graph. Instead, start working backwards from the job your end users need to get done.
In this use case, our goal is to identify suspicious individuals with unusual levels of connectivity. All we really care about is the people nodes – everything else is metadata that helps to single out some people over others.
Let’s see what happens if we create a new visual representation which is derived from our raw graph as follows:
- Represent people as nodes.
- Put links between two people who are connected via a path that includes an insurance claim (graph databases are really good at this kind of query).
- Size the people nodes based on their betweenness centrality – one of our powerful social network analysis centrality algorithms.
The result looks much nicer, and we can immediately spot a couple of interesting individuals who would have been impossible to see inside that original hairball. The user can now do their job by focusing on those people and making a decision about whether to investigate further.
By remodeling the data in this way, we’ve made full use of our efforts to build a tightly connected graph database. We wouldn’t have known all of those links existed without first seeing the original hairball, but by paring back the data, we’ve given end users something they can actually use.
Or, in other words, hairballs in your knowledge graph are a good thing; but just don’t let them anywhere near your UI.
Try it for yourself
If you have the data, we have the visualization toolkits that can bring it to life.
For more information, take a look at our detailed downloadable resources.
Or if you’re ready to get started, request a free evaluation of our powerful KeyLines and ReGraph toolkits.