In this final post in our three-parter exploring the typical challenges you face when visualizing your graph data for the first time. In part one, we looked at ‘the hairball‘, and covered ‘the snowstorm‘ in part 2.
This time, we look at ‘the starburst’ – a graph visualization pattern where one heavily connected node dominates the chart.
Where do starbursts come from?
In previous posts, we saw how the underlying data source plays a significant role in determining whether you see a chart dominated by snowstorms or hairballs. But in both those cases, the main focus was on visualizing a dataset in its entirety.
When working with large datasets, it’s more common for users to progressively reveal data. They do this by expanding in additional information from the backend on demand. A popular technique is to expand outwards from a node of interest, and ask the backend to bring in the neighbors of that node.
Since there’s no way to predict what a user is going to double-click on, we risk creating starbursts – sudden and unexpected explosions of data radiating from a node.
Remodeling central characters
Sometimes the nature of your data queries make it easier to predict and mitigate a starburst.
In our TrumpWorld dataset visualization we investigated organizations and individuals connected to the Trump administration. Obviously, Donald himself would be a ‘supernode’ in this network.
Ironic as it seems, to find insight in his connections, it’s better to remove him from the dataset altogether. Instead, we used a simple glyph or decoration – one of many effective visual styling options – to highlight nodes directly connected to Trump, rather than show him as a central starburst.
This kind of data remodeling plays an important role in delivering a great user experience from a visualized dataset. It sometimes makes sense to do the opposite of what seems natural.
Let’s take another example. If you want to view email activity as a graph, you’d probably start by creating nodes representing email accounts and links for emails. We did this to great effect in our visualization of the Enron emails released in the wake of that company’s collapse.
But now consider this Kaggle dataset containing a release by the State Department of around 7000 heavily redacted emails sent or received by Hillary Clinton. Almost all of these 7000 emails will connect to the Hillary node – a totally avoidable starburst. A much more effective network model here might be to remove Hillary as a node and instead link two email accounts if they both appear cc’d in the same email thread.
Giving users the right choice
Another way to prevent starbursts is by limiting the number of nodes brought in from the backend. It’s a simple technique but it does have drawbacks. You have to decide carefully which nodes you do and don’t bring in. Visualizing only part of the data can result in a misleading chart, which may cause analysts to draw incorrect conclusions.
If you’re going to throttle data at the point of request, here are some tips.
Be transparent & give users control
If you’re clear upfront, there shouldn’t be any big surprises. A message that says “This will bring in 7329 nodes – are you sure?” is better than just silently bringing in the first 50 nodes and not explaining where the rest are. Better still, let the user choose whether they want to import all nodes, some nodes or none at all.
If you have many different node types, it may be better to bring in data one type at a time.
Filtering is a powerful way to reduce noise and direct focus. For example, expanding a node representing a company might bring in 2000 employee nodes, five office location nodes and seven subsidiary company nodes. If your user is only interested in company-to-company relationships, let them choose only to expand in the company nodes and ignore the rest.
Use backend network analysis
When deciding which nodes to bring in, you can learn a lot from the underlying network. Some nodes are more interesting than others, and if you’re going to limit starbursts by choosing a subset of nodes to import, you definitely want the interesting ones.
Social network analysis centrality measures can help here. These graph algorithms identify important nodes in a network. The best measures include:
- Degree – finds nodes with the highest (and lowest) number of links. Nodes with only one neighbor (leaf nodes) can often cause starbursts and are possibly the least useful. More on this in the next section, but a simple graph query can identify a leaf node in the backend and prevent it from being added to the chart.
- Betweenness centrality – nodes with a high betweenness score are the ‘gatekeeper’ nodes that most frequently act as bridges between other nodes. Your users will want to see these.
If your graph database backend comes with the ability to calculate centrality scores and other social network measures, your backend analysis task is easy. If you don’t, KeyLines and ReGraph come with a graph analysis engine which can run in a nodeJS server-side environment to score your nodes before importing them.
Sweeping up the leaf nodes
You can’t always remodel your data to remove starbursts, particularly if your visualization is littered with leaf nodes. They connect to the central hub but have no other neighbors in the dataset.
The best way to tackle these is using combos, our intuitive way to combine nodes. You can group the leaf nodes as combo nodes which don’t take up much screen space, but can be opened if needed to see what’s inside.
Here’s an example looking at a database of aviation incidents involving DC-9 and DC-10 aircraft. These are long-serving aircraft with a high number of historic incidents associated with them. Showing everything gives us a starburst of airline operators and locations, so I’ve grouped nodes by type into combos to keep things much simpler.
I can now see through the clutter that there are only three airlines who have had incidents involving both these aircraft types.
If all else fails and you’re unable to avoid the occasional starburst, layouts can help. Here are two of our favourites from KeyLines and ReGraph:
- organic layout helps reveal the global and local structure of a graph, which reduces the impact of starburst-prone data. It’s also fast, even for very large datasets.
- structural layout identifies leaf nodes and arranges them in tight circles away from their parent, so you can isolate them visually (and potentially allow your users to easily select and delete them).
- sequential layout can give you the best of both worlds if you have a hierarchy or level-based structure with some big starbursts. Use the stacking function to keep the starbursts tidy by putting leaf nodes into neat grids. Find out more about the stacking function