This is the second in a series of posts exploring the typical challenges you face when visualizing your graph data for the first time.
Here, we’ll look at the ‘snowstorm’ – a common pattern of disconnected entities that often results from attempting to visualize tables from a spreadsheet or a relational database.
In our previous post on hairballs, we looked at the tangle of overconnected data that results from visualizing a typical knowledge graph database without focusing on the user’s workflow.
Graph databases are hugely popular, but we know that many of our customers still rely on more traditional relational databases, often augmented with data imported directly from CSV files or spreadsheets. This kind of data suffers from a very different problem – it is typically underconnected rather than overconnected. Attempting to view it as a collection of nodes and links will lead you inevitably to a snowstorm.
Let’s look at a typical example. Here’s a small fragment of the ransomware tracker dataset from abuse.ch. It lists observations of the top three ransomware families from January to April 2016.
Each row of the spreadsheet corresponds to the identification of a particular ransomware botnet being controlled from a host site with an associated IP address. The dataset runs into the thousands of entries.
It’s a fairly standard table, and at first it’s not clear how you’d translate it into a graph. We might start by attempting to link each botnet controller host with its observed IP address. Putting this into a graph visualization engine such as KeyLines or ReGraph gives us our first snowstorm.
The only patterns that stand out are the star-like snowflakes on the left hand edge. These correspond to cases where more than one host website shares a common IP address. We’ve basically just tried to visualize a DNS lookup table as a picture – there’s really nothing of interest to see here.
We see this situation a lot. Companies and organizations are never short of data, but they tell us it’s not the right kind of data for graph visualization. All too often, people give up at the first sign of a snowstorm.
There are many solutions to the snowstorm problem but we’ll focus on some of the most common: enrichment, aggregation and entity resolution.
To enrich a datasource, you need to combine it with information gathered from one or more secondary sources. It’s one of the principal tools in any investigative workflow.
In some cases, where organizations have big data aggregation projects, this enrichment may have been done on the backend in advance. Often it can be a manual or semi-manual process. In many cases the multiple data sources don’t come together until the analyst connects them in their user interface.
An investigative organization’s enrichment sources are their most prized possession. In the case of our botnet controllers, we’ll use an obvious open-source option – we’ll enrich our list of IP addresses with the countries they’re associated with.
We can visualize this in a number of ways. Let’s keep it simple and display the country next to the IP address using a glyph. This styling option offers a quick and easy way to communicate different characteristics and add decorations to items in your network.
It’s not bad, but it hasn’t helped our snowstorm. So let’s combine enrichment with our second technique – aggregation.
We use combos – our toolkit technology’s advanced grouping feature – to combine the host nodes by ransomware family, and the IP addresses by our newly enriched country information. We also size the combos based on the number of items they contain.
We now have a very different – and much more valuable – chart.
At a glance we can see the relationships between countries and ransomware families. We can see which countries had the most botnet controllers (US and Russia) and which countries have all three families in common (Spain, France and UK).
If we need the underlying data back, no problem. We can open up a combo of interest and look inside. Here I’ve opened the Russia combo node, and by playing back the data sequence using our time bar feature we can watch how the Locky Ransomware botnet infection evolved.
When we used an IP address to aggregate those hosts into countries, we were lucky to have what the intelligence community sometimes calls a strong selector – IP addresses provide (fairly) unambiguous information about the point on the network where the software was located.
We’re not always that fortunate. Take a look at this excerpt from the US Federal Election Commission website which lists political donations, and only tracks the donor’s name and state. I’ve searched for a common name – John Williams, and filtered by California.
On its own this data is a recipe for the perfect snowstorm. Grouping by state would be an easy win, but it’s tempting to combine these contributions by Contributor Name, so we can see who the most prolific contributors to various organizations are.
A person’s name, though, is very much a soft selector – it’s extremely unlikely that the John Williams who donated $300 to Donald J Trump For President is the same John Williams who donated $25 to Bernie 2020. We don’t even know how many unique John Williamses this dataset contains.
There are also a lot of inconsistencies in this data. “John H Williams”, “Williams, John Douglas Mr” and so on – this kind of realistic ‘dirty’ data is one of the biggest contributors to the snowstorm problem.
The science (or art) of entity resolution – how we resolve these multiple fragments into the correct aggregations – is a fascinating one. If it sounds like a challenge, that’s because it is – and the prize for getting it right is huge.
If enriching the data isn’t an option, you’re left with educated guesswork. It’s here that tools like Machine Learning can play an interesting role.
Take a look at some of the patterns in the FEC data. Some individuals make small numbers of large donations, while others make regular donations of the same amount at the same time of the month. On this basis, we might be able to form hypotheses about which contributors are really the same person and explore the results using the powerful combos feature of KeyLines and ReGraph.
But that’s for another blog…
To turn the connected data in your spreadsheets and relational databases into insightful and beautiful visualizations, simply request a free trial of KeyLines and ReGraph.