Network Clustering

11th January, 2016

In our last post, we took a closer look at EigenCentrality and Page Rank.

This time we look at another exciting new way to understand your connected data: Clustering!

A Cambridge Intelligence visualization showing community structures in company ownership data. Here, we have complemented the clustering algorithm with lens layout, producing a circular presentation.

Currently in beta, the clustering function can be used to identify communities in your networks. It has been carefully optimized to balance speed and quality, providing insight into potential community structures.

Let’s take a closer look…

What is clustering?

To understand clustering, we need to understand a network (or ‘graph’) concept called modularity.

Modularity is a way to measure how readily a network can be divided into sub-networks, which we call modules. A high modularity score means there are tightly connected modules, with relatively few links connecting the modules together. A low modularity score indicates the opposite – or a relatively even distribution of links between nodes in the network.

How do we calculate modularity?

In our products, we calculate network modularity as the fraction of the links whose ends fall inside a group, minus the expected fraction if links were distributed at random. This gives us a score between 0 and 1.

For example, if we imagine a network with 100 nodes and 200 links. If one cluster has 25 nodes and 100 links – i.e. a quarter of the nodes, but half of the links – the modularity would be ½ -¼ = ¼.

Our clustering algorithm works by finding network partitions that will minimize the modularity score. At the beginning of the algorithm, it takes each node as a cluster. We then run through every permutation by moving nodes into clusters, keeping the configuration if the modularity score increases.

The result is ‘optimal’ partitions for different numbers of modules:

Here we are steadily increasing the number of modules. With each change we see smaller clusters, but more of them. The ‘cluster factor’ can be adjusted by the user, giving them a simple way to explore network modularity.

Note: This works for both connected and disconnected graphs, and can also take link weightings into account.

Why is it useful?

Uncovering and understanding communities is a great way of gaining network insight.

Some useful use cases include:

  • Cyber security – studying clusters can help you to model how something can move through a network. For example, malicious software will propagate more quickly through a dense community, compared to a sparse one.
  • Security / Intelligence – law enforcement and security agencies often need to extrapolate insight about organizational structures from complex communications meta-data. Clustering algorithms make it much easier to find communities and start learning more about the organizations being investigated.
  • Anti-fraud – inevitably, organized gangs perform the most damaging and costly fraud. By looking at clusters of fraudulent activity, investigators can find and shut down fraud rings.

Of course, these three use cases are just a tiny fraction of the potential ways clustering can help you find insight in your complex connected data.

Request a free trial and learn more about our clustering capabilities.

Subscribe to our newsletter

Get occasional data visualization updates, stories and best practice tips by email