This beginner’s guide explains what makes graph data visualization useful – particularly when your data has many-to-many relationships that are hard to understand in tables. It covers how to model data as nodes and links, why interactivity (like query-and-expand, filtering, and grouping) matters, and how to use visual styling, analytics, and timelines to reveal patterns, highlight important entities, and avoid clutter. The goal is to help users explore connected data intuitively and uncover insights faster.
In this webinar you’ll learn how to:
- define and model a graph
- work with graph data
- follow graph data visualization best practice
Explore our graph, timeline, geospatial visualization resources
Transcript
Corey Lanum: Hello. Welcome to the video that’s called A Beginner’s Guide to Graph Visualization. Today, we’re going to take things back to basics and really drill down into graphs, why they are useful, and how we visualize them. We recorded this video in 2017, and this current one in 2022 is an update on that with some new technologies, some new concepts, and new techniques for how to visualize graphs.
My name is Corey Lanum. I’ve been doing this for about 20 years, working with graphs, and 20 years ago, graph ideas were a very niche academic subject, and I’ve been really excited over the course of the last 20 years or so to watch the explosion in popularity of graphs and graph concepts to a variety of different business domains.
And I’ve been excited to be a part of that, helping my customers explore the opportunity to use graphs and graph visualizations across their business, regardless of what that might actually be. I’ve written a book on the subject called Visualizing Graph Data, published in 2016. I’m going to use some of the examples from that book in the talk today, but they’ve been updated because now in 2022, we have more access to some variety of different technologies that we can utilize when visualizing graphs than we did in 2016.
So here’s what we’re going to talk about today. I do think it’s really important to drill down and talk about why graphs are useful. Why does it make sense to think of our data as a graph, and what benefits does that give us when we visualize them? We’re going to talk about modeling a little bit. So how do I take data that might be stored or thought of as columns and tables and rows and turn that into nodes and links and properties?
And then we’re going to talk about why we visualize graphs. So it may make sense to model and think of your data as a graph and to use it that way, but not necessarily to produce a visualization that you present to business users of that graph data. Then what we’re going to do, and the bulk of the time that we’re going to spend is going to be on the concepts related around graph visualization.
What sorts of techniques can we employ to allow our visualizations to improve the value to our end users who are looking at this? Then in the last two concepts, we’ll discuss what do we do with graphs that vary over time. How do we show changes in the graph between now and some point in the future?
Or what do we do with graphs that have time data embedded in the data itself as one of the properties of the nodes or links? And then we’ll talk about what do we do with graphs that have a location associated with it. If the nodes have actual position in space, how can that be displayed to the user in an intuitive way?
So we’ll start off with describing what a graph is, and the way I like to treat it is to think that graphs are a model of your data. It’s not your data; it’s just a model of the data, that the connections between the data are equally important as the data elements themselves. So if I have just a list of, say, cars and the VIN numbers associated with those cars, that’s not a useful graph because it’s a one-to-one correspondence.
Every car has a unique VIN number, and every VIN number is connected to a single car. So there’s no inherent connection in that tabular data, and so it doesn’t make for a useful graph. Where a graph is important is where we have connections between the data. Now, that doesn’t mean that you have to have all kinds of different data elements associated with your data and you have lots of different properties associated with those connections.
But it is the case that the connections have to make some sense and be somewhat useful to you. Now, just two columns in a single table is enough to make a simple graph, so it doesn’t have to be a complex data structure in order to think of it and gain value out of it as a graph. So in the example I have below, we have a list of people, and we have a list of phone numbers.
Now, because each person can have multiple phones, especially over the course of their lives, and each phone number can be used by multiple different people as it gets recycled, then this can turn itself into an actual useful graph. That many-to-many relationship is the thing that creates a graph and makes it valuable and useful to look at.
So if we look at the example down here at the bottom, we do have that many-to-many relationship. Now we’re looking at multiple tables, and this is the traditional relational database way of storing connected data. So we have a list of students with a key associated with each student, and then we have a list of classes over there on the right with a key associated with each class.
And then we have a separate table, which is sometimes called a link table, which shows which students are enrolled in which classes. And each student presumably is in multiple classes, and each class presumably has more than one student in it. And it’s that many-to-many type relationship that makes for a useful graph that you can see up there on the upper right.
Now, the table itself is storing those links in this case, and I can have additional properties associated with that enrollment, say the quarter in which they were enrolled or something else useful about their membership in that class. But it doesn’t have to be stored this way. You could have, as you saw in the phone example earlier, that relationship be apparent by the fact that two items are in the same row in a single table as well.
Now, the actual data elements themselves are called nodes. So in this case, both the students and the classes are nodes, and you can have properties associated with those, such as, say, the sex of the student, the birth date, things like that. You can have properties of the nodes for the classes, such as where they’re held or something like that.
And then you can have properties with the links, the quarter in which they were enrolled, as I mentioned earlier. So all three, in this case, both nodes and links can have properties that can be useful to display in a graph visualization, can be useful to store in your graph. Now I want to get to a key point that I want to make here is that a graph model of your data is not the same as using a graph database.
There are many benefits to using graph databases because it stores the data natively in many cases in that graph format. But it is not necessary to use a graph database in order to think of your data as a graph, to model your data as a graph, and to visualize it that way. So this is just a list of some of the graph databases that I’ve encountered over the course of doing this for several years.
I’d say that the most popular that we see more recently are Amazon Neptune, which is popular because it’s embedded inside of AWS and quite easy to take advantage of. Neo4j has been around for a long time but is incredibly popular because it is quite easy to use and very easy to get started with, and I find Cypher to be an intuitive graph query language.
ArangoDB is a relatively new company entry into this market, but we’ve seen use and interest in them start to grow because it is a multimodal database which can store all kinds of different types of data in different modes. Now, this landscape of the graph databases is totally changing. When we recorded this in 2017, some of these items were here already, but many of them are new and some of them have dropped off the list as they’ve lost popularity.
So the other thing is why do we visualize graphs? So we’ve talked about when it makes sense to model your data as a graph and use data in that format. But when does it make sense to actually present the model of a data in a visual format to an end user? And it doesn’t always, actually.
So sometimes it is very valuable to produce a graph model of your data, but the actual visualization of those nodes and links, the nodes represented as dots on the screen and the links drawn as arrows or lines between them, is not going to produce any value for the end user. The key example I like to use here is a recommendation engine.
So in a recommendation engine, you know, if I go onto a retail website and I order some products, it’s very likely that it’s using graph technology and graph queries to say, “Go find me other customers who have ordered similar sorts of things and find out what other things they have bought from our site,” and present those to this user to say, “These are the other things that you might be interested in.”
Now, as an end user, I might find that really valuable. It’s recommending products to me that I may not even know that I wanted. But I don’t necessarily want to see the visualization of who I’m connected to because I ordered the same things, and in fact, the merchant doesn’t even really want to show me that because that could be proprietary data about what other customers are buying.
I just want to see the answers. Here are the products that are recommended based on some graph algorithm that’s running behind the scenes. So in some cases, it does make sense to show your work, to show the end user, this is how the data’s modeled, this is how we got there. But in other cases, perhaps it doesn’t.
But it is a very intuitive way of understanding that graph model. It’s hard to think about graphs without sort of visually imagining dots and lines connecting those dots, even if that’s not something that you’re going to present to the end user. So in the example on the lower left here, you can see what the most well-connected items are just by modeling this data as a graph.
Those ones that have a lot of links are very central to the network. It’s very clear to pick those out. It’s very intuitive. And on the example over there on the right, we have graphs embedded over the top of a map, but it’s very clear who the hubs are, where the busiest airports are in the Western United States in this example, by looking at the pattern of flights originating from each one.
And also, it’s intuitive to people who don’t necessarily need to know what a graph is or this whole modeling business that we just talked about. They can see the lines and the dots on the page or on the screen and make intuitive sense of it. So now what we’re going to do is we’re going to talk about some techniques.
So things that make it useful to look at the data in this format, things that you can do when you’re designing your visualization to help produce some value, help your users make sense of what it is that they’re seeing, and make business decisions based on it. So the first one is that really, unless you are producing your graph or visualization, say, in a newspaper, where you’re putting ink on a page, and then you’re distributing it out, and it can’t change after that, you really do want to make interactive graphs.
And the more interactive, the better. The more the user can drill into the data that they’re looking at, zoom in, pan around, find the bits, the data elements that are of interest to them, the more valuable it’s going to be. And you don’t actually have to display all of the data to your end user at once.
I’m going to keep coming back to this point because I think it’s a really important point. So you can show the basic structure and overview of the graph, and then give them additional detail about the nodes or about the links when they click or hover or that sort of thing. So what I’m going to do now is I’m going to exit out from the visuals or from the slides and show you some examples.
So the first one we talked about is not showing all of the data all at once. So we have this concept that we’ve put into our products called query and expand, which is where instead of taking my entire dataset and showing that all to the end user all at once, because that gets unwieldy, we take some subsets of that.
We query our database, and then we show the results of that query to the end user, and they can then expand from there based on what they find interesting in the results of that query. In this example, we’ve tied one of our visualization products to the Twitter API. So we’re going to go off and issue a query to Twitter to say, “Find me all of the hashtags which use the following,” or, “All of the tweets which use the following hashtags.”
Now, I’m sitting here in Lowell, Massachusetts, so I’m going to do a search for anybody using the hashtag Lowell. And I see over the last, I don’t know, however long hour or so that this query runs for, we see the tweeters, the individual Twitter users who are using that hashtag, and we’re bringing those back onto the chart.
Now, the hashtags are the nodes in this case, although we could easily do a different data model, and the Twitter users themselves, the accounts, are the nodes. The link between them is the actual tweet that references that hashtag. So I’ll look here, and I’ll see, you know, the National Weather Service has a tweet associated with a severe thunderstorm warning, where they use that hashtag.
Well, let’s expand on that. Go off and find out, you know, who else or what other hashtags is that Twitter user showing. And I’ve got, you know, Illinois weather and Indiana weather, for example. So it looks like I’m picking up not Lowell, Massachusetts, but probably some other Lowell, but that’s okay. So now we’ll say, “Okay, well, let’s see who else is tweeting about Indiana weather.”
Well, the National Weather Service Indianapolis and so on. So what this is doing is it’s allowing the user to understand what they’re looking at, see the results of that query which we entered into a user interface, see the results as a graph, and then expand. Say, “Show me who else. Show me what other tweets has this user issued.
Show me what other Twitter users are using that hashtag,” and so on. And it creates this kind of exploratory type interface where you can really get some good value out of doing that. Let me go back to the slides here really quickly. The next thing I want to talk about is filtering and grouping.
Filtering is a really key point. So the query and expand is really helpful because it allows the user to decide what subset of the data they want to see, so we’re not overwhelming them with visual clutter. But even if we do that, sometimes it makes sense to allow the user themselves to filter down what they’re looking at so that they’re not seeing the whole data set all at once.
It’s kind of the opposite of the query and expand. You show them a broader set, and then they remove the bits that are of interest to them. And one of the really interesting things about filtering is it allows you to play these kind of what-if games. So what if we remove these nodes from the chart?
How does the rest of the algorithm adapt, or how does the rest of the visualization adapt to the removal of those items? So in this example, we’re looking at a mafia family in Sicily, or several mafia families in Sicily. So we’ve got the families represented, the individuals represented by the nodes.
They’re color-coded by which family they are a member of. And then we’ve got some unaffiliated nodes here too, and the links are showing the relationships between those nodes, so the relationship between the people in those families. Now we can remove some of the nodes, we can filter them out from the chart and see what happens.
We’ll take the three largest nodes and pull them out, and you actually got to watch the chart adapt. So you can see, okay, well, what does this network look like, and how might the connections, how might the structure of it change if items are removed? Or we can remove not just individual nodes, but entire subsets of them.
So if I want to take out, for example, people who are not members of one of those families, then I can do that as well and just look at the resulting chart, which is very simplified, and make decisions based on that too. So the filtering can be a really powerful way of working through larger data sets.
And the next thing I want to look at is combinations. This is kind of the grouping capability that allows me to take subsets of the data, group them together under a single node, which represents that entire group, and then look at how the groups themselves are connected to one another. This can be a really powerful way of taking what can be a very busy chart and showing a much simplified view of it, but then allowing the user to drill down and see the detail where they need to.
So as an example of that, I’ve got a chart that’s got a number of members of Al-Qaeda, and the connections between them, the links between them, are showing when those people have communicated with one another. And then we’re using a couple of visual details here, so things like the flag of the country that that person lives in to show them on the chart.
So this group of people live in Afghanistan, this group of people are in Spain, and so on. That’s called a glyph in our terminology, and it’s just a way of ornamenting a node to show some additional property of that node right there on the surface of the chart. Now, the countries are allowing us to group together, so the grouping can be by any common properties, the way we most often encounter grouping and the way it’s the most useful.
So everybody who lives in a common country, in this case, we’re going to group them together and show how the countries themselves are connected to one another. This allows us to step back from the individuals and take more of a geopolitical view. How are the countries connected through their Al-Qaeda networks?
And we’ve also sized the node to show a rough idea of the number of individuals inside of that group. The key thing here is that this is a very simplified or summarized view of the data. We’ve taken what was originally two hundred and twenty nodes, boiled it down to about fifteen, and the end user can drill into any one of those countries and see the members of Al-Qaeda who live in that country.
So I’ve got a subset of individuals in France here. I’m looking at both how they’re connected within the organization, but then externally to other countries on the globe. So that can be a really useful way of helping end users understand what they’re looking at and then get down to the detail of what it is that they want to see.
Now, it is possible to take these groupings as many levels deep as you want really, but I found that once you go beyond, say, two or maybe three levels, the user kind of loses their place and can’t really understand at what level am I looking at this? A key example of that is, say, an IP network, where you’ve got the layer-one physical connections between actual devices which have cables strung between them all the way up to the application layer, which applications are talking to one another independent of the underlying network architecture or infrastructure that’s in place.
And you can zoom in or out to look at the network or the structure of the items at whatever layer is appropriate. So in this case, maybe just to showcase how it might make sense, we can take those country nodes themselves and group them together by continent. So now we’re looking at the continents and how they’re connected.
We’ve taken that really busy chart that we looked at at the beginning, boiled it down to just seven nodes, but we can drill down like we did before. So Europe will give us all of the countries in Europe, and say, Germany will give us the individual members of Al-Qaeda in Germany and where else they’re connected to.
So the next thing we want to do is talk about how do I take visual properties of my graph, make them useful to the end user without overwhelming them with too much detail or too much visual clutter. So we’ve talked about what do I do when I have large amounts of data to make it easier for the user to understand what they’re looking at.
But the next thing is how do I make what they’re actually looking at, the representation of those nodes and links, easier and more useful? And there’s a lot of different visual properties which are helpful to take advantage of there. One of the key ones is link width. So let me show you that. I’ve got an example here where I’m looking at internal emails within an organization.
So each person is represented by a node, and the links between them are showing that those two people communicated. They emailed with one another in our dataset. But in this case right here, we’re not actually using any visual properties. Every link looks the same; it’s that same gray line.
Every node looks the same; it’s that same orange circle. One of the key things that you can do that really is helpful is to use some of those visual properties, bind them to properties back in your data model. So in this case, I talked about link width. I want to use the link width to show the strength of the relationship.
This is a really common thing. Thicker links are showing me two things which are more heavily connected. In the sort of visual intuitive model, you think of links as pipes, and the thicker the pipe, the more that can flow across that pipe, the more strongly those two nodes are connected to one another.
So in this case, we’re looking at the emails between those people, but right now, the link is identical regardless of whether they emailed once or ten or a hundred thousand times. And we really want to expose that to the end user. So what I might do here is just take the width of the link and bind that to the number of emails between those people.
So now we see thinner links are showing us relatively infrequent communication, and the thicker links are showing us heavy communication. So as a key example of that here, Kevin Presto is sending a lot of email to Rogers Herndon and receiving very little back in the opposite direction. Whereas this group over here is very tightly connected.
Everybody’s talking to everybody else, and the links are really thick. So not only are they communicating with everybody else in this subsection of the chart, but they’re sending lots of email back and forth. So there we were just using one example of the ways that we can take those visual properties, link size in this case, and arrowheads to show directionality to indicate something about the underlying source data.
We’ll do a similar thing with the nodes. So maybe we want to take the size and the color of the node and bind that to a property in our data. So in this case, we have a score associated with each node for an algorithm called closeness. Closeness is just a way of calculating how central a node is or how well connected it is to the network itself.
And we’ve taken the size and the color of that node and assigned it to the score. So now you see the bigger nodes and the redder nodes are the ones that are more well connected in this case. So the end user doesn’t really have to know what closeness is or how it’s calculated or how it’s stored in my data.
They just see, for example, that Tana Jones here is a very central node and maybe is somebody worthy of paying closer attention to.
It is key to make sure that you don’t overwhelm your users. So, you know, in this example, I’ve got many different glyphs representing me on the chart. I’ve got the flag to show perhaps a country. I’ve got other glyphs with text in them. I’ve got a bubble, which is showing me even more narrative detail associated with it.
I’ve got glyphs and a label on the link, and then the item that I’ve linked to also has glyphs with text in them, and it’s just too much. My suggestion is to take two or maybe three key visual properties and bind those to the underlying source data, but not more than that. If you do more than that, it just gets confusing.
Now, one thing that you can do, as I talked about earlier, is provide additional detail about a node when the user clicks on it or hovers over it or things like that. Or you can allow, as you saw in the example I just showed, the user to decide which properties are useful to bind to the data, and that way they know what they’re looking at because they’ve clicked on the button, for example, that makes that determination.
But do be careful about overwhelming them with too much visual detail. And the other thing is that I would avoid labels to the extent that you can, especially labels on links. I’ve seen way too many charts with labels on links where every link has the same label or the majority of links show the type of relationship.
I haven’t found that to be very useful. I like to use link color in that case as opposed to labels. The user doesn’t want to have to read that, and when all the text is the same anyway, it just adds to the clutter, and it’s not that useful. So the next thing I want to do is I want to talk about temporal data.
Now, a lot of people will think that they don’t have time data in their graph, and that’s almost always not actually the case. So in many cases, there’s actually time data in the graph. It is literally a property of the link. In the example over here on the right, we’re looking at IP communications between various devices on a network, and obviously, each packet has a timestamp associated with it.
So we can look at exactly down to the millisecond when those links happened. But in other cases, you may not have a timestamp as a property of the graph data, but you still do know when this item was added to my database, when it was created in the graph, or when I learned about it. And it can be useful to track that sort of thing over time because then when you’re watching your graph evolve, you can see, what did my graph look like a month ago versus what does it look like now?
So even if that’s the only example of time or date data that you have, it is still useful to think about, how is my graph changing? How do I want to represent the change in my data over time? So there’s a couple of different things that we’re going to look at here. We’re going to look at both how do I represent graph data over time.
Sometimes that’s fairly straightforward to do, and sometimes I need an entirely new visualization to do it, depending on what it is that I want to show. So let me show you an example of that.
So in this example, we’re looking at money flow, and if we just focus for now on the node-link visualization over here on the right, we are strongly implying that cash went from these cash deposits through a casino account into an individual’s checking account, and then so on through some other people, and then down to, say, a property manager.
However, we don’t actually know that by looking at the graph, because the sequence of those money transfers matters quite a bit. If the transfer from Woody’s checking account to Ned’s checking account actually happened ten years ago, then it’s irrelevant for this money flow scenario that we’re describing here, because it didn’t take place at all within the same timeframe of all of the other events here, and we can’t use that information.
Now, there’s not a good way to display that in a traditional node-link visualization. The sequence really can make a big difference in terms of what can we learn from graphs and what can we learn from displaying a graph. So we’ve got a couple of different techniques that I think can be useful for showing that.
So the first thing I’m gonna do is look at what happens when there is a date/time component to the actual data itself, and what can we learn from that. So in this example, I’ve got a set of text messages. Each phone is a phone number, and the link between them is showing that those two phones at some point sent a text message between them.
So you get a node-link chart which is showing us that structure of the data, who’s communicating with whom else. And like we did in the email scenario I showed you earlier, the width of the link is bound to the strength of the connection, the number of text messages between those two items. Now, what we’re going to do is we’re going to look at when those text messages happened.
Obviously, each text message has a date/time stamp, which is something that we’re going to take advantage of. It’s a property of that link. Now, we can see down here at the bottom we have a histogram which is showing us when those messages occurred. They spiked in September, dropped off in October, and then leveled off to a relatively steady level after that.
Our data set happens to cover five months in 2010 and early 2011. But it can be really useful for an end user to focus on a specific window of time. This is allowing the user to filter the graph. It’s showing us similarly to what we did with the mafia scenario, show only the data which occurs during the window that I have selected.
So I’m going to drill down into that single week in October, filter the graph—the node-link representation—to show me only the items which fall during that week. And I’ve got a much easier to digest and easier to understand graph visualization. Now, I can zoom in as far as is useful.
I’ve gone down to just a few hours of the day now, and I’m looking at the graph, the phone calls or the text messages that occurred during that window. Another thing that can be really helpful here is the animation, so the eye can pick up on changes in the data over time. If I play this back, I can actually watch the data change over time.
I’m watching new items get added to the chart as people send those messages, but I’m watching things disappear as well because I’m getting off toward the middle of the night when people are using their phone less than they are during the day. So you see the chart itself start to show fewer items on it.
Another thing that can be really useful with graph data that has a time component is to look not just at the sum total of everything in my visualization, which is what we’re doing here, but to look at subsets of that data too. This can be especially useful in sort of an anti-fraud or a security type scenario where I want to see does this pattern of the graph data match the pattern that I would expect from everything else in my data, or is it aberrant for some reason?
Does it stand out? So in this case, maybe I want to focus on just this individual. So I’ve selected him in green on the link chart, and I’m looking at his text messages superimposed on the histograms as well. So I can see, for example, that he was using his phone somewhat in September, dropped off to, in fact, zero in early October, and then picked up usage significantly after that.
Maybe I want to compare that to, say, this person in orange who has two distinct spikes in activity, one in September and one toward the end of the year, but not much outside of that range. So that can be a really useful way of saying, “Should I look at this more closely? Is this a reason that I need to investigate this pattern of activity and drill down a bit more?”
And you can do that from right within the visualization too. So, for example, here, you know, with this person that I selected in red whose activity is exclusively in the month of September, maybe I want to zoom in and see which specific days in September he was active. And you’ll notice here that I’m doing something a little bit differently with the filtering too.
As I drill down into an individual subset of the time range, in this case, a week in September, instead of removing everything else off of the chart, I’m just fading it out. This can also be a really useful thing to show when you want to maintain that context of everything else on the chart, but highlight certain items, in this case, the items that are in the window that I’m selecting.
So both of those can make a lot of sense. But the other thing is that perhaps we don’t necessarily want to display a link chart at all. If you’re looking more exclusively around the sequence of events or the time component is front and center to how we want to understand and we want to visualize this data, then maybe we don’t necessarily want to show a link chart.
What’s linked to what else is less important than the when. As an example of that, I’ve got a timeline here which is showing us 40 years of terror activity around the globe. Every single terror event is represented here. And if we were to show this as a graph, it would just be overwhelming.
There would be way too many events associated with it. We wouldn’t have any sense of time when the events happened. Things that were separated by decades could be shown right next to one another in the graph. So instead, we’re taking a tack where we’re looking at this as a timeline. It is still graph data.
We are representing the country where the event took place as a node, and we’re representing the group that was responsible as a node, and the link between them is the actual terror event that took place. But because each terror event takes place at a specific point in time, we’re showing a timeline to represent that.
Now, what we’re doing here is we’re showing a heat map, so the times that were more active with terror activity are denser on the heat map. So it gives us a very quick insight into when certain areas were active. So for example, in Africa, we can see that there was a distinct spike in the 1993 to ’95 range and then another spike between 2005 and 2010, but relatively less activity outside of that range.
Whereas in the Americas, we can see that it really grew after about 2000 or so. So maybe I want to drill down from here. So this is helpful at this level, but let’s focus on—we’ll just pick the first country in the list, Algeria. So I’ve decided to focus on Algeria, look at only terror activity that took place in Algeria, display that on the timeline.
So now I’m looking at the links between terror groups and who was responsible for individual events over the course of time in Algeria. And I can drill down and zoom in on a window of time here to show, for example, you know, this Frenchman who was shot in Algiers in January of 1995. So what we’ve done is we’ve started from 40 years of terror activity, and we’ve condensed that down or drilled down to show an individual event that happened in January of 1995.
So there we were starting with filtering based on only events that occurred in a certain country. But the other filtering we can do is based on time. So if I wanted to, say, look at the year 2000. Once I get down to a level where it actually makes sense to visualize my links, my events as individual events instead of heat maps, then it can make a lot of sense to show that.
So an example here, you know, we have activity in Colombia. The Federal Armed Forces, the FARC, in Colombia had some activity in the year 2000 that was taking place in Colombia, unsurprisingly. So that can be a really useful way if you narrow down on time to show the individual events once it makes sense.
You’re not overwhelming the screen with clutter. And as we saw in this example, sometimes it makes sense to take the two and combine them together. What links are or what nodes are associated with which other nodes is the node-link representation over here on the right, and over here on the left, I’ve got the actual transfers themselves and when they happened, the transfers from checking accounts to property management accounts, and so on.
Let me go back to the slides here.
What we’re going to wrap it up here is with a talk about what do I do with graphs that have locations? And I think it can be really useful to take that node-link visualization and superimpose it on top of a map. But you do have to be careful when you do this. There’s a couple of caveats associated with that.
So for example, in the screenshot over here on the right, I’m looking at airline flights on Southwest Airlines and the routes that they have. So the airports are the nodes, the links between them show the flights between those two cities. And even just with this small zoomed-in area of the Northeast United States, we can see that it’s starting to get really cluttered.
It’s hard to trace out where some of those links go because we lose that layout flexibility. We can’t organize our nodes and links to show it in a way that makes it easy to read. The nodes have to go over the actual physical location of that airport on the map. So superimposing nodes and links on top of a map can be really helpful, but you should do so sparingly and only with a smaller subset or a smaller level of detail and data than you would in a traditional node-link visualization.
So let me show you an example of this. We were talking about airlines and flights earlier. Here I’ve got just the hundred busiest US airline routes. The airports are the nodes. The link between them is color-coded by the airline that flies between those cities. Now, you can see some very clear patterns here just by looking at this in our traditional representation.
So the hubs of the various airlines stand out. Atlanta is very clearly a Delta hub, Dallas/Fort Worth is very clearly an American Airlines hub, and so on. But unless we knew our airport codes, we wouldn’t necessarily know where those airports actually were. So taking this and putting it over the top of a map can be really helpful in this scenario.
But notice that we’ve got a lot of wasted space. There’s not, in any one of the hundred busiest airline flights, much in the north-central United States. There’s some activity over here in California, and then quite a bit on the East Coast, but things are really clustered together, and it doesn’t make as much sense, and it makes it really hard to see where these links are going.
I can zoom way in to see, you know, Boston and my New York airports, but even still, it’s difficult to make sense of what it is I’m looking at. So what I have found to be the most useful is to allow end users to switch back and forth between their graph view and their map view on the same pane so that they can better understand what it is that they’re looking at.
Now, an unanswered question is, what do we do with nodes that have no geographic representation but are linked to ones that do? So in this case, our airports, every single airport has a location, and airports are the only things we’re showing on the nodes. But if we’re showing, say, for example, passengers that are taking various flights around the country, then the passengers themselves don’t have a location.
They are linked to airports which do. But how do we show that link on a map? Where do we put the people without falsely implying that they have some sort of location that they don’t? Can we make the map part of the screen and have some section outside of the map for those items which don’t have a visual representation?
It’s an unsolved problem. I don’t have a good answer for that. I’ve looked at some various techniques for doing it. Some might make sense in some scenarios, and some might not, and it’s difficult to really understand exactly how we should treat a scenario like that, so we’re still working on it. All right, let’s take a trip back to the slides here and wrap things up.
So I’ve talked about what graphs are, when we do want to visualize them, and some techniques that you might use for visualizing graphs. And there are various domains where people have been using graph visualization for quite a while, and I’ve found that they tend to cluster in a few different areas, although, of course, I’m always seeing new scenarios in which graph visualization makes a lot of sense or it’s starting to get picked up.
Defense and intelligence and law enforcement are some of the key areas, and you can imagine that. You saw in some of the example datasets I used, you know, terror activity around the globe or connections between gangs or criminal groups or looking at communications data and what can I learn about a network based on the pattern of communications that maybe I intercepted or something like that.
And that translates over to anti-fraud pretty well. If I’m trying to better understand claims data or policy data to understand which claims might be fraudulent or worthy of further investigations or which ones are just the normal course of doing business. Supply chain, you can imagine, is becoming particularly important in the last year or so, where we’re trying to understand who are the vendors that supply me with goods, who are the vendors that supply them with goods, and how can I make that supply chain more resilient?
Well, in order to do so, I need to understand my data as a graph and map that out. Or looking at cybersecurity. That’s been a very popular and growing domain for graphs, looking at actual patterns of IP communications across a network or the propagation of malware over a network or things like that.
And as I said, there are always new ones that are coming up.
So over the course of this talk, I’ve used our products at Cambridge Intelligence — KeyLines, ReGraph, and KronoGraph — to show you some examples on things that you might want to do in your graph visualizations. There’s certainly no requirement to use our products to do these. You could build your own. You could use some of the open source tools like D3, but obviously I feel like our tools give you a significant head start in producing those valuable graph visualizations.
What we do is we make graph visualizations for embedding inside our customers’ applications. So they’re toolkits as opposed to end user products themselves. And I’ve shown you over the course of the talk three different ones here. We have KeyLines for JavaScript developers, ReGraph for React developers, and KronoGraph, which is supported in JavaScript or React for building timelines.
We do have free trials of those, so do feel free to reach out if you want to give our tools a shot, and do keep these techniques in mind. Thank you very much for attending this video, and if you have any questions or anything you want to run by me, feel free to reach out to me directly. My email’s right there on the screen.
Share:
