Visualizing a Knowledge Graph

8th September, 2015

Once you start working with graphs, it does not take long before you begin to see them all around you.

This blog post is about one of our recent experiments, looking at the graph structures in Wikipedia articles (via DBpedia) to understand the evolution of music through time.

If you feel inspired, why not try it for yourself? Register for a free KeyLines trial.

What is DBpedia?

DBpedia can be thought of as a machine-readable version of Wikipedia. DBpedia is a huge database built upon structured information found in Wikipedia articles.

DBpedia has a robot that will parse Wikipedia articles and store them in a ‘Semantic Web’ format.

This is great for querying relationships between things and of course data with relationships is often great to visualize in KeyLines.

Notice how the right hand panel is filled with machine-parseable structured information.
Notice how the right hand panel is filled with machine-parseable structured information.
The DBpedia version of the article, shown here as an HTML table but also available as a JSON object.
The DBpedia version of the article, shown here as an HTML table but also available as a JSON object.

Defining SPARQL and RDF Triples

SPARQL is a query language for the Resource Description Framework (RDF) – a data model that describes information as triples of Subjects, Predicates and Objects:

  • A subject is the resource being described in our triple
  • A predicate defines the relationship within the triple
  • An object is something related to the subject, via the predicate

The terminology subject-predicate-object is also used in spoken languages (to describe the three components required to form a sentence) which makes RDF triples a logical format for describing a resource:

  • Subject: A band
  • Predicate: Has
  • Object: A genre

Introducing Ontologies

Another concept of SPARQL (and the semantic web) we need to understand is an ‘ontology’.

An ontology can be thought of as a dictionary of descriptive terms we can use to link things. For example, if we look at the dbpedia resource for ‘The Clash’ (http://dbpedia.org/page/The_Clash), we can see that they have a genre defined as:

visualizing dbpedia 1

The machine representation of this information that dbpedia stores is as follows:

< http://dbpedia.org/resource/The_Clash >
< http://dbpedia.org/property/genre >
< http://dbpedia.org/resource/Punk_rock >

Here we are using 2 ontologies: dbpedia.org/resource and dbpedia.org/property ontologies are great because they let us define commonalities between information. We can say that data is linked to other data if they share any of a subject/predicate/object combined with the same ontology.

How to write a SPARQL query for DBpedia

With this knowledge, let’s try writing our first SPARQL query to run on the live DBpedia SPARQL endpoint.

This is a great place to test out your SPARQL skills: http://live.dbpedia.org/sparql.

Let’s try the following SPARQL query:

PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

SELECT ?label, ?band
WHERE {
  ?band dbo:genre dbr:Punk_rock .
  ?band foaf:name ?label .
  FILTER (LANG(?label) = 'en')
}

We’ve got four components to this query:

  1. The PREFIX at the top – defining the list of ontologies we use in the query.
  2. The SELECT statement – defining the variables we want to select (these can be any node in the RDF dataset).
  3. The WHERE clause – which in this case is defining a band as something with a genre which is punk_rock. At this stage, we are also saying the label is the name of the band.
  4. Finally, we apply a filter to show only labels in the English language.

When we click ‘Run Query’, we will get back a huge table of every punk rock band found on Wikipedia:

A list of all the punk rock bands on Wikipedia
A list of all the punk rock bands on Wikipedia

Now, DBpedia and SPARQL can be great fun to play around with, but there’s one thing missing from these huge tables of results: a nice visualization!

Time to build a KeyLines visualization!

Visualizing DBpedia in KeyLines

For this demo, I have something in mind. If you look back at the earlier DBpedia representation of Reggae (Figure 2), you will see that it has some properties ‘derivative’ and ‘stylisticOrigin’.

In the example of a music genre the derivatives will be other genres that were inspired by, or branched from, the original genre. Conversely, the stylistic origin will be genres that influenced the genre in question.

So for every music genre we will have its parents and children – a perfect graph structure!

The first thing to do is to write our SPARQL query:

PREFIX rdfs: 
PREFIX rdf: 
PREFIX dbp: 
PREFIX dbo: 

SELECT ?label, ?genre, ?decade, ?origins, ?derivatives
WHERE {
  ?genre rdf:type dbo:MusicGenre .
  ?genre dbp:culturalOrigins ?decade .
  ?genre rdfs:label ?label .
  ?genre dbp:stylisticOrigins ?origins .
  ?genre dbp:derivatives ?derivatives .
  FILTER (LANG(?label) = 'en')
}
GROUP BY ?label

Then it is easy to write a script, which will send this SPARQL to a URL endpoint and from the JSON returned, create a JSON file.

The URL I hit was as follows:

http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&format=application%2Fsparql-results%2Bjson&timeout=30000&query=

After query in the URL I added the URI encoded SPARQL query. The results coming back have the label repeated on multiple lines, so I wrote some small code to parse the response and group each parameter by its label (the genre name).

It is best to save the data you want from DBpedia, that way we don’t have to keep hitting the DBpedia endpoint, which will be both slow for our users and not very nice for the DBpedia service which is kindly hosted by someone else for our benefit.

Presenting the data in KeyLines

Now we have our cleansed JSON file containing all the DBpedia data we need – every music genre found on Wikipedia, listed with the decade it emerged, and its parent/child genres.

Here is what the data looks like when it is first loaded into KeyLines:

A chaotic graph of the connections between every music genre
A chaotic graph of the connections between every music genre

Yikes. This graph is a bit chaotic.

I decided to color nodes based on the decade in which they emerged, when the data was available, and size the nodes depending on the genre’s overall influence.

Even so, each node can have a huge number of parents and children, which is what is causing the denseness we see here.

Fortunately, KeyLines makes it really easy for us to add some controls to help in scenarios like these. Let’s try some searches and filtering.

Let’s have a look at all the music genres which were created in the 1970s:

A network of 1970s music genres
A network of 1970s music genres

It’s no surprise that the 1970s were a very creative time for music, much of today’s music derives from genres created during that period – hip hop, punk rock, post-punk, etc.

In this view, we can clearly see the influence of post-punk in the 1970s, which influenced or drew influence from, a network of other rock genres.

We can also see less mainstream genres, sitting aside from the main graph as singletons: psychadelic folk, doom metal, cadence-lypso, et al.

Using a hierarchy layout, we can track the influence of genres through the decades. Let’s click on Acid Rock:

The descendents of acid house
The descendents of acid house

The nodes on the first level were directly influenced by acid rock, further down we can see genres influenced by acid rock’s children. Clicking any of these nodes will allow us to explore further, working through a world of music!

Try it for yourself

DBpedia is a gold mine of knowledge, available for you to explore – whether for fun, or to derive some more meaningful information.

KeyLines is the best way to navigate through the connections and relationships. Give it a try! Register for a free trial:

Try KeyLines!

Subscribe to our newsletter

Get occasional data visualization updates, stories and best practice tips by email