You may think that, unlike NoSQL databases, relational databases aren’t suited to graph analysis and visualization.
Whilst it’s true that the tabular data model and schema restrictions sometimes make relational databases an inefficient option for visualizing connected data, with the right data modeling it is possible. In fact, sometimes a tabular view may be the most efficient storage and query mechanism.
This page introduces the topic of relational databases, and how they can be used for network-graph visualization.
What is a relational database?
A relational database, as the name suggests, is one storing related data, represented in two-dimensional tables of columns and rows. Each data entity is represented with its attributes in the columns, and data instances in the rows:
|AB123123||123 Road Street||19UYA1234L000213||Coupe||2012|
|ZN 987987||42 Mill Crescent||1005||Estate||2007|
|JP 456456||12 Sample Street||19UYA1234L000213||Coupe||2012|
Creating a network with SQL Joins
These tables, known as relations, can rapidly grow to millions, or billions of rows. It’s possible to get data from multiple tables. For example, suppose the data shown above was stored in two separate tables:
|1||AB123123||123 Road Street||1001|
|2||ZN987987||42 Mill Crescent||1005|
|3||JP456456||12 Sample Street||1001|
As long as there is a column that appears in both tables (in this case, the vehicle ID), we can cross-reference using an SQL operator called “JOIN”. In the vehicle table, this column is known as the Primary Key; in the policy table it is the foreign key.
If the primary key is deleted or altered, the database may trigger an action on cascade to keep the consistency of the references.
Using the JOIN operator, we can start to build up a complex connected data structure, resembling a network (or graph) that KeyLines can work with. Joins, however, can be expensive in terms of performance.
Optimizing SQL joins for data visualization
Once we understand how a join works, it’s possible to fine-tune the logic to speed up system performance. For example:
FROM Policy JOIN Vehicle
ON Policy.Vehicle_ID = Vehicle.ID
The SQL statement above will search all the Vehicle IDs in the Policy table to find a match in the Vehicles table. If the table contains 1,000,000 rows, the user will have to wait for the system to finish scanning all of them.
Database administrators can overcome this by:
- Designing data schemas to reducing reliance on joins
- Using indexes to improve lookup speed
Taking these steps can help ensure database performance doesn’t become too much of a bottleneck.
Converting relational data structure to a network
How relational data is modeled into a graph depends entirely on the data being visualized and the questions that need to be answered.
Taking the above vehicle / policy example, there would be four generic steps required:
- Understand the data – know which entity attributes and record formats are present in the data, and how they relate to one another.
- Combine required data relations – where possible, move the data into a single table to reduce the reliance on joins.
- Identify key relationships – understand which relationships users need to understand to answer the questions they are asking.
- Model data around key relationships – For example, an insurance fraud analyst may need to identify vehicles with more than one insurance policy, in which case the Vehicle Identification Numbers and Policy Numbers would be mapped to nodes, with the remaining data entities included as properties:
When mapping tabular data to a connected graph format, it’s important to consider how to manage IDs – important because KeyLines will automatically merge new data into an existing node if the ID already exists on the chart.
For example, in the case above we could set the vehicle make as the node ID field, so all Nissan vehicles appear as a single node. This could be useful for understanding patterns on a brand level.
Alternatively, the car number plate could be set as the node ID. This way, each unique number plate would appear as a unique node, which is potentially more useful to an analyst. However, it would still be necessary to carefully plan this structure and consider potential anomalies – stolen, shared or transferred number plates, for example.
This blog post explains the basics of modeling tabular data as a graph.
To learn more about the KeyLines network visualization toolkit and how to visualize your complex connected data, download a copy of our KeyLines white paper.