In this article we explore a database technology that has some intriguing possibilities in the world of finance and banking. While the concept of graph databases has been around since the 1960’s, the technology really began to hit its stride in the 2000’s. By the end of this article, you’ll know what a graph database is, and we hope it sparks a discussion about applications where it could serve your business.
Relational Database Primer
If you work in finance, chances are that you’re working with data that’s stored in a relational database management system (RDBMS). Like an Excel spreadsheet, data is organized into rows and columns within unique tables, where each table only contains specific types of information, like a “transactions” table, a separate “accounts” table, a separate “positions” table, and so on.
Data stored like this is a logical way to stay organized, but isn’t useful unless you join tables together. Normalized data uses “primary keys” to identify unique rows of information in tables, and these same keys appear as “foreign keys” in separate tables that they share a relationship with. For example, suppose there is a unique account called “Delv Global” in the Accounts table, which has an ID of 221. In my Transactions table, there will be a column called “AccountID”, and the number 221 will appear once for each unique Delv Global transaction.
Of course, the real banking world you’re operating in has a schema with many, many tables. Depending on what you need to know, someone must write SQL statements to join them to extract the information you need. And for topics like liquidity reporting, GL reconciliations, and derivative P&L attribution, it probably requires heavy SQL processing that typically happens in batch.
When many tables have to be joined together, things can get out of control quickly, especially if there’s a need to do recursive programming (joining tables to themselves). Computationally, joins are expensive, even with the use of indexes (which tell a database where to begin looking for information), because the RDBMS is literally scanning each of the rows in all the tables for the information you specified. And when the business expands or changes, the results can impact the schema and all the processes that rely on it.
Graph databases make these challenges a lot easier to manage.
Introducing the Graph Database
A graph database is a map of connections. The map has three columns: A subject, a relationship (or “predicate”), and an object. Objects can become the subjects of other relationships—think of them as nouns. And nouns can have properties, like the adjectives that describe them. Consider: Bill is friends with Jane. “Bill” is the subject, “is friends with” is the relationship, and “Jane” is the object. Bill and Jane have independent properties to describe things about them, such as their phone number and date of birth. One may also have properties that the other doesn’t. For example, Bill has a pet. Jane does not. Therefore, even though Bill and Jane are both “people”, Jane doesn’t need a “pet” property with a zero stored in it.
Relationships can have properties too. In addition to friends, Bill also has relatives, neighbors, and coworkers. There’s no limit to the numbers of attributes that a subject, object or relationship can have. And most importantly, there’s no limit to the number of relationships that subject or objects can have.
If you draw the relationship of Bill and Jane, you get something resembling a barbell, where Bill and Jane are the heavy ends. The line connecting them is the relationship—also known as the “graph,” or the “edge.” By the time you’re finished drawing all the relationships, you end up with lots of nodes and connectors, like a social network graph. Whereas a relational database description of Bill and Jane’s friendship would include separate tables for people, pets, property types, and relationship types—each requiring unique IDs, with primary and foreign keys to connect them—the graph database just sticks to a straightforward subject, relationship, and object.
Connecting the Possibilities
In finance, this approach might come in very handy for modeling relationships in, say, master account management, where a parent counterparty has multiple divisions and subdivisions, each with their own unique accounts and subaccounts. A graph representation of this information means never having to join these data sets together, since the graph query language is specifically designed to traverse these relationships. The performance gain is massive: There are no linear searches or index references.
Now imagine if the graph could be expanded to define the relationships between accounts, positions, transactions, products, and journal entries. Information that typically depends on overnight batch processes to produce could potentially be available T+0, updated in real time, regardless of any changes to object properties. Storing information as connections might also unveil some interesting relationships in your data, which has intriguing implications for compliance and AML. Whereas AI and Big Data are typically coupled to search for “signals in the noise,” “intelligence at the edge” is a fascinating prospect because the AI is examining connections between objects.
It might seem counterintuitive, but this type of analysis is extremely difficult to perform in a relational database. Think about it: Your brain is a graph of connected neurons. It uses just 20 watts of electricity, yet it can make associations and connections between objects faster than any machine.
“Business users want to connect the dots across ever-increasing available data sets,” said Vince Scafaria, CEO and founder of DotAlign, which builds productivity and relationship insights software. “The RDF data format underlying graphs is ideal for this.”
RDF, which stands for Resource Description Framework, is a model for data publishing and interchange on the web. “RDF is automatically connectable because it describes itself relative to other well-known concepts,” Vince explained. “The data travels with information expressing its equivalence to other well-known concepts, highlighting machine-discoverable relationships across data sets. Needless to say, we share our clients’ excitement about this technology.”
We’ve encountered instances where certain aspects of banking enterprise models could benefit from operating a complementary graph database representation of its data. This is certainly true for regulatory reporting, which generally requires combining product data from various upstream systems at the account level and making extensive use of “reference tables” to bridge the connections. The graph database technology that seems to have the most momentum is Neo4j. You can start experimenting with its Sandbox, a free interface to learn the technology, which includes sample datasets for popular use cases.