Event-Driven Knowledge Graphs

Event-Driven Knowledge Graphs - securely open, flexibly opinionated - Event-Driven Knowledge Graphs

Event-driven architectures have been a thing for quite a while now – architectural patterns such as event-sourcing and CQRS are starting to appear in a wide variety of sectors. Event sourcing provides a near-realtime method for distributing data and compute load. This video from Confluent provides an excellent explainer of the approach. Knowledge graphs are appearing everywhere you look too. A knowledge graph is a data representation of key concepts and their relationships. They tackle problems around discovery and understanding of highly connected information, and there is a whole genre of databases that have developed to support various use cases – labelled property graphs (LPGs) and RDF triplestores. LPGs tend to be used for analytical tasks, usually around network analysis, and are optimised for navigation over chains of links. Triplestores are focussed more on knowledge-representation – and therefore tend to be the choice for knowledge graphs. The boundary is blurring though, as hybrid technologies are starting to emerge.

Knowledge graphs are particularly useful for dealing with high-value enterprise data – the kind of data you keep for a long time and whose accuracy and quality affect the operation of the business. Typical applications are master data management, forensics, healthcare records, biomedical data, financial research, intelligence systems and indexing of unstructured data (graphs of meta-data enabling document discovery). In other words: complex, high-value data. The problem in most large organisations that have been working with data for a long time is that the high value data tends to be scattered (and often replicated) across multiple systems and organisational silos. It’s buried in heaps of low-value operational data, or worse still slowly aging in a stagnant data lake. An event sourcing approach provides a way to extract key data from multiple systems in near real time. Once extracted, it can be resolved (de-duped) and linked. Hey presto, an event-driven knowledge graph.

We built the open-source CORE platform to provide a way to enable just such an architecture. We use Apache Kafka, which is a web-scale, battle-hardened, open-source event log. It also has the advantage of having been widely adopted by the sys admin community so you tend to find a well-provisioned Kafka cluster in most enterprises these days which makes integrating CORE a breeze. Any given Kafka instance can support a number of event logs (known as topics) and we leverage this to maximum effect. Each data source gets its own topic, and we use these to buffer incoming data events. Once on the log, our event-driven architecture triggers data manipulation events (cleansing, ETL, linking, etc) and the data is promoted up to a master topic (the “knowledge” topic). Again, being event-driven, our graph database (Apache Jena) reacts to events on the knowledge topic and stores the data. After that, we expose the data through secure APIs and query endpoints.

Event-Driven Knowledge Graphs - securely open, flexibly opinionated - Event-Driven Knowledge Graphs

The CORE platform standardises your data events in the RDF format ready for ingestion into the knowledge graph, and if you have an ontology it can run the transformers to conform your data to that ontology. It turns out that having a standardised, sequential log of all your data events has a number of uses. For example, database replication (load balancing) is a breeze when they can all sync from the same event log. It also opens up opportunities for storage models other than graph – for example, in the enterprise version of CORE, we can sync the data events to a search index and a geospatial store. We’re also working on adding in a vector store to support machine learning operations. We then provide a coherent API across the stores using GraphQL

Having all your key data in one place is great, but there are alway compliance and security issues to consider. The systems that provide the data all have their own access control models, and it is rare to find a unified approach across the systems in most organisations. To ensure the data is handled the way it was intended, we label all incoming data with access control meta-data. These labels stay with the data as it goes through processing, and CORE’s APIs and databases redact data based on authenticated user credentials. This allows us to deal with the very complex security and privacy scenarios you find in Government, healthcare and finance.

So, in summary, we’ve built a secure, event-driven data platform for standardising and linking your most important data. Get in touch to find out more.