You're reading from Getting Started with the Graph Query Language (GQL) A complete guide to designing, querying, and managing graph databases with GQL

Product type Paperback

Published in Aug 2025

Publisher Packt

ISBN-13 9781836204015

Length 392 pages

Edition 1st Edition

Languages

Cypher

Concepts

Databases

Authors (3):

Ricky Sun

Jason Zhang

Yuri Simione

View More author details

Table of Contents (17) Chapters

Preface

1. Evolution Towards Graph Databases

2. Key Concepts of GQL FREE CHAPTER

3. Getting Started with GQL

4. GQL Basics

5. Exploring Expressions and Operators

6. Working With GQL Functions

7. Delve into Advanced Clauses

8. Configuring Sessions

9. Graph Transactions

10. Conformance to the GQL Standard

11. Beyond GQL

12. A Case Study – Anti-Fraud

13. The Evolving Landscape of GQL

14. Glossary and Resources

Resources

15. Other Books You May Enjoy

16. Index

Download a Free PDF Copy of This Book

History of database query languages

Before electronic databases, data management was manual. Records were maintained in physical forms such as ledgers, filing cabinets, and card catalogs. Until the mid-20th century, this was the primary approach to data management. This method, while systematic, was labor-intensive and prone to human errors.

Discussions on database technology often begin with the 1950s and 1960s, particularly with the introduction of magnetic tapes and disks. These developments paved the way for navigational data models and, eventually, relational models. While these discussions are valuable, they sometimes overlook deeper historical perspectives.

Before magnetic tapes, punched cards were widely used, particularly for the 1890 U.S. Census. The company behind these tabulating systems later evolved into IBM Corporation, one of the first major technological conglomerates. I vividly recall my father attending college courses on modern computing, where key experiments involved operating IBM punch-card computers—decades before personal computers emerged in the 1980s.

Examining punched card systems reveals a connection to the operation of looms, one of humanity’s earliest sophisticated machines. Looms, which possibly originated in China and spread globally, have been found in various forms, including in remote African and South American villages.

Across the 3,000 to 5,000 years of recorded history, there have been many inventions for memory-aiding, messaging, scheduling, or recording data, ranging from tally sticks to quipu (khipu). While tally sticks were once thought to be a European invention, Marco Polo, after his extensive travels in China, reported that they were widely used there to track daily transactions.

On the other hand, when quipu was first discovered by Spanish colonists, it was believed to be an Inca invention. However, if the colonists had paid more attention to the pronunciation of khipu, they would have noticed that it means recording book in ancient Chinese. This suggests that quipu was a popular method for recording data and information long before written languages were developed.

Why focus on these pre-database inventions? Understanding these historical innovations through a graph-thinking lens helps illustrate how interconnected these concepts are and underscores the importance of recognizing these connections. Embracing this perspective allows us to better understand and master modern technologies, such as graph databases and graph query languages.

Early computer-based data management

The advent of electronic computers marked the beginning of computerized data storage. World Wars I and II drove major advancements in computing technology, notably the German Enigma machine and the Polish and Allied forces deciphering its encrypted messages, which contained top-secret information from Nazi Germany. When mechanical machines proved inadequate for the required computing power—such as in brute-force decryption—electronic and much more powerful alternatives were invented. Consequently, the earliest computers were developed during and before the end of World War II.

Early computers such as the ENIAC (1946) and UNIVAC (1951) were used for calculations and data processing. The Bureau of the Census and military and defense departments quickly adopted them to optimize troop deployment and arrange the most cost-effective logistics. These efforts laid the foundation for modern global supply chains, network analytics, and social behavior network studies.

The concept of systematic data management, or databases, became feasible with the rapid advancement of electronic computers and storage media, such as magnetic disks. Initially, most of these computers operated in isolation; the development of computer networks lagged significantly behind telecommunication networks for over a century.

The development of database technologies is centered around how data modeling is conducted, and the general perception is that there have been three phases so far:

Phase 1: Navigational data modeling
Phase 2: Relational (or SQL) data modeling
Phase 3: Not-only-SQL (or post-relational, or GQL) data modeling

Let’s briefly examine the three development phases so that we have a clear understanding of why GQL or the graphical way of data modeling and processing was invented.

Navigational data modeling

Before navigational data modeling (or navigational databases), the access of data on punched-cards or magnetic-tapes was sequential. Hence, this was very counter-productive. To improve speed, systems introduced references, which were similar to pointers, that allowed users to navigate data more efficiently. This led to the development of two data navigation models:

Hierarchical model (or tree-like model)
Network model

The hierarchical model was first developed by IBM in the 1960s on top of their mainframe computers, while the network model, though conceptually more comprehensive, was never widely adopted beyond the mainframe era. Both models were quickly displaced by the relational model in the 1970s.

One key reason for this shift was that navigational database programming is intrinsically procedural, focusing on instructing the computer systems with steps on how to access the desired data record. This approach had two major drawbacks:

Strong data dependency
Low usability due to programming complexity

Relational data modeling

The relational model, unlike the navigational model, is intrinsically declarative. This means instructing the system what data to retrieve, which means better data independence and program usability.

Another key reason for the shift from navigational databases/models was their limited search capabilities, as data records were stored using linked lists. This limitation led Edgar F. Codd, while working at IBM’s San Jose, California Labs, to invent tables as a replacement for linked lists. His groundbreaking work culminated in the highly influential 1970 paper titled A Relational Model of Data for Large Shared Data Banks. This seminal paper inspired a host of relational databases, including IBM’s System R (1974), UC Berkeley’s INGRES (1974, which spawned several well-known products such as PostgreSQL, Sybase, and Microsoft SQL Server), and Larry Ellison’s Oracle (1977).

Not-only-SQL data modeling

Today, there are approximately 500 known and active database management systems (DBMS) worldwide (as shown in Figure 1.1). While over one-third are relational DBMS, the past two decades have seen a rise in hundreds of non-relational (NoSQL) databases. This growth is driven by increasing data volumes, which have given rise to many big data processing frameworks that utilize both data modeling and processing techniques beyond the relational model. Additionally, evolving business demands have led to more sophisticated architectural designs, requiring more streamlined data processing.

The entry of major players into the database market has further propelled this transformation, with large technology companies spearheading the development of new database systems tailored to handle diverse and increasingly complex data structures. These companies have helped define and redefine database paradigms, providing a foundation for a variety of solutions in different industries.

As the landscape has continued to evolve, OpenAI, among other cutting-edge companies, has contributed to this revolution with diverse database systems to optimize data processing in machine learning models. In OpenAI’s system architecture, a variety of databases (both commercial and open source) are used, including PostgreSQL (RDBMS), Redis (key-value), Elasticsearch (full-text), MongoDB (document), and possibly Rockset (a derivative of the popular KV-library RocksDB, ideal for real-time data analytics). This heterogeneous approach is typical in large-scale, especially highly distributed, data processing environments. Often, multiple types of databases are leveraged to meet diverse data processing needs, reflecting the difficulty—if not impossibility—of a single database type performing all functions optimally.

Figure 1.1: Changes in Database popularity per category (August 2024, DB-Engines)

Despite the wide range of database genres, large language models still struggle with questions requiring “deep knowledge.” Figure 1.2 illustrates how a large language model encounters challenges with queries necessitating extensive traversal.

Figure 1.2: Hallucination with LLM

The question in Figure 1.2 involves finding causal paths (simply the shortest path) between different entities. While large language models are trained on extensive datasets, including Wikipedia, they may struggle to calculate and retrieve hidden paths between entities if they are not directly connected.

Figure 1.3 demonstrates how Wikipedia articles—represented as nodes (titles or hyperlinks) and their relationships as predicates—can be ingested into the Ultipa graph database. By performing a real-time six-hop-deep shortest path query, the results yield casual paths that are self-explanatory:

Genghis Khan launched the Mongol invasions of West Asia and Europe.
These invasions triggered the spread of the Black Death.
The last major outbreak of the Black Death was the Great Plague of London.
Isaac Newton fled the plague while attending Trinity College.

Figure 1.3: The shortest paths between entities using a graph database

The key takeaway from this section is the importance of looking beyond the surface when addressing complex issues or scenarios. The ability to connect the dots and delve deeper into the underlying details allows for identifying root causes, which in turn fosters better decision-making and a more comprehensive understanding of the world’s intricate dynamics.

The rise of SQL

The introduction of the relational model revolutionized database management by offering a more structured and flexible way to organize and retrieve data. With the relational model as its foundation, SQL emerged as the standard query language, enabling users to interact with relational databases in a more efficient and intuitive manner. This section will explore how SQL’s development, built upon the relational model, became central to modern database systems and continues to influence their evolution today.

The rise of the relational model

Edgar F. Codd’s 1970 paper, A Relational Model of Data for Large Shared Data Banks (https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf) laid the foundation for the relational model. Codd proposed a table-based structure for organizing data, introducing the key concepts of relations (tables), columns (attributes), rows (tuples), primary keys, and foreign keys. When compared to the navigational model (hierarchical model), such a structure provided a more intuitive and flexible way to handle data.

While we state that the relational model is more intuitive and flexible, it is within the context of dealing with data processing scenarios back in the 1970s-1990s. Things have gradually and constantly changed. The relational model has been facing more challenges and criticism with the rise of NoSQL and eventually the standardization of GQL. We will expand the topic on the limitations of SQL and the promises of GQL in the last section of this chapter.

The key concept of the relational model is rather simple, as simple as just five components, which are table, schema, key, relationship, and transaction. Let’s break them down one by one.

Table

A table, or relation, in the relational model is a structured collection of related data organized into rows and columns. Each table is defined by its schema, which specifies the structure and constraints of the table. Here’s a breakdown of its components:

Table Name: Each table has a unique name that describes its purpose. For instance, a table named Employees would likely store employee-related information.
Columns (Attributes): Each table consists of a set of columns, also known as attributes or fields. Columns represent the specific characteristics or properties of the entities described by the table. Each column has a name and a data type, which defines the kind of data it can hold. For example, in an Employees table, columns might include EmployeeID, FirstName, LastName, HireDate, and Department. The data type for each column could be integer, varchar (variable character), date, etc.
Rows (Tuples): Rows, or tuples, represent individual records within a table. Each row contains a set of values corresponding to the columns defined in the table. For example, a row in the Employees table might include 101, John, Doe, 2023-06-15, Marketing. Each row is a unique instance of the data described by the table.

Schema

Another key concept tied to tables (sometimes even tied to the entire RDBMS) is the schema. The schema of a table is a blueprint that outlines the table’s structure. It includes the following:

Column Definitions: For each column, the schema specifies its name, data type, and any constraints. Constraints might include NOT NULL (indicating that a column cannot have null values), UNIQUE (ensuring all values in a column are distinct), or DEFAULT (providing a default value if none is specified).
Primary Key: A primary key is a column or a set of columns that uniquely identifies each row in the table. It ensures that no two rows can have the same value for the primary key columns. This uniqueness constraint is crucial for maintaining data integrity and enabling efficient data retrieval. For example, EmployeeID in the Employees table could serve as the primary key.
Foreign Keys: Foreign keys are columns that create relationships between tables. They refer to the primary key of another table, establishing a link between the two tables. This mechanism supports referential integrity, ensuring that relationships between data in different tables are consistent.

Here we need to talk about normalization, which is a process applied to table design to reduce redundancy and improve data integrity. It involves decomposing tables into smaller, related tables and defining relationships between them. The goal is to minimize duplicate data and ensure that each piece of information is stored in only one place.

For example, rather than storing employee department information repeatedly in the Employees table, a separate Departments table can be created, with a foreign key in the Employees table linking to it.

The normalization concept sounds absolutely wonderful, but only on the surface. In many large data warehouses, too many tables have been formed. What was once seen as being intuitive and flexible in the relational model, can be a huge limitation and burden from a data governance perspective.

Relationships

Entity Relationship (ER) modeling is a foundational technique for designing databases using the relational model. Developed by Peter Chen in 1976, ER modeling provides a graphical framework for representing data and its relationships. It is crucial for understanding and organizing the data within a relational database. The keyword of ER modeling is graphical. The core concepts include entities, relationships, and attributes:

Entities: In ER modeling, an entity represents a distinct object or concept within the database. For example, in a university database, entities might include Student, Course, and Professor. Each entity is represented as a table in the relational model.
Attributes: Attributes describe the properties of entities. For instance, the Student entity might have attributes such as Student_ID, Name, Date_Of_Birth, and Major. Attributes become columns within the corresponding table.
Relationships: Relationships in ER modeling illustrate how entities are associated with one another. Relationships represent the connections between entities and are essential for understanding how data is interrelated. For example, a Student might be enrolled in a Course, creating a relationship between these two entities.

The caveat about relationships is that there are many types of relationships:

One-to-One: In this type of relationship, each instance of entity A is associated with exactly one instance of entity B, and vice versa. For example, each Student might have one Student_ID, and each Student_ID corresponds to exactly one student.
One-to-Many: This relationship type occurs when a single instance of entity A is associated with multiple instances of entity B, but each instance of entity B is associated with only one instance of entity A. For example, a Professor might teach multiple Courses, but each Course is taught by only one Professor. If we pause here, we can immediately sense that a problem will arise when such a rigid relationship is enforced, if a course is to be taught by two or three professors (a rare scenario but it does happen), the schema and table design would need a change. And the more exceptions you can think of here, the more re-designs you would experience.
Many-to-Many: This relationship occurs when multiple instances of entity A can be associated with multiple instances of entity B. For example, a Student can enroll in multiple Courses, and each Course can have multiple Students enrolled. To model many-to-many relationships, a junction table (or associative entity) is used, which holds foreign keys referencing both entities.

ER diagrams offer a clear and structured way to represent entities, their attributes, and the relationships between them:

Entities are represented by rectangles
Attributes are shown as ovals, each connected to its corresponding entity
Relationships are illustrated as diamonds, linking the relevant entities

This visual framework provides a comprehensive way to design database schemas and better understand how different data elements interact within a system.

The ER diagram is essentially the graph data model we will be discussing throughout the book. The only difference between SQL and GQL in terms of ER diagrams is that GQL and graph databases natively organize and represent entities and their relationships, while SQL and RDBMS use ER diagrams with metadata, where real data records are stored in lower-dimensional tables. It’s tempting to believe that the prevalence of the relational model matches with the limited computing power at the time it was invented. Exponentially higher computing power eventually would demand something more advanced, and more intuitive and flexible as well.

Transactions

Transactions are a crucial aspect of relational databases, ensuring that operations are performed reliably and consistently. To better understand how these principles work in practice, let’s explore ACID properties.

The ACID properties – Atomicity, Consistency, Isolation, and Durability – define the key attributes of a transaction. Let’s explore them in detail:

Atomicity: Atomicity ensures that a transaction is treated as a single, indivisible unit of work. This is crucial for maintaining data integrity, especially in scenarios where multiple operations are performed as part of a single transaction. This means that either all operations within the transaction are completed successfully, or none are applied. If any operation fails, the entire transaction is rolled back, leaving the database in its previous state. It prevents partial updates that could lead to inconsistent data states.
Consistency: Consistency ensures that a transaction takes the database from one valid state to another valid state, preserving the integrity constraints defined in the schema. All business rules, data constraints, and relationships must be maintained throughout the transaction. Consistency guarantees that database rules are enforced and that the database remains in a valid state before and after the transaction.
Isolation: Isolation ensures that the operations of a transaction are isolated from other concurrent transactions. Even if multiple transactions are executed simultaneously, each transaction operates as if it were the only one interacting with the database. Isolation prevents interference between transactions, avoiding issues such as dirty reads, non-repeatable reads, and phantom reads. It ensures that each transaction’s operations are independent and not affected by others.
Durability: Durability guarantees that once a transaction is committed, its changes are permanent and persist even in the event of a system failure or crash. The committed data is stored in non-volatile memory, ensuring its longevity. Durability ensures that completed transactions are preserved and that changes are not lost due to unforeseen failures. This property provides reliability and trustworthiness in the database system.

These attributes are best illustrated by linking them to a real-world system and application ecosystem. Considering any financial institution’s transaction processing system where a transaction involves transferring funds from one account to another, the transaction must ensure that both the debit and credit operations are completed successfully (atomicity), the account balances remain consistent (consistency), other transactions do not see intermediate states (isolation), and the changes persist even if the system fails (durability). These properties are essential for the accuracy and reliability of financial transactions.

The ACID properties were introduced in 1976 by Jim Gray and laid the foundation for reliable database transaction management. These properties were gradually incorporated into the SQL standard with the SQL-86 standard and have since remained integral to relational database systems. For over fifty years, the principles of ACID have been continuously adopted and refined by most relational database vendors, ensuring robust transaction management and data integrity. When comparing relational database management systems (RDBMS) with NoSQL and graph databases, the needs and implementation priorities of ACID properties vary, influencing how these systems handle transaction management and consistency.

Modern RDBMS include robust transaction management mechanisms to handle ACID properties. These systems use techniques such as logging, locking, and recovery to ensure transactions are executed correctly and data integrity is maintained. Managing concurrent transactions is essential for ensuring isolation and consistency. Techniques such as locking (both exclusive and shared) and multi-version concurrency control (MVCC) are used to handle concurrent access to data and prevent conflicts.

Evolution of NoSQL and new query paradigms

Today, big data is ubiquitous, influencing nearly every industry across the globe. As data grows in complexity and scale, traditional relational databases show limitations in addressing these new challenges. Unlike the structured, table-based model of relational databases, the real world is rich, high-dimensional, and interconnected, requiring new approaches to data management. The evolution of big data and NoSQL technologies demonstrates how traditional models struggled to meet the needs of complex, multi-faceted datasets. In this context, graph databases have emerged as a powerful and flexible solution, capable of modeling and querying intricate relationships in ways that were previously difficult to achieve. As industries continue to generate and rely on interconnected data, graph databases are positioning themselves as a transformative force, offering significant advantages in managing and leveraging complex data relationships.

The emergence of NoSQL and big data

The advent of big data marked a significant turning point in data management and analytics. While we often date the onset of the big data era to around 2012, the groundwork for this revolution was laid much earlier. A key milestone was the release of Hadoop by Yahoo! in 2006, which was subsequently donated to the Apache Foundation. Hadoop’s design was heavily inspired by Google’s seminal papers on the Google File System (GFS) and MapReduce.

GFS, introduced in 2003, and MapReduce, which followed in 2004, provided a new way of handling vast amounts of data across distributed systems. These innovations stemmed from the need to process and analyze the enormous data generated by Google’s search engine. At the core of Google’s search engine technology was PageRank, a graph algorithm for ranking web pages based on their link structures. Named intentionally as a pun after Google co-founder Larry Page. This historical context illustrates that big data technologies have deep roots in graph theory, evolving towards increasingly sophisticated and large-scale systems.

Figure 1.4: From data to big data to fast data and deep data

Examining the trajectory of data processing technologies over the past 50 years reveals a clear evolution through distinct stages:

The Era of Relational Databases (1970s-present): This era is defined by the dominance of relational databases, which organize data into structured tables and use SQL for data manipulation and retrieval.
The Era of Non-Relational Databases and Big Data Frameworks (2000s-present): The rise of NoSQL databases and big data frameworks marked a departure from traditional relational models. These technologies address the limitations of relational databases in handling unstructured data and massive data volumes.
The Post-Relational Database Era (2020s and beyond): Emerging technologies signal a shift towards post-relational databases, including NewSQL and Graph Query Language (GQL). These advancements seek to overcome the constraints of previous models and offer enhanced capabilities for managing complex, interconnected data.

Each of these stages has been accompanied by the development of corresponding query languages:

Relational Database—SQL: Standardized in 1983, SQL became the cornerstone of relational databases, providing a powerful and versatile language for managing structured data.
Non-Relational Database—NoSQL: The NoSQL movement introduced alternative models for data storage and retrieval, focusing on scalability and flexibility. NoSQL databases extend beyond SQL’s capabilities but lack formal standardization.
Post-Relational Database—NewSQL and GQL: NewSQL brings SQL-like functionality to scalable, distributed systems, while GQL, with the first edition released in 2024, is designed to address the needs of graph databases.

These stages reflect an evolution in data characteristics and processing capabilities:

Relational Database—Data, Pre-Big Data Era: Focused on managing structured data with well-defined schemas.
Non-Relational Database—Big Data, Fast Data Era: Emphasized handling large volumes of diverse and rapidly changing data, addressing the 4Vs—volume, variety, velocity, and veracity.
Post-Relational Database—Deep Data, or Graph Data Era: Represented a shift towards understanding and leveraging complex relationships within data, enhancing depth and analysis beyond the 4Vs.

The 4Vs – volume, variety, velocity, and veracity – capture the essence of big data:

Volume: The sheer amount of data
Variety: The different types and sources of data
Velocity: The speed at which data is generated and processed
Veracity: The reliability and accuracy of data

As data complexity grows, an additional dimension, depth, becomes crucial. This deep data perspective focuses on uncovering hidden relationships and extracting maximum value from interconnected data.

Understanding deep relationships among data is essential for various business and technological challenges:

Business Dimension: Value is embedded within networks of relationships, making it critical to analyze these connections to derive actionable insights.
Technology Dimension: Traditional databases struggle with network-based value extraction due to their tabular structure, which limits their ability to quickly identify deep associations between entities.

From 2004 to 2006, as Yahoo! developed Hadoop, other data processing projects emerged from different teams within the company. Yahoo!’s vast server clusters processed the massive volumes of web logs, posing significant processing challenges. Hadoop was designed to utilize low-cost, low-configuration machines in a distributed manner, but it encountered inefficiencies in terms of data processing speed and analytical depth. Despite excelling at handling large volumes and a variety of data types, these limitations led to its donation to the Apache Foundation.

The introduction of Apache Spark in 2014 brought a major shift. Developed by the University of California, Berkeley, Spark addressed many of Hadoop’s performance issues. Its in-memory processing capabilities allowed it to process data up to 100x faster than Hadoop. With components such as GraphX, Spark enabled graph analytics, including algorithms such as PageRank and Connected Component. However, Spark’s focus remained on batch processing rather than real-time, dynamic data processing, leaving gaps in real-time, deep data analysis.

The ability to process deep data involves extracting insights from multi-layered, multi-dimensional data quickly. Graph databases and GQL, the focus of this book, are designed to address these challenges. By applying graph theory principles, they enable advanced network analysis, offering unique advantages over traditional NoSQL databases and big data frameworks. Their ability to perform real-time, dynamic analysis of complex data relationships makes them well-suited to the evolving demands of data management and analysis.

This book will guide readers through the historical development, current state, and future trends of graph databases, emphasizing their relevance to market needs and technological implementation.