Graph Databases in Action: Examples in Gremlin
By Josh Perryman and Dave Bechberger
()
About this ebook
Summary
Relationships in data often look far more like a web than an orderly set of rows and columns. Graph databases shine when it comes to revealing valuable insights within complex, interconnected data such as demographics, financial records, or computer networks. In Graph Databases in Action, experts Dave Bechberger and Josh Perryman illuminate the design and implementation of graph databases in real-world applications. You'll learn how to choose the right database solutions for your tasks, and how to use your new knowledge to build agile, flexible, and high-performing graph-powered applications!
Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications.
About the technology
Isolated data is a thing of the past! Now, data is connected, and graph databases—like Amazon Neptune, Microsoft Cosmos DB, and Neo4j—are the essential tools of this new reality. Graph databases represent relationships naturally, speeding the discovery of insights and driving business value.
About the book
Graph Databases in Action introduces you to graph database concepts by comparing them with relational database constructs. You'll learn just enough theory to get started, then progress to hands-on development. Discover use cases involving social networking, recommendation engines, and personalization.
What's inside
Graph databases vs. relational databases
Systematic graph data modeling
Querying and navigating a graph
Graph patterns
Pitfalls and antipatterns
About the reader
For software developers. No experience with graph databases required.
About the author
Dave Bechberger and Josh Perryman have decades of experience building complex data-driven systems and have worked with graph databases since 2014.
Table of Contents
PART 1 - GETTING STARTED WITH GRAPH DATABASES
1 Introduction to graphs
2 Graph data modeling
3 Running basic and recursive traversals
4 Pathfinding traversals and mutating graphs
5 Formatting results
6 Developing an application
PART 2 - BUILDING ON GRAPH DATABASES
7 Advanced data modeling techniques
8 Building traversals using known walks
9 Working with subgraphs
PART 3 - MOVING BEYOND THE BASICS
10 Performance, pitfalls, and anti-patterns
11 What's next: Graph analytics, machine learning, and resources
Josh Perryman
Josh Perryman is technologist with over two decades of diverse experience building and maintaining complex systems, including high performance computing (HPC) environments. Since 2014 he has focused on graph databases, especially in distributed or big data environments, and he regularly blogs and speaks at conferences about graph databases.
Related to Graph Databases in Action
Related ebooks
Data-Oriented Programming: Reduce software complexity Rating: 4 out of 5 stars4/5GraphQL in Action Rating: 2 out of 5 stars2/5Machine Learning Systems: Designs that scale Rating: 0 out of 5 stars0 ratingsMLOps Engineering at Scale Rating: 0 out of 5 stars0 ratingsSoftware Mistakes and Tradeoffs: How to make good programming decisions Rating: 0 out of 5 stars0 ratingsMaking Sense of NoSQL: A guide for managers and the rest of us Rating: 0 out of 5 stars0 ratingsDesigning Cloud Data Platforms Rating: 0 out of 5 stars0 ratingsMachine Learning Engineering in Action Rating: 0 out of 5 stars0 ratingsThink Like a Data Scientist: Tackle the data science process step-by-step Rating: 0 out of 5 stars0 ratingsMachine Learning Bookcamp: Build a portfolio of real-life projects Rating: 4 out of 5 stars4/5Data Engineering on Azure Rating: 0 out of 5 stars0 ratingsData Lake Development with Big Data Rating: 0 out of 5 stars0 ratingsDeep Learning with Structured Data Rating: 0 out of 5 stars0 ratingsOperations Anti-Patterns, DevOps Solutions Rating: 0 out of 5 stars0 ratingsAI as a Service: Serverless machine learning with AWS Rating: 1 out of 5 stars1/5Data Engineering with dbt: A practical guide to building a cloud-based, pragmatic, and dependable data platform with SQL Rating: 0 out of 5 stars0 ratingsData Visualization: Representing Information on Modern Web Rating: 5 out of 5 stars5/5Infrastructure as Code, Patterns and Practices: With examples in Python and Terraform Rating: 0 out of 5 stars0 ratingsFunctional Programming in JavaScript: How to improve your JavaScript programs using functional techniques Rating: 0 out of 5 stars0 ratingsFull Stack GraphQL Applications: With React, Node.js, and Neo4j Rating: 0 out of 5 stars0 ratingsVisualizing Graph Data Rating: 0 out of 5 stars0 ratingsStreaming Data: Understanding the real-time pipeline Rating: 0 out of 5 stars0 ratingsData Pipelines with Apache Airflow Rating: 0 out of 5 stars0 ratingsReal-World Functional Programming: With examples in F# and C# Rating: 0 out of 5 stars0 ratingsHow to Lead in Data Science Rating: 0 out of 5 stars0 ratingsNeo4j in Action Rating: 0 out of 5 stars0 ratingsSpark in Action Rating: 0 out of 5 stars0 ratingsMastering Large Datasets with Python: Parallelize and Distribute Your Python Code Rating: 0 out of 5 stars0 ratingsGo in Practice Rating: 5 out of 5 stars5/5
Data Modeling & Design For You
Managing Data Using Excel Rating: 5 out of 5 stars5/5Data Analytics for Beginners: Introduction to Data Analytics Rating: 4 out of 5 stars4/5The Secrets of ChatGPT Prompt Engineering for Non-Developers Rating: 5 out of 5 stars5/5150 Most Poweful Excel Shortcuts: Secrets of Saving Time with MS Excel Rating: 3 out of 5 stars3/5Neural Networks for Beginners: An Easy-to-Follow Introduction to Artificial Intelligence and Deep Learning Rating: 2 out of 5 stars2/5Mapping with ArcGIS Pro: Design accurate and user-friendly maps to share the story of your data Rating: 0 out of 5 stars0 ratingsDAX Cookbook: Over 120 recipes to enhance your business with analytics, reporting, and business intelligence Rating: 0 out of 5 stars0 ratingsData Visualization: a successful design process Rating: 4 out of 5 stars4/5Data Analytics with Python: Data Analytics in Python Using Pandas Rating: 3 out of 5 stars3/5Machine Learning Interview Questions Rating: 5 out of 5 stars5/5DAX Patterns: Second Edition Rating: 5 out of 5 stars5/5Microsoft Access: Database Creation and Management through Microsoft Access Rating: 0 out of 5 stars0 ratingsHow To Make Money With 3D Printing: The New Digital Revolution Rating: 3 out of 5 stars3/5Supercharge Power BI: Power BI is Better When You Learn To Write DAX Rating: 5 out of 5 stars5/5Thinking in Algorithms: Strategic Thinking Skills, #2 Rating: 4 out of 5 stars4/5Raspberry Pi :Raspberry Pi Guide On Python & Projects Programming In Easy Steps Rating: 3 out of 5 stars3/5Data Warehousing: Optimizing Data Storage And Retrieval For Business Success Rating: 0 out of 5 stars0 ratingsThe Definitive Guide to Power Query (M): Mastering complex data transformation with Power Query Rating: 5 out of 5 stars5/5Python Data Analysis - Second Edition Rating: 0 out of 5 stars0 ratingsSpreadsheets To Cubes (Advanced Data Analytics for Small Medium Business): Data Science Rating: 0 out of 5 stars0 ratingsNeural Networks: Neural Networks Tools and Techniques for Beginners Rating: 5 out of 5 stars5/5Python Data Science Essentials - Second Edition Rating: 4 out of 5 stars4/5WordPress For Beginners - How To Set Up A Self Hosted WordPress Blog Rating: 0 out of 5 stars0 ratingsR All-in-One For Dummies Rating: 0 out of 5 stars0 ratingsMastering D3.js Rating: 3 out of 5 stars3/5
Reviews for Graph Databases in Action
0 ratings0 reviews
Book preview
Graph Databases in Action - Josh Perryman
Graph Databases in Action
Examples in Gremlin
Dave Bechberger and Josh Perryman
Foreword by Ted Wilmes
To comment go to liveBook
Manning
Shelter Island
For more information on this and other Manning titles go to
manning.com
Copyright
For online information and ordering of these and other Manning books, please visit manning.com. The publisher offers discounts on these books when ordered in quantity.
For more information, please contact
Special Sales Department
Manning Publications Co.
20 Baldwin Road
PO Box 761
Shelter Island, NY 11964
Email: [email protected]
©2020 by Manning Publications Co. All rights reserved.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher.
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps.
♾ Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine.
ISBN: 9781617296376
contents
foreword
preface
acknowledgments
about this book
about the authors
about the cover illustration
Part 1. Getting started with graph databases
1 Introduction to graphs
1.1 What is a graph?
What is a graph database?
Comparison with other types of databases
Why can’t I use SQL?
1.2 Is my problem a graph problem?
Explore the questions
I’m still confused. . . . Is this a graph problem?
2 Graph data modeling
2.1 The data modeling process
Data modeling terms
Four-step process for data modeling
2.2 Understand the problem
Domain and scope questions
Business entity questions
Functionality questions
2.3 Developing the whiteboard model
Identifying and grouping entities
Identifying relationships between entities
2.4 Constructing the logical data model
Translating entities to vertices
Translating relationships to edges
Finding and assigning properties
2.5 Checking our model
3 Running basic and recursive traversals
3.1 Setting up your environment
Starting the Gremlin Server
Starting the Gremlin Console, connecting to the Gremlin Server, and loading the data
3.2 Traversing a graph
Using a logical data model (schema) to plan traversals
Planning the steps through the graph data
Fundamental concepts of traversing a graph
Writing traversals in Gremlin
Retrieving properties with values steps
3.3 Recursive traversals
Using recursive logic
Writing recursive traversals in Gremlin
4 Pathfinding traversals and mutating graphs
4.1 Mutating a graph
Creating vertices and edges
Removing data from our graph
Updating a graph
Extending our graph
4.2 Paths
Cycles in graphs
Finding the simple path
4.3 Traversing and filtering edges
Introducing the E and V steps for traversing edges
Filtering with edge properties
Include edges in path results
Performant edge counts and denormalization
5 Formatting results
5.1 Review of values steps
5.2 Constructing our result payload
Applying aliases in Gremlin
Projecting results instead of aliasing
5.3 Organizing our results
Ordering results returned from a graph traversal
Grouping results returned from a graph traversal
Limiting results
5.4 Combining steps into complex traversals
6 Developing an application
6.1 Starting the project
Selecting our tools
Setting up the project
Obtaining a driver
Preparing the database server Instance
6.2 Connecting to our database
Building the cluster configuration
Setting up the GraphTraversalSource
6.3 Retrieving data
Retrieving a vertex
Using Gremlin language variants (GLVs)
Adding terminal steps
Creating the Java method in our application
6.4 Adding, modifying, and deleting data
Adding vertices
Adding edges
Updating properties
Deleting elements
6.5 Translating our list and path traversals
Getting a list of results
Implementing recursive traversals
Implementing paths
Part 2. Building on Graph Databases
7 Advanced data modeling techniques
7.1 Reviewing our current data models
7.2 Extending our logical data model
7.3 Translating entities to vertices
Using generic labels
Denormalizing graph data
Translating relationships to edges
Finding and assigning properties
Moving properties to edges
Checking our model
7.4 Extending our data model for personalization
7.5 Comparing the results
8 Building traversals using known walks
8.1 Preparing to develop our traversals
Identifying the required elements
Selecting a starting place
Setting up test data
8.2 Writing our first traversal
Designing our traversal
Developing the traversal code
8.3 Pagination and graph databases
8.4 Recommending the highest-rated restaurants
Designing our traversal
Developing the traversal code
8.5 Writing the last recommendation engine traversal
Designing our traversal
Adding this traversal to our application
9 Working with subgraphs
9.1 Working with subgraphs
Extracting a subgraph
Traversing a subgraph
9.2 Building a subgraph for personalization
9.3 Building the traversal
Reversing the traversing direction
Evaluating the individualized results of the subgraph
9.4 Implementing a subgraph with a remote connection
Connecting with TinkerPop’s Client class
Adding this traversal to our application
Part 3. Moving Beyond the Basics
10 Performance, pitfalls, and anti-patterns
10.1 Slow-performing traversals
Explaining our traversal
Profiling our traversal
Indexes
10.2 Dealing with supernodes
It’s about instance data
It’s about the database
What makes a supernode?
Monitoring for supernodes
What to do if you have a supernode
10.3 Application anti-patterns
Using graphs for non-graph use cases
Dirty data
Lack of adequate testing
10.4 Traversal anti-patterns
Not using parameterized traversals
Using unlabeled filtering steps
11 What’s next: Graph analytics, machine learning, and resources
11.1 Graph analytics
Pathfinding
Centrality
Community detection
Graphs and machine learning
Additional resources
11.2 Final thoughts
appendix. Apache TinkerPop installation and overview
index
front matter
foreword
At the dawn of a new decade, developers are confronted with a myriad of database options when beginning a new project. The stalwart relational database still rules the roost, maintaining popularity in both legacy and greenfield projects. This is for good reason; flexibility and forty plus years of cumulative engineering history are hard to argue with. Despite the success of relational databases, the last decade saw an explosion of new commercial and open-source database systems that were designed around alternative models and query languages. Some tackle traditional RDBMS workloads with a new twist, perhaps focusing horizontal scale out or high performance via the embrace of in-memory optimization that have become available due to decreases in RAM prices. Many other systems diverged from the relational model altogether. Out of this set, we find a variety of focus areas and modeling paradigms. This book focuses on one of the more expressive and powerful developments, the graph model, and the property graph in particular.
Graph databases aren’t a new thing. Hierarchical and navigational databases have existed since the 60s, but these have recently experienced an increase in developer popularity. I think this is largely due to the intuitiveness of the property graph data model. People are already wired to think in graphs. If you draw a graph on a whiteboard, technical and non-technical folks get it. Consequently, after you overlay the graph model onto your software tasks at hand, everything starts to look like a graph problem.
With all that said, we’re still dealing with technology, and the available property graph databases are the newer technology at that, so there isn’t any magic. This is where Dave and Josh come in. I can’t imagine a better pair to help lay out the signposts and guide you on the journey to graph understanding. Both are accomplished graph architects and developers that have been involved in this junior space since before its recent uptick in popularity. Having worked in graph-based product development and consulting, they’ve racked up years of real-world experience.
This experience has influenced their pragmatic approach to the problems of graph application development, and though both proponents of graphs, they’re proponents with a healthy dose of skepticism and are not overly fascinated with the technology. After all, as mentioned, one of the first and most important questions new developers have is, Is this a graph problem?
As you make your way through this book, you’ll hone an intuition for translating real world problems into graph data models and build up your Gremlin query chops, a popular and powerful property graph query language. The rubber meets the road in chapter 6 where you use this knowledge to build your first graph application. By the time you’ve finished, you’ll have the knowledge to evaluate if a graph database is a good fit for your next project, and if so, to execute on that vision having already built an example graph database application.
Ted Wilmes
Data Architect & JanusGraph Technical Steering Committee Member
Expero Inc.
preface
Two complementary trends started in the mid to late 2000s. First, companies began using and collecting more data on their customers, competition, and users than ever before. Second, the information companies wanted from this data became more complex, often containing hidden connections. These two trends drove the need for an easier exploration of expansive, yet highly connected data. Graph databases met that need.
Both the authors have gotten an up-close and personal view of this market as the technology, usage, and adoption of graph technology has matured. We both started using graph databases in the mid 2010s while working for a niche software consulting company. Independently, we each worked on projects that used graph databases to solve specific types of complex data problems. At that time, graph databases were new and very rough. Despite the challenges of working with new technologies, we both recognized the power of this tool and were hooked.
Since then, we have spent countless hours banging our heads against a proverbial wall to understand all the intricacies and nuances of building graph-backed applications. This book is the distillation of those countless hours of struggle. It is our hope that the hands-on nature of this book will provide a solid, foundational understanding of the skills needed to build graph-backed applications and, in the process, help you to avoid some of the pitfalls that we encountered.
acknowledgments
This book has been a labor of love, and sometimes frustration, so we first and foremost need to thank our wives (Melody and Meredith), and then acknowledge family and friends for their endless patience and for indulging us as we shared our latest esoteric discoveries while working with graph databases. Without their support we never could have made it through the countless hours it took to create this book.
A big thank you goes out to Dr. Denise Gosnell, Kelly Mondor, Ted Wilmes, and Daniel Farrell for all the specific insights, interviews, and support you provided, which helped us immensely in creating this book.
We would also like to thank the team at Manning Publications for allowing us the time and opportunity to publish this book. We would like to thank the entire Manning staff and specifically our publishers Marjan Bace and Michael Stephens, as well as our editors Frances Lefkowitz, Nick Watts, Alex Ott, Lori Weidert, and Frances Buran for all the amazing feedback and endless patience you have shown. Our appreciation also goes out to all the reviewers whose comments and reviews were invaluable in solidifying the organization and in clarifying the focus of this book: Scott Bartram, Andrew Blair, Alain Couniot, Douglas Duncan, Mike Erickson, John Guthrie, Mike Haller, Milorad Imbra, Ramaninder Singh Jhajj, Mike Jensen, Nicholas Robert Keers, Mladen Knežic´, Miguel Montalvo, Luis Moux, Nick Rakochy, Ron Sher, Deshuang Tang, Richard Vaughan, and Matthew Welke.
We would also like to thank the team at Expero Inc., without whom Josh and Dave would never have met, nor would have ever started their exploration of graph databases. Our many years of working side by side with the exceptionally talented Experonauts were a fruitful starting point that eventually led to writing this book.
about this book
This book is written for anyone building applications using graph databases. It is designed to provide a foundational understanding of graphs and graph databases, as well as to provide a framework for building applications using common graph database patterns. To teach this framework, this book follows the development lifecycle of a fictitious application called DiningByFriends. We use this application throughout the book to provide a realistic grounding of graph principles and examples of the concepts and content we teach. In many areas throughout this book, we compare and contrast the differences between building a graph-backed application and using the more traditional relational database model. By the end of this book, you will not only have the skills needed to build your own graph-backed application, but you will have built your first application, DiningByFriends.
Who should read this book
This book is for application developers, data engineers, and database developers who want to use graph databases as the backing data store for their applications. Throughout this book, we do not expect the reader to have any prior experience using graph databases, but you should be familiar with data modeling concepts, specifically with relational database development, as these are used heavily throughout as a common point of reference. Although all the application code is written in Java, any developer with object-oriented application development experience should be able to follow along with the concepts and content.
How this book is organized: A roadmap
This book is organized into 3 parts, comprising of 11 chapters. In part 1, Getting started with graph databases,
we establish the foundation for our DiningByFriends application:
Chapter 1 begins with an introduction to graphs and graph terminology. We discuss how graph databases differ from relational databases and how you can use graph databases to solve highly connected data problems. We finish this chapter by discussing what makes a problem a good candidate for using a graph database.
Chapter 2 is where we hit the ground running by building an initial data model for our DiningByFriends application. We start with the types of information needed to begin the data modeling process. We then show how to turn this information into a conceptual data model. Finally, we walk through a framework for taking our business needs and our conceptual data model and turn that into our initial data model using the elements of a graph database: vertices, edges, and properties.
Chapter 3 begins a set of three chapters focused on learning the process of querying a graph database, known as traversing. We begin by teaching you how to retrieve and filter data from our graph. We follow this with learning how to navigate the structure of our graph and how that differs from working with a relational database. Then we finish up this chapter by demonstrating the ease with which you can recursively traverse through a graph to retrieve complex, interconnected data.
Chapter 4 continues our exploration of graph traversals with data mutation use cases. We then show how you can traverse the graph to find the entities and relationships that connect two items, known as the path. Finally, we look at how to leverage properties on relationships to filter the traversals and increase their performance.
Chapter 5 finishes our initial focus on graph traversals with a discussion of ways to format the results of our traversal into a desired output. Additionally, you learn how to perform common operations such as sorting, filtering, and limiting the results returned.
Chapter 6 begins the process of building our DiningByFriends application by taking the traversals we developed in chapters 3, 4, and 5 and walking through incorporating these into a Java application. Then we’ll process the results to complete this first part.
In part 2, Building an application with graph databases,
we extend the concepts introduced in part 1:
Chapter 7 uses the foundations of data modeling from chapter 2, as well as what you learned about traversing a graph, to extend the data model for more complex use cases, such as recommendation engines and personalization.
Chapter 8 leverages a recommendation engine use case to demonstrate the power of using a known-walk pattern to create a robust recommendation application pattern.
Chapter 9 uses our personalization use case to demonstrate how to use a subgraph access pattern within a graph-backed application.
In part 3, Beyond the basics,
we move past the DiningByFriends application to discuss our next steps in the application development process.
Chapter 10 discusses how to debug and troubleshoot common performance problems with traversals. We then investigate exactly what supernodes are and why they cause issues in graph-backed applications. We follow up these common performance problems with common application and traversal pitfalls and anti-patterns, as well as how to recognize and avoid them.
Chapter 11 takes a forward-looking view and discusses some of the next steps you might want to take with your graph-backed application. We also discuss some of the most common graph analytics algorithms and how you can apply these to solve a specific problem. Finally, we wrap up this chapter with a brief overview of how to leverage graphs in machine learning (ML) application.
About the code
This book contains many examples of source code, both in numbered listings and in line with normal text. In both cases, source code is formatted in a fixed-width font like this to separate it from ordinary text.
In many cases, the original source code has been reformatted; we’ve added line breaks and reworked indentation to accommodate the available page size in the book. In rare cases, even this was not enough and code listings include line-continuation markers (➥). Additionally, code annotations accompany many of the listings, highlighting important concepts.
The code for the examples in this book is available for download from the Manning website at https://www.manning.com/books/graph-databases-in-action, and from GitHub at https://github.com/bechbd/graph-databases-in-action.
About the technologies
Our goal throughout this book is to equip the reader with the conceptual knowledge needed to build graph-backed applications. However, in order to provide practical examples of these concepts, we had to make decisions regarding the technologies used for demonstration.
Our first decision was to pick the type of database. We decided to use a labeled property graph database, instead of, for example, an RDF store or triplestore database. Labeled property graph databases are the most common type we have seen in production use and seem to be the ones with the most momentum behind them. Additionally, these are the closest to the familiar concepts of relational databases, so labeled property graph databases are quite effective for comparisons.
This lead us to our next decision: the traversal language to use, openCypher or Gremlin.
While there’s a strong case for using openCypher, the goal of this book is to remain as vendor-agnostic as possible. It is important to us that these concepts and techniques are easily transferable to many popular databases when you start to build your applications. In the end, we decided to use the Apache TinkerPop version 3.4.x framework because it currently has the most database vendors with compatible implementations.
We have been questioned multiple times during the proposal and review processes as to why we chose this stack over a Neo4j/Cypher stack. Given the popularity of the Neo4j ecosystem this is a fair question which deserves fuller comment. There are three reasons we chose TinkerPop’s Gremlin for the illustrations throughout this book:
Gremlin is a better tool for teaching how a traversal works.
Gremlin is a common language of choice for enterprise applications.
Gremlin is the most portable language between property graph databases.
As for the first reason, we believe that the imperative design of Gremlin provides a better teaching tool for learning how a graph traversal works compared to the declarative approach of Cypher/openCypher. The syntax of Gremlin requires that we think about how we are moving through our graph in order to determine where we will move next. While we do appreciate the simplicity of Cypher/openCypher, it can also obfuscate critical technical matters, especially when dealing with issues of performance or scale. So while Cypher/openCypher is a great starting point for learning how to work with connected data, we feel that Gremlin is better suited for building high performing, scalable data applications.
Because Gremlin is the common language of choice for enterprise applications, many of these applications were built using TinkerPop-enabled databases. This means that Gremlin is the query language of choice. Some organizations have both Cypher/openCypher and Gremlin applications. But in our experience, the bigger, more complex enterprise-level projects seem to have chosen one of the many TinkerPop-enabled databases or cloud services.
As for our third choice, at this time, it is easy to say that Gremlin is the most widely available query language across graph database engines. Nearly all of the major cloud vendors (Amazon Web Services, Microsoft Azure, IBM, Huawei, and so forth) offer graph databases or services compatible with Gremlin. The lone exception is the Google Cloud Platform, which offers Neo4j as a service.
Our goal is not to advocate for one database or language over another. We seek to provide you with a solid foundation for how to use a graph database when building applications with highly connected data and to illustrate how graph databases work under the cover. We think that Gremlin provides the best path to accomplish this.
With the decision to use TinkerPop’s Gremlin made, we had to pick a specific TinkerPop-enabled database to use. In the spirit of remaining vendor agnostic, we’ve decided to use TinkerGraph for the examples. TinkerGraph is the graph implementation used in the Gremlin Server and Gremlin Console, the reference software provided as part of the Apache Software Foundation’s TinkerPop project.
Finally, we had to decide on an application programming language to build our example application, DiningByFriends. As Java is the most common language we have used with graph databases, we chose that as our application language. We should note that it is possible to build the same application with other languages such as C#, JavaScript and Python. Not only is it possible, we have done so ourselves. But all the traversals provided in this book are written in Gremlin and any application code is written in Java.
While almost all the concepts presented throughout this book are not specific to TinkerPop-enabled databases, there are a few we discuss that are unique to TinkerPop. When this is the case, we'll note where a TinkerPop-specific feature is used so that you’re aware that a particular feature might not be available in your graph database of choice. If no such note is given, it is safe to assume that the concept we discuss is applicable to other labeled property graph databases as well.
liveBook discussion forum
Purchase of Graph Databases in Action includes free access to a private web forum run by Manning Publications where you can make comments about the book, ask technical questions, and receive help from the authors and from other users. To access the forum, go to https://livebook.manning.com/#!/book/graph-databases-in-action/discussion. You can also learn more about Manning’s forums and the rules of conduct at https://livebook.manning.com/#!/discussion.
Manning’s commitment to our readers is to provide a venue where a meaningful dialogue between individual readers and between readers and the authors can take place. It is not a commitment to any specific amount of participation on the part of the authors, whose contribution to the forum remains voluntary (and unpaid). We suggest you try asking the authors some challenging questions lest their interest stray! The forum and the archives of previous discussions will be accessible from the publisher’s website as long as the book is in print.
about the authors
Dave Bechberger is a data architect and developer with over two decades of experience. He uses his extensive knowledge of graph and other big data technologies to build highly performant and scalable data platforms in complex data domains such as bioinformatics, oil and gas, and supply chain management. Since the mid-2010s, Dave has worked with graph databases as a consultant, consumer, and vendor. He is an active member of the graph community and has presented on a wide range of graph-related topics at national and international conferences.
Josh Perryman also has over two decades of experience building and maintaining complex systems. Since 2014, he has focused on graph databases, especially in distributed or big data environments, and he regularly blogs and speaks at conferences about graph databases. Josh has worked with a variety of industries, including enterprise software, financial services, consumer products, and government intelligence agencies. In addition to consulting and product work, he has designed Gremlin training courses that have been delivered all over the world.
about the cover illustration
The figure on the cover of Graph Databases in Action is captioned Femme de la Foret Noire,
or a woman from the Black Forest, in Southwest Germany. The illustration is taken from a collection of dress costumes from various countries by Jacques Grasset de Saint-Sauveur (1757-1810), titled Costumes civils actuels de tous les peoples connus, published in France in 1788. Each illustration is finely drawn and colored by hand. The rich variety of Grasset de Saint-Sauveur’s collection reminds us vividly of how culturally apart the world’s towns and regions were just 200 years ago. Isolated from each other, people spoke different dialects and languages. In the streets or in the countryside, it was easy to identify where they lived and what their trade or station in life was just by their dress.
The way we dress has changed since then and the diversity by region, so rich at the time, has faded away. It is now hard to tell apart the inhabitants of different continents, let alone different towns, regions, or countries. Perhaps we have traded cultural diversity for a more varied personal life--certainly for a more varied and fast-paced technological life.
At a time when it is hard to tell one computer book from another, Manning celebrates the inventiveness and initiative of the computer business with book covers based on the rich diversity of regional life of two centuries ago, brought back to life by Grasset de Saint-Sauveur’s pictures.
Part 1. Getting started with graph databases
Journeys into new technologies take work, and in this book, our journey will extend your current knowledge of building relational database applications to demonstrate how you can solve complex data problems by building graph databases and graph-backed applications. In this first part, we ease into your journey by establishing concepts, terms, and processes, while highlighting the critical differences required when approaching a problem with a graph mindset.
Chapter 1 introduces the core concepts of graphs and discusses the types of problems that are well suited for these models. In chapter 2, we establish a data modeling methodology and build a simple data model for a social network that we’ll use in our example application, DiningByFriends. The next three chapters introduce the most common operations that you’ll use to find and manipulate data in graph databases. We approach these operations in three stages, starting with the basics of moving around a graph in chapter 3. Chapter 4 then covers how to perform basic CRUD (Create/Read/Update/Delete) operations before extending the work we did in chapter 3 to perform more complex recursive and pathfinding traversals. In chapter 5, we close our introduction by using simple graph operations to examine ways to organize your results. Chapter 6 completes this part by synthesizing the work from chapters 2 through 5 into our working Java application, DiningByFriends.
1 Introduction to graphs
This chapter covers
An introduction to graphs and graph terminology
How graph databases help solve highly connected data problems
The advantages of graph databases over relational databases
Identifying problems that make good candidates for using a graph database
Modern applications are built on data--data that is ever increasing in both size and complexity. Even as the complexity of our data grows, so do our expectations of what insight our applications can derive from that data. If you are old enough, you likely remember when applications took a long time to load data and had limited features. Today’s reality is different; applications provide powerful, flexible, and immediate insight into data. But for every 100 questions modern applications answer, the most common data tool these use (namely, a relational database) handles only about 88 of those questions well. That leaves 12 types of questions where relational databases struggle. These remaining questions deal with the links and connections within the data, those aspects of the data that can generate powerful and unique insights. This puts us at a crossroad: we can use the relational database hammer
to pound away at those questions and make this work well enough, or we can take a step back and look at what other tools can answer these questions better, faster, and with less effort.
By reading this book, you decided to take a step back from your relational database hammer and investigate a road less traveled: graph databases. This book is written for developers, engineers, and architects who are interested in other ways to solve problems specific to working with highly connected data. We assume you are already familiar with relational databases but are interested in learning when, where, and how graph databases are a better tool.
Our goal with this book is to equip you with the techniques needed to add graph databases as another tool in your toolbelt. We like to think of this book as the guide that we wish we had when we started building graph-backed applications. Throughout this book, we’ll demonstrate common graph patterns that highlight how graph databases enable navigation and exploration of data in ways not easily accomplished with a traditional relational database.
Our primary approach is through an example of building a fictitious restaurant review and recommendation application we call DiningByFriends.
As we move through the software development life cycle from planning, to analysis, to design, and on to implementation, this application demonstrates how to think about and work with graph data. Each chapter builds on the previous chapter, and by the end of this book, we’ll have created a functioning application on a graph database. We believe that putting the concepts immediately to work by solving a realistic set of problems, even if they are somewhat simplistic, is the best way to get comfortable using a new technology. Let’s begin our journey with an introduction to what graphs and graph databases are and how they compare with traditional tools such as relational databases.
1.1 What is a graph?
When you look at a road map, examine an organizational chart, or use social networks such as Facebook, LinkedIn, or Twitter, you use a graph. Graphs are a nearly ubiquitous way to think about real-world scenarios as these abstract out the items and the relationships being represented, and this abstraction allows for quick and efficient processing of the connections within the data.
Let’s demonstrate with a common task: going to the supermarket. Take out a piece of paper and draw out a plan for getting from your house to your supermarket. Chances are it looks something like figure 1.1.
Figure 1.1 A graph representing directions to the supermarket
Figure 1.1 shows a graph where the key items and relationships are represented by abstractions. First, we abstracted key locations, like intersections, and represented these as circles. We then designated the connections between these key intersections as lines, showing how the key intersections are related. This is just one example of how we naturally represent real-world problems as graphs.
It is human nature to abstract real-world entities and their relationships, and the mathematical name for this abstract construct is a graph. When thinking about a set of data that contains a vast array of highly interconnected items, we might also describe this data set as a web of interconnected things, which is just another way of saying a graph.
On maps, cities are frequently represented by circles, and the roads that connect these are represented by lines. On an organizational chart (org chart), a circle usually represents a person, normally with an associated title, and lines that connect these