100% found this document useful (1 vote)

507 views

Graph Data Science For Dummies Book

Uploaded by

nurmilayanti ii

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

507 views

Graph Data Science For Dummies Book

Uploaded by

nurmilayanti ii

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 53

These materials are © 2021 John Wiley & Sons, Inc.

Any dissemination, distribution, or unauthorized use is strictly prohibited.

Graph Data
Science
(GDS)
Neo4j Special Edition

by Amy Hodler
and Mark Needham

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Graph Data Science (GDS) For Dummies®, Neo4j Special Edition

Published by
John Wiley & Sons, Inc.
111 River St.
Hoboken, NJ 07030-5774
www.wiley.com
Copyright © 2021 by John Wiley & Sons, Inc.

No part of this publication may be reproduced, stored in a retrieval system or transmitted in any
form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise,
except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without the
prior written permission of the Publisher. Requests to the Publisher for permission should be
addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ
07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.
Trademarks: Wiley, For Dummies, the Dummies Man logo, The Dummies Way, Dummies.com,
Making Everything Easier, and related trade dress are trademarks or registered trademarks of
John Wiley & Sons, Inc. and/or its affiliates in the United States and other countries, and may not
be used without written permission. Neo4j and the Neo4j logo are registered trademarks of
Neo4j. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc.,
is not associated with any product or vendor mentioned in this book.

LIMIT OF LIABILITY/DISCLAIMER OF WARRANTY: THE PUBLISHER AND THE AUTHOR MAKE NO

REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE ACCURACY OR COMPLETENESS OF
THE CONTENTS OF THIS WORK AND SPECIFICALLY DISCLAIM ALL WARRANTIES, INCLUDING
WITHOUT LIMITATION WARRANTIES OF FITNESS FOR A PARTICULAR PURPOSE. NO WARRANTY
MAY BE CREATED OR EXTENDED BY SALES OR PROMOTIONAL MATERIALS. THE ADVICE
AND STRATEGIES CONTAINED HEREIN MAY NOT BE SUITABLE FOR EVERY SITUATION. THIS
WORK IS SOLD WITH THE UNDERSTANDING THAT THE PUBLISHER IS NOT ENGAGED IN
RENDERING LEGAL, ACCOUNTING, OR OTHER PROFESSIONAL SERVICES. IF PROFESSIONAL
ASSISTANCE IS REQUIRED, THE SERVICES OF A COMPETENT PROFESSIONAL PERSON SHOULD BE
SOUGHT. NEITHER THE PUBLISHER NOR THE AUTHOR SHALL BE LIABLE FOR DAMAGES ARISING
HEREFROM. THE FACT THAT AN ORGANIZATION OR WEBSITE IS REFERRED TO IN THIS WORK
AS A CITATION AND/OR A POTENTIAL SOURCE OF FURTHER INFORMATION DOES NOT MEAN
THAT THE AUTHOR OR THE PUBLISHER ENDORSES THE INFORMATION THE ORGANIZATION
OR WEBSITE MAY PROVIDE OR RECOMMENDATIONS IT MAY MAKE. FURTHER, READERS
SHOULD BE AWARE THAT INTERNET WEBSITES LISTED IN THIS WORK MAY HAVE CHANGED OR
DISAPPEARED BETWEEN WHEN THIS WORK WAS WRITTEN AND WHEN IT IS READ.

For general information on our other products and services, or how to create a custom For Dummies
book for your business or organization, please contact our Business Development Department in
the U.S. at 877-409-4177, contact [email protected], or visit www.wiley.com/go/custompub. For
information about licensing the For Dummies brand for products or services, contact Branded
Rights&[email protected].
ISBN: 978-1-119-74604-1 (pbk); ISBN: 978-1-119-74605-8 (ebk)

Manufactured in the United States of America

10 9 8 7 6 5 4 3 2 1

Publisher’s Acknowledgments

Some of the people who helped bring this book to market include the
following:
Project Manager: Production Editor: Siddique Shaik
Carrie Burchfield-Leighton Business Development
Sr. Managing Editor: Rev Mengle Representative: Molly Daugherty
Acquisitions Editor: Ashley Coffey

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Table of Contents
INTRODUCTION................................................................................................ 1
About This Book.................................................................................... 1
Icons Used in This Book........................................................................ 2
Beyond the Book................................................................................... 2

CHAPTER 1: Understanding Graphs and

Graph Data Science..................................................................... 3
Explaining What a Graph Is.................................................................. 3
Defining Graph Analytics and Graph Data Science........................... 6
Looking at the Types of Questions for GDS....................................... 6

CHAPTER 2: Using Graph Data Science in the Real World......... 9

Looking at Graphs in Healthcare....................................................... 10
Discovering more efficient drugs................................................. 10
Improving the patient journey..................................................... 11
Recommendations and Personalized Marketing............................ 11
Fraud Detection................................................................................... 12

CHAPTER 3: Evolving Your Application of

GDS Technology........................................................................... 13
Knowledge Graphs.............................................................................. 14
Graph Analytics.................................................................................... 15
Graph Feature Engineering................................................................ 17
Graph Embedding............................................................................... 18
Graph Networks.................................................................................. 19

CHAPTER 4: Using Neo4j as a Graph Data

Science Platform......................................................................... 21
Neo4j GDS Library............................................................................... 21
Neo4j Graph Database Management System................................. 22
Neo4j Desktop and Browser.............................................................. 23
Neo4j Bloom........................................................................................ 24

Table of Contents iii

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
CHAPTER 5: Detecting Fraud with Graph Data Science............. 25
Finding a Good Fraud Dataset........................................................... 25
Removing Outliers............................................................................... 26
Finding Suspicious Clusters............................................................... 28
Visually Exploring a Suspicious Cluster............................................ 32
Predicting Fraudsters Using Graph Features................................... 35

CHAPTER 6: Ten Tips with Resources for Successful

Graph Data Science................................................................... 37

APPENDIX ........................................................................................................... 41

iv Graph Data Science (GDS) For Dummies, Neo4j Special Edition

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Introduction
C
onnectivity is the single most pervasive characteristic of
today’s networks and systems. From protein interactions to
social networks, from communication systems to power
grids, and from retail experiences to supply chains, networks with
even a modest degree of complexity aren’t random, which means
connections are neither evenly distributed nor static. Simple
statistical analysis alone fails to sufficiently describe, let alone
predict, behaviors within connected systems.

As the world becomes increasingly interconnected and systems

increasingly complex, using technologies built to leverage rela-
tionships and their dynamic characteristics is imperative. Not
surprisingly, interest in graph data science (GDS) and graph ana-
lytics has exploded because they were explicitly developed to gain
insights from connected data. GDS and graph analytics reveal the
workings of intricate systems and networks at massive scale.

About This Book

We are passionate about the utility and importance of GDS and
graph analytics, so we wrote this book to help organizations bet-
ter leverage graphs so they can make new discoveries and develop
intelligent solutions faster.

In this book, we focus on the commercial applications of graph

analysis and graph-enhanced machine learning (ML), which
takes the form of GDS. We also use the Neo4j graph technology
to illustrate a GDS platform. You take a quick look at GDS and its
uses before covering the journey of GDS adoption. You also review
Neo4j technology as a GDS platform and walk through a fraud
detection example.

Introduction 1

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Icons Used in This Book
The following icons are used in this book:

Information here can be filed away for later use.

This information may not be critical to most people, but if you like
the extra techie tidbits, you’ll enjoy the insight here. Otherwise,
just skip over it!

Are you interested in saving time or effort on your projects? Check

out these tips to help you do just that.

Beyond the Book

This book is focused on GDS and relies on graph theory, graph
analytics, and graph databases. If you want resources beyond what
we can offer you in this short book, we recommend the following:

»» neo4j.com/graph-algorithms-book: For hands-on graph

algorithm examples, this book provides usable code and
explanations for getting started.
»» neo4j.com/graph-databases-book: Additional detail on
the Neo4j graph database and its property graph model can
be found here.
»» neo4j.com/graph-databases-for-dummies: If you’re new
to graph databases, this book is a great place to start your
journey because it assumes no previous experience and
walks you through modeling, querying, and importing graph
data, all the way through your first production system.

2 Graph Data Science (GDS) For Dummies, Neo4j Special Edition

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Defining a graph

»» Understanding graph analytics and GDS

»» Using questions to explore GDS

Chapter 1
Understanding Graphs
and Graph Data Science

G
raph approaches to data are exploding in the commercial
world to better reveal meaning in data as well as forecast
behavior of complex systems. This burst is due to the
increasing connectedness of data, breakthroughs in scaling graph
technology to enterprise-sized problems, excellent results when
integrated with machine learning (ML) and artificial intelligence
(AI) solutions, and more accessible tools for general analytics and
data science teams.

In this chapter, you discover how we define a graph and the rela-
tionship of graphs to analytics and data science. You also get a
foundation in how graphs are used to answer tough questions
about complex systems.

Explaining What a Graph Is

Networks are a representation, a tool to understand complex sys-
tems and the complex connections inherent in today’s data. For
example, you can represent how a social system works by think-
ing about interactions between pairs of people. By analyzing the

CHAPTER 1 Understanding Graphs and Graph Data Science 3

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
structure of this representation, you can answer questions and
make predictions about how the system works or how individ-
uals behave within it. In this sense, network science is a set of
technical tools applicable to nearly any domain, and graphs are
the mathematical models used to perform analysis. Simply put,
graphs are a mathematical representation of complex systems.

Graphs have a history dating back to 1736. The origins of graph

theory hail from the city of Königsberg, which included two large
islands connected to each other and the two mainland portions of
the city by seven bridges. The puzzle was to create a walk through
the city, crossing each bridge once and only once. Leonhard Euler
solved that puzzle by asking whether it was possible to visit all
four areas of a city connected by seven bridges, while only cross-
ing each bridge once. It wasn’t.

With the insight that only the connections themselves were

relevant to solving this kind of problem, Euler established the
groundwork for graph theory and its mathematics. As one of
Euler’s original sketches, Figure 1-1 depicts Euler’s progression:

»» Walking the bridges of Königsberg: Four main areas of

Königsberg with seven bridges. Can you cross each bridge
only once and return to your starting point?
»» Euler’s insight: The only relevant data is the main areas and
the bridges connecting them.
»» Origins of graph theory: Euler abstracted the problem and
created generalized rules based on nodes and relationships
that apply to any connected system.

FIGURE 1-1: The origins of graph theory.

4 Graph Data Science (GDS) For Dummies, Neo4j Special Edition

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
While graphs originated in mathematics, they are also a prag-
matic and faithful representation of data for modeling and anal-
ysis. A graph is a representation of a network, often illustrated
with circles to represent entities, also called nodes or vertices, and
lines between them. Those lines are known as relationships, links,
or edges. Think of nodes as the nouns in sentences, and relation-
ships as verbs that give context to the nodes. To avoid any con-
fusion, the graphs we talk about in this book have nothing to do
with graphing equations or charts. Take a look at the differences
in Figure 1-2.

FIGURE 1-2: A graph is a representation of a network.

The bottom graph on the left in Figure 1-2 is a person graph.

When looking at that graph, you can construct several sentences
to describe it. For example, person A lives with person B who owns
a car, and person A drives a car that person B owns. This model-
ing approach maps easily to the real world and is whiteboard-
friendly, which helps align data modeling and analysis.

We often use the phrase “whiteboard-friendly” for anything

that’s easy to describe with simple drawings that you could illus-
trate on a whiteboard.

CHAPTER 1 Understanding Graphs and Graph Data Science 5

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Defining Graph Analytics and Graph
Data Science
Modeling graphs is only half of the story. You may also want to
analyze them to reveal insight that isn’t immediately obvious. So
in this section, we explain the domain of graph data science (GDS)
and graph analytics.

GDS is a science-driven approach to gain knowledge from the

relationships and structures in data, typically to power predic-
tions. It uses multi-disciplinary workflows that may include que-
ries, statistics, algorithms, and ML.

GDS can typically be broken down into three areas:

»» Graph statistics provides basic measures about a graph,

such as the number of nodes and distribution of relation-
ships. These insights may influence how you configure and
execute more complex analysis as well as interpret results.
»» Graph analytics builds on graph statistics by answering
specific questions and gaining insights from connections in
existing or historical data. Graph queries and algorithms are
typically applied together in “recipes” during graph analytics,
and the results are used directly for analysis.
»» Graph-enhanced ML and AI is the application of graph data
and analytics results to train ML models or support probabi-
listic decisions within an AI system.

Graph statistics and analytics are often used in conjunction to

answer certain types of questions about complex systems and the
subsequent insights, applied to improve ML.

Looking at the Types of Questions

for GDS
Data scientists try to tackle many types of questions when using
GDS to evaluate interdependencies, infer meaning, and predict
behavior. At the most abstract level, these questions fall into a
few broad areas: movement, influence, groups and interactions,
and patterns, as shown in Figure 1-3.

6 Graph Data Science (GDS) For Dummies, Neo4j Special Edition

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
FIGURE 1-3: GDS questions fall into four different areas.

The areas in Figure 1-3 answer the following questions:

»» How do things travel (move) through a network?

Understanding how things move through a network involves
deep path analysis to find propagation pathways, such as
the route of diseases or network failures. It can also be used
to optimize for the best possible route or for flow con-
straints. We cover these classic uses for pathing algorithms
more in Chapter 3.
»» What are the most influential points? Identifying influenc-
ers involves uncovering the structurally well-placed nodes
that represent the control points in a network. These
influencers can act as fast dissemination points, bridges
between less connected groups, or bottlenecks. Influencers
can accelerate or slow the flow of items through networks
from finances to opinions. The concept of highly connected
and influential nodes in a graph is referred to as centrality.
Centrality algorithms are essential for understanding
influence in a network.
»» What are the groups and interactions? Detecting commu-
nities requires grouping and partitioning nodes based on the
number and strength of interactions. This method is the
primary way to presume group affinity, although neighbor
likeness can also be a factor. Link prediction is about
inferring future (or unseen) connections based on network
structure. Heuristic Link Prediction algorithms are often used
to predict behavior. In addition to community detection
algorithms, similarity algorithms are also used to understand
groupings.
»» What patterns are significant? Uncovering network
patterns reveals similarities and can also be used for general
exploration.

CHAPTER 1 Understanding Graphs and Graph Data Science 7

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
For example, you may look for a known relationship pattern
between a few nodes or compare attributes of all your
nodes to find similarities. Or perhaps you want to evaluate
the entire structure of a network, with its intricate hierar-
chies, to correlate patterns to certain social behavior to
investigate. Aggregating related but ambiguous information
in large datasets is a common activity that relies on finding
similar and related information. Finding patterns may
employ simple queries or various types of algorithms found
in Chapter 3.

Multiple types of graphs queries and algorithms are usually

applied in a recipe fashion as part of a GDS workflow. For exam-
ple, a query to understand the density of relationships in a graph
may help determine the appropriate community detection algo-
rithm for the most relevant results. Tactically, graph queries and
algorithms are the tools for understanding the overall nature of
a connected system and for using relationships in various data
science pipelines.

THE RISE OF GRAPH DATA SCIENCE

The rise of graph data science (GDS) is the result of more accessible
technologies, increased ability to compute over massive graph datas-
ets, and an awareness of the power of graphs to infer meaning and
improve forecasts. Researchers play an essential role in developing
awareness and advocating for the best techniques. As data scientists
see the potency of structural information, they’re increasingly incor-
porating graphs into their statistics, analytics, and ML practices. In
fact, according to the Dimensions Knowledge system for research
publications, the use of graph technology in AI research is accelerat-
ing. In the last ten years, the number of AI research papers that fea-
ture graph technology has increased over 700 percent.

8 Graph Data Science (GDS) For Dummies, Neo4j Special Edition

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Seeing how graphs help the healthcare
industry

»» Using graphs in marketing

»» Putting graphs to work to prevent fraud

Chapter 2
Using Graph Data
Science in the Real World

T
oday’s most pressing data challenges center around con-
nections, not just tabulating discrete data. The ability for
graph data science (GDS) to uncover and leverage network
structure drives a range of use cases from fraud prevention and
targeted recommendations to personalized experiences and drug
repurposing.

We can’t overstate the impact of improved graph techniques such

as new algorithms or the efforts of applied network scientists such
as within computational biology. We don’t want you to overlook
societal projects that use graphs, either. However, we believe that
the recent explosion of graphs in the business world represents a
shift in accessibility and opportunity to drive a democratization of
graphs for everyone.

Graph technologies help organizations with many practical use

cases across industries and domains. In the past, many busi-
nesses began exploring graph technology to create a 360-degree
view of their customers or to unify master data, including cus-
tomer, product, supplier, and logistics information. They may use
this kind of tracking to improve customer experience or to meet
compliance regulations of recent privacy acts such as the EU’s

CHAPTER 2 Using Graph Data Science in the Real World 9

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
General Data Protection Regulation (GDPR) and the California
Consumer Privacy Act (CCPA). This same kind of complete view
and data lineage in graphs is also now used to understand and
track data used in machine learning (ML) for more responsible
artificial intelligence (AI) applications.

Today, businesses are just as likely to look at using graphs spe-

cifically for data science as they recognize the predictive power
of relationships, the ability to use network structures to improve
their ML, and their own need to innovate. The sections in this
chapter highlight a few GDS use cases in areas of accelerating
growth and significant commercial interest.

Looking at Graphs in Healthcare

It’s easy to see how any industry with biological roots would nat-
urally comprehend the importance of interconnected systems.
You can see this relationship in computational biology as well as
healthcare and life sciences in how they view challenges as part
of larger processes. Two examples stand out for serving health
and commercial interests: more efficient drug discovery and bet-
ter patient outcomes.

Discovering more efficient drugs

Safety, speed, and costs are paramount in making new drug solu-
tions accessible. Graphs can help tackle the complexity of inter-
twined relationships between diseases, genes, drugs, side effects,
and demographics — to name just a few considerations.

One impressive knowledge graph in the life sciences industry

integrates over 50 years of biomedical data that includes genes,
compounds, diseases, and other information such as symptoms
and side effects. One of the projects from the graph predicts new
uses for drugs by using the graph topology. The graph helps
predict new uses for currently approved drugs by evaluating rela-
tionships, network structures, and similarities. Drug repurpos-
ing significantly reduces costs and time to market compared to
developing and testing new drugs — not to mention the benefit of
having more real-world information available about side effects
and unexpected results when a drug is already in use.

10 Graph Data Science (GDS) For Dummies, Neo4j Special Edition

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Improving the patient journey
Another area of emerging interest is the use of graphs for map-
ping, evaluating, and improving patient journeys. When a patient
doesn’t feel well, many factors are in play that may have evolved
over a period of time. Likewise, treatments are rarely a single
event, especially for chronic or serious illnesses. The tree of pos-
sible symptoms, visits, test, care givers, treatment plans, out-
comes, and then secondary tests and treatments and so on can
branch out into an immense number of possible paths. Imagine
the patient treatment options that can be mapped with a graph
to better see the sequence alternatives and path splits after each
and every test result or visit. In fact, researchers and healthcare
providers already employ graphs to better understand what influ-
ences patient journeys so they can improve individual outcomes
as well as create and compare to optimal paths.

Recommendations and
Personalized Marketing
Making relevant product and service recommendations requires
correlating product, customer information, historic behavior,
inventory, supplier, logistics, and even social sentiment data.
Graph-powered recommendations and targeted marketing help
companies provide more appropriate services and experiences
to a wider range of users. For example, graph community detec-
tion algorithms are used to group customers with interactions or
similar behavior for more relevant recommendations. Research
shows that graph-enhanced ML can predict customer churn, for
example, for uses such as targeted prevention or marketing.

Graph analytics are also used to help target offers to online users
that are anonymous in name and demographics but not in site
behavior. Insights from analysis performed offline are typically
rolled into decision models used in production for real-time rec-
ommendations, which can include recommendations for products
that ship faster based on shifting stock levels or instantly incor-
porating data from the customer’s current visit.

CHAPTER 2 Using Graph Data Science in the Real World 11

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Fraud Detection
The amount of money lost to fraud each year is growing, despite
increased use of AI and ML to detect and prevent it. To uncover
more fraud while avoiding costly false positives, organizations
look beyond individual data points to the connections and pat-
terns that link them. Organizations use the network structure to
augment existing ML pipelines as a practical approach to increase
the amount of fraud detected and recovered.

Graph feature engineering allows businesses to extract predic-

tive elements based on graph queries or algorithms and use that
information to train ML models. Improving the predictive accu-
racy in fraud detection even small percentage points can result in
tens of millions of dollars saved in just a few months. GDS enables
companies to stay ahead of the ever-shifting patterns of fraud as
well as recover more losses.

Head to Chapter 5 where we give you a detailed example of detect-

ing fraud with GDS.

12 Graph Data Science (GDS) For Dummies, Neo4j Special Edition

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Bringing together diverse information

»» Using graph analytics to understand your

network

»» Finding, combining, and extracting

predictive elements

»» Simplifying graphs with embedding

»» Taking a new approach with graph

networks

Chapter 3
Evolving Your
Application of GDS
Technology

T
oday, graph data science (GDS) is usually applied in busi-
ness with one or more major aims in mind: better decisions,
increased quality of predictions, and creating new ways to
innovate and learn. These goals are increasingly tied to tangible
benefits, such as reduced financial loss, faster time to results,
increased customer satisfaction, and predictive lift. You may be
trying to improve or automate decision-making by people and
domain experts that need additional context. Or perhaps your goal
is to improve predictive accuracy by using relationships and net-
work structure in analytics and machine learning (ML).

Graphs provide a unique structure for learning that helps evolve

ML techniques through better abstraction and interpretability.
These business goals strongly map to how organizations integrate
graph technology into their data science practices. Figure 3-1 dia-
grams the major phases of a typical GDS journey. We cover each

CHAPTER 3 Evolving Your Application of GDS Technology 13

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
of these phases in this chapter. The first three phases of the GDS
journey are most prevalent in the commercial world today, and
the last two are emerging phases on your GDS journey.

FIGURE 3-1: The GDS journey.

Your organization can use practical steps to gain immediate value

and then layer more sophisticated techniques in a way that con-
tinually increases your return on effort.

Knowledge Graphs
Knowledge graphs are the foundation of GDS and offer a way to
streamline workflows, automate responses, and scale intelligent
decisions. At a high level, knowledge graphs are interlinked sets of
data points and describe real-world entities, facts, or things and
their relationship with each other in a human- understandable
form. Unlike a simple knowledge base with flat structures and
static content, a knowledge graph acquires and integrates adjacent
information by using data relationships to derive new knowledge.

As the first phase in GDS, knowledge graphs are often imple-

mented to bring together diverse information to help domain
experts find related content as well as explore the connections in
their data. Knowledge graphs can also add context to applications,
such as those in artificial intelligence (AI) systems, so they can
make better and faster approximating decisions. This approach is
used in AI systems, such as chatbots, that use a knowledge graph,
for example, to better route a request for a “bat for my hus-
band’s birthday.” In this case, the graph grasps that the request
isn’t most likely a flying mammal someone is looking for but

14 Graph Data Science (GDS) For Dummies, Neo4j Special Edition

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
instead sporting goods of higher quality for a special occasion.
The chatbot can also take into account what’s in-stock, shipping
times, and specialty products combining the context of not only
the requestor but also of supply and other logistics.

Graph Analytics
After implementing a knowledge graph (see the preceding sec-
tion), businesses often start using graph analytics to understand
their networks better and answer specific questions based on
relationships and topology. You’re often trying to infer mean-
ing based on the network structure: finding clusters, identifying
influential nodes, evaluating different pathways. Graph analytics
usually refers to the use of global queries and algorithms that look
at entire graphs for offline analysis of historical data. This process
is in contrast to small, real-time transactions and local queries
that focus on small areas around a few nodes.

Graph queries are used when you know exactly what you’re look-
ing for, such as asking a question like “How many relationships
does Mia have?” or “How many fraudsters or flagged accounts are
four hops away?” (A hop is a level or a layer of relationship.) These
kinds of queries seem simple because we can imagine standing
up and looking at things that are close to us. However, solutions
that don’t store relationships alongside their data must per-
form extra processes to look up and join this related information.
Graphs store relationships together with data so following the
path of relationships is simple and fast. Native graph databases
are particularly good at multiple hop queries because they avoid
expensive index lookups and data joins by storing and processing
related information adjacently and treating relationships as first
class citizens.

Graph algorithms are a subset of data science algorithms that

originated from network science to enable reasoning about struc-
ture in a more unsupervised fashion. They’re used when you know
the pattern or indicator you’re looking for but not exactly what
you’ll find. For example, you may be looking for unusually tight
communities where nodes have more relationships between each
other than you’d expect in a random or normal distribution. To
find these communities, you could use the graph algorithm called
Louvain Modularity to uncover clusters with higher interaction

CHAPTER 3 Evolving Your Application of GDS Technology 15

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
densities inside, among group members when compared to inter-
actions outside of the group.

The graph algorithms most prevalent in commercial applications

fall into roughly six categories:

»» Pathfinding and search: These algorithms are foundational

to graph analytics and explore paths between nodes. They
evaluate routes for uses such as physical logistics and
least-cost call or Internet protocol (IP) routing.
»» Centrality (importance): Centrality algorithms help you
uncover the roles of individual nodes and their impact. They
identify influential nodes based on their position in the
network. These algorithms infer group dynamics, such as
credibility, rippling vulnerability, and bridges between groups.
»» Community detection: These algorithms find communities
where members have more significant interactions. These
connections reveal tight clusters, isolated groups, and
structures. This information helps predict similar behavior or
preferences, estimate resilience, find duplicate entities, or
simply prepare data for other analyses.
»» Similarity: These algorithms employ set comparisons to
look at how alike individual nodes are. The properties and
attributes of nodes are used to score the likeness between
nodes. This approach is used in applications such as
personalized recommendations as well as developing
categorical hierarchies.
»» Heuristic link prediction: These algorithms consider the
proximity of nodes in the network as well as structural ele-
ments, such as possible triangles between nodes, to estimate
the likelihood of a new relationship forming or that undocu-
mented connections exist. This class of algorithms has many
applications from drug repurposing to criminal investigations.
»» Graph embedding: These algorithms translate the topology
and attributes of a graph into a unique numerical represen-
tation that can be used for feature engineering (see the next
section “Graph Feature Engineering” for more info), similarity
calculations, or visualizations. Unlike traditional graph
algorithms that use pre-calculated formulas, embeddings
learn the representation from your graph based on neural
network models (deep learning) or linear algebra. See the
later section “Graph Embedding” in this chapter for more on
graph embedding.

16 Graph Data Science (GDS) For Dummies, Neo4j Special Edition

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
In graph analytics, you’re either asking a targeted question or
looking at the graph as a whole to infer meaning or make predic-
tions about future behavior.

Graph Feature Engineering

Graph feature engineering is the process of finding, combining,
and extracting predictive elements from raw graph data to be
used in ML tasks. More information generally makes ML mod-
els more accurate, but data scientists rarely have as much data
as they’d like. Because relationships are extremely predictive of
behavior and they inherently exist inside current data, you can
employ graph feature engineering to improve predictions and
increase ML model accuracy — with the data you already have.

Graph feature engineering uses relationships and network struc-

tures to create new, more meaningful features. It’s the next step
to apply what you learn from graph analytics to ML. For example,
you could score nodes based on a query that computes how many
fraudsters are four hops out, or a centrality algorithm to measure
importance. You could also label nodes based on their community
ID. (The community ID is assigned by the community detection
algorithm.) These scores and labels can then be extracted to a list
or table of numbers and identifiers (also called a feature vector) for
training ML models. The graph features and resulting ML metrics
are often written back to the graph database for persistence and
future use.

Figure 3-2 shows how the use of graph features to enhance ML

is part of a larger workflow with some example technologies for
illustration.

For graph-enhanced ML, you would typically aggregate, explore,

and cleanse data and then use graph queries or algorithms for fea-
ture engineering. Then you’d prepare the data for ML and split
it into training and testing datasets. Although this process isn’t
completely linear, after you’ve trained a model and are happy with
the results, the model can then be used in production. Although
the model may feed a real-time transaction in production, such as
approving credit applications online, the graph feature engineer-
ing and ML are done offline and periodically updated in a cyclical
process. Graph feature engineering offers organizations attainable
model improvements without needing to change their ML pipelines.

CHAPTER 3 Evolving Your Application of GDS Technology 17

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
FIGURE 3-2: Graph feature engineering is part of a larger ML workflow.

Graph Embedding
Graph embedding simplifies graphs or subsets of graphs into a
feature vector, or set of vectors, that are in a lower dimensional
form, such as a list of numbers. The goal is to create easily con-
sumable data for tasks like ML that still describe more intricate
topology, connectivity, or nodes attributes. For example, you can
represent an entire graph or a path as an embedding and then

18 Graph Data Science (GDS) For Dummies, Neo4j Special Edition

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
learn based on the graph or paths themselves. There are three
types of graph embeddings:

»» Node embeddings describe connectivity of each node.

»» Path embeddings encompass the traversals across a graph.
»» Graph embeddings encode an entire graph into a single
vector.

Graph embedding is often used for more advanced feature engi-

neering that incorporates more complex information, which is
why this phase typically comes later in the GDS journey. Embed-
dings can also be useful for data exploration, computing similarity
between entities, and reducing dimensionality to aid in statistical
analysis. Graph embedding offers the ability to more widely use
the rich structures that make up graphs in various data science
tasks and learn based on nuanced information.

Graph Networks
Graph networks are an exciting area of research that represents
a new approach to ML that may drastically improve results with
less data, make predictions more explainable, and lead to new
types of learning itself. Graph network and graph native learning are
terms coined by Peter Battaglia and a group of researchers. They
concluded that using graphs for ML was the next major advance-
ment in ML itself because of the graph’s ability to abstract topol-
ogy. Their thinking follows this approach:

1. Native graph learning takes a graph as an input, per-

forms learning computations while preserving transient
states, and then returns a graph.
2. This native graph learning process allows the domain
expert to review and validate the learning path that
leads to more explainable predictions.
3. With this process comes richer and more accurate
predictions that use less data and training cycles.

CHAPTER 3 Evolving Your Application of GDS Technology 19

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Graph native learning enables whole-graph learning and multi-
task predictions that reduce data requirements and automate the
identification of relevant features. Today, the valuable time of data
scientists and domain experts is frequently employed to tediously
select and test potentially predictive data and collect those fea-
tures into optimal models. Improving the model accuracy while
streamlining the process positively impacts ML processes and
results across all applications. We’re excited by early progress and
look forward to seeing ML evolve to be extremely efficient
and flexible as well as more accurate and transparent.

20 Graph Data Science (GDS) For Dummies, Neo4j Special Edition

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Running algorithms with the Neo4j GDS
Library

»» Supporting multiple databases with the

Neo4j DBMS

»» Looking at the Neo4j Desktop and Neo4j

Browser

»» Finding patterns with Neo4j Bloom

Chapter 4
Using Neo4j as a Graph
Data Science Platform

I
f you’re going to use graph data science (GDS), you should run
it on a platform. In this chapter, we show you what platform
pieces Neo4j offers to help you. Neo4j is a graph technology
company that provides an enterprise-grade GDS platform that
includes four components.

Neo4j supports transactional processing and analytical process-

ing of graph data as well as visualization. It also includes graph
storage and compute with data management and analytics tool-
ing. The set of integrated tools includes a common protocol, API,
and query language (Cypher) to provide effective access for dif-
ferent uses. In this chapter, we cover each of the four areas of the
Neo4j platform in a bit more detail to help you see how your GDS
solution fits together.

Neo4j GDS Library

The Neo4j GDS Library offers an enterprise-ready approach to
running sophisticated graph algorithms on connected data at
scale. Graph analytics and feature engineering add highly pre-
dictive relationships to your machine learning (ML) for better

CHAPTER 4 Using Neo4j as a Graph Data Science Platform 21

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
results. Algorithms are executed in an analytics workspace that
scales computations to handle graphs that contain tens of billions
of nodes and relationships. For examples, training, and details on
how to use the Neo4j GDS Library, visit neo4j.com/developer/
graph-algorithms. You can also go directly to the Neo4j GDS
Library at neo4j.com/graph-data-science-library.

Neo4j Graph Database

Management System
The Neo4j Database Management System (DBMS) supports mul-
tiple databases that can be run in standalone or clustered instal-
lations and supports sharding and federated access to databases.
Neo4j graph databases are designed to treat the relationships
between data as important as the data itself. It’s considered a
native-graph database because the data is stored together with
how each individual entity connects with or is related to others.
You can find more information about the Neo4j Graph DBMS at
neo4j.com/developer/graph-database.

To discover more about the property graph model that’s used by the
DBMS and other tools, check out Graph Databases For Dummies, Neo4j
Special Edition, at neo4j.com/graph-databases-for-dummies.

CYPHER DECLARATIVE QUERY

LANGUAGE
Cypher is the most widely adopted, fully defined, and open query
language for property graph databases. It is a declarative, SQL-
inspired language for describing visual patterns in graphs by using
ASCII-Art syntax. You can state what you want to select, insert, update,
or delete from your graph data without describing how to do it.
Cypher is intended to be readable. For example the phrase, “Jennifer
likes graph technology,” would be written as

(p:Person {name: "Jennifer"})-[rel:LIKES]->(g:Technology {type: "Graphs"})

Cypher basics and learning resources can be found on the Cypher

page for Neo4j developers at neo4j.com/developer/cypher-
query-language.

22 Graph Data Science (GDS) For Dummies, Neo4j Special Edition

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Neo4j Desktop and Browser
Neo4j Desktop is a user interface for operating local databases.
Neo4j Browser is a general purpose user interface for working with
the Neo4j database and is a core component of Neo4j Desktop.
Developers and data scientists can use this tool to query, visualize,
administer, and monitor their databases. The diagram in Figure 4-1
shows the Neo4j Browser being used against a fraud graph.

FIGURE 4-1: Neo4j Browser is an interface for developers to administer

and interact with Neo4j databases.

CHAPTER 4 Using Neo4j as a Graph Data Science Platform 23

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Neo4j Bloom
Neo4j Bloom is a graph visualization and exploration tool that
allows you to find patterns in a Neo4j graph by using a codeless
search paradigm. It uses an interactive point-and-click interface
to expand and refine results, find interesting paths, and share
insights with others.

Bloom is intended for ad-hoc, visual explorations, and fast pro-

totyping with type-ahead search suggestions and direct editing
of nodes and relationships. The visual presentation has flexi-
ble color, size, and icon schemes to help differentiate influential
items with styling that can be based on the results of running
algorithms from the GDS Library (see the earlier section in this
chapter titled “Neo4j GDS Library”).

Figure 4-2 shows the Bloom interface for an example of restau-

rant reviews that can be exported and shared.

FIGURE 4-2: A visual exploration of Neo4j Bloom graphs with code-free

searching.

24 Graph Data Science (GDS) For Dummies, Neo4j Special Edition

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
» Preparing a good dataset

» Exploring suspicious groups

» Predicting fraudsters

Chapter 5
Detecting Fraud with
Graph Data Science

I
n this chapter, we walk you through an example of applying
graph data science (GDS) techniques to investigate and predict
financial fraud. After we familiarize you with a sample financial
transaction dataset, we then remove the outlier information that
may skew your results and identify suspicious clusters of clients.
After that, you visually explore one of the clusters for graph-
based indicators of fraud and look at how graph-based features
can help predict fraudulent behavior in the larger dataset.

Finding a Good Fraud Dataset

To simulate a good fraud dataset, you want to create realistic,
synthetic data to describe fraudulent transactions, so in this sec-
tion, we give you a model of a finance network, where users make
transactions with merchants and each other via mobile devices.
This has similar patterns to traditional credit card networks more
common in the United States, Canada, and Europe. Figure 5-1 is a
graph example that uses a subset of available nodes and relation-
ships from data that we modified with additional identifiers.

CHAPTER 5 Detecting Fraud with Graph Data Science 25

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
FIGURE 5-1: The fraud dataset.

This example uses the following node labels:

»» Clients: People who have personally identifiable information

(PII) such as Social Security Numbers (SSNs), phone num-
bers, and email addresses
»» Mules: Clients who are known to have fraudulently trans-
ferred money
»» Clients’ PII:
• SSNs
• Phone: Phone numbers
• Email: Email addresses
These nodes are connected by the following relationship types:

»» (Client)-[:HAS_PHONE]->(Phone)
»» (Client)-[:HAS_SSN]->(SSN)
»» (Client)-[:HAS_EMAIL]->(Email)
The analysis performed here is focused on the above informa-
tion, but the dataset also contains additional information, such as
transactions performed to banks, merchants, and clients.

Removing Outliers
An important first step when performing fraud analysis is to check
the quality of the data. In fraud datasets, you may have outliers
that aren’t relevant for your analysis. Outliers are rare events or

26 Graph Data Science (GDS) For Dummies, Neo4j Special Edition

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
items that raise suspicions by being significantly different to the
majority of data. Outliers in a graph are based on their connectiv-
ity and the topology of the graph instead of a property value.

The Degree Centrality algorithm measures the number of rela-

tionships that a node has. Running the Degree Centrality algo-
rithm is therefore a good way of finding potential outliers.

Because you should expect different types of nodes to have dif-

ferent connectivity, you need to project a graph that consists of a
single node type so you can look for outliers relevant to one type
of node at a time. For example, each bank should receive many
deposits (so it would have a high-degree of centrality), but an
individual account holder should receive far fewer. You want to
compare banks against banks and accounts against accounts.

To find potential outliers, you can run the Degree Centrality algo-
rithm against the fraud dataset with the following query:

CALL gds.alpha.degree.stream({

nodeQuery:'MATCH (n) WHERE n:Phone OR n:Email OR

n:SSN RETURN id(n) as id',
relationshipQuery:'MATCH
(n1)<-[:HAS_PHONE|HAS_EMAIL|HAS_SSN]-(c:Client),

(n2)<-[:HAS_PHONE|HAS_EMAIL|HAS_SSN]->(c)
RETURN id(n1) as source,
id(n2) as target'
})
YIELD nodeId, score
WITH gds.util.asNode(nodeId) AS node, nodeId,
score
RETURN labels(node) as label, nodeId, score, node.
email, node.phoneNumber, node.ssn
ORDER BY score DESC
LIMIT 10

When you execute this query, you get many connections to fake
identifiers. The output is shown in Figure 5-2. See the appendix
for a full-featured view of this figure.

Four big outliers present high scores (column three) for the num-
ber of connections. These outliers represent nodes that have fake
email accounts, SSNs, and phone numbers.

CHAPTER 5 Detecting Fraud with Graph Data Science 27

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
FIGURE 5-2: The results of the Degree Centrality algorithm.

Exclude these fake result nodes from your analysis because more
than likely these people chose not to fill in the form’s information
accurately instead of them representing fraudulent activity. If not
excluded, you’d find many false positives based on people shar-
ing common bogus filler information such as an email of “fake@
fake.com.”

Next, update the labels on these nodes so they’ll be easier to

exclude from future analysis. The following queries remove the
original labels while adding the new “Bad” labels:

MATCH (n:Email)
WHERE n.email='[email protected]' or n.email='no@
gmail.com'
SET n:BadEmail REMOVE n:Email;
MATCH (n:SSN)
WHERE n.ssn='000-00-0000'
SET n:BadSSN REMOVE n:SSN;
MATCH (n:Phone)
WHERE n.phoneNumber='000-000-0000'
SET n:BadPhone REMOVE n:Phone;

Finding Suspicious Clusters

Want to find some actual fraudsters? Now is your time! In first
party fraud, fake accounts are created with no intention of repay-
ment of loans or debt. A common way of finding these fakesters is
to look for accounts that share identifiers, like SSNs, phone num-
bers, and email addresses.

28 Graph Data Science (GDS) For Dummies, Neo4j Special Edition

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Islands of interacting nodes that have little connection to the
larger graph aren’t representative of typical financial behavior.
You can use this information and the Weakly Connected Com-
ponents algorithm to find disjointed subgraphs that suspiciously
share common identifiers.

The Weakly Connected Components algorithm is a community

detection algorithm that finds sets of connected nodes in an undi-
rected graph where each node is reachable from any other node
in the same set.

The following query runs the Weakly Connected Components over

a projected graph of clients:

CALL gds.wcc.write({
nodeQuery:'MATCH (c:Client) RETURN
id(c) as id',
relationshipQuery:'MATCH
(c1:Client)-[:HAS_PHONE|HAS_EMAIL|HAS_SSN]-
>(intermediate)<-[:HAS_PHONE|HAS_EMAIL|HAS_SSN]-
(c2:Client)
WHERE not(intermediate:BadSSN)
AND not(intermediate:BadEmail)
AND not(intermediate:BadPhone)
RETURN id(c1) as source, id(c2) as target',
writeProperty:'componentId'
});

In the “Removing Outliers” section earlier in this chapter, you

assumed that many non-fraudsters use similar bogus form infor-
mation, so now when you’re building that projected graph, you
need to exclude the bad SSNs, email addresses, and phone num-
bers identified in the preceding section. Otherwise, you’ll end up
with just one large cluster due to the commonality of bogus form
data.

This Weakly Connected Components algorithm matches clients

that share an email, phone number, or SSN, and assigns a label to
the property componentId for each client node. Nodes that have
the same componentID values are considered to be in the same
cluster.

CHAPTER 5 Detecting Fraud with Graph Data Science 29

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
After that, you can use the following query to see a distribution of
the cluster sizes returned by this algorithm:

MATCH (c:Client)
WITH c.componentId AS componentId, count(*) AS
size
WITH size, count(*) AS count
RETURN CASE WHEN 1 <= size <= 2 THEN "1-2"
WHEN 3 <= size <= 5 THEN "3-5"
WHEN 6 <= size <= 9 THEN "6-9"
ELSE ">= 10" END AS size,
sum(count)
ORDER BY size

The result of this query shows that most clients are in small
clusters with only 1 or 2 clients and is illustrated in Figure 5-3. See
the appendix for a full-featured view of this figure.

FIGURE 5-3: Most client are in small clusters.

The majority of clients are in clusters of 1 or 2 clients, more than

those in clusters of 3 to 5 and 6 to 9, and then very few are part
of clusters of ten clients or larger. You might expect to see this;
most people don’t share personal identifiers, but there are nine
clusters with ten or more clients sharing at least one identifier.
If you zoom in on just those clusters with ten or more clients and
explore those further, you get the following query:

MATCH (c:Client)
WITH c.componentId AS componentId, count(*) AS
numberOfClients, collect(c) AS clients
WHERE numberOfClients >= 10

30 Graph Data Science (GDS) For Dummies, Neo4j Special Edition

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
WITH componentId, numberOfClients,
// Find all the identifiers of clients in a
cluster
apoc.coll.toSet(apoc.coll.flatten(
[client in clients | [(client)-[:HAS_
SSN|HAS_EMAIL|HAS_PHONE]->(id) | id]])) AS ids,
clients
return componentId, numberOfClients,
// Find out how many of those identifiers
are shared
// Only return identifiers shared by > 1
Client in the cluster
size([record in [id in ids | {
id: id,
sharedClients: size([(id)<--
(client:Client) WHERE client in clients |
client])
}] WHERE record.sharedClients > 1 |
record]) AS sharedIdentifiers
ORDER BY numberOfClients DESC

FIGURE 5-4: The results of clusters with ten or more clients.

These query results are shown in Figure 5-4. See the appendix for
a full-featured view of this figure.

Cluster 106 looks like an interesting one to explore further because

it has a large number of clients and five shared identifiers between
them. In the next section, you visually investigate this cluster.

CHAPTER 5 Detecting Fraud with Graph Data Science 31

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Visually Exploring a Suspicious Cluster
Exploring cluster 106 in a tool like Neo4j Bloom, which we cover
in Chapter 4, can help you further understand this group. You
can visualize the relationships between the clients in that clus-
ter with a Bloom search phrase. A Bloom search phrase is a way
that you can define a natural language construct that executes a
query against the database for you. The search phrase “explore
cluster 106” finds the relationships in Figure 5-5 between clients
in cluster 106.

FIGURE 5-5: The resulting graph of the search phrase “explore cluster 106.”

32 Graph Data Science (GDS) For Dummies, Neo4j Special Edition

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
The nodes with horseshoe icons represent mules, the ones with
people icons are clients, the ones with mail icons are email
addresses, and the others are SSNs.

In this cluster, you have four mules, and you can also see three
email addresses that are shared by 13 clients. At this point, you
probably want to send a list of the people in this cluster to a
domain expert to explore further.

From this visualization, you can see that most of the clients in
this cluster are sharing just three email accounts. We can imagine
a couple of people sharing an email address but having more than
that may be something to explore further.

Within the cluster, some of these nodes seem more important,

acting as local bridges between clients in different areas of
Figure 5-5. You can then use the Betweenness Centrality algo-
rithm to confirm your suspicions.

The Betweenness Centrality algorithm estimates the shortest path

between every node-pair and then each node receives a score,
based on the number of the shortest paths that pass through the
node. Nodes that most frequently lie on these shortest paths will
have a higher betweenness centrality score.

Run this algorithm by executing the following query:

CALL gds.betweenness.write({
nodeQuery: 'MATCH (c:Client) WHERE
c.componentId=106 RETURN id(c) as id',
relationshipQuery: 'MATCH
(c1:Client)-[:HAS_PHONE|HAS_EMAIL|HAS_SSN]-
>(intermediate)<-[:HAS_PHONE|HAS_EMAIL|HAS_SSN]-
(c2:Client)
WHERE not(intermediate:BadSSN)
AND not(intermediate:BadEmail)
AND not(intermediate:BadPhone)
RETURN id(c1) as source, id(c2) as target',
writeProperty:'betweennessCentrality'
})

This query stores a betweenness centrality score in the between-

nessCentrality property on each client node for this cluster. After
that, you can then update the styling rules in Neo4j Bloom to
inspect the results in Figure 5-6.

CHAPTER 5 Detecting Fraud with Graph Data Science 33

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
FIGURE 5-6: The result of using the Betweenness Centrality score for node
sizing in Neo4j Bloom.

The largest nodes are the most influential nodes in the cluster.
These nodes represent mules that are known to commit fraud.
At this point, you’ve identified suspicious behaviors and clusters.
After your fraud analysts confirm this likely nefarious activity,
you can use this information to predict mules in the larger dataset.

34 Graph Data Science (GDS) For Dummies, Neo4j Special Edition

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Predicting Fraudsters Using Graph
Features
In a real dataset, you wouldn’t actually know who the mules are,
but in the dataset we use, they’re identified. This identification
allows you to test your prediction that a higher betweenness cen-
trality score is predictive of fraud using the whole graph. A quick
check of your theory shows that clients with the mule label have
on average a 0.9685 betweenness centrality score, which is sig-
nificantly higher than non-mule scores as shown in Figure 5-7.
See the appendix for a full-featured view of this figure.

FIGURE 5-7: Comparing Betweenness Centrality scores for the entire graph.

Although this is a considerable indicator, the deviation and dis-

tribution of scores mean there’s overlap that could lead to false
positives and negatives. In this situation, you’d want to combine
this betweenness centrality score with other predictive elements
and work with a data scientist to create a ML model.

One ML scenario that you could use is an approach that extracts

graph features for use in a binary classifier to predict mules.
Examples of graph features include

»» The betweenness centrality score

»» The number of clients sharing identifiers
»» The weighting of shared identifiers
»» The number of known mules within <n> hops
»» The size of clusters

CHAPTER 5 Detecting Fraud with Graph Data Science 35

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Figure 5-8 shows several graph features we might extract for dif-
ferent people in the graph. See the appendix for a full-featured
view of this figure.

FIGURE 5-8: A matrix of graph-engineered features and mule classification.

These features can be extracted to a tabular format for training

an ML model.

After you’re happy with your fraud detection model, you can use
it in production to identify other mules as your graph evolves. As
new information is added to real-world graphs, it’s common to
iterate on this process and create new graph features and update
models.

36 Graph Data Science (GDS) For Dummies, Neo4j Special Edition

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Expanding your knowledge with Neo4j
resources

»» Getting started with a spearhead team

»» Implementing a strategy to get your

project approved

Chapter 6
Ten Tips with Resources
for Successful Graph
Data Science

I
f you’re wondering if your project is “graphy” and how to get
started with graph data science (GDS), this chapter can help.
We give you some Neo4j resources to guide you to more infor-
mation, and to help you explore your project’s opportunity and
successfully move forward from concepts to production, we
include these ten tips:

»» Investigate use cases and get comfortable with con-

cepts. Because graph technology is applied across industries
and in various use cases, it can be hard to know where to
start. To expand your knowledge and help you get comforta-
ble with GDS, review these examples:

• Review use cases. Get up to speed on the problems

graph technology can solve. Visit neo4j.com/use-cases
to read some use cases.

• Watch talks. Find out how people use GDS. Watch

presentations from Neo4j’s Connections for GDS digital
event: go.neo4j.com/connections-graph-data-
science-lp.html.

CHAPTER 6 Ten Tips with Resources for Successful Graph Data Science 37

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
• Expand your knowledge of key concepts. Review
material that sets GDS in a larger context. Visit neo4j.
com/whitepapers/artificial-intelligence-graph-
technology for how graphs enhance AI.
»» Identify and engage a spearhead team. Using graph
technology in production can be new to many people, so
don’t expect teams to understand how to evaluate or
compare graph options to other solutions. Assemble a small
team that can become your experts in translating business
needs into technical requirements and the application of
GDS. Make sure to have representation from key organiza-
tions, including business, IT, and data science teams.
Provide your developers and data scientists with more
technical information. Your team will likely need time to
familiarize itself with the technology so look for resources
that allow an easy start. Some examples include

• neo4j.com/graph-algorithms-book
• neo4j.com/graph-databases-book
• neo4j.com/graph-databases-for-dummies
• neo4j.com/sandbox
»» Evaluate your “graphy” problem. Graph technology is
useful anywhere you have a lot of connected, interdepen-
dent information. But at some point you need to look into
what areas of your business to focus on and what kind of
project to start with.
Start with an intersection of ideas between users, business,
and technology. Consider hosting offsite or virtual innova-
tion sessions with your cross-functional team to define your
stakeholders’ needs, create connections-related questions,
story-board possible solutions, and identify key challenges
and opportunities. This collaboration may naturally lead to a
prototype that you can share with executives for feedback,
but the goal is to uncover promising target use cases.
»» Assess the current state. After you have a target use case
in mind, start with documenting your current state. Consider
existing problems as well as how the various parts of your
organization will have different experiences and issues. Find
out how your business sponsors view this use case and any
problems or opportunities. Be as specific as you can. For
example, what is the impact per customer of improved

38 Graph Data Science (GDS) For Dummies, Neo4j Special Edition

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
online profiles? What’s the revenue implication of a half
percentage increase in recovered fraud? Also remember to
consider external market factors such as customer or
transaction growth, competitive factors, emerging opportu-
nities such as new delivery platforms, or productization
opportunities.
»» Map the value of the proposed state. Although your first
graph project may spawn many new ideas and future
projects, make clear and direct mapping of features of the
near-term graph project to business values. Consider the
current state and pain points and how your graph target use
case can help with business concerns such as cost savings,
increased revenue, new market opportunities, time to
market, risk mitigation and the like. For example, uncovering
similar customer journeys and using that information in a
machine learning (ML) model may increase the accuracy of
churn prediction so the business could take early preventa-
tive action and reduce revenue loss.
»» Measure ROI. For each of your value areas, determine how
you plan to measure your return on investment (ROI) or
success. For example, will you use predictive accuracy or
reduced financial loss to estimate the impact of your end
state? Compare the soft and hard costs of maintaining
existing processes to your graph project. If you’re unable to
audit your existing state, be more conservative when estimat-
ing incremental saving or revenue opportunities. Likewise, it
may be difficult to measure the value of net-new capabilities,
such as answering previously intractable questions, so you
may need to get creative or add qualitative analysis.
»» Align stakeholders. Eventually, you need cross-functional
agreement on the goals and requirements of your graph
project. This process is iterative, not something you tackle at
one point in time. Different teams may have alternative
views on the project vision, key ROI, and even the role of
graph technology. Getting alignment on the goals of the
project and how success is measured are essential — and
you may want to consider a process for dealing with conflict
or dissenting opinions.
»» Get your project approved. Taking advantage of new
technologies like GDS requires your stakeholders and
approvers to be comfortable trying something unfamiliar,
so your work to target the right use case, map values, and

CHAPTER 6 Ten Tips with Resources for Successful Graph Data Science 39

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
estimate ROI needs to come together in a concise story that
aligns with your company’s motivations.
Document stakeholder assumptions about business value.
For example, customer churn may be an issue, but is it a
priority and why? You may be asked about the competitive
landscape as well as alternatives and the costs or lost
opportunity if you don’t proceed. Clearly document the
interdependent system touchpoints that are part of current
processes and the impact of your graph solution.
»» Conduct a POC and plan for production. Larger projects,
especially if the technology is new to a team, often require a
proof of concept (POC) before approval and deployment. A
POC can prepare your team for production and identify any
gaps. This process may involve iterating on previous
prototypes before you move into data modeling and testing
specific workflows.
In GDS, your data model and algorithm choices are highly
dependent on the questions you’re trying to answer. Your
data scientists and subject matter experts should be
involved to ensure the right assumptions are made. Also
make sure that your IT teams are involved to raise any red
flags and that your end-users are on hand to evaluate any
usability concerns.
Vendors that provide POC services can help accelerate your
project with their graph experience. Visit neo4j.com/
professional-services for more info.
»» Get connected and continue your journey. Applying GDS
is a journey. You may start with one focused project and find
yourself answering questions you never knew you had. We
highly recommend your team connect and engage with the
graph community. Graph communities consist of active
groups of users that share new ideas and help with specific,
and sometimes unusual, questions. Getting involved in a rich
active community with educational support and certifica-
tions helps your team be successful with its first graph
project and expands the value of your graphs over time.
Visit the Neo4j community at community.neo4j.com, and
check out its resources at neo4j.com/graphacademy.

40 Graph Data Science (GDS) For Dummies, Neo4j Special Edition

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Appendix
In this appendix, we formatted some of the figures from
Chapter 5 into full-featured tables, so you can better see the
details in each image.

Appendix 41

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
42
Figure 5-2

The Results of the Degree Centrality Algorithm

label nodeid score node.email node.phoneNumber node.ssn

[“BadPhone”] 3360969 870.0 null “000-000-0000” null

[“BadEmail”] 3360966 773.0 “[email protected]” null null

[“BadSSN”] 3419027 765.0 null null “000-00-0000”

[“BadEmail”] 3360963 284.0 “[email protected]” null null

[“Email”] 3367379 21.0 “[email protected]” null null

[“Email”] 3364780 19.0 “[email protected]” null null

[“Email”] 3379485 19.0 “[email protected]” null null

[“Email”] 3380415 19.0 “[email protected]” null null

[“Email”] 3363100 18.0 “[email protected]” null null

Graph Data Science (GDS) For Dummies, Neo4j Special Edition

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Figure 5-3

Most Clients are in Small Clusters

size sum(count)

“1-2” 15505

“3-5” 1231

“6-9” 137

“>=10” 9

Figure 5-4

Results of Clusters with Ten or More Clients

componentid numberOfClients sharedIdentifiers

106 18 5

4932 14 8

1087 13 4

562 11 3

83 10 4

959 10 5

1396 10 3

5160 10 5

7865 10 3

Appendix 43

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
44
Figure 5-7
Comparing Betweenness Centrality Scores for the Entire Graph
isMule average max stdev 50 75 95 99 99.9 count
true 0.9685534591194972 192.0 7.038325008720976 0.0 0.0 4.0 24.0 120.0 1908

false 0.0000999999999999937 2.0 0.014142135623729861 0.0 0.0 0.0 0.0 0.0 20000

Figure 5-8
A Matrix of Graph-Engineered Features and Mule Classification
c.name betweenness sharedIdentities clusterSize mulesNearby isMule
“Jacob Olsen” 0.0 1 3 1 false

“Kayle Roach” 32.0 2 18 4 false

“Mackenzie Burris” 0.0 0 1 0 false

“Ellie Ballard” 0.0 1 5 0 false

“Damian Clarke” 0.0 0 1 1 false

“Kayla Knowles” 192.0 3 18 4 true

“Nicholas Olsen” 0.0 1 3 2 false

Graph Data Science (GDS) For Dummies, Neo4j Special Edition

“Juan Tran” 0.0 1 18 1 false

“Zoe Cobb” 0.0 1 3 2 true

These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
WILEY END USER LICENSE AGREEMENT
Go to www.wiley.com/go/eula to access Wiley’s ebook EULA.

Probability For Dummies (StarDiwa) PDF
88% (8)
Probability For Dummies (StarDiwa) PDF
384 pages
Gurobi - Optimization For Dummies PDF
100% (1)
Gurobi - Optimization For Dummies PDF
27 pages
Machine Learning with R - Third Edition: Expert techniques for predictive modeling, 3rd Edition
From Everand
Machine Learning with R - Third Edition: Expert techniques for predictive modeling, 3rd Edition
Brett Lantz
No ratings yet
Spark For Dummies Ibm
100% (1)
Spark For Dummies Ibm
77 pages
Cloud Marketing Analytics FD CUST Snowflake 9781119781738
100% (1)
Cloud Marketing Analytics FD CUST Snowflake 9781119781738
68 pages
Data Lakes For Dummies
From Everand
Data Lakes For Dummies
Alan R. Simon
No ratings yet
Mlops For Dummies Databricks
No ratings yet
Mlops For Dummies Databricks
55 pages
The Modern Cloud Data Platform For Dummies Databricks Special Edition
100% (2)
The Modern Cloud Data Platform For Dummies Databricks Special Edition
36 pages
Multivariate Data Analysis For Dummies CAMO PDF
100% (4)
Multivariate Data Analysis For Dummies CAMO PDF
43 pages
Product Analytics For Dummies
100% (3)
Product Analytics For Dummies
52 pages
Deep Learning
100% (2)
Deep Learning
53 pages
Data Science Programming All-in-One For Dummies
From Everand
Data Science Programming All-in-One For Dummies
John Paul Mueller
5/5 (1)
Graph Databases For Dummies PDF
100% (1)
Graph Databases For Dummies PDF
51 pages
Big Data Management
100% (3)
Big Data Management
53 pages
Big Data Management For Dummies Informatica Ed
100% (4)
Big Data Management For Dummies Informatica Ed
53 pages
Graph Data Science For Dummies Neo4j 2nd Edition
No ratings yet
Graph Data Science For Dummies Neo4j 2nd Edition
53 pages
Database As A Service
No ratings yet
Database As A Service
61 pages
Big Data & Analytics - Cisco Systems Special Ed. (2016)
100% (1)
Big Data & Analytics - Cisco Systems Special Ed. (2016)
53 pages
Predictive Analytics For Dummies
From Everand
Predictive Analytics For Dummies
Anasse Bari
3/5 (5)
Data Blending For Dummies
100% (8)
Data Blending For Dummies
53 pages
Data Preparation For Dummies by Trifacta
100% (3)
Data Preparation For Dummies by Trifacta
52 pages
Cloud Data Analytics For Dummies
No ratings yet
Cloud Data Analytics For Dummies
68 pages
Big Data Analytics For Dummies
100% (1)
Big Data Analytics For Dummies
51 pages
XBM03004USEN BD A For Dummies
No ratings yet
XBM03004USEN BD A For Dummies
51 pages
Cloud Data Platform For Dummies PDF
100% (1)
Cloud Data Platform For Dummies PDF
68 pages
Customer Data Strategies For Dummies
100% (4)
Customer Data Strategies For Dummies
77 pages
Marketing Artificial Intellegence FD Dataiku Special Edition
100% (1)
Marketing Artificial Intellegence FD Dataiku Special Edition
37 pages
Big Data Analytics Infrastructure: Deploy An Infrastructure That Meets Your Business Challenges
100% (1)
Big Data Analytics Infrastructure: Deploy An Infrastructure That Meets Your Business Challenges
51 pages
Multivariate Analysis For Dummies
No ratings yet
Multivariate Analysis For Dummies
43 pages
Knowledge Graphs Data in Context Responsive
100% (2)
Knowledge Graphs Data in Context Responsive
87 pages
Unified Analytics For Data and AI, Databricks Special Edition PDF
No ratings yet
Unified Analytics For Data and AI, Databricks Special Edition PDF
37 pages
Knowledge Graphs in Three Parts Part One - The Story Behind The Hype
No ratings yet
Knowledge Graphs in Three Parts Part One - The Story Behind The Hype
16 pages
Business Analytics in Retail For Dummies
100% (4)
Business Analytics in Retail For Dummies
75 pages
Saa For Dummies 3rdedition
No ratings yet
Saa For Dummies 3rdedition
64 pages
Rapid Application Development For Dummies Ebook
100% (3)
Rapid Application Development For Dummies Ebook
77 pages
Databricks Migrating From Edw To Data Lakehouse For Dummies
No ratings yet
Databricks Migrating From Edw To Data Lakehouse For Dummies
36 pages
Preventing E-Crimes For Dummies
100% (1)
Preventing E-Crimes For Dummies
124 pages
Cloud Data Platforms For Dummies 2nd Edition
100% (1)
Cloud Data Platforms For Dummies 2nd Edition
68 pages
2018 Book DataScienceAndPredictiveAnalyt PDF
100% (3)
2018 Book DataScienceAndPredictiveAnalyt PDF
851 pages
2016 Book PrinciplesOfDataMining PDF
100% (2)
2016 Book PrinciplesOfDataMining PDF
530 pages
Data Science Strategy For Dummies
From Everand
Data Science Strategy For Dummies
Ulrika Jagare
No ratings yet
Python for Data Science For Dummies
From Everand
Python for Data Science For Dummies
John Paul Mueller
No ratings yet
Statistics All-in-One For Dummies
From Everand
Statistics All-in-One For Dummies
Deborah J. Rumsey
No ratings yet
Practical Data Science Cookbook - Second Edition
From Everand
Practical Data Science Cookbook - Second Edition
Prabhanjan Tattar
No ratings yet
Data Science For Dummies
From Everand
Data Science For Dummies
Lillian Pierson
4.5/5 (2)
Semantic Knowledge Graphing Third Edition
From Everand
Semantic Knowledge Graphing Third Edition
Gerardus Blokdyk
No ratings yet
R Graphs Cookbook Second Edition
From Everand
R Graphs Cookbook Second Edition
Jaynal Abedin
3/5 (1)
Python Machine Learning: Machine Learning Algorithms for Beginners - Data Management and Analytics for Approaching Deep Learning and Neural Networks from Scratch
From Everand
Python Machine Learning: Machine Learning Algorithms for Beginners - Data Management and Analytics for Approaching Deep Learning and Neural Networks from Scratch
Ahmed Ph. Abbasi
No ratings yet
Mastering Machine Learning with R - Second Edition
From Everand
Mastering Machine Learning with R - Second Edition
Cory Lesmeister
No ratings yet
18297
No ratings yet
18297
56 pages
Graph Data Science With Neo4j: Learn How To Use Neo4j 5 With Graph Data Science Library 2.0 and Its Python Driver For Your Project Scifo
100% (3)
Graph Data Science With Neo4j: Learn How To Use Neo4j 5 With Graph Data Science Library 2.0 and Its Python Driver For Your Project Scifo
62 pages
Metaverse and NFT Investing 2022 and Beyond: A Beginners Guide On Making Money In Virtual Lands, Blockchain Gaming, Non-Fungible Tokens, Crypto Art, DeFi Projects, Smart Contracts, Web 3.0
From Everand
Metaverse and NFT Investing 2022 and Beyond: A Beginners Guide On Making Money In Virtual Lands, Blockchain Gaming, Non-Fungible Tokens, Crypto Art, DeFi Projects, Smart Contracts, Web 3.0
Andrew Walker
No ratings yet
Graph Data Science with Neo4j: Learn how to use Neo4j 5 with Graph Data Science library 2.0 and its Python driver for your project Scifo - The ebook is ready for download with just one simple click
100% (1)
Graph Data Science with Neo4j: Learn how to use Neo4j 5 with Graph Data Science library 2.0 and its Python driver for your project Scifo - The ebook is ready for download with just one simple click
63 pages
Pervasive Intelligence Now: Enabling Game-Changing Outcomes in the Age of Exponential Data
From Everand
Pervasive Intelligence Now: Enabling Game-Changing Outcomes in the Age of Exponential Data
Anu Jain
No ratings yet
ChatGPT Side Hustles 2024 - Unlock the Digital Goldmine and Get AI Working for You Fast with More Than 85 Side Hustle Ideas to Boost Passive Income, Create New Cash Flow, and Get Ahead of the Curve
From Everand
ChatGPT Side Hustles 2024 - Unlock the Digital Goldmine and Get AI Working for You Fast with More Than 85 Side Hustle Ideas to Boost Passive Income, Create New Cash Flow, and Get Ahead of the Curve
Alec Rowe
No ratings yet
Spark by IBM
No ratings yet
Spark by IBM
80 pages
Understanding the Predictive Analytics Lifecycle
From Everand
Understanding the Predictive Analytics Lifecycle
Alberto Cordoba
5/5 (1)
The Customer Affinity Manifesto: How AI can help businesses connect with customer emotions
From Everand
The Customer Affinity Manifesto: How AI can help businesses connect with customer emotions
Joss Gillet
No ratings yet
TI-84 Plus CE Graphing Calculator For Dummies, 3rd Edition Jeff Mccalla pdf download
100% (1)
TI-84 Plus CE Graphing Calculator For Dummies, 3rd Edition Jeff Mccalla pdf download
44 pages
Data Science For Dummies 1st Edition Lillian Pierson download
100% (1)
Data Science For Dummies 1st Edition Lillian Pierson download
42 pages