Graph Data Science For Dummies Book
Graph Data Science For Dummies Book
by Amy Hodler
and Mark Needham
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Graph Data Science (GDS) For Dummies®, Neo4j Special Edition
Published by
John Wiley & Sons, Inc.
111 River St.
Hoboken, NJ 07030-5774
www.wiley.com
Copyright © 2021 by John Wiley & Sons, Inc.
No part of this publication may be reproduced, stored in a retrieval system or transmitted in any
form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise,
except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without the
prior written permission of the Publisher. Requests to the Publisher for permission should be
addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ
07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.
Trademarks: Wiley, For Dummies, the Dummies Man logo, The Dummies Way, Dummies.com,
Making Everything Easier, and related trade dress are trademarks or registered trademarks of
John Wiley & Sons, Inc. and/or its affiliates in the United States and other countries, and may not
be used without written permission. Neo4j and the Neo4j logo are registered trademarks of
Neo4j. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc.,
is not associated with any product or vendor mentioned in this book.
For general information on our other products and services, or how to create a custom For Dummies
book for your business or organization, please contact our Business Development Department in
the U.S. at 877-409-4177, contact [email protected], or visit www.wiley.com/go/custompub. For
information about licensing the For Dummies brand for products or services, contact Branded
Rights&[email protected].
ISBN: 978-1-119-74604-1 (pbk); ISBN: 978-1-119-74605-8 (ebk)
10 9 8 7 6 5 4 3 2 1
Publisher’s Acknowledgments
Some of the people who helped bring this book to market include the
following:
Project Manager: Production Editor: Siddique Shaik
Carrie Burchfield-Leighton Business Development
Sr. Managing Editor: Rev Mengle Representative: Molly Daugherty
Acquisitions Editor: Ashley Coffey
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Table of Contents
INTRODUCTION................................................................................................ 1
About This Book.................................................................................... 1
Icons Used in This Book........................................................................ 2
Beyond the Book................................................................................... 2
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
CHAPTER 5: Detecting Fraud with Graph Data Science............. 25
Finding a Good Fraud Dataset........................................................... 25
Removing Outliers............................................................................... 26
Finding Suspicious Clusters............................................................... 28
Visually Exploring a Suspicious Cluster............................................ 32
Predicting Fraudsters Using Graph Features................................... 35
APPENDIX ........................................................................................................... 41
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Introduction
C
onnectivity is the single most pervasive characteristic of
today’s networks and systems. From protein interactions to
social networks, from communication systems to power
grids, and from retail experiences to supply chains, networks with
even a modest degree of complexity aren’t random, which means
connections are neither evenly distributed nor static. Simple
statistical analysis alone fails to sufficiently describe, let alone
predict, behaviors within connected systems.
Introduction 1
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Icons Used in This Book
The following icons are used in this book:
This information may not be critical to most people, but if you like
the extra techie tidbits, you’ll enjoy the insight here. Otherwise,
just skip over it!
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Defining a graph
Chapter 1
Understanding Graphs
and Graph Data Science
G
raph approaches to data are exploding in the commercial
world to better reveal meaning in data as well as forecast
behavior of complex systems. This burst is due to the
increasing connectedness of data, breakthroughs in scaling graph
technology to enterprise-sized problems, excellent results when
integrated with machine learning (ML) and artificial intelligence
(AI) solutions, and more accessible tools for general analytics and
data science teams.
In this chapter, you discover how we define a graph and the rela-
tionship of graphs to analytics and data science. You also get a
foundation in how graphs are used to answer tough questions
about complex systems.
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
structure of this representation, you can answer questions and
make predictions about how the system works or how individ-
uals behave within it. In this sense, network science is a set of
technical tools applicable to nearly any domain, and graphs are
the mathematical models used to perform analysis. Simply put,
graphs are a mathematical representation of complex systems.
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
While graphs originated in mathematics, they are also a prag-
matic and faithful representation of data for modeling and anal-
ysis. A graph is a representation of a network, often illustrated
with circles to represent entities, also called nodes or vertices, and
lines between them. Those lines are known as relationships, links,
or edges. Think of nodes as the nouns in sentences, and relation-
ships as verbs that give context to the nodes. To avoid any con-
fusion, the graphs we talk about in this book have nothing to do
with graphing equations or charts. Take a look at the differences
in Figure 1-2.
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Defining Graph Analytics and Graph
Data Science
Modeling graphs is only half of the story. You may also want to
analyze them to reveal insight that isn’t immediately obvious. So
in this section, we explain the domain of graph data science (GDS)
and graph analytics.
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
FIGURE 1-3: GDS questions fall into four different areas.
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
For example, you may look for a known relationship pattern
between a few nodes or compare attributes of all your
nodes to find similarities. Or perhaps you want to evaluate
the entire structure of a network, with its intricate hierar-
chies, to correlate patterns to certain social behavior to
investigate. Aggregating related but ambiguous information
in large datasets is a common activity that relies on finding
similar and related information. Finding patterns may
employ simple queries or various types of algorithms found
in Chapter 3.
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Seeing how graphs help the healthcare
industry
Chapter 2
Using Graph Data
Science in the Real World
T
oday’s most pressing data challenges center around con-
nections, not just tabulating discrete data. The ability for
graph data science (GDS) to uncover and leverage network
structure drives a range of use cases from fraud prevention and
targeted recommendations to personalized experiences and drug
repurposing.
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
General Data Protection Regulation (GDPR) and the California
Consumer Privacy Act (CCPA). This same kind of complete view
and data lineage in graphs is also now used to understand and
track data used in machine learning (ML) for more responsible
artificial intelligence (AI) applications.
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Improving the patient journey
Another area of emerging interest is the use of graphs for map-
ping, evaluating, and improving patient journeys. When a patient
doesn’t feel well, many factors are in play that may have evolved
over a period of time. Likewise, treatments are rarely a single
event, especially for chronic or serious illnesses. The tree of pos-
sible symptoms, visits, test, care givers, treatment plans, out-
comes, and then secondary tests and treatments and so on can
branch out into an immense number of possible paths. Imagine
the patient treatment options that can be mapped with a graph
to better see the sequence alternatives and path splits after each
and every test result or visit. In fact, researchers and healthcare
providers already employ graphs to better understand what influ-
ences patient journeys so they can improve individual outcomes
as well as create and compare to optimal paths.
Recommendations and
Personalized Marketing
Making relevant product and service recommendations requires
correlating product, customer information, historic behavior,
inventory, supplier, logistics, and even social sentiment data.
Graph-powered recommendations and targeted marketing help
companies provide more appropriate services and experiences
to a wider range of users. For example, graph community detec-
tion algorithms are used to group customers with interactions or
similar behavior for more relevant recommendations. Research
shows that graph-enhanced ML can predict customer churn, for
example, for uses such as targeted prevention or marketing.
Graph analytics are also used to help target offers to online users
that are anonymous in name and demographics but not in site
behavior. Insights from analysis performed offline are typically
rolled into decision models used in production for real-time rec-
ommendations, which can include recommendations for products
that ship faster based on shifting stock levels or instantly incor-
porating data from the customer’s current visit.
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Fraud Detection
The amount of money lost to fraud each year is growing, despite
increased use of AI and ML to detect and prevent it. To uncover
more fraud while avoiding costly false positives, organizations
look beyond individual data points to the connections and pat-
terns that link them. Organizations use the network structure to
augment existing ML pipelines as a practical approach to increase
the amount of fraud detected and recovered.
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Bringing together diverse information
Chapter 3
Evolving Your
Application of GDS
Technology
T
oday, graph data science (GDS) is usually applied in busi-
ness with one or more major aims in mind: better decisions,
increased quality of predictions, and creating new ways to
innovate and learn. These goals are increasingly tied to tangible
benefits, such as reduced financial loss, faster time to results,
increased customer satisfaction, and predictive lift. You may be
trying to improve or automate decision-making by people and
domain experts that need additional context. Or perhaps your goal
is to improve predictive accuracy by using relationships and net-
work structure in analytics and machine learning (ML).
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
of these phases in this chapter. The first three phases of the GDS
journey are most prevalent in the commercial world today, and
the last two are emerging phases on your GDS journey.
Knowledge Graphs
Knowledge graphs are the foundation of GDS and offer a way to
streamline workflows, automate responses, and scale intelligent
decisions. At a high level, knowledge graphs are interlinked sets of
data points and describe real-world entities, facts, or things and
their relationship with each other in a human- understandable
form. Unlike a simple knowledge base with flat structures and
static content, a knowledge graph acquires and integrates adjacent
information by using data relationships to derive new knowledge.
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
instead sporting goods of higher quality for a special occasion.
The chatbot can also take into account what’s in-stock, shipping
times, and specialty products combining the context of not only
the requestor but also of supply and other logistics.
Graph Analytics
After implementing a knowledge graph (see the preceding sec-
tion), businesses often start using graph analytics to understand
their networks better and answer specific questions based on
relationships and topology. You’re often trying to infer mean-
ing based on the network structure: finding clusters, identifying
influential nodes, evaluating different pathways. Graph analytics
usually refers to the use of global queries and algorithms that look
at entire graphs for offline analysis of historical data. This process
is in contrast to small, real-time transactions and local queries
that focus on small areas around a few nodes.
Graph queries are used when you know exactly what you’re look-
ing for, such as asking a question like “How many relationships
does Mia have?” or “How many fraudsters or flagged accounts are
four hops away?” (A hop is a level or a layer of relationship.) These
kinds of queries seem simple because we can imagine standing
up and looking at things that are close to us. However, solutions
that don’t store relationships alongside their data must per-
form extra processes to look up and join this related information.
Graphs store relationships together with data so following the
path of relationships is simple and fast. Native graph databases
are particularly good at multiple hop queries because they avoid
expensive index lookups and data joins by storing and processing
related information adjacently and treating relationships as first
class citizens.
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
densities inside, among group members when compared to inter-
actions outside of the group.
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
In graph analytics, you’re either asking a targeted question or
looking at the graph as a whole to infer meaning or make predic-
tions about future behavior.
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
FIGURE 3-2: Graph feature engineering is part of a larger ML workflow.
Graph Embedding
Graph embedding simplifies graphs or subsets of graphs into a
feature vector, or set of vectors, that are in a lower dimensional
form, such as a list of numbers. The goal is to create easily con-
sumable data for tasks like ML that still describe more intricate
topology, connectivity, or nodes attributes. For example, you can
represent an entire graph or a path as an embedding and then
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
learn based on the graph or paths themselves. There are three
types of graph embeddings:
Graph Networks
Graph networks are an exciting area of research that represents
a new approach to ML that may drastically improve results with
less data, make predictions more explainable, and lead to new
types of learning itself. Graph network and graph native learning are
terms coined by Peter Battaglia and a group of researchers. They
concluded that using graphs for ML was the next major advance-
ment in ML itself because of the graph’s ability to abstract topol-
ogy. Their thinking follows this approach:
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Graph native learning enables whole-graph learning and multi-
task predictions that reduce data requirements and automate the
identification of relevant features. Today, the valuable time of data
scientists and domain experts is frequently employed to tediously
select and test potentially predictive data and collect those fea-
tures into optimal models. Improving the model accuracy while
streamlining the process positively impacts ML processes and
results across all applications. We’re excited by early progress and
look forward to seeing ML evolve to be extremely efficient
and flexible as well as more accurate and transparent.
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Running algorithms with the Neo4j GDS
Library
Chapter 4
Using Neo4j as a Graph
Data Science Platform
I
f you’re going to use graph data science (GDS), you should run
it on a platform. In this chapter, we show you what platform
pieces Neo4j offers to help you. Neo4j is a graph technology
company that provides an enterprise-grade GDS platform that
includes four components.
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
results. Algorithms are executed in an analytics workspace that
scales computations to handle graphs that contain tens of billions
of nodes and relationships. For examples, training, and details on
how to use the Neo4j GDS Library, visit neo4j.com/developer/
graph-algorithms. You can also go directly to the Neo4j GDS
Library at neo4j.com/graph-data-science-library.
To discover more about the property graph model that’s used by the
DBMS and other tools, check out Graph Databases For Dummies, Neo4j
Special Edition, at neo4j.com/graph-databases-for-dummies.
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Neo4j Desktop and Browser
Neo4j Desktop is a user interface for operating local databases.
Neo4j Browser is a general purpose user interface for working with
the Neo4j database and is a core component of Neo4j Desktop.
Developers and data scientists can use this tool to query, visualize,
administer, and monitor their databases. The diagram in Figure 4-1
shows the Neo4j Browser being used against a fraud graph.
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Neo4j Bloom
Neo4j Bloom is a graph visualization and exploration tool that
allows you to find patterns in a Neo4j graph by using a codeless
search paradigm. It uses an interactive point-and-click interface
to expand and refine results, find interesting paths, and share
insights with others.
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
» Preparing a good dataset
» Predicting fraudsters
Chapter 5
Detecting Fraud with
Graph Data Science
I
n this chapter, we walk you through an example of applying
graph data science (GDS) techniques to investigate and predict
financial fraud. After we familiarize you with a sample financial
transaction dataset, we then remove the outlier information that
may skew your results and identify suspicious clusters of clients.
After that, you visually explore one of the clusters for graph-
based indicators of fraud and look at how graph-based features
can help predict fraudulent behavior in the larger dataset.
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
FIGURE 5-1: The fraud dataset.
»» (Client)-[:HAS_PHONE]->(Phone)
»» (Client)-[:HAS_SSN]->(SSN)
»» (Client)-[:HAS_EMAIL]->(Email)
The analysis performed here is focused on the above informa-
tion, but the dataset also contains additional information, such as
transactions performed to banks, merchants, and clients.
Removing Outliers
An important first step when performing fraud analysis is to check
the quality of the data. In fraud datasets, you may have outliers
that aren’t relevant for your analysis. Outliers are rare events or
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
items that raise suspicions by being significantly different to the
majority of data. Outliers in a graph are based on their connectiv-
ity and the topology of the graph instead of a property value.
To find potential outliers, you can run the Degree Centrality algo-
rithm against the fraud dataset with the following query:
CALL gds.alpha.degree.stream({
(n2)<-[:HAS_PHONE|HAS_EMAIL|HAS_SSN]->(c)
RETURN id(n1) as source,
id(n2) as target'
})
YIELD nodeId, score
WITH gds.util.asNode(nodeId) AS node, nodeId,
score
RETURN labels(node) as label, nodeId, score, node.
email, node.phoneNumber, node.ssn
ORDER BY score DESC
LIMIT 10
When you execute this query, you get many connections to fake
identifiers. The output is shown in Figure 5-2. See the appendix
for a full-featured view of this figure.
Four big outliers present high scores (column three) for the num-
ber of connections. These outliers represent nodes that have fake
email accounts, SSNs, and phone numbers.
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
FIGURE 5-2: The results of the Degree Centrality algorithm.
Exclude these fake result nodes from your analysis because more
than likely these people chose not to fill in the form’s information
accurately instead of them representing fraudulent activity. If not
excluded, you’d find many false positives based on people shar-
ing common bogus filler information such as an email of “fake@
fake.com.”
MATCH (n:Email)
WHERE n.email='[email protected]' or n.email='no@
gmail.com'
SET n:BadEmail REMOVE n:Email;
MATCH (n:SSN)
WHERE n.ssn='000-00-0000'
SET n:BadSSN REMOVE n:SSN;
MATCH (n:Phone)
WHERE n.phoneNumber='000-000-0000'
SET n:BadPhone REMOVE n:Phone;
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Islands of interacting nodes that have little connection to the
larger graph aren’t representative of typical financial behavior.
You can use this information and the Weakly Connected Com-
ponents algorithm to find disjointed subgraphs that suspiciously
share common identifiers.
CALL gds.wcc.write({
nodeQuery:'MATCH (c:Client) RETURN
id(c) as id',
relationshipQuery:'MATCH
(c1:Client)-[:HAS_PHONE|HAS_EMAIL|HAS_SSN]-
>(intermediate)<-[:HAS_PHONE|HAS_EMAIL|HAS_SSN]-
(c2:Client)
WHERE not(intermediate:BadSSN)
AND not(intermediate:BadEmail)
AND not(intermediate:BadPhone)
RETURN id(c1) as source, id(c2) as target',
writeProperty:'componentId'
});
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
After that, you can use the following query to see a distribution of
the cluster sizes returned by this algorithm:
MATCH (c:Client)
WITH c.componentId AS componentId, count(*) AS
size
WITH size, count(*) AS count
RETURN CASE WHEN 1 <= size <= 2 THEN "1-2"
WHEN 3 <= size <= 5 THEN "3-5"
WHEN 6 <= size <= 9 THEN "6-9"
ELSE ">= 10" END AS size,
sum(count)
ORDER BY size
The result of this query shows that most clients are in small
clusters with only 1 or 2 clients and is illustrated in Figure 5-3. See
the appendix for a full-featured view of this figure.
MATCH (c:Client)
WITH c.componentId AS componentId, count(*) AS
numberOfClients, collect(c) AS clients
WHERE numberOfClients >= 10
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
WITH componentId, numberOfClients,
// Find all the identifiers of clients in a
cluster
apoc.coll.toSet(apoc.coll.flatten(
[client in clients | [(client)-[:HAS_
SSN|HAS_EMAIL|HAS_PHONE]->(id) | id]])) AS ids,
clients
return componentId, numberOfClients,
// Find out how many of those identifiers
are shared
// Only return identifiers shared by > 1
Client in the cluster
size([record in [id in ids | {
id: id,
sharedClients: size([(id)<--
(client:Client) WHERE client in clients |
client])
}] WHERE record.sharedClients > 1 |
record]) AS sharedIdentifiers
ORDER BY numberOfClients DESC
These query results are shown in Figure 5-4. See the appendix for
a full-featured view of this figure.
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Visually Exploring a Suspicious Cluster
Exploring cluster 106 in a tool like Neo4j Bloom, which we cover
in Chapter 4, can help you further understand this group. You
can visualize the relationships between the clients in that clus-
ter with a Bloom search phrase. A Bloom search phrase is a way
that you can define a natural language construct that executes a
query against the database for you. The search phrase “explore
cluster 106” finds the relationships in Figure 5-5 between clients
in cluster 106.
FIGURE 5-5: The resulting graph of the search phrase “explore cluster 106.”
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
The nodes with horseshoe icons represent mules, the ones with
people icons are clients, the ones with mail icons are email
addresses, and the others are SSNs.
In this cluster, you have four mules, and you can also see three
email addresses that are shared by 13 clients. At this point, you
probably want to send a list of the people in this cluster to a
domain expert to explore further.
From this visualization, you can see that most of the clients in
this cluster are sharing just three email accounts. We can imagine
a couple of people sharing an email address but having more than
that may be something to explore further.
CALL gds.betweenness.write({
nodeQuery: 'MATCH (c:Client) WHERE
c.componentId=106 RETURN id(c) as id',
relationshipQuery: 'MATCH
(c1:Client)-[:HAS_PHONE|HAS_EMAIL|HAS_SSN]-
>(intermediate)<-[:HAS_PHONE|HAS_EMAIL|HAS_SSN]-
(c2:Client)
WHERE not(intermediate:BadSSN)
AND not(intermediate:BadEmail)
AND not(intermediate:BadPhone)
RETURN id(c1) as source, id(c2) as target',
writeProperty:'betweennessCentrality'
})
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
FIGURE 5-6: The result of using the Betweenness Centrality score for node
sizing in Neo4j Bloom.
The largest nodes are the most influential nodes in the cluster.
These nodes represent mules that are known to commit fraud.
At this point, you’ve identified suspicious behaviors and clusters.
After your fraud analysts confirm this likely nefarious activity,
you can use this information to predict mules in the larger dataset.
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Predicting Fraudsters Using Graph
Features
In a real dataset, you wouldn’t actually know who the mules are,
but in the dataset we use, they’re identified. This identification
allows you to test your prediction that a higher betweenness cen-
trality score is predictive of fraud using the whole graph. A quick
check of your theory shows that clients with the mule label have
on average a 0.9685 betweenness centrality score, which is sig-
nificantly higher than non-mule scores as shown in Figure 5-7.
See the appendix for a full-featured view of this figure.
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Figure 5-8 shows several graph features we might extract for dif-
ferent people in the graph. See the appendix for a full-featured
view of this figure.
After you’re happy with your fraud detection model, you can use
it in production to identify other mules as your graph evolves. As
new information is added to real-world graphs, it’s common to
iterate on this process and create new graph features and update
models.
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Expanding your knowledge with Neo4j
resources
Chapter 6
Ten Tips with Resources
for Successful Graph
Data Science
I
f you’re wondering if your project is “graphy” and how to get
started with graph data science (GDS), this chapter can help.
We give you some Neo4j resources to guide you to more infor-
mation, and to help you explore your project’s opportunity and
successfully move forward from concepts to production, we
include these ten tips:
CHAPTER 6 Ten Tips with Resources for Successful Graph Data Science 37
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
• Expand your knowledge of key concepts. Review
material that sets GDS in a larger context. Visit neo4j.
com/whitepapers/artificial-intelligence-graph-
technology for how graphs enhance AI.
»» Identify and engage a spearhead team. Using graph
technology in production can be new to many people, so
don’t expect teams to understand how to evaluate or
compare graph options to other solutions. Assemble a small
team that can become your experts in translating business
needs into technical requirements and the application of
GDS. Make sure to have representation from key organiza-
tions, including business, IT, and data science teams.
Provide your developers and data scientists with more
technical information. Your team will likely need time to
familiarize itself with the technology so look for resources
that allow an easy start. Some examples include
• neo4j.com/graph-algorithms-book
• neo4j.com/graph-databases-book
• neo4j.com/graph-databases-for-dummies
• neo4j.com/sandbox
»» Evaluate your “graphy” problem. Graph technology is
useful anywhere you have a lot of connected, interdepen-
dent information. But at some point you need to look into
what areas of your business to focus on and what kind of
project to start with.
Start with an intersection of ideas between users, business,
and technology. Consider hosting offsite or virtual innova-
tion sessions with your cross-functional team to define your
stakeholders’ needs, create connections-related questions,
story-board possible solutions, and identify key challenges
and opportunities. This collaboration may naturally lead to a
prototype that you can share with executives for feedback,
but the goal is to uncover promising target use cases.
»» Assess the current state. After you have a target use case
in mind, start with documenting your current state. Consider
existing problems as well as how the various parts of your
organization will have different experiences and issues. Find
out how your business sponsors view this use case and any
problems or opportunities. Be as specific as you can. For
example, what is the impact per customer of improved
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
online profiles? What’s the revenue implication of a half
percentage increase in recovered fraud? Also remember to
consider external market factors such as customer or
transaction growth, competitive factors, emerging opportu-
nities such as new delivery platforms, or productization
opportunities.
»» Map the value of the proposed state. Although your first
graph project may spawn many new ideas and future
projects, make clear and direct mapping of features of the
near-term graph project to business values. Consider the
current state and pain points and how your graph target use
case can help with business concerns such as cost savings,
increased revenue, new market opportunities, time to
market, risk mitigation and the like. For example, uncovering
similar customer journeys and using that information in a
machine learning (ML) model may increase the accuracy of
churn prediction so the business could take early preventa-
tive action and reduce revenue loss.
»» Measure ROI. For each of your value areas, determine how
you plan to measure your return on investment (ROI) or
success. For example, will you use predictive accuracy or
reduced financial loss to estimate the impact of your end
state? Compare the soft and hard costs of maintaining
existing processes to your graph project. If you’re unable to
audit your existing state, be more conservative when estimat-
ing incremental saving or revenue opportunities. Likewise, it
may be difficult to measure the value of net-new capabilities,
such as answering previously intractable questions, so you
may need to get creative or add qualitative analysis.
»» Align stakeholders. Eventually, you need cross-functional
agreement on the goals and requirements of your graph
project. This process is iterative, not something you tackle at
one point in time. Different teams may have alternative
views on the project vision, key ROI, and even the role of
graph technology. Getting alignment on the goals of the
project and how success is measured are essential — and
you may want to consider a process for dealing with conflict
or dissenting opinions.
»» Get your project approved. Taking advantage of new
technologies like GDS requires your stakeholders and
approvers to be comfortable trying something unfamiliar,
so your work to target the right use case, map values, and
CHAPTER 6 Ten Tips with Resources for Successful Graph Data Science 39
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
estimate ROI needs to come together in a concise story that
aligns with your company’s motivations.
Document stakeholder assumptions about business value.
For example, customer churn may be an issue, but is it a
priority and why? You may be asked about the competitive
landscape as well as alternatives and the costs or lost
opportunity if you don’t proceed. Clearly document the
interdependent system touchpoints that are part of current
processes and the impact of your graph solution.
»» Conduct a POC and plan for production. Larger projects,
especially if the technology is new to a team, often require a
proof of concept (POC) before approval and deployment. A
POC can prepare your team for production and identify any
gaps. This process may involve iterating on previous
prototypes before you move into data modeling and testing
specific workflows.
In GDS, your data model and algorithm choices are highly
dependent on the questions you’re trying to answer. Your
data scientists and subject matter experts should be
involved to ensure the right assumptions are made. Also
make sure that your IT teams are involved to raise any red
flags and that your end-users are on hand to evaluate any
usability concerns.
Vendors that provide POC services can help accelerate your
project with their graph experience. Visit neo4j.com/
professional-services for more info.
»» Get connected and continue your journey. Applying GDS
is a journey. You may start with one focused project and find
yourself answering questions you never knew you had. We
highly recommend your team connect and engage with the
graph community. Graph communities consist of active
groups of users that share new ideas and help with specific,
and sometimes unusual, questions. Getting involved in a rich
active community with educational support and certifica-
tions helps your team be successful with its first graph
project and expands the value of your graphs over time.
Visit the Neo4j community at community.neo4j.com, and
check out its resources at neo4j.com/graphacademy.
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Appendix
In this appendix, we formatted some of the figures from
Chapter 5 into full-featured tables, so you can better see the
details in each image.
Appendix 41
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
42
Figure 5-2
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Figure 5-3
“1-2” 15505
“3-5” 1231
“6-9” 137
“>=10” 9
Figure 5-4
106 18 5
4932 14 8
1087 13 4
562 11 3
83 10 4
959 10 5
1396 10 3
5160 10 5
7865 10 3
Appendix 43
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
44
Figure 5-7
Comparing Betweenness Centrality Scores for the Entire Graph
isMule average max stdev 50 75 95 99 99.9 count
true 0.9685534591194972 192.0 7.038325008720976 0.0 0.0 4.0 24.0 120.0 1908
false 0.0000999999999999937 2.0 0.014142135623729861 0.0 0.0 0.0 0.0 0.0 20000
Figure 5-8
A Matrix of Graph-Engineered Features and Mule Classification
c.name betweenness sharedIdentities clusterSize mulesNearby isMule
“Jacob Olsen” 0.0 1 3 1 false
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
These materials are © 2021 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
WILEY END USER LICENSE AGREEMENT
Go to www.wiley.com/go/eula to access Wiley’s ebook EULA.