0% found this document useful (0 votes)
159 views138 pages

Day 1 - Boot Camp Intro To Graph

Uploaded by

Đức Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
159 views138 pages

Day 1 - Boot Camp Intro To Graph

Uploaded by

Đức Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 138

Boot Camp Intro to

Graph
Presented By: The letter G

1 © DataStax, All Rights Reserved. Confidential


Day 1

2 © 2018 DataStax, All Rights Reserved. © DataStax, All Rights Reserved.


Agenda
● DSE Graph Overview
● Configuration
● Initial hands-on
● DSE Graph fundamentals, terminology and architecture
● Graph data modeling

3 © 2018 DataStax, All Rights Reserved.


What is a Graph Database

● A database for storing, managing and querying highly connected and complex
data.
● A graph database architecture makes it particularly well suited for unlocking the
value in the data’s relationships and finding commonalities and anomalies in
large data volumes.

4 © DataStax, All Rights Reserved. Confidential


What is DataStax Enterprise Graph?
● Part of DSE’s multi-model platform.
● A scale-out graph database for cloud applications that need to manage complex
and highly connected data in real-time.
● Supports a property graph model native inside the DataStax product, engineered
specifically for Cassandra.
● Store & find relationships in data fast and easy in large graphs.
● Built-in support for real-time search, and analytic graph queries via tight
integration with DSE.
● Aligns to DSE’s CARDS differentiators.
● Is a Property Graph
○ More on that later

5 © DataStax, All Rights Reserved. Confidential


Integrated with Apache Cassandra
● DSE Graph inherits all of Cassandra’s key benefits including constant uptime,
write/read/active-everywhere functionality, linear scalability, predictable low-
latency response times, and operational maturity.
● To that foundation, DSE Graph adds other performance-enhancing capabilities that
include
○ adaptive query optimizer
○ locality-driven graph data partitioner
○ distributed query execution engine
○ various graph-specific index structures.

6 © DataStax, All Rights Reserved. Confidential


Inspired by Titan
● Titan is a TinkerPop-enabled, scale-out graph database that has a pluggable
storage backend option that allows it to persist data to a variety of databases
including Cassandra, HBase, and others.
● DSE Graph benefits from years of experience with real-world, scale-out graph
database use cases implemented against Titan.
● While DSE Graph uses Titan as a model, it integrates much more deeply into
Cassandra and the DSE platform to provide better consistency and performance as
well as additional features.
● DSE Graph is not an upgraded version of Titan.

7 © DataStax, All Rights Reserved. Confidential


Built with Apache TinkerPop
● Apache TinkerPop™ is an open source graph computing framework that enables
database and data analytic systems to offer graph computing capabilities to their
users.
○ de-facto standard for Graph (Amazon and Azure)
○ Another is Neo4J
● Gremlin is the standard language for Graph. What SQL is to an RDBMS, Gremlin is
to a graph database.
○ Neo4J uses Cypher
○ No conversion between Cypher and Gremlin but Neo4J started having some TinkerPop
functionality
● DataStax is a heavy contributor to TinkerPop and is using our knowledge and
experience in developing DSE Graph
● Along with the Gremlin language and virtual machine, TinkerPop provides various
supporting tools such as Gremlin Server, data bulk loaders, graph algorithms,
visualization tool connectors, and more.
8 © DataStax, All Rights Reserved. Confidential
tinkerpop.apache.org

tinkerpop.apache.org

g.V().has("name","gremlin").
repeat(in("manages")).until(has("title","ceo")).
path().by("name")
>> The management chain from Gremlin to the CEO
9 © DataStax, All Rights Reserved. Confidential
Enterprise-Readiness with DSE Extensions/Support

● Advanced security protection, encryption, and authentication.


● In-Memory capabilities to speed up transactional workloads.
● Tiered storage and multi-instance support for improved TCO and performance.
● Visual management, monitoring, and development tooling.
● Around-the-clock support from the graph experts at DataStax.
● Formal
○ End-of-life policies
○ Certified software updates
○ Hot-fixes
○ Bug escalation privileges

10 © DataStax, All Rights Reserved. Confidential


Graph Search with DSE Search
● DSE Graph has seamless support for DSE Search powered by Apache Solr
● Simple, schema-driven index management
● Search can be enabled for OLTP and/or OLAP workloads
● Allows
○ Case Insensitive Wildcard Search
○ Indexed as both text and string
○ Phrase Searching
○ Fuzzy Search
○ Tokenized Fuzzy Search
○ Geospatial in distance or degrees
● DSE’s Graph query optimizer automatically uses Solr behind the scenes

11 © DataStax, All Rights Reserved. Confidential


Graph Analytics with DSE Analytics
● DSE Graph has seamless support for DSE Analytics powered by Apache Spark.
● No need to learn Spark –
○ Gremlin language used both for OLTP and OLAP workloads
● If you prefer Spark, use DSEGraphFrames
○ Deep integration with GraphFrames
○ Enables combining C* and graph data
○ Enables Spark Streaming/SQL directly to graph
● OLAP workloads can be separated from OLTP workloads.

12 © DataStax, All Rights Reserved. Confidential


Developer support with DataStax Studio
● Web-based developer solution which helps developers visually explore, query, and
troubleshoot DSE Graph in one intuitive UI.
● Auto-completion, result set visualization, execution management, and much more

13 © DataStax, All Rights Reserved. Confidential


Data Loading Support with DSE Graph Loader
● Simplifies loading large amounts of enterprise data from various sources into DSE
Graph (CSV, JSON, GraphSON, GraphML, Gryo, JDBC, Titan, Neo4J..etc)
● Inspects incoming data for schema compliance.
● Uses declarative data mappings and custom transformations to handle diverse
types of data. Data Mappings
DSE Graph
Batch Loading

JSON RDBMS

Graph
Loader

Stream Ingestion

14 © DataStax, All Rights Reserved. Confidential


Operational support with DataStax OpsCenter
● Web-based operations solution which can launch, manage, monitor and
troubleshoot DSE clusters and deployments.
● Launch wizard, failure alerts, monitoring dashboards, and much more.

15 © DataStax, All Rights Reserved. Confidential


Conclusions

● DSE Graph
○ A scale-out graph database purposely built for cloud applications that need to
act on complex and highly connected relationships.
○ Inherits all of the native power of Apache Cassandra as well as the enterprise
functionality of DSE, making it the first and best choice for today’s enterprise
systems that require graph support.

16 © DataStax, All Rights Reserved. Confidential


DSE Graph
Positioning, Qualification and
Selling
How do you know if you have a graph use case? This
is actually a difficult question but we want to touch on it
some. Feel free to ask questions and bring forth your
ideas.

17 © DataStax, All Rights Reserved. Confidential


Eleven DataStax Use Cases

Customer Experience (CX)


● Customer 360
● Personalization & Recommendation
● Loyalty Program
● Fraud Detection
Enterprise Optimization (EO)
● eCommerce
● Inventory Management
● Asset Monitoring
● Supply Chain Management
● Logistics
● Security and Compliance
● Identity Management

18 © DataStax, All Rights Reserved. Confidential


Each a Possible Graph Use Case

While all may have a use for a graph


● Not all require it
● Most often CX use cases are great DSE Graph matches
● The rest all depends on the customers needs
● So you need to qualify for graph
○ Does it align with CARDS? (DSE qualification)
○ Does it need human intervention and traverse relationships in real time?
(Walk through relationship e.g. in C360)
○ There are many other qualifying questions
○ We will dive a little bit into it later
○ But qualification is not the focus of Graph in this boot camp
○ We will spend most of our time on the technical aspects

19 © DataStax, All Rights Reserved. Confidential


Obvious Graph Use Cases

Customer 360 Recommendation Personalization Fraud

20 © DataStax, All Rights Reserved. Confidential


C360
This is who I am

21 © DataStax, All Rights Reserved. Confidential


C360
Description Aggregating all the data about an individual customer’s touch points across a time axis to create a view
into the customer’s journey.
Storyline Example: Customer calls up Billing in a hospital. Because the payer’s system unifies the
customer’s data into a C360 graph, the billing center person immediately knows the customer’s
most recent interactions with their system. The billing center person can see that this customer
has called 3 previous times in the past 1 month - twice for the same bill and once to complain
about a canceled appointment, and has recently gone to Pediatrics, and taken his 10 year
daughter for an x-ray in Radiology.

The hospital has data from similar trends of complaints that there is a 60% chance of churn for
customers immediately upon the third billing issue.
Identifiers Customer Journey, Customer Key Risks Entity Resolution: the foundation of a
Profile, Instant Customer Profile, C360 graph heavily relies on analytical
Siloed Customer Information, processes for determining the process
Customer Data Integration, Single for matching and merging customer data
Point of Reference, Single Source across disparate systems.
of Truth, Customer Data
Consolidation

22 © DataStax, All Rights Reserved. Confidential


Personalization
This is what I care about

23 © DataStax, All Rights Reserved. Confidential


Real-Time Personalization
Description The process of tailoring the content to individual customer’s characteristics or preferences.

Storyline Example 1: Built responsive, real-time apps that allow call center agents to access data at their
fingertips for a seamless, intelligent interaction with members, preventing negative experiences

Example 2: Macquarie Bank personalizes their apps based on what that individual customer would
like to see - credit card, then checking account transactions, with their partner’s balance also
showing based on the latest transactions. As such, what shows up top would be based on the
instantly updated last transaction .
Identifiers Content Based Filtering, Customer Key Risks Confusion with C360: Personalization is
Analytics, Localized Content, based on the knowledge and relationships
Individualized, Omnichannel between data based on their C360 project

Confusion with Recommendations: If the


customer just browsed something - present
them something else from the catalog. This is
Recommendations, and not Personalization.

24 © DataStax, All Rights Reserved. Confidential


● These used to be 2 separate use cases
● This year they have been rolled up into
Offer & one DataStax use case due to the
Recommendation synergy between the two

This is what you might care


about

25 © DataStax, All Rights Reserved. Confidential


Offer
This is what you might care
about

26 © DataStax, All Rights Reserved. Confidential


Offer
This is what you might care
about

27 © DataStax, All Rights Reserved. Confidential


Recommendation
This is what you might care
about

28 © DataStax, All Rights Reserved. Confidential


Recommendation
This is what you might care
about

29 © DataStax, All Rights Reserved. Confidential


Recommendation
This is what you might care
about

30 © DataStax, All Rights Reserved. Confidential


Recommendation
This is what you might care
about

31 © DataStax, All Rights Reserved. Confidential


Recommendation & Offers
Description Across one or more interaction channels, recommend other products or services to view/buy and/or offer
special deals or discounts to the customer.
Storyline Recommendation: When a customer logs into their account, USAA wants to populate a banner with
cars that the customer may want to purchase. The set of recommended vehicles can be generated
from previous purchases of people in the same demographic, or from the customer’s vehicle
search/viewing history.

Offers: Walmart has a special offer: based on your location (say in Baltimore, MD) every time you
spend $50 on electronics you get a 20% discount that is valid only for another electronics purchase in
the next 24 hours.
Identifiers Product recommendation, A/B Testing, Key Risks Wide graph queries: recommending or
Ad Targeting, Conversion, Digital offering products based on a population
Marketing, Geotargeting, Cross-Sell, Up set or demographic requires a very wide
Sell graph query. This type of query will not
be as performant as a recommendations
that start from a single vertex (like one
product).

32 © DataStax, All Rights Reserved. Confidential


Are you sure this is me?

Fraud
Is it legit

33 © DataStax, All Rights Reserved. Confidential


Are you sure this is me?

Fraud
Is it legit

34 © DataStax, All Rights Reserved. Confidential


Are you sure this is me?

Fraud
Is it legit

35 © DataStax, All Rights Reserved. Confidential


Are you sure this is me?

Fraud
Is it legit

36 © DataStax, All Rights Reserved. Confidential


Are you sure this is me?

Fraud
Is it legit

37 © DataStax, All Rights Reserved. Confidential


Fraud
Description To identify patterns and anti-patterns to instantly find identity theft, scams, online wire fraud, and the
like. This requires monitoring and analyzing data transactions in real-time to identify atypical patterns so
action can be taken immediately – such as denying a purchase or locking an account.
Storyline The customer is a 68 years old female in Florida who travels around the world but with no younger family
members that she has made major purchases for in the past 5 years. Credit card company notices a
purchase being made for a BMW motorcycle in Sydney using this customer’s credit card, although the
customer has not purchased any flights on the same card, and the company can also instantly map Retail
Bank partner data that shows an ATM transaction made in Boca Raton, FL a few seconds ago. Card
company instantly denies that purchase and freezes the account until it has a chance to confirm the
transaction with the customer.
Identifiers Advanced Persistent Key Risks Dynamic Graph Schema: We don’t know every
Threat, Cyber Fraud, single piece of data that will enter the system about
Intrusion, Intruder an individual user. In order to build out a graph
Detection, Malware database that is capable of monitoring fraud, we need
Detection, Phish some level of flexibility in the graph schema over time
Detection so that we can augment with known fraudulent
properties as the rules set grows.

38 © DataStax, All Rights Reserved. Confidential


What about the other use cases?

Enterprise Operations (EO)


● eCommerce
● Inventory Management
● Asset Monitoring
● Supply Chain Management
● Logistics
● Security and Compliance
● Identity Management

39 © DataStax, All Rights Reserved. Confidential


It becomes trickier

It is still being worked out what it means to qualify for graph


● Like sizing it becomes an art and experience helps
● Really have to qualify with CARDS
● Think: does these use cases drive them towards visualization of their data
○ Is there a need for human intervention?
● Do they need to walk or explore relationships

40 © DataStax, All Rights Reserved. Confidential


Potential Good Graph Use Cases

Examples of Graph Shaped Business Problems (IOT, Asset Management, Networking)


● How can I easily perform analysis on numerous relationships that form among data
elements and tend to be of much greater interest when examined collectively than
reviewed in isolation?
● A graph is also a good model for managing network assets (with their properties or
configurations) and how they relate to each other over time.

41 © DataStax, All Rights Reserved. Confidential


Entity Resolution
With DSE
Though not primarily graph when we start talking
customers, C360, personalization, etc we have to have
a strong strategy for who is a customer. So as we do
more with graph this becomes more and more
important.

42 © DataStax, All Rights Reserved. Confidential


Ideal

C360
This is who I am

43 © DataStax, All Rights Reserved. Confidential


Reality

C360
Entity Resolution critical to
Customer 360

44 © DataStax, All Rights Reserved. Confidential


Recommendation Personalization Fraud

Customer 360

CX Solutions
Entity Resolution
Entity Resolution: Core problem
we need to help our customers
solve in order to build any CX
solutions

45 © DataStax, All Rights Reserved. Confidential


Recommendation Personalization Fraud

Customer 360

DSE Solution
Entity Resolution
Critical beyond Graph

DSE Search DSE Analytics DSE Graph

DSE Core

46 © DataStax, All Rights Reserved. Confidential


Solving Entity Resolution

Entity Resolution

DSE Search DSE Analytics DSE Graph

Property Matching

Global Confidence Scoring

Relationship Inference

47 © DataStax, All Rights Reserved. Confidential


Property Matching - Field Matching
● 3 aspects
○ Strong identifiers (hard rules, exact match only)
○ Fuzzy matching (acceptable discrepancy on characters)
○ Field transposition (swap birth month and day)

Denise Gosnell Denise Gosnel Denise Koessler Denise Gosnell


10/06/1986 10/06/1986 10/06/1986 06/10/1986
1 Main St. 1 Main St. 1 Main St. 1 Main St.
Daniel Island, SC Charleston, SC Charleston, SC Daniel Island, SC

48 © DataStax, All Rights Reserved. Confidential


Property Matching - Relationship Matching
● Identify based on entity
relationship

49 © DataStax, All Rights Reserved. Confidential


Property Matching - Relationship Matching

50 © DataStax, All Rights Reserved. © DataStax, All Rights Reserved. Confidential


Property Matching - Relationship Matching

Public Device?

Family Member?

Stolen Identity?

51 © DataStax, All Rights Reserved. © DataStax, All Rights Reserved. Confidential


Global Confidence Scoring
● A system needs to be able to measure how well the matching rules set performs on
the data of interest
● Create a realistic, labeled, data set to test and measure the system
● Have to minimize False Positives
Predicted Class
P N

P True Positives False Negatives


Actua
l
Class
N False Positives True Negatives

52 © DataStax, All Rights Reserved. Confidential


Relationship Inference
● Identify characteristics
and use observation
and logic on subject
identification
Master
Master
identity
identity

a b c d e f

53 © DataStax, All Rights Reserved. Confidential


Relationship Inference

Master Master
identity identity
?

a b c d e f

54 © DataStax, All Rights Reserved. Confidential


DSE Solution
● We are technologically
positioned to solve Entity DSE Analytics
Resolution Challenges

Stream Batch
Processing Processing

Bulk Data DSE


Loading Drivers

DSE Graph DSE Search

55 © DataStax, All Rights Reserved. Confidential


Configuring
DSE Graph

How you go about turning on graph and setting up DSE


with Graph.

56 © DataStax, All Rights Reserved. Confidential


Graph Tables

Graph Tables are C* tables with some special characteristics (more on that later)
● Thus configuring and tuning graph tables is configuring Cassandra
○ Note do not create or drop graph tables by hand, use either:
■ Gremlin-console
■ Studio
○ Both techniques as well as via code will be discussed later
● So tune C*
○ gc_grace_period, ttl, local_read_repair...etc
● Set Replication factor on graph tables
system.graph("KillrVideo").
option("graph.replication_config").
set("{'class' : 'NetworkTopologyStrategy','DC-East' : 3,'DC-West' : 5}").
option("graph.system_replication_config").
set("{'class' : 'NetworkTopologyStrategy','DC-East' : 3,'DC-West' : 3}").
ifNotExists().create()

57 © DataStax, All Rights Reserved. Confidential


DSE Yaml File

● Like many of the DSE features when it comes to how you configure integration
points a lot of the work is done in the dse.yaml file.
● Configuration settings for DSE Graph are found at the bottom of the file
● Normally no setting should be changed unless a specific use case pushes you to
changing a setting
● Most likely candidates may be:
● system_evaluation_timeout_in_seconds
● analytic_evaluation_timeout_in_minutes
● realtime_evaluation_timeout_in_seconds
● schema_agreement_timeout_in_ms
● max_query_queue

Note: In DSE 5.0 and 5.1 the best practice is to leave schema_mode to Production
● In 6.0 the config was removed in the dse.yaml and mode is set to Production
58 © DataStax, All Rights Reserved. Confidential
DSE Graph Startup - dse / dse.default

While working on our local systems with a tar installed we started up with a
● dse cassandra -s -k -g
● where -g stands for starting with graph
When in a real production environment you will normally have your system auto start
When using the default init script to autostart graph you flip a flag in /etc/default/dse
● A copy of this file can be found in tar install in the resource/dse/conf directory
● Just change GRAPH_ENABLED=0 to 1 and graph starts when dse starts

Note: All nodes in a single DC should have graph enabled or disabled, do not have a
mixed workload within a single DC

59 © DataStax, All Rights Reserved. Confidential


That’s it

Not a lot of configuration to do on the graph side


● The real work/trick is how to model your graph system
○ It will be a graph model, not a cassandra model
○ So again will need to switch mental spaces
○ For some it is quite intuitive as graph is what you whiteboard
○ For others it is alien as that is not how you model data
○ Again, that will have to wait for later
● And how to access your data
○ We will use the gremlin graph query language
○ Somewhat intuitive but a bit different that a SQL or CQL query
● Don’t worry, you will have some hands on practice

60 © DataStax, All Rights Reserved. Confidential


Initial Hands-on
DSE Graph

Once you have graph enabled, you will need to know


how to access it. This is what we are covering next.

61 © DataStax, All Rights Reserved. Confidential


Accessing Graph is Done in 3 ways

Gremlin Console
● dse gremlin-console
● Groovy based
● allows you to execute gremlin commands
● :help gives you info when in the shell
DataStax Studio
● Same one as used early to do CQLSH commands
● Allows you to execute gremlin
● Real time or against analytics engine
● Great for initial development and exploring
Drivers to access via various languages
● Encourage customers to use the DSE drivers
● Open Source drivers cannot access graph

62 © DataStax, All Rights Reserved. Confidential


DSE Graph Basics Review
Vertex Vertex

Edge
Works For
Susan DataStax
Property Name Property Value

Start Date 1/1/2017

Property Name Property Value Location Santa Clara Property Name Property Value

Name Susan Company Name DataStax

Gender Female Corporate Address 3975 Freedom Cir,


Schema Santa Clara, CA
Age 23 95054

Works For
Person Company

63 © DataStax, All Rights Reserved. Confidential


● Follow along on your own machine as the
instructor walks through the gremlin console.
Hands On ● Pay attention to the steps as we will do a very
like exercise utilizing studio when done.
The Gremlin Console

64 © DataStax, All Rights Reserved. Confidential


The Gremlin Console - Creating a Graph
We will get into the what, how’s and why’s of graph and graph modeling later.

But to get your hands dirty let's do a few things in the console

Start up the console


● dse gremlin-console
○ Note the plugins that are activated when you start
Create a graph
● system.graph('BootTest').create()
○ It doesn’t return anything but we can check if it was created
List the graphs in the database
● system.graphs()
○ You should now see the graph you created

65 © DataStax, All Rights Reserved. Confidential


The Gremlin Console - Config Alias

We can now start playing with the graph by going around typing BootTest.g… to do
something. This is cumbersome and most examples you will see out in the wild usually
talk about ‘g’ as the graph object

So let's make an alias so we can do the same


● I don’t :remote config alias g BootTest.g
○ you should see a prompt that shows that the object g now stands for BootTest graph

Note that the console is not that friendly, no tab complete, etc. But it allows you to get the
job done.

At this point you have made a graph in the cassandra database, but there is nothing to it.
It is a concept only at this point.

66 © DataStax, All Rights Reserved. Confidential


The Gremlin Console - C* tables

If we check out what was created in cassandra by using cqlsh.


● Two new keyspaces were created
○ "BootTest_system"
■ table shared_data (1 row of data)
■ table shared_data_versions (1 row of data)
■ table dseg_variables (empty)
○ "BootTest"
■ table id_allocation (empty)
● Not much in them. But it does prove a point. There is a special structure and tables
that are associated when creating a graph database in DSE.
○ Do Not try to just replicate and create by hand using CQLSH. Graph databases have to
be created by Graph tools, not CQLSH.

Note: in DSE 5.0 and 5.1 there was also a *_pvt keyspace

67 © DataStax, All Rights Reserved. Confidential


The Gremlin Console - Dev Mode

By default graph schemas are set to ‘Production’ mode.


● This restricts the Graph system from modifying the schema for you and disables
scanning by default.
But what if you didn’t want to worry as much about schema creation?
● You are in development or exploratory mode
● You want to just do things and see what happens
We can switch this system to development mode. In the console
● schema.config().option('graph.schema_mode').set('Development')
○ Now the Graph system will create the schema on the fly for us
○ Check schema_mode by schema.config().option('graph.schema_mode').get()
Confirm the graph is currently empty
● g.V().count()
○ should show 0

68 © DataStax, All Rights Reserved. Confidential


The Gremlin Console - Add Vertex

The graph is empty which is useless, so we will add some object:


● matt = graph.addVertex(label,'Instructor', 'Name', 'Matt', 'Pet_Peeve',
'Timeliness')
○ You will get returned something that shows an object was created
○ Details on what the community_id and member_id is later, for now we are doing a very
high level

A vertex is the object in the graph, usually thought of as a noun. The who/what of graph
● More details later as we get into what is a graph and the syntax that describes it

69 © DataStax, All Rights Reserved. Confidential


The Gremlin Console - Add Vertex

Add another couple of objects


● bob = graph.addVertex(label,'Instructor', 'Name', 'Bob', 'Pet_Peeve', 'Matt')
● bootcamp = graph.addVertex(label, 'Course', 'Name', 'Boot Camp', 'Iteration',
18)
● sam = graph.addVertex(label, 'Student', 'Name', 'Sam', 'Role', 'SA')
● g.V().count()
○ Should now return four objects
● g.V().valueMap()
○ Will give you some details
○ Never do this in a real graph database as you would walk all your objects, ewww

70 © DataStax, All Rights Reserved. Confidential


Gremlin Console - Add Edge

We have the who, but graph is all about relationships, so we need to add some
relationships between these objects. These connectors are called Edges and are thought
of as actions. Again more later.

So if Matt ‘teaches’ a class we should show that


● matt.addEdge('teaches', bootcamp)
○ Note teaches, the action is the edge's name
○ All kinds of stuff comes back
Sam attends the boot camp 18 course
● sam.addEdge('attends', bootcamp)
Guess what Bob does
● bob.addEdge('tolerates', matt)

71 © DataStax, All Rights Reserved. Confidential


The Gremlin Console - Viewing Graph

Looking at the edges you added you can


● g.E().count()
And since we are in developer mode as we added Objects and Actions the schema was
created on the fly. We can see what we created by
● schema.describe()
Enough playing around let's delete the graph
First clear the alias
● :remote config alias reset
Now drop the graph
● system.graph('BootTest').drop()
○ Checking in CQLSH we see the keyspaces have been removed
● :quit
○ To exit the console

72 © DataStax, All Rights Reserved. Confidential


DataStax Studio

So let’s do the exact same thing using Studio rather than the console.

Open up studio and create a new notebook

Follow the exercise instructions

73 © DataStax, All Rights Reserved. Confidential


Open up Studio and create a new NoteBook
● When Prompted create a new connection
○ BootTest2
● And a new graph
○ BootTest2
● Check in CQLSH to see if it was created
● Now add all of the Objects and Actions as per
the Gremlin Console walkthrough
○ Note you do not have to set up the alias
Exercise ○

system.graphs()
schema.config().option('graph.schema_mod
e').set('Development')
Creating Simple Graph ○

g.V().count()

with Studio etc.


● Before you drop, click on the schema button
● And do a g.V() playing with the buttons
○ Use {{{Name}}} on visual config

74 © DataStax, All Rights Reserved. Confidential


DataStax Studio

The nice thing about Studio is it saves all your commands


● Gremlin-console will keep a certain amount of history you can access with the up
arrow key
● Studio keeps it all unless you delete a notebook
○ And now with Studio 6.0 can export and import notebooks for long time storage
● Studio has much better visualization of your graph
● Studio helps with autocomplete
○ ctl+space
● In Studio, you can clear all your results and try again
○ After clearing graph and data in it of course

Note: Studio is an individual tool right now. There’s no restricting access to components
of studio, i.e. a user who has Studio and is in dev mode can clear a schema. Also,
sharing is a bit of a challenge. Studio is heading in the multi user direction in future
releases.
75 © DataStax, All Rights Reserved. Confidential
Accessing Graph Programmatically

While gremlin-console exists,


And Studio can help you explore and develop,
Real systems eventually need to be a part of an application, not an development tool

DSE provides drivers to access DSE Graphs in the following languages


● Java
● Python
● C#
● C/C++
● Node.js
● Ruby
● PHP

76 © DataStax, All Rights Reserved. Confidential


DSE Graph
Fundamentals

77 © DataStax, All Rights Reserved. Confidential


Why Graph Database?
● Performance
○ Address relationship questions with improved performance
● Flexibility
○ Relationship definition changeable as target space is better understood
○ Add new entities and relationships without compromising existing data,
relationships and queries
● Agility
○ Flexibility in graph data model allows pushing initial model out, bypassing the
need to construct a complete, inflexible model in the beginning

78 © DataStax, All Rights Reserved. Confidential


What types of Graph DB’s Exist

When we discuss Graphs in general we are discussing three possible Graph Types
● Property Graphs
● Resource Description Framework (RDF)
● Hyper Graphs

DSE Graph is a schema based labeled typed property graph


● Schema
○ DSE graph requires a schema to be defined
● Labeled
○ Objects are identified by a label
● Property
○ Objects have properties defined as key/value pairs
● Typed
○ Properties are typed, i.e. a timestamp is not a string object

79 © DataStax, All Rights Reserved. Confidential


DSE Graph has

Vertices
● Objects, Common Nouns, however you want to think of them
● With labels
● And properties
○ Key/Value pairs
○ Typed
○ Single or multiple
○ Can have meta properties (properties of properties)
Edges
● Actions, verbs, the what of a relationship
● With labels
● And Properties
○ Same rules as vertex, except
○ No meta properties allowed on edges

80 © DataStax, All Rights Reserved. Confidential


DSE Graph has

Vertex to Edge relationships are one way


● Can access from the IN relationship
● Can access from the OUT relationship
● Usually do not need to specify it both ways

OLTP (Online Transactional Processing)


● Data Center with Solr/Cassandra/Graph
● Not true transactions
OLAP (Online Analytical Processing)
● Data Center with Cassandra/Solr/Spark/Graph
● 5.1+ now integrates with graph frames for deep analytics

81 © DataStax, All Rights Reserved. Confidential


DSE Graph Terminology
● Adjacent Vertex - A vertex directly attached to another vertex via an edge
● Adjacency List - describe neighbors of a vertex (vertices and properties)
● Cardinality - Apply to edge or property which can be single or multiple
● Data Type - Apply to property key values where data type must be assigned
● Element - A vertex, edge or property
● Graph Traversal - Algorithmic walk across graph elements in search of target data
● Incident Edge - All outgoing and incoming edges touching a vertex
● Meta-property - A property that describes some attribute of another property
● Relation - Relates a vertex to another vertex (edge) or a value (property)
● Vertex Degree - number of edges incident to a vertex

82 © DataStax, All Rights Reserved. Confidential


DSE Graph Allows you

To select sub graphs


Traverse the graph (walk paths) in multiple ways
● Simple traversal
○ You never repeat an edge
● Cyclic traversal
○ Can repeat edges possibly walking in circles
Find the shortest path
● How to get from a to z with the least number of hops
Find the degree
● How many edges a vertex have associated
○ In degree - how many edges pointed to
○ Out degree - how many edges pointed from

83 © DataStax, All Rights Reserved. Confidential


DSE Graph Allows you

Cache Pieces of the Graph


● Very specific use cases. If you have core part of graph in your database you would
cache that.
● Only useful if core piece is used often and is local to a server. Need to be relatively
static and small enough to actually fit in the cache.
○ The way you do it schema.vertexLabel(<value>).cache().properties().ttl(<number>)
○ Make sure you have a ttl on it as it is the only way to invalidate the cache, you can't
manually do it and it does not update even if underlying data changes
○ So again, has to be small, core, static and used often
○ Once cached it does not query actual disk until the ttl has expired, so consistency
guarantees are not in effect.
○ Off heap cache

84 © DataStax, All Rights Reserved. Confidential


DSE Graph versus Apache TinkerPop

DSE Graph extends upon or modifies the TinkerPop API in the following ways:
● DSE Graph adds additional predicates (Geo, Text Search) to be used in
conditions.
● DSE Graph does not allow explicit transaction handling
● DSE Graph uses custom classes to represent element ids.
● DSE Graph has a schema.
● Adds some utility methods to be used by external software specifically built
for DSE Graph (e.g. graph loader)

85 © DataStax, All Rights Reserved. Confidential


Graph Data Location

DSE Graph stores graph data the following keyspaces:


● [graph-name]: The data keyspace contains all primary graph data.
● [graph-name]_system: Contains the tables to maintain the graph schema and
facilitate id allocation. This is maintained in a separate keyspace to:
○ Separate system data from user data
○ Allow a separate replication configuration from the data keyspace
■ Needs to be ensured with near absolute certainty that this data cannot get lost
otherwise the data becomes unreadable
■ Should never be modified directly

86 © DataStax, All Rights Reserved. Confidential


Graph Data Location

The data keyspace contains two tables for each vertex label. Both tables start with the
name of the vertex label and end in:
● [vertex-label]_p: Contains all of the properties incident on vertices of this label
● [vertex-label]_e: Contains all the edges incident on vertices of this label
○ Same data will also store in adjacent vertex’s [vertex-label]_e table
○ Allows for quick traversal in either direction

In other words, the adjacency list for the graph is partitioned by vertex label with edges
and vertex properties further separated into distinct tables.

87 © DataStax, All Rights Reserved. Confidential


Data Representation in DSE Graph

DSE Graph stores the graph in adjacency list format.


● Store the entire neighborhood for each vertex consecutively on disk
● Representation of DSE Graph stores all relations incident (edges and properties) on a particular
vertex for all vertices in the graph
○ Within a single partition in Cassandra
○ Can run vertex-local queries (vertex queries)
■ Only require retrieving a single vertex’s adjacency list (or subset thereof)
○ Quickly answer incidence on local and neighborhood queries
■ All edges incident on vertex X satisfying some condition
■ All vertices connected to vertex X by edges satisfying some condition

88 © DataStax, All Rights Reserved. Confidential


<vertex-label>_p Table Example
● PRIMARY KEY (uid, "~~property_key_id", "~~property_id")) WITH
CLUSTERING ORDER BY ("~~property_key_id" ASC, "~~property_id"
ASC)
○ uid (custom partition key)
■ If auto-generated in 5.x (community_id, group_id, member_id)
○ ~~property_key_id - system generated to identify each property key
○ ~~property_id - system generated to identify each value
● Each property value has its own row

89 © DataStax, All Rights Reserved. Confidential


<vertex-label>_e Table Example
● PRIMARY KEY (uid, "~~edge_label_id", "~~adjacent_vertex_id", "~~adjacent_label_id",
"~~edge_id")) WITH CLUSTERING ORDER BY ("~~edge_label_id" ASC,
"~~adjacent_vertex_id" ASC, "~~adjacent_label_id" ASC, "~~edge_id" ASC)
○ ~~edge_label_id - identify edge label
○ ~~adjacent_vertex_id - identify vertex id connected to vertex by edge
○ ~~adjacent_label_id - identify vertex label id connected to vertex by edge
○ ~~edge_id - id for this particular edge
● All edge and edge property values can only be accessed under a vertex (uid)
○ Edge/edge properties are second-class citizens
● Table includes ALL edge property columns from different edge labels connected to this vertex
● Each edge row contains all edge property values
● Same edge data duplicated in both adjacent vertices

90 © DataStax, All Rights Reserved. Confidential


Auto Generated IDs

Default for DSE 5.x but deprecated started DSE 6.x (Don’t use it!)
● Does default partitioning in underlying C* database
● Thinks in terms of community
○ community_id —  vertex belongs to a community
○ member_id — vertex is uniquely identified in a community
● Guaranteed unique
○ Synthetic/surrogate
○ Small footprint
● More flexible as not tied to the domain
● Randomly assigned, by load order
○ Thinks things loaded next to each other are part of the same community so partition with
others for fast lookups
○ But if data is not loaded in that order, community may do nothing so from outside point of
view things are randomly partitioned

91 © DataStax, All Rights Reserved. Confidential


User Defined Vertex IDs

When should you use custom vertex IDs?


● Always (“officially” no other option starting DSE 6.0)
● You define the Partition Key

To define you need to understand your data


● Can control partition sizes this way
● Can get that very fast lookup
● Have to switch from Graph mindset to C* when doing this

92 © DataStax, All Rights Reserved. Confidential


Creating a Graph Schema

When manually creating a graph schema there are 5½ steps that need to be done in a
specific order
● Schema design relies on earlier definitions to complete the definition of later items,
you have to do in order.
1. Design a Data Model (cover later)
2. Define the Property Keys
○ Can be reused in multiple vertices and edges (e.g. name)
3. Define the Vertices (by label)
○ Use Custom ID’s as partition key
4. Define the Edges (by label)
○ Cannot define key. System generated
5. Define your indexes
○ Many types depending on need
5.5. Define some Materialized Aggregates (optional)

93 © DataStax, All Rights Reserved. Confidential


Defining Property Keys

Unique Key
● Only one property key of a given name within a graph
● Same property key can be reused on multiple vertices or Edges
Data Type
● DSE Graph is typed so all properties need to be assigned a type
Cardinality
● Multiple by default 5.x
● Single by default 6.0
● Recommend to always define and not just accept default for code clarity
● Property key with a single integer value
schema.propertyKey("year").Int().single().create();
● Multiple can only be on vertex
schema.propertyKey("name").Text().multiple().create();

94 © DataStax, All Rights Reserved. Confidential


Defining Property Keys

Meta Properties
● Properties of properties
● Needs to define the enclosed properties first
schema.propertyKey("source").Text().single().create();
schema.propertyKey("date").Timestamp().single().create();
● Then use them to define the meta property
schema.propertyKey("budget").Text().multiple()
.properties("source","date").create();
● Can only be defined for vertices
○ Not allowed for edges

95 © DataStax, All Rights Reserved. Confidential


Property Data Types

Category Data Types

Numbers Smallint, Int, Bigint, Varint, Float, Double, Decimal

Boolean Boolean

Literals Text

Universal Identifiers UUID

Dates and Times Date, Time, Timestamp, Duration

Geospatial Point, Linestring, Polygon

Binary Blob

Network Inet

96 © DataStax, All Rights Reserved. Confidential


Defining Vertex Labels

Vertex Label Definition


● Unique Label
○ Define the vertex type
○ Act as a table with actual associated objects
■ Unique Label “user” with actual vertices representing actual users: 'Bob', 'Nancy', 'Jorge', etc.

● Associated Property Keys


○ Property keys need to be pre-defined
○ Act as the columns within the unique label (table)
○ Individual property keys can be used in multiple vertex label definitions

schema.vertexLabel("user").properties("first","last","middle",...).create()

97 © DataStax, All Rights Reserved. Confidential


User Defined Vertex IDs

Custom Vertex IDs definition


● Partition key(s)
● Clustering key(s)
● Will leverage auto generated IDs if omitting definition (Do not use!)
schema.vertexLabel("user")
.partitionKey("last")
.clusteringKey("first", "middle")
.properties(...).create()

● Query using keys

g.V().hasLabel('user').has('last', 'Atwater').has('first', 'Matt')

98 © DataStax, All Rights Reserved. Confidential


Defining Edges

Edge definition
● Unique Label
○ Like vertices the edge label has to be unique label
○ Action/verb/relationship/connection between vertices
■ e.g. User - has a -> Pet
● Single edge cardinality
■ One edge between 2 vertices
■ e.g. a user can have one address he’s living in

schema.edgeLabel("livesAt").single().properties(...).connection("user",
"address").create()

● Multiple edge cardinality


■ Multiple edges between 2 vertices
■ e.g. a user can have multiple addresses

schema.edgeLabel("hasA").multiple().properties("type",...).connection("user",
"address").create()
99 © DataStax, All Rights Reserved. Confidential
Defining Edges

● Associated property keys


○ Cannot have multi valued properties
○ Cannot have meta properties
● Can create multiple connections between different vertices within one definition
schema.edgeLabel("knows").single().connection("user","user”).connection("user",
"pet").properties(...).create()

● Edge IDs are composed of outgoing and incoming vertex IDs, edge label, and a
local edge identifier.
○ Always generated automatically and are not customizable.

100 © DataStax, All Rights Reserved. Confidential


Define your Indexes

● Created based on data access patterns (just like Core)


● For efficient graph querying and traversal
● Avoid full scan of the system which Production Mode does not allow
○ Too inefficient in large graphs, very slow, lots of time outs
● Where to index
○ Vertex index - Index a vertex property key
schema.vertexLabel("movie").index("moviesById").materialized().by("movieId").add()

○ Property index - Index a meta-property key within a property


schema.vertexLabel("movie").index("movieBudgetBySource").property("budget").by("source").add()

○ Edge index - Index an edge property key


schema.vertexLabel("movie").index("toUsersByRating").inE("rated").by("rating").add();

Note: Property and Edge Index can only leverage the Materialized View type so no need to specify type

101 © DataStax, All Rights Reserved. Confidential


Types of Vertex Indexes

Vertex Index Type Description

Materialized view index Most efficient index for high cardinality, high selectivity vertex properties and equality
predicates. This index is implemented via a materialized view in Apache Cassandra™.

Secondary index Efficient index for low cardinality, low selectivity vertex properties and equality
predicates. This index is implemented via a secondary index in Apache Cassandra™.

Search index Efficient and versatile index for vertex properties with a wide range of cardinalities and
selectivities. This index is implemented via an index in Apache Solr™. A search index
may support a variety of predicates depending on the index flavor:
● Full Text and String searches
● Fuzzy search
● Phrase Search
● Token fuzzy
● Spatial searches
● Other generic type searches

102 © DataStax, All Rights Reserved. Confidential


Materialized View Index

● On high cardinality properties


● All Materialized View rules apply
○ Fast as search term is part of Partition Key
○ Exact match only
○ Duplicates data on the disk
○ Watch for large partitions

schema.vertexLabel("user").index("userByLast").materialized().by("last").add()

● Example query leveraging index

g.V().hasLabel("user").has("last","Atwater")

103 © DataStax, All Rights Reserved. Confidential


Secondary Index

● For low cardinality properties


● All secondary index rules apply

g.V().vertexLabel("user").index("userByLast").secondary().by("last").add()

Note: Any thoughts on using secondary index here?

● Example query leveraging index

g.V().hasLabel("user").has("last","Atwater")

Note: If you don’t explicitly set the vertex label you cannot take advantage of any index

e.g. g.V().has("last","Atwater") won’t leverage index

104 © DataStax, All Rights Reserved. Confidential


Full Text Search Index

● Full text search using DSE Search Engine


● All indexes have to be named ‘search’
○ Creates a solr core when creating the index
○ Allows for geospatial searches as well (in degree and distance)

schema.vertexLabel("user").index("search").search().by("last").asText().add()

● Can have different search types even across the same field

schema.vertexLabel("user").index("search").search().by("last").asText().by("last").asString().add()

● Can have mixed and inferring data types

schema.vertexLabel("user").index("search").search().by("last").asText().by("last").asString()
.by("age").by("birthday").by("salary").add()

Note: Indexes a string, text, int, date and float without specifying (if types are defined as such)
105 © DataStax, All Rights Reserved. Confidential
Geospatial Indexing

● Define property as Point

schema.propertyKey("point1").Point().withGeoBounds().create();

● Define property as Linestring

schema.propertyKey("point2").Linestring().withBounds(1.5, 2.3, 4.5, 65).create();

● Define property as Polygon

schema.propertyKey("point3").Polygon().withGeoBounds().create();

● Define index
vertexLabel("human").index("search").search().by("point1").by("point2").by("point3")
.withError(2,3).add()

106 © DataStax, All Rights Reserved. Confidential


Token, Phrase and Fuzzy

● Define index
schema.vertexLabel("user").index("search").search().by("first").asText().add()
● Token searches
g.V().hasLabel("user").has("name",Search.tokenRegex("^Ma.*"))
○ Would find user ‘Matt’ (or any user that started with a Ma)
○ Token Flavors: token(), tokenPrefix(), tokenRegex() and tokenFuzzy
● Phrase search
g.V().hasLabel("user").has("first", Search.phrase("Matt Lost", 2))
○ The 2 defines I can have up to two things in between the works in the phrase, so the
above would find ‘Matt the Lost’
● Fuzzy search
g.V().hasLabel('user).has('first', Search.fuzzy('Mett', 1))
○ Allows one letter in the misspelling, so would find ‘Matt’

107 © DataStax, All Rights Reserved. Confidential


Materialized Aggregates

Materialize on frequently computed results or inferences


● Frequently computed results
○ Average, count, etc
○ Create a new property (table column) to store results
● Frequently traversed paths
○ e.g. frequently going through multiple hops going from vertex A to vertex B
○ Create a new edge directly from A to B
● Only works with existing data
● Update and populate new property or edge on a periodic basis, often via a batch job
(e.g. average rating)
○ Else do in as part of analytics

108 © DataStax, All Rights Reserved. Confidential


Architecture
DSE Graph The following may be a little dry but it walks through
how Graph is integrated into DSE. Some of this is
changing as we update Graph, but some is basic and
will stay the same (or a close proximity).

This section is pretty dense and we will most likely walk


through very quickly unless there are questions. But
do use it for reference material if desired at a later
date.

109 © DataStax, All Rights Reserved. Confidential


Deployment Overview
DC2
DC3 DSE DSE
DSE DSE Graph Graph
Graph Graph +Search +Search
+Analytics +Analytics

DSE
DSE Graph
Studio
Graph +Search
+Analytics

Browser DSE DSE


Driver
OpsCenter Graph Graph

Client
DSE Application
Graph

Graph DC1
Loader

110 © DataStax, All Rights Reserved. Confidential


Component Overview Graph Loader DX Studio
Gremlin
Console

TinkerPop
OpsCenter DX Driver TinkerPop Driver
Driver

Spark Context Graph Analytics


DSE Event Framework DSE Server GremlinServer
Gremlin Server

VertexInputRDD SparkGraphComputer

Graph Globals Per Graph Per Transaction Gremlin Executor


Event Handler Schema Migrations Schema Model

Thread Pools Graph API


Shared Data (Graph) Per Query
T
Relation Containers i Schema API
Graph System Traversal Rewriting
ID Allocation n
k
Tx Caches e Graph System API
Shared Data (System) Configuration Query Optimizer r
p
o
p
Data Store Index Cache Index Store Query Builder
A
P
Statement Messenger Adjacency Cache Adjacency List Store I
Query Processor

DSE-Search
PVT Read Processor Traversal Processor

Cassandra PVT Write Index Router


111 © DataStax, All Rights Reserved. © DataStax,
Confidential
All Rights Reserved.
DSE Graph
Data Modeling
Just like C* when it comes to modeling you have to first
think about what you want to get out of your graph.
Queries first always for the best performance. Sure we
say graph is flexible and easy to add to, but it comes at
a price.

112 © DataStax, All Rights Reserved. Confidential


It is all about Relationships

That is what you are doing with graphs


● Whiteboard friendly
● Data models have close affinity between logical and physical
● When modeling you are describing a relationship
○ Vertex is a common noun
■ The object that takes place in a relationship
○ Edge is a verb
■ Describing the relationship that takes place
○ Proper nouns
■ Becomes the instance of a vertex
Hint
When modeling a graph you are describing relationships. When normally whiteboarding these it works very
well, but because we often use imprecise terms when talking we forget some key components. Take the
example Bob emailed Nancy. You might incorrectly model this as Users: Bob and Nancy, with emailed as the
relationship. But in doing so you missed an important vertex, the email itself. Think instead of Bob sent an
email to Nancy. Then you capture the email and how it relates to Nancy and Bob.

113 © DataStax, All Rights Reserved. Confidential


Two Approaches to Graph Modeling

Data Modeling Framework


● Apply a well-defined methodology to design a graph schema
● Perform conceptual, logical and physical schema design
● Implement design
● Start inserting data

Data-Driven Approach
● Create a representative graph instance
● Do so by being in developer mode and inserting data directly
● This will create the schema for you
● Review schema and implement missing pieces

114 © DataStax, All Rights Reserved. Confidential


Either way you have to...

● Collect and analyze the data requirements


● Identify important entities and relationships
● Understand data access patterns (queries)
● Think of the specific way to organize and structure data
● Identify the properties that further define the entities and relationships
● Optimize the schema for performance

● Two approaches have different orders of addressing the task list


● Conceptual is still done with both approaches, less formally in Data-Driven approach
● No logical design for Data-Driven approach

See the academy course that breaks down each of these approaches in detail

115 © DataStax, All Rights Reserved. Confidential


We are not going to do it either way

Instead in this course we are going to do somewhat of a hybrid approach


● As a group we will whiteboard a graph design
● Discuss the pros and cons of each design decision as well as alternatives
○ This is our conceptual design
● We will skip creating a logical as far as a formal document
○ Our whiteboard session will end up being both conceptual and logical
● Then individually you will create the schema on your local system

There is no one right answer in the design


● Since this is a live session and each bootcamp may make different choices there is
no reference design that says this is the right and only way
● That said it is likely that many elements will look the same as the design has to meet
the same requirements

116 © DataStax, All Rights Reserved. Confidential


Check List

Once we are complete with our design and have created our schema we will run through
the design checklist to make sure we have not missed anything

❏ Types of vertices match all needed objects


❏ How we will define the vertex IDs for the vertices
❏ Types of edges match all need relationships
❏ Cardinality of edges are correct
❏ Domains and ranges of the edges
❏ Types of properties are correct
❏ Datatypes of property values are correct
❏ Cardinality of properties are correct

117 © DataStax, All Rights Reserved. Confidential


A bit about Production Mode

Globally the mode of your graph is set in the dse.yaml file


● Found with the schema_mode flag 5.x
● By default this is set to Production
○ This is the best practice for production settings
○ You are unable to add data if the data does not match the schema
■ Developer mode - this would create new objects in the schema on the fly
■ schema.config().option(‘graph.schema_mode‘).set(‘Development’)
● Disable scans
■ Error if you run a traversal that would cause a large scan on your database
■ Forces you to create a proper index for the transversal
■ You can override with schema.config().option(‘graph.allow_scan‘).set(true)
■ Bad idea in production, but may be ok to test small local datasets
● Set to Developer is how you model by the Data-Driven approach

118 © DataStax, All Rights Reserved. Confidential


As a group, whiteboard the schema design of the
Aurabute order transaction and customer service
aspect of the business

Exercise
Instructor Led Data Modeling

© DataStax, All Rights Reserved, Confidential


Instructor Led Data Modeling
● Design a graph data model based on the following requirements

Customer Employee Order Service Request


● Full name ● Full name ● Order date ● Request date
● Address(es) ● Address(es) ● Item(s) ● Customer info
● Phones(s) ● Phone(s) ● Shipping info ● Support rep info
● Email(s) ● Email(s) ● Giftwrap info ● Action log
● Credit card(s) ● Credit card(s) ● Discount ● Refund info
● Devices(s) ● Devices(s) ● Payment
● Gender ● Gender ● Total Item
● Birthday ● Birthday ● Item name
● Marital status ● Marital status ● Color
● Login user name ● Login user name Cart/Wishlist
● Size
● Loyalty account ID ● Employee ID ● Item(s)
● Unit price

120 © DataStax, All Rights Reserved. Confidential


Questions to be Addressed
● Be able to retrieve the following data within your model

Q1: Individual customer information via name, email or phone number


Q2: Individual customer order history via name, email or phone number
Q3: All customers who are parents
Q4: Individual customer’s wishlist items not in an order
Q5: Items in an order or wishlist by name, size or color
Q6: Individual customer’s most frequently used credit card
Q7: Individual customer’s all service requests within a time range
Q8: Individual customer’s most used service request contact method
Q9: Individual customer’s refund amount within a time range

121 © DataStax, All Rights Reserved. Confidential


● Create the data model in a new keyspace
named “aurabute_360” using Studio
● When finalized, display your complete
schema
● Review the checklist to make sure it is
complete
● Fix anything that is missing
Exercise ● No need to enter data yet

Build your Data Model

122 © DataStax, All Rights Reserved. Confidential


Homework

If you have not completed, finish data modeling exercise

123 © DataStax, All Rights Reserved. Confidential


End of Graph
Day One

124 © DataStax, All Rights Reserved. Confidential


Addendum

125 © DataStax, All Rights Reserved. Confidential


Configuration

Graph configuration draws from two stores:


● The graph section of dse.yaml
● Key-value pairs stored in Shared Data

Options that affect the entire DSE(G) process are in dse.yaml (such as the gremlin-server listen
socket port). Options that affect specific graphs (such as whether linear scan queries are allowed)
are stored in the Shared Data associated with that graph.

Options are declared mostly in graph’s ConfigurationDefinitions. They are represented as subtypes
of ConfigOption<V>, where the generic type parameter V describes values for that option. The API
for config option reads uses these ConfigOption instances rather than strings to represent particular
config keys.

The implementation of Shared-Data-backed config options includes an observer mechanism that


automatically propagates those changes across the cluster (or at least to the subset of the cluster on
which that graph has been opened and where the config options for that graph are in use).

126 © DataStax, All Rights Reserved. Confidential


Index Adjacency Store Cache

The graph level caches are implemented as an off heap map of queries to adjacency lists. The Off
Heap Cache project is used to give us this implementation.

The cache keys are structured as:


● Index store - The query (Index queries are global and have no context)
● Adjacency list store - The query, the originating vertex ID (Adjacency queries are in the
context of a vertex)

127 © DataStax, All Rights Reserved. Confidential


Index Adjacency Store Cache

Schema can be used to control caching on a vertex label basis:


● vl.cache().properties().ttl(“1m”).add() - This setting affects global index queries and
vertex centric property queries. For instance:

g.V().has(“name”)
g.V(2).properties(“name”)

● vl.cache().bothE(“knows”).ttl(“1m”).add() - This setting affects edge queries.

g.V(2).outE(“knows”)

Note that there is currently no eager eviction mechanism, so it is possible to get stale data from the
cache. By using caching you are accepting a trade-off between performance and data freshness.

128 © DataStax, All Rights Reserved. Confidential


Schema Model

The schema model is the internal schema representation used by DSE Graph.
It is a transaction level resource that describes: Rollback
Discard any
Transaction
schemas that have
been modified
● Property keys Commit
● Edges
Schema
○ Property keys
● Vertex labels Commit

○ Property keys
Shared Data
○ Indices
■ Vertex (secondary, materialized view, search) Changed
■ Edge (materialized view)
○ Adjacencies Schema migrations

○ Partitions
○ Caching

129 © DataStax, All Rights Reserved. Confidential


Schema Model

The user does not have direct access to this model but can modify it using the Schema API.
The schema is stored in shared data and the lifecycle is bound to the transaction. Committing a
transaction will also commit any changes to the schema which will be applied via schema migrations.

Schema has two modes:


● Production (Default) - Property keys, edges and adjacencies must be specified up front in the
schema before they are used in the graph.
● Development - Property keys, edges and adjacencies will be created lazily in response to
adding data to the graph.

The schema mode can be modified using the graph config.

130 © DataStax, All Rights Reserved. Confidential


Schema Migrations

Once a new schema has been committed to shared data the physical changes to the Cassandra
schema must be made.

Schema migrations makes those changes by:

1. Compare the old schema with the new schema


2. For each difference execute the appropriate DDL statement and update the old model to match
the new one.
3. Finally once the new and old schema model is in sync the new schema is made live such that
other nodes will pick it up.

Differences are processed individually so that should a DDL statement fail any schema changes that
have already been processed can still be promoted to live.

All migrations use DDLQueryBuilder to create the actual C* query to execute allowing unit testing
without starting C*.

131 © DataStax, All Rights Reserved. Confidential


Graph Analytics

Graph Analytics is a read-only Spark-based backend for Gremlin traversals and vertex programs.
The traversal and VP logic is implemented entirely in TinkerPop, centrally in SparkGraphComputer.
DSE’s code contribution is limited to providing SparkGraphComputer with input data. DSE
implements TinkerPop’s InputRDD interface by leaning heavily on the Spark Cassandra Connector.
This InputRDD impl which is responsible for reading graph data from DSEG’s C* tables and
converting those data into TinkerPop StarVertex ego-networks.

DSE also provides an OLAP snapshot feature. This uses TinkerPop’s BulkDumperVertexProgram to
read DSEG data through its InputRDD and into a SparkContext-attached cached RDD. Subsequent
queries hit the cached RDD instead of the original C* tables.
TraversalSourceManager SparkSnapshotBuilderImpl

SparkGraphComputer
Legend
DSE Component
VertexInputRDD
Non-DSE Component
Spark C* Connector
132 © DataStax, All Rights Reserved. Confidential
Index

DSE Graph uses index structures to speed up certain frequently occurring query patterns or index
paths. We distinguish between indexes that support vertex queries and those that support graph
queries.

Vertex Query Indexes


Finding the relations that match a highly selective vertex query condition requires retrieving part of the
adjacency list and filtering in-memory. For large adjacency lists that can be very costly and DSE
Graph allows the user to add vertex-centric indexes to index relations by a particular property key on
a vertex label. Those indexes are defined in the schema.
● Edge Index: Indexes the edges of the vertex by a property
● Property Index: Indexes the properties of the vertex by a meta-property

A separate table is created for each vertex-centric index which is similar in structure to the base
representation (previous slides) with the only difference that the indexed property becomes part of the
clustering key and is inserted after the relation type id. This index table is maintained by a
materialized view on the base table.

133 © DataStax, All Rights Reserved. Confidential


Indexes

Graph Query Indexes


In order to efficiently find vertices matching a particular condition (other than by vertex id) we need
global index structures in order to avoid a full table scan. DSE Graph supports the following indexes
● Materialized view indexes: Index vertices by a single property. Can only be used answer
equality conditions. Uses C* materialized views. Best choice for highly selective properties and
equality queries.
● Secondary Index: Index vertices by a single property. Can only be used to answer equality
conditions. Uses C* secondary indexes. Best choice for medium selective properties and
equality queries.
● Search Index: Indexes all of the specified vertex properties in DSE Search (which is required
for this to be available). Can be used to answer arbitrary Lucene-style queries and complex
conditions. Best choice for low selectivity and arbitrary conditions.
The query builder determines which index structures (if any) are being used for a particular vertex
query or graph query.

134 © DataStax, All Rights Reserved. Confidential


Architecture High Level

135 © DataStax, All Rights Reserved. Confidential


DataStax Drivers

DataStax Drivers communicate over the native Cassandra DataStax Drivers


protocol.
Cassandra Native Protocol
For graph based requests they adhere to an internally defined Graph Subprotocol
subprotocol, that allows DSE Server to properly route those
requests to the Gremlin Executor. DSE Server

The DataStax Drivers are the recommended mechanism for Gremlin Executor
interaction with DSE Graph. They provide an improved
experience for interacting with DSE Graph compared to non
DataStax Drivers. There are two modules available for DSE
Graph interaction:
● Gremlin Module - Fluent API
● Core Enterprise Module - Fluent API and/or String API

136 © DataStax, All Rights Reserved. Confidential


DSE Server

In terms of what is relevant to DSE Graph, DSE Server responds to


Gremlin scripts and Gremlin Bytecode passed on the native
Cassandra protocol.
• A sub-protocol has been defined within the Cassandra
DataStax Drivers
protocol to support graph-based queries.
• The DseQueryHandler identifies custom payloads on the Cassandra Native Protocol
Cassandra protocol that have a “graph-language” key and
Graph Subprotocol
those requests are funneled for “graph processing”.

DefaultGraphQueryHandler holds reference to a Gremlin Executor DSE Server


instance - the same instance configured for and held by Gremlin
Gremlin Executor
Server.

DataStax drivers with graph extensions can connect using the sub-
protocol. Results from requests are serialized to GraphSON for
consumption by the drivers.

137 © DataStax, All Rights Reserved. Confidential


Gremlin Server
TinkerPop-enabled Drivers
Gremlin Server TinkerPop component is a host for the Gremlin
Executor which provides service endpoints that enable remote TinkerPop Maintained Drivers
access to it. It is embedded into DSE itself and runs as a plugin.
Configuration for Gremlin Server is normally handled with an Java (gremlin- Python
driver) (gremlin-python)
external YAML file, but for the Gremlin Server embedded in DSE,
it takes its configuration from the dse.yaml file, where the entirety
of the Gremlin Server YAML configuration is held within a single Third-party Maintained Drivers
key called “gremlin_server”. Gremgo (Go) Python (goblin)

The Gremlin Server instance embedded in DSE only supports the Others
Websocket endpoint and does not have an option for REST.
Given that the Websocket implementation is exposed, it is Websockets
possible for any TinkerPop-enabled drivers to connect to it. This
would include both those maintained by TinkerPop as well as Gremlin Server
third-party implementations that are TinkerPop compliant.
Gremlin Executor

138 © DataStax, All Rights Reserved. Confidential

You might also like