Day 1 - Boot Camp Intro To Graph
Day 1 - Boot Camp Intro To Graph
Graph
Presented By: The letter G
● A database for storing, managing and querying highly connected and complex
data.
● A graph database architecture makes it particularly well suited for unlocking the
value in the data’s relationships and finding commonalities and anomalies in
large data volumes.
tinkerpop.apache.org
g.V().has("name","gremlin").
repeat(in("manages")).until(has("title","ceo")).
path().by("name")
>> The management chain from Gremlin to the CEO
9 © DataStax, All Rights Reserved. Confidential
Enterprise-Readiness with DSE Extensions/Support
Graph
Loader
Stream Ingestion
● DSE Graph
○ A scale-out graph database purposely built for cloud applications that need to
act on complex and highly connected relationships.
○ Inherits all of the native power of Apache Cassandra as well as the enterprise
functionality of DSE, making it the first and best choice for today’s enterprise
systems that require graph support.
The hospital has data from similar trends of complaints that there is a 60% chance of churn for
customers immediately upon the third billing issue.
Identifiers Customer Journey, Customer Key Risks Entity Resolution: the foundation of a
Profile, Instant Customer Profile, C360 graph heavily relies on analytical
Siloed Customer Information, processes for determining the process
Customer Data Integration, Single for matching and merging customer data
Point of Reference, Single Source across disparate systems.
of Truth, Customer Data
Consolidation
Storyline Example 1: Built responsive, real-time apps that allow call center agents to access data at their
fingertips for a seamless, intelligent interaction with members, preventing negative experiences
Example 2: Macquarie Bank personalizes their apps based on what that individual customer would
like to see - credit card, then checking account transactions, with their partner’s balance also
showing based on the latest transactions. As such, what shows up top would be based on the
instantly updated last transaction .
Identifiers Content Based Filtering, Customer Key Risks Confusion with C360: Personalization is
Analytics, Localized Content, based on the knowledge and relationships
Individualized, Omnichannel between data based on their C360 project
Offers: Walmart has a special offer: based on your location (say in Baltimore, MD) every time you
spend $50 on electronics you get a 20% discount that is valid only for another electronics purchase in
the next 24 hours.
Identifiers Product recommendation, A/B Testing, Key Risks Wide graph queries: recommending or
Ad Targeting, Conversion, Digital offering products based on a population
Marketing, Geotargeting, Cross-Sell, Up set or demographic requires a very wide
Sell graph query. This type of query will not
be as performant as a recommendations
that start from a single vertex (like one
product).
Fraud
Is it legit
Fraud
Is it legit
Fraud
Is it legit
Fraud
Is it legit
Fraud
Is it legit
C360
This is who I am
C360
Entity Resolution critical to
Customer 360
Customer 360
CX Solutions
Entity Resolution
Entity Resolution: Core problem
we need to help our customers
solve in order to build any CX
solutions
Customer 360
DSE Solution
Entity Resolution
Critical beyond Graph
DSE Core
Entity Resolution
Property Matching
Relationship Inference
Public Device?
Family Member?
Stolen Identity?
a b c d e f
Master Master
identity identity
?
a b c d e f
Stream Batch
Processing Processing
Graph Tables are C* tables with some special characteristics (more on that later)
● Thus configuring and tuning graph tables is configuring Cassandra
○ Note do not create or drop graph tables by hand, use either:
■ Gremlin-console
■ Studio
○ Both techniques as well as via code will be discussed later
● So tune C*
○ gc_grace_period, ttl, local_read_repair...etc
● Set Replication factor on graph tables
system.graph("KillrVideo").
option("graph.replication_config").
set("{'class' : 'NetworkTopologyStrategy','DC-East' : 3,'DC-West' : 5}").
option("graph.system_replication_config").
set("{'class' : 'NetworkTopologyStrategy','DC-East' : 3,'DC-West' : 3}").
ifNotExists().create()
● Like many of the DSE features when it comes to how you configure integration
points a lot of the work is done in the dse.yaml file.
● Configuration settings for DSE Graph are found at the bottom of the file
● Normally no setting should be changed unless a specific use case pushes you to
changing a setting
● Most likely candidates may be:
● system_evaluation_timeout_in_seconds
● analytic_evaluation_timeout_in_minutes
● realtime_evaluation_timeout_in_seconds
● schema_agreement_timeout_in_ms
● max_query_queue
Note: In DSE 5.0 and 5.1 the best practice is to leave schema_mode to Production
● In 6.0 the config was removed in the dse.yaml and mode is set to Production
58 © DataStax, All Rights Reserved. Confidential
DSE Graph Startup - dse / dse.default
While working on our local systems with a tar installed we started up with a
● dse cassandra -s -k -g
● where -g stands for starting with graph
When in a real production environment you will normally have your system auto start
When using the default init script to autostart graph you flip a flag in /etc/default/dse
● A copy of this file can be found in tar install in the resource/dse/conf directory
● Just change GRAPH_ENABLED=0 to 1 and graph starts when dse starts
Note: All nodes in a single DC should have graph enabled or disabled, do not have a
mixed workload within a single DC
Gremlin Console
● dse gremlin-console
● Groovy based
● allows you to execute gremlin commands
● :help gives you info when in the shell
DataStax Studio
● Same one as used early to do CQLSH commands
● Allows you to execute gremlin
● Real time or against analytics engine
● Great for initial development and exploring
Drivers to access via various languages
● Encourage customers to use the DSE drivers
● Open Source drivers cannot access graph
Edge
Works For
Susan DataStax
Property Name Property Value
Property Name Property Value Location Santa Clara Property Name Property Value
Works For
Person Company
But to get your hands dirty let's do a few things in the console
We can now start playing with the graph by going around typing BootTest.g… to do
something. This is cumbersome and most examples you will see out in the wild usually
talk about ‘g’ as the graph object
Note that the console is not that friendly, no tab complete, etc. But it allows you to get the
job done.
At this point you have made a graph in the cassandra database, but there is nothing to it.
It is a concept only at this point.
Note: in DSE 5.0 and 5.1 there was also a *_pvt keyspace
A vertex is the object in the graph, usually thought of as a noun. The who/what of graph
● More details later as we get into what is a graph and the syntax that describes it
We have the who, but graph is all about relationships, so we need to add some
relationships between these objects. These connectors are called Edges and are thought
of as actions. Again more later.
So let’s do the exact same thing using Studio rather than the console.
Note: Studio is an individual tool right now. There’s no restricting access to components
of studio, i.e. a user who has Studio and is in dev mode can clear a schema. Also,
sharing is a bit of a challenge. Studio is heading in the multi user direction in future
releases.
75 © DataStax, All Rights Reserved. Confidential
Accessing Graph Programmatically
When we discuss Graphs in general we are discussing three possible Graph Types
● Property Graphs
● Resource Description Framework (RDF)
● Hyper Graphs
Vertices
● Objects, Common Nouns, however you want to think of them
● With labels
● And properties
○ Key/Value pairs
○ Typed
○ Single or multiple
○ Can have meta properties (properties of properties)
Edges
● Actions, verbs, the what of a relationship
● With labels
● And Properties
○ Same rules as vertex, except
○ No meta properties allowed on edges
DSE Graph extends upon or modifies the TinkerPop API in the following ways:
● DSE Graph adds additional predicates (Geo, Text Search) to be used in
conditions.
● DSE Graph does not allow explicit transaction handling
● DSE Graph uses custom classes to represent element ids.
● DSE Graph has a schema.
● Adds some utility methods to be used by external software specifically built
for DSE Graph (e.g. graph loader)
The data keyspace contains two tables for each vertex label. Both tables start with the
name of the vertex label and end in:
● [vertex-label]_p: Contains all of the properties incident on vertices of this label
● [vertex-label]_e: Contains all the edges incident on vertices of this label
○ Same data will also store in adjacent vertex’s [vertex-label]_e table
○ Allows for quick traversal in either direction
In other words, the adjacency list for the graph is partitioned by vertex label with edges
and vertex properties further separated into distinct tables.
Default for DSE 5.x but deprecated started DSE 6.x (Don’t use it!)
● Does default partitioning in underlying C* database
● Thinks in terms of community
○ community_id — vertex belongs to a community
○ member_id — vertex is uniquely identified in a community
● Guaranteed unique
○ Synthetic/surrogate
○ Small footprint
● More flexible as not tied to the domain
● Randomly assigned, by load order
○ Thinks things loaded next to each other are part of the same community so partition with
others for fast lookups
○ But if data is not loaded in that order, community may do nothing so from outside point of
view things are randomly partitioned
When manually creating a graph schema there are 5½ steps that need to be done in a
specific order
● Schema design relies on earlier definitions to complete the definition of later items,
you have to do in order.
1. Design a Data Model (cover later)
2. Define the Property Keys
○ Can be reused in multiple vertices and edges (e.g. name)
3. Define the Vertices (by label)
○ Use Custom ID’s as partition key
4. Define the Edges (by label)
○ Cannot define key. System generated
5. Define your indexes
○ Many types depending on need
5.5. Define some Materialized Aggregates (optional)
Unique Key
● Only one property key of a given name within a graph
● Same property key can be reused on multiple vertices or Edges
Data Type
● DSE Graph is typed so all properties need to be assigned a type
Cardinality
● Multiple by default 5.x
● Single by default 6.0
● Recommend to always define and not just accept default for code clarity
● Property key with a single integer value
schema.propertyKey("year").Int().single().create();
● Multiple can only be on vertex
schema.propertyKey("name").Text().multiple().create();
Meta Properties
● Properties of properties
● Needs to define the enclosed properties first
schema.propertyKey("source").Text().single().create();
schema.propertyKey("date").Timestamp().single().create();
● Then use them to define the meta property
schema.propertyKey("budget").Text().multiple()
.properties("source","date").create();
● Can only be defined for vertices
○ Not allowed for edges
Boolean Boolean
Literals Text
Binary Blob
Network Inet
schema.vertexLabel("user").properties("first","last","middle",...).create()
Edge definition
● Unique Label
○ Like vertices the edge label has to be unique label
○ Action/verb/relationship/connection between vertices
■ e.g. User - has a -> Pet
● Single edge cardinality
■ One edge between 2 vertices
■ e.g. a user can have one address he’s living in
schema.edgeLabel("livesAt").single().properties(...).connection("user",
"address").create()
schema.edgeLabel("hasA").multiple().properties("type",...).connection("user",
"address").create()
99 © DataStax, All Rights Reserved. Confidential
Defining Edges
● Edge IDs are composed of outgoing and incoming vertex IDs, edge label, and a
local edge identifier.
○ Always generated automatically and are not customizable.
Note: Property and Edge Index can only leverage the Materialized View type so no need to specify type
Materialized view index Most efficient index for high cardinality, high selectivity vertex properties and equality
predicates. This index is implemented via a materialized view in Apache Cassandra™.
Secondary index Efficient index for low cardinality, low selectivity vertex properties and equality
predicates. This index is implemented via a secondary index in Apache Cassandra™.
Search index Efficient and versatile index for vertex properties with a wide range of cardinalities and
selectivities. This index is implemented via an index in Apache Solr™. A search index
may support a variety of predicates depending on the index flavor:
● Full Text and String searches
● Fuzzy search
● Phrase Search
● Token fuzzy
● Spatial searches
● Other generic type searches
schema.vertexLabel("user").index("userByLast").materialized().by("last").add()
g.V().hasLabel("user").has("last","Atwater")
g.V().vertexLabel("user").index("userByLast").secondary().by("last").add()
g.V().hasLabel("user").has("last","Atwater")
Note: If you don’t explicitly set the vertex label you cannot take advantage of any index
schema.vertexLabel("user").index("search").search().by("last").asText().add()
● Can have different search types even across the same field
schema.vertexLabel("user").index("search").search().by("last").asText().by("last").asString().add()
schema.vertexLabel("user").index("search").search().by("last").asText().by("last").asString()
.by("age").by("birthday").by("salary").add()
Note: Indexes a string, text, int, date and float without specifying (if types are defined as such)
105 © DataStax, All Rights Reserved. Confidential
Geospatial Indexing
schema.propertyKey("point1").Point().withGeoBounds().create();
schema.propertyKey("point3").Polygon().withGeoBounds().create();
● Define index
vertexLabel("human").index("search").search().by("point1").by("point2").by("point3")
.withError(2,3).add()
● Define index
schema.vertexLabel("user").index("search").search().by("first").asText().add()
● Token searches
g.V().hasLabel("user").has("name",Search.tokenRegex("^Ma.*"))
○ Would find user ‘Matt’ (or any user that started with a Ma)
○ Token Flavors: token(), tokenPrefix(), tokenRegex() and tokenFuzzy
● Phrase search
g.V().hasLabel("user").has("first", Search.phrase("Matt Lost", 2))
○ The 2 defines I can have up to two things in between the works in the phrase, so the
above would find ‘Matt the Lost’
● Fuzzy search
g.V().hasLabel('user).has('first', Search.fuzzy('Mett', 1))
○ Allows one letter in the misspelling, so would find ‘Matt’
DSE
DSE Graph
Studio
Graph +Search
+Analytics
Client
DSE Application
Graph
Graph DC1
Loader
TinkerPop
OpsCenter DX Driver TinkerPop Driver
Driver
VertexInputRDD SparkGraphComputer
DSE-Search
PVT Read Processor Traversal Processor
Data-Driven Approach
● Create a representative graph instance
● Do so by being in developer mode and inserting data directly
● This will create the schema for you
● Review schema and implement missing pieces
See the academy course that breaks down each of these approaches in detail
Once we are complete with our design and have created our schema we will run through
the design checklist to make sure we have not missed anything
Exercise
Instructor Led Data Modeling
Options that affect the entire DSE(G) process are in dse.yaml (such as the gremlin-server listen
socket port). Options that affect specific graphs (such as whether linear scan queries are allowed)
are stored in the Shared Data associated with that graph.
Options are declared mostly in graph’s ConfigurationDefinitions. They are represented as subtypes
of ConfigOption<V>, where the generic type parameter V describes values for that option. The API
for config option reads uses these ConfigOption instances rather than strings to represent particular
config keys.
The graph level caches are implemented as an off heap map of queries to adjacency lists. The Off
Heap Cache project is used to give us this implementation.
g.V().has(“name”)
g.V(2).properties(“name”)
g.V(2).outE(“knows”)
Note that there is currently no eager eviction mechanism, so it is possible to get stale data from the
cache. By using caching you are accepting a trade-off between performance and data freshness.
The schema model is the internal schema representation used by DSE Graph.
It is a transaction level resource that describes: Rollback
Discard any
Transaction
schemas that have
been modified
● Property keys Commit
● Edges
Schema
○ Property keys
● Vertex labels Commit
○ Property keys
Shared Data
○ Indices
■ Vertex (secondary, materialized view, search) Changed
■ Edge (materialized view)
○ Adjacencies Schema migrations
○ Partitions
○ Caching
The user does not have direct access to this model but can modify it using the Schema API.
The schema is stored in shared data and the lifecycle is bound to the transaction. Committing a
transaction will also commit any changes to the schema which will be applied via schema migrations.
Once a new schema has been committed to shared data the physical changes to the Cassandra
schema must be made.
Differences are processed individually so that should a DDL statement fail any schema changes that
have already been processed can still be promoted to live.
All migrations use DDLQueryBuilder to create the actual C* query to execute allowing unit testing
without starting C*.
Graph Analytics is a read-only Spark-based backend for Gremlin traversals and vertex programs.
The traversal and VP logic is implemented entirely in TinkerPop, centrally in SparkGraphComputer.
DSE’s code contribution is limited to providing SparkGraphComputer with input data. DSE
implements TinkerPop’s InputRDD interface by leaning heavily on the Spark Cassandra Connector.
This InputRDD impl which is responsible for reading graph data from DSEG’s C* tables and
converting those data into TinkerPop StarVertex ego-networks.
DSE also provides an OLAP snapshot feature. This uses TinkerPop’s BulkDumperVertexProgram to
read DSEG data through its InputRDD and into a SparkContext-attached cached RDD. Subsequent
queries hit the cached RDD instead of the original C* tables.
TraversalSourceManager SparkSnapshotBuilderImpl
SparkGraphComputer
Legend
DSE Component
VertexInputRDD
Non-DSE Component
Spark C* Connector
132 © DataStax, All Rights Reserved. Confidential
Index
DSE Graph uses index structures to speed up certain frequently occurring query patterns or index
paths. We distinguish between indexes that support vertex queries and those that support graph
queries.
A separate table is created for each vertex-centric index which is similar in structure to the base
representation (previous slides) with the only difference that the indexed property becomes part of the
clustering key and is inserted after the relation type id. This index table is maintained by a
materialized view on the base table.
The DataStax Drivers are the recommended mechanism for Gremlin Executor
interaction with DSE Graph. They provide an improved
experience for interacting with DSE Graph compared to non
DataStax Drivers. There are two modules available for DSE
Graph interaction:
● Gremlin Module - Fluent API
● Core Enterprise Module - Fluent API and/or String API
DataStax drivers with graph extensions can connect using the sub-
protocol. Results from requests are serialized to GraphSON for
consumption by the drivers.
The Gremlin Server instance embedded in DSE only supports the Others
Websocket endpoint and does not have an option for REST.
Given that the Websocket implementation is exposed, it is Websockets
possible for any TinkerPop-enabled drivers to connect to it. This
would include both those maintained by TinkerPop as well as Gremlin Server
third-party implementations that are TinkerPop compliant.
Gremlin Executor