0% found this document useful (0 votes)
42 views27 pages

ArangoDB PDF Submission Handling Billions of Edges in A Graph Database

Unlock the power of ArangoDB, the most complete graph database. Explore its scalability for multiple use cases including fraud detection, supply chain, network analysis, traceability, recommendations, and more. Trusted by global enterprises. Explore the advantage today! URL: https://arangodb.com/ Location: San Francisco, CA 94104-5401 United States

Uploaded by

arangodb448
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views27 pages

ArangoDB PDF Submission Handling Billions of Edges in A Graph Database

Unlock the power of ArangoDB, the most complete graph database. Explore its scalability for multiple use cases including fraud detection, supply chain, network analysis, traceability, recommendations, and more. Trusted by global enterprises. Explore the advantage today! URL: https://arangodb.com/ Location: San Francisco, CA 94104-5401 United States

Uploaded by

arangodb448
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Handling Billions Of Edges in

a Graph Database

+ +

Copyright © ArangoDB GmbH / ArangoDB Inc, 2018 1


What are Graph Databases
‣ Schema-free Objects (Vertices)
{
hobby {
name: "alice",
age: 32 name: "dancing" ‣ Relations between them (Edges)
} }
‣ Edges have a direction
ho
bb
y

{ ‣ Edges can be queried in both directions


name: "reading"
}
‣ Easily query a range of edges (2 to 5)
‣ Undefined number of edges (1 to *)
y
bb
ho

{ ‣ Shortest Path between two vertices


name: "bob", hobby {
age: 35, name: "fishing"
size: 1,73m }
}
Typical Graph Queries

‣ Give me all friends of Alice

Eve Bob Frank

Charly Alice Dave


Typical Graph Queries

‣ Give me all friends-of-friends of Alice

Eve Bob Frank

Charly Alice Dave


Typical Graph Queries

‣ What is the linking path between Alice and Eve

Eve Bob Frank

Charly Alice Dave


Typical Graph Queries

‣ Which Train Stations can I reach if I am allowed to travel a distance of at most 6


stations on my ticket

You are
here
Typical Graph Queries: Pattern Matching

‣ Give me all users that share two hobbies with Alice

Alice Friend

Hobby1 Hobby2
Typical Graph Queries: Pattern Matching

‣ Give me all products that at least one of my friends has bought together with the
products I already own, ordered by how many friends have bought it and the
products rating, but only 20 of them.

is_friend has_bought
Alice Friend Product
ha

ht
ug
s_

bo
bo

s_
ug

ha
ht

Product
Non-Typical Graph Queries

‣ Give me all users which have an age attribute between 21 and 35.
‣ Give me the age distribution of all users
‣ Group all users by their name
‣ MULTI-MODEL database
‣ Stores Key Value, Documents, and Graphs
‣ All in one core
‣ Query language AQL
‣ Document Queries
‣ Graph Queries
‣ Joins
‣ All can be combined in the same statement
‣ ACID support including Multi Collection Transactions
AQL

FOR user IN users


RETURN user
AQL

FOR user IN users


FILTER user.name == "alice"
RETURN user

Alice
AQL

FOR user IN users


FILTER user.name == "alice"
FOR product IN OUTBOUND user has_bought
RETURN product

has_bought
Alice TV
AQL

FOR user IN users


FILTER user.name == "alice"
FOR recommendation, action, path IN 3 ANY user has_bought
FILTER path.vertices[2].age <= user.age + 5
AND path.vertices[2].age >= user.age - 5
FILTER recommendation.price < 25
LIMIT 10
RETURN recommendation

alice.age - 5 <= bob.age &&


bob.age <= alice.age + 5 playstation.price < 25

has_bought has_bought has_bought


Alice TV Bob Playstation
Traversal - Iterate down two edges with some filters

‣ We first pick a start vertex (S)


‣ We collect all edges on S
S
‣ We apply filters on edges
‣ We iterate down one of the new vertices (A)
C
‣ We apply filters on edges
A
B ‣ The next vertex (E) is in desired depth.
Return the path S -> A -> E
‣ Go back to the next unfinished vertex (B)
D ‣ We iterate down on (B)
E
F ‣ We apply filters on edges
‣ The next vertex (F) is in desired depth.
Return the path S -> B -> F
Traversal - Complexity

‣ Once: 1
‣ Find the start vertex Depends on indexes: Hash:
‣ For every depth: 1
‣ Find all connected edges Edge-Index or Index-Free: n
‣ Filter non-matching edges Linear in edges: n*1
‣ Find connected vertices Depends on indexes: Hash: n
‣ Filter non-matching vertices Linear in vertices: 3n
Only one pass:

O(3n)
Traversal - Complexity

‣ Linear sounds evil?


‣ NOT linear in All Edges O(E)
‣ Only Linear in relevant Edges n < E
‣ Traversals solely scale with their result size
‣ They are not effected at all by total amount of data
‣ BUT: Every depth increases the exponent: O(3*nd)
‣ "7 degrees of separation": 3*n6 < E < 3*n7
Challenge 1: Supernodes

‣ Many graphs have "celebrities"


‣ Vertices with many inbound and/or outbound edges
‣ Traversing over them is expensive (linear in number of Edges)
‣ Often you only need a subset of edges

Bob Alice
First Boost - Vertex Centric Indices

‣ Remember Complexity? O(3 * nd)


‣ Filtering of non-matching edges is linear for every depth

‣ Index all edges based on their vertices and arbitrary other attributes
‣ Find initial set of edges in identical time
‣ Less / No post-filtering required
‣ This decreases the n significantly

Alice
Challenge 2: Big Data

‣ We have the rise of big data


‣ Store everything you can
‣ Dataset easily grows beyond one machine
‣ This includes graph data!
Scaling

‣ Distribute graph on several machines (sharding)

‣ How to query it now?


‣ No global view of the graph possible any more
‣ What about edges between servers?

‣ In a sharded environment network most of the time is the bottleneck


‣ Reduce network hops
‣ Vertex-Centric Indexes again help with super-nodes
‣ But: Only on a local machine
Dangers of Sharding

‣ Only parts of the graph on every machine


‣ Neighboring vertices may be on different machines
‣ Even edges could be on other machines than their vertices

‣ Queries need to be executed in a distributed way


‣ Result needs to be merged locally
Random Distribution

‣ Advantages: ‣ Neighbors on different machines


‣ every server takes an equal portion of ‣ Probably edges on other machines than
data their vertices
‣ easy to realize ‣ A lot of network overhead is required for
‣ no knowledge about data required querying
‣ always works
‣ Disadvantages:
Index-Free Adjacency

‣ Used by most other graph databases


‣ Every vertex maintains two lists of it's edges (IN and OUT)
‣ Do not use an index to find edges
‣ How to shard this?

????

‣ ArangoDB uses an hash-based EdgeIndex (O(1) - lookup)


‣ The vertex is independent of it's edges
‣ It can be stored on a different machine
Domain Based Distribution

‣ Many Graphs have a natural distribution


‣ By country/region for People
‣ By tags for Blogs
‣ By category for Products
‣ Most edges in same group
‣ Rare edges between groups

ArangoDB Enterprise Edition


uses Domain Knowledge
for short-cuts
SmartGraphs - How it works

Foxx Foxx

Coordinator Coordinator

DB Server 1 DB Server 2 DB Server n


Thank You
‣ Further questions?
‣ Follow us on twitter: @arangodb and @ArangoMatthew
‣ Join our slack: slack.arangodb.com
‣ https://www.arangodb.com/speakers/matthew-von-maszewski/
‣ https://github.com/arangodb/arangodb

You might also like