SlideShare a Scribd company logo
GraphFrames
DataFrame-based graphs for Apache® Spark™
Joseph K. Bradley
4/14/2016
About the speaker: Joseph Bradley
Joseph Bradley is a Software Engineerand Apache
Spark PMC member working on MLlib at
Databricks. Previously,he was a postdoc at UC
Berkeley after receiving hisPh.D. in Machine
Learning from Carnegie Mellon U.in 2013.His
research included probabilistic graphical models,
parallel sparse regression, and aggregation
mechanismsfor peergrading in MOOCs.
2
About the moderator: Denny Lee
Denny Lee is a Technology Evangelistwith
Databricks; he is a hands-on data sciencesengineer
with more than 15 years of experience developing
internet-scale infrastructure, data platforms, and
distributed systems for both on-premisesand cloud.
Prior to joining Databricks, Denny worked as a
SeniorDirector of Data SciencesEngineering at
Concur and was part of the incubation teamthat
builtHadoop on Windowsand Azure (currently
known as HDInsight).
3
We are Databricks, the company behind Apache Spark
Founded by the creators of
Apache Spark in 2013
Share of Spark code
contributed by Databricks
in 2014
75%
4
Data Value
Created Databricks on top of Spark to make big data simple.
…
Apache Spark Engine
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
Unified engineacross diverse workloads & environments
Scale out, fault tolerant
Python, Java, Scala, and R
APIs
Standard libraries
GraphFrames: DataFrame-based graphs for Apache® Spark™
NOTABL E USERS THAT PRESENTED AT SPARK SUMMIT
2 0 1 5 SAN F RANCISCO
Source: Slide5ofSparkCommunityUpdate
Outline
GraphFrames overview
GraphFrames vs. GraphXand other libraries
Details for power users
Roadmap and resources
8
Outline
GraphFrames overview
GraphFrames vs. GraphXand other libraries
Details for power users
Roadmap and resources
9
Graphs
10
vertex
edge
id City State
“JFK” “New York” NY
Example: airports & flights between them
JFK
IAD
LAX
SFO
SEA
DFW
src dst delay tripID
“JFK
”
“SEA” 45 1058923
Apache Spark’s GraphX library
Overview
• General-purpose graph
processinglibrary
• Optimized for fast
distributedcomputing
• Library of algorithms:
PageRank, Connected
Components,etc.
11
Challenges
• No Java, PythonAPIs
• Lower-levelRDD-based
API (vs.DataFrames)
• Cannot use recent Spark
optimizations:Catalyst
query optimizer,Tungsten
memory management
Enter GraphFrames
Goal: DataFrame-based graphson ApacheSpark
• Simplify interactive queries
• Support motif-findingforstructural pattern search
• Benefitfrom DataFrame optimizations
Collaboration between Databricks, UC Berkeley& MIT
+ Now with community contributors!
12
Graphs
13
vertex
edge
id City State
“JFK” “New York” NY
Example: airports & flights between them
JFK
IAD
LAX
SFO
SEA
DFW
src dst delay tripID
“JFK
”
“SEA” 45 1058923
GraphFrames
“vertices” DataFrame
• 1 vertexper Row
• id: column with unique ID
“edges” DataFrame
• 1 edge per Row
• src, dst: columns using IDs from vertices.id
14
Extra columns store vertexor edge data
(a.k.a. attributes or properties).
id City State
“JFK” “New York” NY
“SEA” “Seattle” WA
src dst delay tripID
“JFK” “SEA” 45 1058923
“DFW” “SFO” -7 4100224
Demo:
Building a GraphFrame
15
16
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™
Queries
Simple queries
Motif finding
Graph algorithms
19
Simple queries
SQL queries on vertices & edges
E.g., what trips are most likely to have significantdelays?
20
Graph queries
• Vertex degrees
• # edgesper vertex(incoming,outgoing,total)
• Triplets
• Join vertices and edgesto get (src, edge,dst)
21
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™
Motif finding
24
IAD
JFK
LAX
SFO
SEA
DFW
Search for structural
patterns within a graph.
val paths: DataFrame =
g.find(“(a)-[e1]->(b);
(b)-[e2]->(c);
!(c)-[]->(a)”)
Motif finding
25
IAD
JFK
LAX
SFO
SEA
DFW
(b)
(a)Search for structural
patterns within a graph.
val paths: DataFrame =
g.find(“(a)-[e1]->(b);
(b)-[e2]->(c);
!(c)-[]->(a)”)
Motif finding
26
IAD
JFK
LAX
SFO
SEA
DFW
(b)
(a)
(c)
Search for structural
patterns within a graph.
val paths: DataFrame =
g.find(“(a)-[e1]->(b);
(b)-[e2]->(c);
!(c)-[]->(a)”)
Motif finding
27
IAD
JFK
LAX
SFO
SEA
DFW
(b)
(a)
(c)
Search for structural
patterns within a graph.
val paths: DataFrame =
g.find(“(a)-[e1]->(b);
(b)-[e2]->(c);
!(c)-[]->(a)”)
Motif finding
28
IAD
JFK
LAX
SFO
SEA
DFW
Search for structural
patterns within a graph.
val paths: DataFrame =
g.find(“(a)-[e1]->(b);
(b)-[e2]->(c);
!(c)-[]->(a)”)
(b)
(a)
(c)
Then filter using vertex &
edge data.
paths.filter(“e1.delay > 20”)
29
GraphFrames: DataFrame-based graphs for Apache® Spark™
Graph algorithms
Find importantvertices
• PageRank
31
Find pathsbetweensets of vertices
• Breadth-first search (BFS)
• Shortest paths
Find groupsof vertices(components,
communities)
• Connected components
• Strongly connected components
• Label Propagation Algorithm(LPA)
Other
• Triangle counting
• SVDPlusPlus
32
Algorithm implementations
Mostly wrappers for GraphX
• PageRank
• Shortest paths
• Connected components
• Strongly connected components
• Label Propagation Algorithm (LPA)
• SVDPlusPlus
33
Some algorithms implemented
usingDataFrames
• Breadth-first search
• Triangle counting
Saving & loading graphs
Save & load the DataFrames.
vertices = sqlContext.read.parquet(...)
edges = sqlContext.read.parquet(...)
g = GraphFrame(vertices, edges)
g.vertices.write.parquet(...)
g.edges.write.parquet(...)
In the future...
• SQL data sources for graph formats
34
APIs: Scala, Java, Python
API available from all 3 languages
à First time GraphX functionality hasbeen available to
Java & Python users
2 missing items (WIP)
• Java-friendliness is currently in alpha.
• Python does not have aggregateMessages
(for implementing your own graph algorithms).
35
Outline
GraphFrames overview
GraphFrames vs. GraphX and other libraries
Details for power users
Roadmap and resources
36
2 types of graph libraries
37
Graph algorithms Graph queries
Standard & custom algorithms
Optimized for batch processing
Motif finding
Point queries &updates
GraphFrames: Both algorithms &queries (but notpoint updates)
GraphFrames vs. GraphX
38
GraphFrames GraphX
Builton DataFrames RDDs
Languages Scala, Java, Python Scala
Use cases Queries & algorithms Algorithms
Vertex IDs Any type (in Catalyst) Long
Vertex/edg
e attributes
Any number of
DataFrame columns
Any type (VD, ED)
Return
types
GraphFrame or
DataFrame
Graph[VD, ED], or
RDD[Long, VD]
GraphX compatibility
Simple conversionsbetweenGraphFrames& GraphX.
val g: GraphFrame = ...
// Convert GraphFrame à GraphX
val gx: Graph[Row, Row] = g.toGraphX
// Convert GraphX à GraphFrame
val g2: GraphFrame = GraphFrame.fromGraphX(gx)
39
Vertex & edgeattributes
are Rows in order to
handlenon-LongIDs
Wrapping existing GraphX code: See Belief Propagation example:
https://github.com/graphframes/graphframes/blob/master/src/main/scala/org/graphframes/examples/BeliefPropagation.scala
Outline
GraphFrames overview
GraphFrames vs. GraphXand other libraries
Details for power users
Roadmap and resources
40
Scalability
Currentstatus
• DataFrame-based parts benefitfrom DataFrame scalability +
performance optimizations(Catalyst, Tungsten).
• GraphX wrappers are as fast as GraphX (+ conversion overhead).
WIP
• GraphX hasoptimizationswhich are not yet ported to GraphFrames.
• See nextslide…
41
WIP optimizations
Join elimination
• GraphFrame algorithms require lots
of joins.
• Not all joins are necessary
Solution:
• Vertex IDs serve as unique keys.
• Tracking keys allows Catalyst to
eliminate some joins.
42
For more info & benchmark results, see AnkurDave’s SSE 2016 talk.
https://spark-summit.org/east-2016/events/graphframes-graph-queries-in-spark-sql/
Materializedviews
• Data locality for common usecases
• Message-passing algorithms often
need “triplet view” (src,edge, dst)
Solution:
• Materialize specific views
• Analogous to GraphX’s “replicated
vertex view”
Implementing new algorithms
43
Method 2: Messagepassing
aggregateMessages
• Same primitive as GraphX
• Specify messages & aggregation
using DataFrame expressions
Belief propagation example code
Method 1: DataFrame &
GraphFrame operations
Motif finding
• Series of DataFrame joins
Triangle count
• DataFrame ops + motif finding
BFS
• DataFrame joins & filters
Outline
GraphFrames overview
GraphFrames vs. GraphXand other libraries
Details for power users
Roadmap and resources
44
Current status
Published
• Open source (Apache 2.0) on Github
https://github.com/graphframes/graphframes
• Spark package http://spark-
packages.org/package/graphframes/graphframes
Compatible
• Spark 1.4, 1.5, 1.6
• Databricks Community Edition
Documented
• http://graphframes.github.io/
45
Roadmap
• MergeWIP speed optimizations
• Java API tests & examples
• Migrate more algorithms to DataFrame-based
implementations for greater scalability
• Getcommunity feedback!
46
Contribute
• Tracking issueson Github
• Thanks to those who have
already sent pull requests!
Resources for learning more
User guide + API docs http://graphframes.github.io/
• Quick-start
• Overview & examples for all algorithms
• Alsoavailable as executablenotebooks:
• Scala: http://go.databricks.com/hubfs/notebooks/3-GraphFrames-User-Guide-scala.html
• Python: http://go.databricks.com/hubfs/notebooks/3-GraphFrames-User-Guide-python.html
Blog posts
• Intro: https://databricks.com/blog/2016/03/03/introducing-graphframes.html
• Flight delay analysis: https://databricks.com/blog/2016/03/16/on-time-flight-performance-
with-spark-graphframes.html
47
48
Thank you!
Thanks to
• Denny Lee & Bill Chambers (demo)
• Tim Hunter, Xiangrui Meng, Ankur Dave &others (GraphFrames development)

More Related Content

PPTX
Apache Spark Architecture
PPTX
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
PDF
GraphFrames: Graph Queries In Spark SQL
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PDF
Hive Bucketing in Apache Spark with Tejas Patil
PDF
Productizing Structured Streaming Jobs
PPTX
iceberg introduction.pptx
PDF
Parallelizing with Apache Spark in Unexpected Ways
Apache Spark Architecture
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
GraphFrames: Graph Queries In Spark SQL
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Hive Bucketing in Apache Spark with Tejas Patil
Productizing Structured Streaming Jobs
iceberg introduction.pptx
Parallelizing with Apache Spark in Unexpected Ways

What's hot (20)

PDF
Introducing DataFrames in Spark for Large Scale Data Science
PDF
Best Practices for the Most Impactful Oracle Database 18c and 19c Features
PDF
Designing Structured Streaming Pipelines—How to Architect Things Right
PPTX
Hive + Tez: A Performance Deep Dive
PDF
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
PPTX
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
PDF
Apache Iceberg Presentation for the St. Louis Big Data IDEA
PPTX
ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)
PDF
Understanding Query Plans and Spark UIs
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
PDF
Making Apache Spark Better with Delta Lake
PPTX
Comparing three data ingestion approaches where Apache Kafka integrates with ...
PDF
C16 45分でわかるPostgreSQLの仕組み by 山田努
PDF
Best Practices for Enabling Speculative Execution on Large Scale Platforms
PDF
噛み砕いてKafka Streams #kafkajp
PDF
Data pipeline with kafka
PDF
Deep Dive: Memory Management in Apache Spark
PDF
Hive on Tezのベストプラクティス
Introducing DataFrames in Spark for Large Scale Data Science
Best Practices for the Most Impactful Oracle Database 18c and 19c Features
Designing Structured Streaming Pipelines—How to Architect Things Right
Hive + Tez: A Performance Deep Dive
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Apache Iceberg Presentation for the St. Louis Big Data IDEA
ポスト・ラムダアーキテクチャの切り札? Apache Hudi(NTTデータ テクノロジーカンファレンス 2020 発表資料)
Understanding Query Plans and Spark UIs
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Making Apache Spark Better with Delta Lake
Comparing three data ingestion approaches where Apache Kafka integrates with ...
C16 45分でわかるPostgreSQLの仕組み by 山田努
Best Practices for Enabling Speculative Execution on Large Scale Platforms
噛み砕いてKafka Streams #kafkajp
Data pipeline with kafka
Deep Dive: Memory Management in Apache Spark
Hive on Tezのベストプラクティス
Ad

Similar to GraphFrames: DataFrame-based graphs for Apache® Spark™ (20)

PDF
Web-Scale Graph Analytics with Apache® Spark™
PDF
Web-Scale Graph Analytics with Apache® Spark™
PDF
An excursion into Graph Analytics with Apache Spark GraphX
PPTX
Graphs in data structures are non-linear data structures made up of a finite ...
PDF
Apache Spark Presentation good for big data
PDF
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
PDF
Graph Analytics in Spark
PDF
GraphX: Graph analytics for insights about developer communities
PDF
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
PDF
Challenging Web-Scale Graph Analytics with Apache Spark
PPTX
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
PDF
Composable Parallel Processing in Apache Spark and Weld
PDF
Simplifying Big Data Analytics with Apache Spark
PPTX
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
PDF
An introduction To Apache Spark
PPTX
Spark Concepts - Spark SQL, Graphx, Streaming
PPTX
SPARK ARCHITECTURE
PPTX
Processing Large Data with Apache Spark -- HasGeek
PPTX
Building a modern Application with DataFrames
PPTX
Building a modern Application with DataFrames
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™
An excursion into Graph Analytics with Apache Spark GraphX
Graphs in data structures are non-linear data structures made up of a finite ...
Apache Spark Presentation good for big data
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
Graph Analytics in Spark
GraphX: Graph analytics for insights about developer communities
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
Composable Parallel Processing in Apache Spark and Weld
Simplifying Big Data Analytics with Apache Spark
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
An introduction To Apache Spark
Spark Concepts - Spark SQL, Graphx, Streaming
SPARK ARCHITECTURE
Processing Large Data with Apache Spark -- HasGeek
Building a modern Application with DataFrames
Building a modern Application with DataFrames
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PPT
Introduction Database Management System for Course Database
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Softaken Excel to vCard Converter Software.pdf
PPTX
Mastering-Cybersecurity-The-Crucial-Role-of-Antivirus-Support-Services.pptx
PDF
System and Network Administration Chapter 2
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Build Multi-agent using Agent Development Kit
PDF
System and Network Administraation Chapter 3
PPTX
Materi-Enum-and-Record-Data-Type (1).pptx
PDF
IEEE-CS Tech Predictions, SWEBOK and Quantum Software: Towards Q-SWEBOK
PDF
medical staffing services at VALiNTRY
PPTX
Materi_Pemrograman_Komputer-Looping.pptx
PDF
Understanding Forklifts - TECH EHS Solution
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
Best Practices for Rolling Out Competency Management Software.pdf
PDF
How to Confidently Manage Project Budgets
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Introduction Database Management System for Course Database
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Softaken Excel to vCard Converter Software.pdf
Mastering-Cybersecurity-The-Crucial-Role-of-Antivirus-Support-Services.pptx
System and Network Administration Chapter 2
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Which alternative to Crystal Reports is best for small or large businesses.pdf
How to Migrate SBCGlobal Email to Yahoo Easily
Build Multi-agent using Agent Development Kit
System and Network Administraation Chapter 3
Materi-Enum-and-Record-Data-Type (1).pptx
IEEE-CS Tech Predictions, SWEBOK and Quantum Software: Towards Q-SWEBOK
medical staffing services at VALiNTRY
Materi_Pemrograman_Komputer-Looping.pptx
Understanding Forklifts - TECH EHS Solution
How to Choose the Right IT Partner for Your Business in Malaysia
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Best Practices for Rolling Out Competency Management Software.pdf
How to Confidently Manage Project Budgets
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx

GraphFrames: DataFrame-based graphs for Apache® Spark™

  • 1. GraphFrames DataFrame-based graphs for Apache® Spark™ Joseph K. Bradley 4/14/2016
  • 2. About the speaker: Joseph Bradley Joseph Bradley is a Software Engineerand Apache Spark PMC member working on MLlib at Databricks. Previously,he was a postdoc at UC Berkeley after receiving hisPh.D. in Machine Learning from Carnegie Mellon U.in 2013.His research included probabilistic graphical models, parallel sparse regression, and aggregation mechanismsfor peergrading in MOOCs. 2
  • 3. About the moderator: Denny Lee Denny Lee is a Technology Evangelistwith Databricks; he is a hands-on data sciencesengineer with more than 15 years of experience developing internet-scale infrastructure, data platforms, and distributed systems for both on-premisesand cloud. Prior to joining Databricks, Denny worked as a SeniorDirector of Data SciencesEngineering at Concur and was part of the incubation teamthat builtHadoop on Windowsand Azure (currently known as HDInsight). 3
  • 4. We are Databricks, the company behind Apache Spark Founded by the creators of Apache Spark in 2013 Share of Spark code contributed by Databricks in 2014 75% 4 Data Value Created Databricks on top of Spark to make big data simple.
  • 5. … Apache Spark Engine Spark Core Spark Streaming Spark SQL MLlib GraphX Unified engineacross diverse workloads & environments Scale out, fault tolerant Python, Java, Scala, and R APIs Standard libraries
  • 7. NOTABL E USERS THAT PRESENTED AT SPARK SUMMIT 2 0 1 5 SAN F RANCISCO Source: Slide5ofSparkCommunityUpdate
  • 8. Outline GraphFrames overview GraphFrames vs. GraphXand other libraries Details for power users Roadmap and resources 8
  • 9. Outline GraphFrames overview GraphFrames vs. GraphXand other libraries Details for power users Roadmap and resources 9
  • 10. Graphs 10 vertex edge id City State “JFK” “New York” NY Example: airports & flights between them JFK IAD LAX SFO SEA DFW src dst delay tripID “JFK ” “SEA” 45 1058923
  • 11. Apache Spark’s GraphX library Overview • General-purpose graph processinglibrary • Optimized for fast distributedcomputing • Library of algorithms: PageRank, Connected Components,etc. 11 Challenges • No Java, PythonAPIs • Lower-levelRDD-based API (vs.DataFrames) • Cannot use recent Spark optimizations:Catalyst query optimizer,Tungsten memory management
  • 12. Enter GraphFrames Goal: DataFrame-based graphson ApacheSpark • Simplify interactive queries • Support motif-findingforstructural pattern search • Benefitfrom DataFrame optimizations Collaboration between Databricks, UC Berkeley& MIT + Now with community contributors! 12
  • 13. Graphs 13 vertex edge id City State “JFK” “New York” NY Example: airports & flights between them JFK IAD LAX SFO SEA DFW src dst delay tripID “JFK ” “SEA” 45 1058923
  • 14. GraphFrames “vertices” DataFrame • 1 vertexper Row • id: column with unique ID “edges” DataFrame • 1 edge per Row • src, dst: columns using IDs from vertices.id 14 Extra columns store vertexor edge data (a.k.a. attributes or properties). id City State “JFK” “New York” NY “SEA” “Seattle” WA src dst delay tripID “JFK” “SEA” 45 1058923 “DFW” “SFO” -7 4100224
  • 16. 16
  • 20. Simple queries SQL queries on vertices & edges E.g., what trips are most likely to have significantdelays? 20 Graph queries • Vertex degrees • # edgesper vertex(incoming,outgoing,total) • Triplets • Join vertices and edgesto get (src, edge,dst)
  • 21. 21
  • 24. Motif finding 24 IAD JFK LAX SFO SEA DFW Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”)
  • 25. Motif finding 25 IAD JFK LAX SFO SEA DFW (b) (a)Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”)
  • 26. Motif finding 26 IAD JFK LAX SFO SEA DFW (b) (a) (c) Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”)
  • 27. Motif finding 27 IAD JFK LAX SFO SEA DFW (b) (a) (c) Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”)
  • 28. Motif finding 28 IAD JFK LAX SFO SEA DFW Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”) (b) (a) (c) Then filter using vertex & edge data. paths.filter(“e1.delay > 20”)
  • 29. 29
  • 31. Graph algorithms Find importantvertices • PageRank 31 Find pathsbetweensets of vertices • Breadth-first search (BFS) • Shortest paths Find groupsof vertices(components, communities) • Connected components • Strongly connected components • Label Propagation Algorithm(LPA) Other • Triangle counting • SVDPlusPlus
  • 32. 32
  • 33. Algorithm implementations Mostly wrappers for GraphX • PageRank • Shortest paths • Connected components • Strongly connected components • Label Propagation Algorithm (LPA) • SVDPlusPlus 33 Some algorithms implemented usingDataFrames • Breadth-first search • Triangle counting
  • 34. Saving & loading graphs Save & load the DataFrames. vertices = sqlContext.read.parquet(...) edges = sqlContext.read.parquet(...) g = GraphFrame(vertices, edges) g.vertices.write.parquet(...) g.edges.write.parquet(...) In the future... • SQL data sources for graph formats 34
  • 35. APIs: Scala, Java, Python API available from all 3 languages à First time GraphX functionality hasbeen available to Java & Python users 2 missing items (WIP) • Java-friendliness is currently in alpha. • Python does not have aggregateMessages (for implementing your own graph algorithms). 35
  • 36. Outline GraphFrames overview GraphFrames vs. GraphX and other libraries Details for power users Roadmap and resources 36
  • 37. 2 types of graph libraries 37 Graph algorithms Graph queries Standard & custom algorithms Optimized for batch processing Motif finding Point queries &updates GraphFrames: Both algorithms &queries (but notpoint updates)
  • 38. GraphFrames vs. GraphX 38 GraphFrames GraphX Builton DataFrames RDDs Languages Scala, Java, Python Scala Use cases Queries & algorithms Algorithms Vertex IDs Any type (in Catalyst) Long Vertex/edg e attributes Any number of DataFrame columns Any type (VD, ED) Return types GraphFrame or DataFrame Graph[VD, ED], or RDD[Long, VD]
  • 39. GraphX compatibility Simple conversionsbetweenGraphFrames& GraphX. val g: GraphFrame = ... // Convert GraphFrame à GraphX val gx: Graph[Row, Row] = g.toGraphX // Convert GraphX à GraphFrame val g2: GraphFrame = GraphFrame.fromGraphX(gx) 39 Vertex & edgeattributes are Rows in order to handlenon-LongIDs Wrapping existing GraphX code: See Belief Propagation example: https://github.com/graphframes/graphframes/blob/master/src/main/scala/org/graphframes/examples/BeliefPropagation.scala
  • 40. Outline GraphFrames overview GraphFrames vs. GraphXand other libraries Details for power users Roadmap and resources 40
  • 41. Scalability Currentstatus • DataFrame-based parts benefitfrom DataFrame scalability + performance optimizations(Catalyst, Tungsten). • GraphX wrappers are as fast as GraphX (+ conversion overhead). WIP • GraphX hasoptimizationswhich are not yet ported to GraphFrames. • See nextslide… 41
  • 42. WIP optimizations Join elimination • GraphFrame algorithms require lots of joins. • Not all joins are necessary Solution: • Vertex IDs serve as unique keys. • Tracking keys allows Catalyst to eliminate some joins. 42 For more info & benchmark results, see AnkurDave’s SSE 2016 talk. https://spark-summit.org/east-2016/events/graphframes-graph-queries-in-spark-sql/ Materializedviews • Data locality for common usecases • Message-passing algorithms often need “triplet view” (src,edge, dst) Solution: • Materialize specific views • Analogous to GraphX’s “replicated vertex view”
  • 43. Implementing new algorithms 43 Method 2: Messagepassing aggregateMessages • Same primitive as GraphX • Specify messages & aggregation using DataFrame expressions Belief propagation example code Method 1: DataFrame & GraphFrame operations Motif finding • Series of DataFrame joins Triangle count • DataFrame ops + motif finding BFS • DataFrame joins & filters
  • 44. Outline GraphFrames overview GraphFrames vs. GraphXand other libraries Details for power users Roadmap and resources 44
  • 45. Current status Published • Open source (Apache 2.0) on Github https://github.com/graphframes/graphframes • Spark package http://spark- packages.org/package/graphframes/graphframes Compatible • Spark 1.4, 1.5, 1.6 • Databricks Community Edition Documented • http://graphframes.github.io/ 45
  • 46. Roadmap • MergeWIP speed optimizations • Java API tests & examples • Migrate more algorithms to DataFrame-based implementations for greater scalability • Getcommunity feedback! 46 Contribute • Tracking issueson Github • Thanks to those who have already sent pull requests!
  • 47. Resources for learning more User guide + API docs http://graphframes.github.io/ • Quick-start • Overview & examples for all algorithms • Alsoavailable as executablenotebooks: • Scala: http://go.databricks.com/hubfs/notebooks/3-GraphFrames-User-Guide-scala.html • Python: http://go.databricks.com/hubfs/notebooks/3-GraphFrames-User-Guide-python.html Blog posts • Intro: https://databricks.com/blog/2016/03/03/introducing-graphframes.html • Flight delay analysis: https://databricks.com/blog/2016/03/16/on-time-flight-performance- with-spark-graphframes.html 47
  • 48. 48
  • 49. Thank you! Thanks to • Denny Lee & Bill Chambers (demo) • Tim Hunter, Xiangrui Meng, Ankur Dave &others (GraphFrames development)