Facebook Wall Data Using Graph API

This document summarizes a research paper about petabyte-scale databases and storage systems deployed at Facebook. It describes four major types of storage systems used at Facebook: online transaction processing databases for the social graph, semi-online light transaction processing databases for messages and time series data, an immutable datastore for photos and videos, and an analytics datastore consisting of a data warehouse and logs storage. It provides details on the size and scale of Facebook's databases, and discusses the architectures and technologies used to implement the social graph database, messages database, photo store, and analytics warehouse.

Uploaded by

Cleilson Pereira

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

77 views55 pages

Facebook Wall Data Using Graph API

Uploaded by

Cleilson Pereira

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

Big Data

Facebook Wall Data using Graph API

Presented by:
Prashant Patel-2556219
Jaykrushna Patel-2619715
Outline
• Data Source
• Processing tools for processing our data
• Big Data Processing System: Mongodb
• Data Transformation-Mongodb
• Data Mining Queries
• References
• Research Paper
Data Source
• Downloaded from Facebook Graph API
• Approach : Python Script
• FQL
• Facebook Query Language
• Access Token
• We received using authorize through our own Application
Python Script
Original Data
Sample data
Possible tools for processing our data
• Spark
• Hive
• MongoDB
Spark-Easy Json data manipulation
• The entry point into all relational functionality in Spark is the SQLContext class. So. Here we create SQLContext
for querying sql command.
• Here, we loads a JSON file (one object per line), returning the result as a SchemaRDD
Schema of the data
• Create table in Spark. Here, we create temporary table jt1 for storing json Data.
• SQL statements can be run by using the sql methods provided by sqlContext. Here only display
one row data from jt1 table.
Here, Display all data from jt1 table.
Why not Spark
• It does not support rich data mining queries.
• No support fro interactive data mining Queries
Why not Hive:
• Hive is not able to process complex JSON Graph data
• No support for dynamic large schemas
Big Data Processing System: Mongodb
• MongoDB is an open-source document database, and the leading
NoSQL database.
• Features:
• Document-Oriented Storage
• Full Index Support
• Auto-Sharding
• Scale horizontally without compromising functionality.
• Querying
• Rich, document-based queries.
• Map/Reduce
• Flexible aggregation and data processing.
• GridFS
• Store files of any size without complicating your stack.
• Mongodb Management service
• Also support cloud services.
• Provide rich GUI Tools:
• Edda
• Fluentd
• Fang of Mongo
• Umongo
• Mongo-vision
• MongoHub
• MongoVUE
• RockMongo
• Robomongo
• MongoDB belongs to the type of document-oriented database in
which data is organized as JSON or BSON document, and store into an
collection.
• Database holds a set of collections. collection holds a set of
documents. Collection and document are same as table and row in
relational database.
• Document is a set of key-value pairs. Documents have dynamic
schema. Dynamic schema means that documents in the same
collection do not need to have the same set of fields or structure, and
common fields in a collection’s documents may hold different types of
data.
Data Transformation-Mongodb
• Start Mongo Server
• ./mongod
• Start Mongo Client
• mongo
• Server

• Client
• Database

• Collections
• Import Data

mongoimport --db <db-name> --collection <coll-name> --type json --

file seed.json --jsonArray
Imported data
Data Transformation
• Issues retrieving specific records
• All Queries return Same records.
• Data array of all elements.
• Reason :
All data are in single array so its a single Object ID (Index)

• Solution
• Transform the Array Elements into separate documents
• Each of them having different INDEX.
• Approaches to Transform the data into individual documents.
• Script (Python , Php)
• Use Batch Insert functionality into MongoDB
• Before Applying transformation Strategy

• After Applying transformation Strategy

Data Mining Queries
• Retrieve all records from graph collection.
db.graph.find()
Retrieve Records having Comments more than 0
db.graph.find({"comments.count" : {$gt : 0 }})
Return few fields instead the whole structure
Suppose we want only post_id and number of comments on it
Output
Return 5 oldest message which is liked by at
least one
Output
Return actor with total number of likes above 100 for all messages for each actor In
descending order of total Likes
• Query

• Compare to SQL Query

Output :
Query return message and comment count and if there is no
comments on the message then return “No Comments”
Query:
Output
Return selective fields inside nested structure
Return each post , messages, comments[ count , comment_list [text]]

• Query
Output
Return data with following Structure
{
“Biggestlikemessage”:{
“message”:”test meaage”,
“TotalNoofLikes”: NoofLikes(integer)
}
“Smallestlikemessage”:{
“message”:”test meaage”,
“TotalNoofLikes”: NoofLikes(integer)
}
“actor”(USER) : actor_id(integer)
}
Query
Output
References
https://developers.facebook.com/

https://spark.apache.org/docs/1.0.2/quick-start.html

https://spark.apache.org/documentation.html

http://docs.mongodb.org/manual/

https://docs.python.org/2/tutorial/

http://www.tutorialspoint.com/mongodb/mongodb_quick_guide.htm

https://www.youtube.com/watch?v=o3u97Vry2bE

https://www.youtube.com/watch?v=bWorBGOFBWY&list=PL-x35fyliRwhKT-NpTKprPW1bkbdDcTTW
Research Paper:
Petabyte Scale Databases and Storage Systems
Deployed at Facebook
Dhruba Borthakur ,
Engineer at Facebook.
Four major types of storage systems in Facebook
• Online Transaction Processing Databases (OLTP)
• The Facebook Social Graph
• Semi-online Light Transaction Processing Databases (SLTP)
• Facebook Messages and Facebook Time Series
• Immutable DataStore
• Photos, videos, etc.
• Analytics DataStore
• Data Warehouse, Logs storage
Size and Scale of Databases
Characteristics
Facebook Social Graph: TAO and MySQL
• An OLTP workload:
▪ Uneven read heavy workload
▪ Huge working set with creation-time locality
▪ Highly interconnected data
▪ Constantly evolving
▪ As consistent as possible
• Graph data Model
• Nodes and Edges: Objects and Associations
Data model
• Objects & Associations
• Object -> unique 64 bit ID plus a typed dictionary
 (id) -> (otype, (key -> value)* )
 ID 6815841748 -> {‘type’: page, ‘name’: “John”, … }
• Association -> typed directed edge between 2 IDs
 (id1, atype, id2) -> (time, (key -> value)* )
 (8636146, RSVP, 130855887032173) -> (1327719600, {‘response’: ‘YES’})
• Association lists
 (id1, atype) -> all associations with given id1, atype in desc order by time
Architecture: Cache & Storage
Messages & Time Series Database SLTP
workload
• Facebook Messages:
Why HBase
• High write throughput
• Horizontal scalability
• Automatic Failover
• Strong consistency within a data center
• Benefits of HDFS : Fault tolerant, scalable, Map-Reduce toolset,
• Why is this SLTP?
• Semi-online: Queries run even if part of the database is offline
• Lightweight Transactions: single row transactions
• Storage capacity bound rather than iops or cpu bound
What we store in HBase
• Small messages
• Message metadata (thread/message indices)
• Search index
• Large attachments stored in Haystack (photo store)
Size and scale of Messages Database
• 6 Billion messages/day
• 74 Billion operations/day
• At peak: 1.5 million operations/sec
• 55% read, 45% write operations
• Average write operation inserts 16 records
• All data is lzo compressed
• Growing at 8 TB/day
Haystack: The Photo Store
• Facebook Photo DataStore
Hive Analytics Warehouse
• Life of a photo tag in Hadoop/Hive storage
Why Hive?
• Prospecting for gold in the wild-west…..
 A platform for huge data-experiments
 A majority of queries are searching for a single gold nugget
 Great advantage in keeping all data in one queryable system
 No structure to data, specify structure at query time
• Crowd Sourcing for data discovery
 There are 50K tables in a single warehouse
 Users are DBAs themselves
 Questions about a table are directed to users of that table
 Automatic query lineage tools help here
Thank You.

Mongo DB
No ratings yet
Mongo DB
104 pages
MongoDB Case Study 1
No ratings yet
MongoDB Case Study 1
6 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
The Cyber Skill Gap PDF
No ratings yet
The Cyber Skill Gap PDF
134 pages
DevOps Thesis Final PDF
No ratings yet
DevOps Thesis Final PDF
79 pages
Bigdata Unit 4
No ratings yet
Bigdata Unit 4
97 pages
Big Data (Unit 3)
No ratings yet
Big Data (Unit 3)
46 pages
05 NoSQL
No ratings yet
05 NoSQL
21 pages
xldb2012 Wed 1105 DhrubaBorthakur PDF
No ratings yet
xldb2012 Wed 1105 DhrubaBorthakur PDF
38 pages
Mongodb Schema Validation
No ratings yet
Mongodb Schema Validation
8 pages
Unit-3 (Mongo DB)
No ratings yet
Unit-3 (Mongo DB)
47 pages
BIG - DATA - Unit 4
No ratings yet
BIG - DATA - Unit 4
99 pages
L48 - MongoDB
No ratings yet
L48 - MongoDB
31 pages
NoSQL DB
No ratings yet
NoSQL DB
33 pages
ADBMS Original-Output
No ratings yet
ADBMS Original-Output
28 pages
Mongo
No ratings yet
Mongo
58 pages
FSD Notes Unit-3-1
No ratings yet
FSD Notes Unit-3-1
26 pages
06 NoSQL
No ratings yet
06 NoSQL
80 pages
Mongodb
No ratings yet
Mongodb
19 pages
Mongo DB
No ratings yet
Mongo DB
104 pages
Lab Sheet 9
No ratings yet
Lab Sheet 9
13 pages
NoSQL Big Data Management
No ratings yet
NoSQL Big Data Management
36 pages
Bda Unit Iv
No ratings yet
Bda Unit Iv
4 pages
ADO Lecture IV 2024-26
No ratings yet
ADO Lecture IV 2024-26
28 pages
Mongodb-Unit 5
No ratings yet
Mongodb-Unit 5
120 pages
Mongo DB Notes
No ratings yet
Mongo DB Notes
5 pages
Unit-V SQL
No ratings yet
Unit-V SQL
18 pages
Document Database
No ratings yet
Document Database
25 pages
NoSQL+Databases+and+MongoDB+-+I+ +Lecture+Notes
No ratings yet
NoSQL+Databases+and+MongoDB+-+I+ +Lecture+Notes
7 pages
MongoDB Data Modeling - Sample Chapter
No ratings yet
MongoDB Data Modeling - Sample Chapter
40 pages
Introduction To NoSQL Database
No ratings yet
Introduction To NoSQL Database
9 pages
281507lecture Notes 1 - Introduction To MongoDB-1718181125439
No ratings yet
281507lecture Notes 1 - Introduction To MongoDB-1718181125439
8 pages
05 DocumentStores
No ratings yet
05 DocumentStores
50 pages
Nosql
100% (1)
Nosql
51 pages
Guided By: Prof-P.K. Deshpande Submitted By: Kiran Zawar USN:2BA12IS020
No ratings yet
Guided By: Prof-P.K. Deshpande Submitted By: Kiran Zawar USN:2BA12IS020
27 pages
No SQL
No ratings yet
No SQL
38 pages
Full Stack-UNIT 3
No ratings yet
Full Stack-UNIT 3
8 pages
2383 - 1019 - DOC - NoSQL Databases
No ratings yet
2383 - 1019 - DOC - NoSQL Databases
6 pages
Mongo DB
No ratings yet
Mongo DB
16 pages
Mongodbinternalsdevternity 151209084136 Lva1 App6891
No ratings yet
Mongodbinternalsdevternity 151209084136 Lva1 App6891
52 pages
BAD601 Module 3 PDF
No ratings yet
BAD601 Module 3 PDF
70 pages
1,2,3 Units
No ratings yet
1,2,3 Units
37 pages
Nosql 12
No ratings yet
Nosql 12
3 pages
MongoDB Schema Design
No ratings yet
MongoDB Schema Design
69 pages
Unit-V DBMS
No ratings yet
Unit-V DBMS
19 pages
NOSQL
No ratings yet
NOSQL
50 pages
FSD Unit - 3 - Part-1
No ratings yet
FSD Unit - 3 - Part-1
15 pages
Chapitre 4 MongoDB
No ratings yet
Chapitre 4 MongoDB
27 pages
FSD Unit III
No ratings yet
FSD Unit III
22 pages
Chatgpt
No ratings yet
Chatgpt
7 pages
MEAN 3 L3 Setting Up and Operating On MongoDB
No ratings yet
MEAN 3 L3 Setting Up and Operating On MongoDB
108 pages
Updated Mongodb Lab Manual IV Sem
No ratings yet
Updated Mongodb Lab Manual IV Sem
48 pages
Module 3
No ratings yet
Module 3
22 pages
CHAP1 No SQL Database - 085309
No ratings yet
CHAP1 No SQL Database - 085309
72 pages
Bda Notes (Unit-2)
No ratings yet
Bda Notes (Unit-2)
26 pages
Nosql Notes
No ratings yet
Nosql Notes
110 pages
MongoDB Lecture 1
No ratings yet
MongoDB Lecture 1
37 pages
41 NoSQL Introduction
No ratings yet
41 NoSQL Introduction
18 pages
BDTT Lab 2023 24 Week9
No ratings yet
BDTT Lab 2023 24 Week9
26 pages
Elements of Android Room
From Everand
Elements of Android Room
Mark Murphy
No ratings yet
Mastering Apache Iceberg: Managing Big Data in a Modern Data Lake
From Everand
Mastering Apache Iceberg: Managing Big Data in a Modern Data Lake
Robert Johnson
No ratings yet
Elasticsearch Essentials: Harness the power of ElasticSearch to build and manage scalable search and analytics solutions with this fast-paced guide
From Everand
Elasticsearch Essentials: Harness the power of ElasticSearch to build and manage scalable search and analytics solutions with this fast-paced guide
Bharvi Dixit
No ratings yet
COVID19 Web
No ratings yet
COVID19 Web
173 pages
2018-07-06-PenTest - Phishing - Techniques, Defenses, and Future Trends
No ratings yet
2018-07-06-PenTest - Phishing - Techniques, Defenses, and Future Trends
5 pages
Eforensics Magazine 2019 02 OSINT in Forensics PREVIEW
No ratings yet
Eforensics Magazine 2019 02 OSINT in Forensics PREVIEW
17 pages
Meta-Meta-Programming: Generating C++ Template Metaprograms With Racket Macros
No ratings yet
Meta-Meta-Programming: Generating C++ Template Metaprograms With Racket Macros
9 pages
Your Guide To Contact Center Best Practices: Ebook Vol. 1, April 2015
No ratings yet
Your Guide To Contact Center Best Practices: Ebook Vol. 1, April 2015
16 pages
2018-05-26-Kotlin Demystified, Rust Over The Past 3 Years + Gatsby
No ratings yet
2018-05-26-Kotlin Demystified, Rust Over The Past 3 Years + Gatsby
5 pages
Modular Logic Metaprogramming: Karl Klose Klaus Ostermann
No ratings yet
Modular Logic Metaprogramming: Karl Klose Klaus Ostermann
20 pages
Telemedicina Desafios À Sua Difusão No Brasil
No ratings yet
Telemedicina Desafios À Sua Difusão No Brasil
11 pages
Analog Quantum Computer CVQC
No ratings yet
Analog Quantum Computer CVQC
10 pages
47 - RSS Privacy Framework
No ratings yet
47 - RSS Privacy Framework
33 pages
List of Programming Languages by Type: From Wikipedia, The Free Encyclopedia (Redirected From)
No ratings yet
List of Programming Languages by Type: From Wikipedia, The Free Encyclopedia (Redirected From)
34 pages
GDPR & Online Identity Proofing:: An Inconvenient Truth
No ratings yet
GDPR & Online Identity Proofing:: An Inconvenient Truth
21 pages
Introducing Smart View For Google Workspace For Cloud EPM
No ratings yet
Introducing Smart View For Google Workspace For Cloud EPM
22 pages
IP Static Routes: Huawei Technologies Co., LTD
No ratings yet
IP Static Routes: Huawei Technologies Co., LTD
16 pages
Igor Starkov - Digital Twins and Lifecyle BIM
No ratings yet
Igor Starkov - Digital Twins and Lifecyle BIM
21 pages
Understanding OWASP Top 10
No ratings yet
Understanding OWASP Top 10
26 pages
MSTG en
No ratings yet
MSTG en
536 pages
Unit-1-Agile Software Development
No ratings yet
Unit-1-Agile Software Development
15 pages
sqqs2013 Group Assignment
No ratings yet
sqqs2013 Group Assignment
6 pages
Pragmatic Clean Architecture
No ratings yet
Pragmatic Clean Architecture
25 pages
Mule SOft For AB
No ratings yet
Mule SOft For AB
134 pages
Erp Introduction PDF
No ratings yet
Erp Introduction PDF
2 pages
Ccna Semester 1 V 3.0
No ratings yet
Ccna Semester 1 V 3.0
69 pages
MIS
No ratings yet
MIS
15 pages
Manual de Proceso de Calidad de Cacao Fino de Aroma, Perú
No ratings yet
Manual de Proceso de Calidad de Cacao Fino de Aroma, Perú
23 pages
Presentation Title: Mirjam Nilsson
No ratings yet
Presentation Title: Mirjam Nilsson
15 pages
Data Quality Management Maturity Model A Case Study in BPS-Statistics of Kaur Regency, Bengkulu Province, 2017
No ratings yet
Data Quality Management Maturity Model A Case Study in BPS-Statistics of Kaur Regency, Bengkulu Province, 2017
4 pages
NN 1 Quiz 5
60% (5)
NN 1 Quiz 5
4 pages
DRP Sample
No ratings yet
DRP Sample
22 pages
Industrial
No ratings yet
Industrial
24 pages
ZOHO Presentation
No ratings yet
ZOHO Presentation
8 pages
Lab 5 - Student
No ratings yet
Lab 5 - Student
20 pages
Mis-2.5-Balaji College - Kadapa - Icet Code Bimk
No ratings yet
Mis-2.5-Balaji College - Kadapa - Icet Code Bimk
68 pages
What Is The Difference Between Derived Tables and A View? Which One Gives Better Performance ?
No ratings yet
What Is The Difference Between Derived Tables and A View? Which One Gives Better Performance ?
1 page
How To Deactivate or Delete Facebook Messenger NordVPN
No ratings yet
How To Deactivate or Delete Facebook Messenger NordVPN
1 page
Data Base Models
No ratings yet
Data Base Models
21 pages
Dbms All Units 2 Marks
No ratings yet
Dbms All Units 2 Marks
28 pages
662 Suntec Profile
No ratings yet
662 Suntec Profile
2 pages
Forcepoint Bandwidth Optimizer
No ratings yet
Forcepoint Bandwidth Optimizer
2 pages
Desktop Apps With Rust and Tauri
No ratings yet
Desktop Apps With Rust and Tauri
2 pages
How To Make A Simple Blog or Eportfolio in Google Blogger
No ratings yet
How To Make A Simple Blog or Eportfolio in Google Blogger
6 pages

Facebook Wall Data Using Graph API

Uploaded by

Facebook Wall Data Using Graph API

Uploaded by

Big Data

Facebook Wall Data using Graph API

mongoimport --db <db-name> --collection <coll-name> --type json --

• After Applying transformation Strategy

• Compare to SQL Query

You might also like