0% found this document useful (0 votes)
77 views55 pages

Facebook Wall Data Using Graph API

This document summarizes a research paper about petabyte-scale databases and storage systems deployed at Facebook. It describes four major types of storage systems used at Facebook: online transaction processing databases for the social graph, semi-online light transaction processing databases for messages and time series data, an immutable datastore for photos and videos, and an analytics datastore consisting of a data warehouse and logs storage. It provides details on the size and scale of Facebook's databases, and discusses the architectures and technologies used to implement the social graph database, messages database, photo store, and analytics warehouse.

Uploaded by

Cleilson Pereira
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views55 pages

Facebook Wall Data Using Graph API

This document summarizes a research paper about petabyte-scale databases and storage systems deployed at Facebook. It describes four major types of storage systems used at Facebook: online transaction processing databases for the social graph, semi-online light transaction processing databases for messages and time series data, an immutable datastore for photos and videos, and an analytics datastore consisting of a data warehouse and logs storage. It provides details on the size and scale of Facebook's databases, and discusses the architectures and technologies used to implement the social graph database, messages database, photo store, and analytics warehouse.

Uploaded by

Cleilson Pereira
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Big Data

Facebook Wall Data using Graph API


Presented by:
Prashant Patel-2556219
Jaykrushna Patel-2619715
Outline
• Data Source
• Processing tools for processing our data
• Big Data Processing System: Mongodb
• Data Transformation-Mongodb
• Data Mining Queries
• References
• Research Paper
Data Source
• Downloaded from Facebook Graph API
• Approach : Python Script
• FQL
• Facebook Query Language
• Access Token
• We received using authorize through our own Application
Python Script
Original Data
Sample data
Possible tools for processing our data
• Spark
• Hive
• MongoDB
Spark-Easy Json data manipulation
• The entry point into all relational functionality in Spark is the SQLContext class. So. Here we create SQLContext
for querying sql command.
• Here, we loads a JSON file (one object per line), returning the result as a SchemaRDD
Schema of the data
• Create table in Spark. Here, we create temporary table jt1 for storing json Data.
• SQL statements can be run by using the sql methods provided by sqlContext. Here only display
one row data from jt1 table.
Here, Display all data from jt1 table.
Why not Spark
• It does not support rich data mining queries.
• No support fro interactive data mining Queries
Why not Hive:
• Hive is not able to process complex JSON Graph data
• No support for dynamic large schemas
Big Data Processing System: Mongodb
• MongoDB is an open-source document database, and the leading
NoSQL database.
• Features:
• Document-Oriented Storage
• Full Index Support
• Auto-Sharding
• Scale horizontally without compromising functionality.
• Querying
• Rich, document-based queries.
• Map/Reduce
• Flexible aggregation and data processing.
• GridFS
• Store files of any size without complicating your stack.
• Mongodb Management service
• Also support cloud services.
• Provide rich GUI Tools:
• Edda
• Fluentd
• Fang of Mongo
• Umongo
• Mongo-vision
• MongoHub
• MongoVUE
• RockMongo
• Robomongo
• MongoDB belongs to the type of document-oriented database in
which data is organized as JSON or BSON document, and store into an
collection.
• Database holds a set of collections. collection holds a set of
documents. Collection and document are same as table and row in
relational database.
• Document is a set of key-value pairs. Documents have dynamic
schema. Dynamic schema means that documents in the same
collection do not need to have the same set of fields or structure, and
common fields in a collection’s documents may hold different types of
data.
Data Transformation-Mongodb
• Start Mongo Server
• ./mongod
• Start Mongo Client
• mongo
• Server

• Client
• Database

• Collections
• Import Data

mongoimport --db <db-name> --collection <coll-name> --type json --


file seed.json --jsonArray
Imported data
Data Transformation
• Issues retrieving specific records
• All Queries return Same records.
• Data array of all elements.
• Reason :
All data are in single array so its a single Object ID (Index)

• Solution
• Transform the Array Elements into separate documents
• Each of them having different INDEX.
• Approaches to Transform the data into individual documents.
• Script (Python , Php)
• Use Batch Insert functionality into MongoDB
• Before Applying transformation Strategy

• After Applying transformation Strategy


Data Mining Queries
• Retrieve all records from graph collection.
db.graph.find()
Retrieve Records having Comments more than 0
db.graph.find({"comments.count" : {$gt : 0 }})
Return few fields instead the whole structure
Suppose we want only post_id and number of comments on it
Output
Return 5 oldest message which is liked by at
least one
Output
Return actor with total number of likes above 100 for all messages for each actor In
descending order of total Likes
• Query

• Compare to SQL Query


Output :
Query return message and comment count and if there is no
comments on the message then return “No Comments”
Query:
Output
Return selective fields inside nested structure
Return each post , messages, comments[ count , comment_list [text]]

• Query
Output
Return data with following Structure
{
“Biggestlikemessage”:{
“message”:”test meaage”,
“TotalNoofLikes”: NoofLikes(integer)
}
“Smallestlikemessage”:{
“message”:”test meaage”,
“TotalNoofLikes”: NoofLikes(integer)
}
“actor”(USER) : actor_id(integer)
}
Query
Output
References
https://developers.facebook.com/

https://spark.apache.org/docs/1.0.2/quick-start.html

https://spark.apache.org/documentation.html

http://docs.mongodb.org/manual/

https://docs.python.org/2/tutorial/

http://www.tutorialspoint.com/mongodb/mongodb_quick_guide.htm

https://www.youtube.com/watch?v=o3u97Vry2bE

https://www.youtube.com/watch?v=bWorBGOFBWY&list=PL-x35fyliRwhKT-NpTKprPW1bkbdDcTTW
Research Paper:
Petabyte Scale Databases and Storage Systems
Deployed at Facebook
Dhruba Borthakur ,
Engineer at Facebook.
Four major types of storage systems in Facebook
• Online Transaction Processing Databases (OLTP)
• The Facebook Social Graph
• Semi-online Light Transaction Processing Databases (SLTP)
• Facebook Messages and Facebook Time Series
• Immutable DataStore
• Photos, videos, etc.
• Analytics DataStore
• Data Warehouse, Logs storage
Size and Scale of Databases
Characteristics
Facebook Social Graph: TAO and MySQL
• An OLTP workload:
▪ Uneven read heavy workload
▪ Huge working set with creation-time locality
▪ Highly interconnected data
▪ Constantly evolving
▪ As consistent as possible
• Graph data Model
• Nodes and Edges: Objects and Associations
Data model
• Objects & Associations
• Object -> unique 64 bit ID plus a typed dictionary
 (id) -> (otype, (key -> value)* )
 ID 6815841748 -> {‘type’: page, ‘name’: “John”, … }
• Association -> typed directed edge between 2 IDs
 (id1, atype, id2) -> (time, (key -> value)* )
 (8636146, RSVP, 130855887032173) -> (1327719600, {‘response’: ‘YES’})
• Association lists
 (id1, atype) -> all associations with given id1, atype in desc order by time
Architecture: Cache & Storage
Messages & Time Series Database SLTP
workload
• Facebook Messages:
Why HBase
• High write throughput
• Horizontal scalability
• Automatic Failover
• Strong consistency within a data center
• Benefits of HDFS : Fault tolerant, scalable, Map-Reduce toolset,
• Why is this SLTP?
• Semi-online: Queries run even if part of the database is offline
• Lightweight Transactions: single row transactions
• Storage capacity bound rather than iops or cpu bound
What we store in HBase
• Small messages
• Message metadata (thread/message indices)
• Search index
• Large attachments stored in Haystack (photo store)
Size and scale of Messages Database
• 6 Billion messages/day
• 74 Billion operations/day
• At peak: 1.5 million operations/sec
• 55% read, 45% write operations
• Average write operation inserts 16 records
• All data is lzo compressed
• Growing at 8 TB/day
Haystack: The Photo Store
• Facebook Photo DataStore
Hive Analytics Warehouse
• Life of a photo tag in Hadoop/Hive storage
Why Hive?
• Prospecting for gold in the wild-west…..
 A platform for huge data-experiments
 A majority of queries are searching for a single gold nugget
 Great advantage in keeping all data in one queryable system
 No structure to data, specify structure at query time
• Crowd Sourcing for data discovery
 There are 50K tables in a single warehouse
 Users are DBAs themselves
 Questions about a table are directed to users of that table
 Automatic query lineage tools help here
Thank You.

You might also like