Facebook Wall Data Using Graph API
Facebook Wall Data Using Graph API
• Client
• Database
• Collections
• Import Data
• Solution
• Transform the Array Elements into separate documents
• Each of them having different INDEX.
• Approaches to Transform the data into individual documents.
• Script (Python , Php)
• Use Batch Insert functionality into MongoDB
• Before Applying transformation Strategy
• Query
Output
Return data with following Structure
{
“Biggestlikemessage”:{
“message”:”test meaage”,
“TotalNoofLikes”: NoofLikes(integer)
}
“Smallestlikemessage”:{
“message”:”test meaage”,
“TotalNoofLikes”: NoofLikes(integer)
}
“actor”(USER) : actor_id(integer)
}
Query
Output
References
https://developers.facebook.com/
https://spark.apache.org/docs/1.0.2/quick-start.html
https://spark.apache.org/documentation.html
http://docs.mongodb.org/manual/
https://docs.python.org/2/tutorial/
http://www.tutorialspoint.com/mongodb/mongodb_quick_guide.htm
https://www.youtube.com/watch?v=o3u97Vry2bE
https://www.youtube.com/watch?v=bWorBGOFBWY&list=PL-x35fyliRwhKT-NpTKprPW1bkbdDcTTW
Research Paper:
Petabyte Scale Databases and Storage Systems
Deployed at Facebook
Dhruba Borthakur ,
Engineer at Facebook.
Four major types of storage systems in Facebook
• Online Transaction Processing Databases (OLTP)
• The Facebook Social Graph
• Semi-online Light Transaction Processing Databases (SLTP)
• Facebook Messages and Facebook Time Series
• Immutable DataStore
• Photos, videos, etc.
• Analytics DataStore
• Data Warehouse, Logs storage
Size and Scale of Databases
Characteristics
Facebook Social Graph: TAO and MySQL
• An OLTP workload:
▪ Uneven read heavy workload
▪ Huge working set with creation-time locality
▪ Highly interconnected data
▪ Constantly evolving
▪ As consistent as possible
• Graph data Model
• Nodes and Edges: Objects and Associations
Data model
• Objects & Associations
• Object -> unique 64 bit ID plus a typed dictionary
(id) -> (otype, (key -> value)* )
ID 6815841748 -> {‘type’: page, ‘name’: “John”, … }
• Association -> typed directed edge between 2 IDs
(id1, atype, id2) -> (time, (key -> value)* )
(8636146, RSVP, 130855887032173) -> (1327719600, {‘response’: ‘YES’})
• Association lists
(id1, atype) -> all associations with given id1, atype in desc order by time
Architecture: Cache & Storage
Messages & Time Series Database SLTP
workload
• Facebook Messages:
Why HBase
• High write throughput
• Horizontal scalability
• Automatic Failover
• Strong consistency within a data center
• Benefits of HDFS : Fault tolerant, scalable, Map-Reduce toolset,
• Why is this SLTP?
• Semi-online: Queries run even if part of the database is offline
• Lightweight Transactions: single row transactions
• Storage capacity bound rather than iops or cpu bound
What we store in HBase
• Small messages
• Message metadata (thread/message indices)
• Search index
• Large attachments stored in Haystack (photo store)
Size and scale of Messages Database
• 6 Billion messages/day
• 74 Billion operations/day
• At peak: 1.5 million operations/sec
• 55% read, 45% write operations
• Average write operation inserts 16 records
• All data is lzo compressed
• Growing at 8 TB/day
Haystack: The Photo Store
• Facebook Photo DataStore
Hive Analytics Warehouse
• Life of a photo tag in Hadoop/Hive storage
Why Hive?
• Prospecting for gold in the wild-west…..
A platform for huge data-experiments
A majority of queries are searching for a single gold nugget
Great advantage in keeping all data in one queryable system
No structure to data, specify structure at query time
• Crowd Sourcing for data discovery
There are 50K tables in a single warehouse
Users are DBAs themselves
Questions about a table are directed to users of that table
Automatic query lineage tools help here
Thank You.