0% found this document useful (0 votes)
907 views61 pages

Pubsubhubbub For Developers: Brett Slatkin Software Engineer Google Inc

This document discusses PubSubHubbub, a protocol for decentralized, real-time messaging at scale. It introduces PubSubHubbub and how it works, explaining the roles of publishers, subscribers, and hubs. It emphasizes the need for "fat pinging" over "light pinging" to efficiently distribute feed updates without burdening servers or networks. The document outlines how PubSubHubbub can scale to the size of the entire web through its hub-based architecture and encourages further adoption and development of the protocol.

Uploaded by

api-19821754
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
907 views61 pages

Pubsubhubbub For Developers: Brett Slatkin Software Engineer Google Inc

This document discusses PubSubHubbub, a protocol for decentralized, real-time messaging at scale. It introduces PubSubHubbub and how it works, explaining the roles of publishers, subscribers, and hubs. It emphasizes the need for "fat pinging" over "light pinging" to efficiently distribute feed updates without burdening servers or networks. The document outlines how PubSubHubbub can scale to the size of the entire web through its hub-based architecture and encourages further adoption and development of the protocol.

Uploaded by

api-19821754
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 61

PubSubHubbub for

Developers

Brett Slatkin
Software Engineer
Google Inc.
September 28, 2009
Agenda
• Background
• Intro
• Motivation
• Scale
• Progress
Background
Why do real-time messaging?
• Syndication
o Creating a "flow"
o Simultaneous delivery of an event spurs immediate
conversation
o More participation enables more developed
conversations, better exchanging of ideas
o Cross-site allows promotion, linking, swarming around
sources, mash-ups, growth opportunity
Why do real-time messaging?
• Business, politics
o 1 minute of delay could cost a company millions, cause a
political scandal, be harmful to investors, etc
o Concrete example: SEC earnings requirements
Why do real-time messaging?
• Future applications (out of scope, but ...)
o Financial data
o Public scientific measurements (e.g., stream of weather
data, traffic status, polling, votes)
o Sensor networks
o Emergency information distribution
o Anything you can think of that's a stream of information!
Why do decentralized messaging?
• Web was built on decentralized protocols
• No single point of failure
• Interoperability is key to network effects and growth
• One API for application developers
Intro
What is PubSubHubbub?

• A simple publish/subscribe protocol


• Turns Atom and RSS feeds into real-time streams
• Web-scale, low-latency messaging
• Three participants: Publisher, Subscriber, Hubs
 
 

Publisher Hub Subscriber


Design goals of PubSubHubbub

• Decentralized: No one company in control


• Scale to the size of the whole web
• Publishing and subscribing as easy as possible
• Complexity in the Hub
• Pragmatic (i.e., not theoretically perfect, but solve huge,
known use cases with minimal effort)
 
 
How-to for Publishers
1. Add a declaration in your feed with your Hub of choice
<link rel="hub"
href="https://pubsubhubbub.appspot.com/"/>
 
2. Add something to your feed!

3. Send a ping to the Hub with the feed URL


POST / HTTP/1.1
Content-Type: application/x-www-form-urlencoded
...

hub.mode=publish&hub.url=<your feed>

4. 204 response = Success, 4xx = Bad request, 5xx = Try again


How-to for Subscribers
1. Detect the Hub declaration in a feed
 
2. Send a subscribe request to the feed's Hub
POST / HTTP/1.1
Content-Type: application/x-www-form-urlencoded
...

hub.mode=subscribe&hub.verify=sync&
hub.topic=<feed URL>&hub.callback=<callback URL>

3. Hub will send a request to verify the subscription


GET /callback?hub.challenge=<random> HTTP/1.1

HTTP/1.1 200
...
<echo random>
How-to for Subscribers
Process new content from the Hub
POST /callback HTTP/1.1
Content-Type: application/atom+xml
...

<?xml version="1.0" encoding="utf-8"?>


<feed xmlns="http://www.w3.org/2005/Atom">
<title>Awesome feed</title>
<link rel="hub"
href="http://pubsubhubbub.appspot.com"/>
...
<entry>
...
</entry>
</feed>
The role of the Hub
• Logical component
o Publishers may be their own Hub
o Combined Hub/Publisher has p2p speed-up

• Distinct functions
o Accept and verify subscriptions to new topics
o Receive pings from publishers, retrieve content
o Extract new/updated items from feed
o Send all subscribers the new content
The role of the Hub
• Scalability
o # of subscribers & feeds, update frequency
o Delegation of content distribution (= bandwidth)

• Reliability
o Retry fetch, delivery, idempotence
How the hub works
How the hub works

See my talk on building a hub using App Engine


 
http://tinyurl.com/building-a-hub
Security model
• Subscriber verification prevents DoS attacks
 
• Declaration of the Hub is a delegation of trust
o Subscribers may trust the Hub to deliver content on
publisher's behalf
o v0.2 supports shared-secret HMACs for subscribers to
verify that notifications came from the hub
 
• Privacy through HTTPS for hubs, feeds, and callbacks
o URLs and payloads can be sent via encrypted channel
o Subscribed topics are not discoverable
o Unguessable, capability URLs (e.g., from OAuth)
 
• Publishers can run their own hub!
Motivation
Push it to the limit
Why push content?
Push it to the limit
Why push content?
 
Learn from our forefathers.
Push it to the limit
Why push content?
 
Learn from our forefathers.
 
 
 

TCP
(est. 1974)
 
Push it to the limit
What is magical about TCP? The Window.
Push it to the limit
Without the window, the tube can't be full.
Push it to the limit
TCP maximizes the throughput of a link
• Dump data in, it will be received
• The window means no waiting for acks!
• When acks are missed, the sender will retransmit
• Receivers reassemble the message in-order, de-dupe
• Good citizenship with congestion control
Push it to the limit
Where is such efficiency for application-level protocols?
• Exists, but often proprietary or an interoperability
nightmare
Push it to the limit
Where is such efficiency for application-level protocols?
• Exists, but often proprietary or an interoperability
nightmare
 
 
 
(cough SOAP cough)
Why another protocol?

 
Why another protocol?
• We want interoperable, web-scale messaging

• Almost every company already has an internal system


o TIBCO, WebsphereMQ, ActiveMQ, RabbitMQ, ...
o Proprietary message payloads, topics, networks  
 
• Existing attempts at an standard haven't caught on
o XMPP weirds people out; started in 1999, still isn't used
for interop widely beyond IM
o These standards are too complex or not pragmatic (XEP-
0060, WS-*, AMQP, RestMS, new REST-*)
 
Why another protocol?
• Build the simplest interoperable messaging protocol that
can scale to the size of the web
• Make the base specification bare-bones, easy-to-use
• Target Atom/RSS initially as a payload format; everyone
uses them for time-based, idempotent streams
• In the future, add extensions for cool stuff
 
Why another protocol?
• Proof of simplicity is in the code
o Bret Taylor added PubSubHubbub subscription to
FriendFeed in a single evening
Scale
Goal
• World-wide RSS publishing currently
o ~X,000 updates per second
• Legitimate email currently
o ~X,000,000 per second
 
• Need to scale by at least 1000x; hopefully more
 
• Trying to enable new use-cases
Light pinging
Light pinging
• Protocols exist for faster Atom/RSS
o Ping-o-Matic, changes.xml, SUP, rssCloud
• All only indicate the feed URL that has changed
o Still need to go and fetch the content
o These protocols are just optimized polling
o Equivalent to killing the TCP window!
 
Light pinging
• Optimized polling is still worse
o Latency is high: 3 round trips
o Thundering herd as subscribers fetch published feeds
 Unpredictable, bursty load pattern
o More bandwidth, CPU, connection star-pattern
 
Light pinging
Light pinging
Light pinging at scale
What if you had to use light pinging at scale?
• Send out pings slowly to reduce the herd
 
• Herd causes all feeds to be fully regenerated
o Invalidates existing caches
 
• Bandwidth increases extremely fast
o (average updates per feed) * (# feeds) * (# subscribers) *
(average feed size)
o Often 99.5%+ more than you needed

• CPU costs increase for subscribers with update frequency


Light pinging at scale
Consider a single-master replication scheme
• After each update, wait for copying to all replicas
 
Fat pinging
Fat pinging
Compared to light pings
• Latency: 1/3 as much
• Based on reasonable averages
o Bandwidth: ~20x less
o CPU:~20x less
• Never wait for replication delays
Fat pinging
Fat pinging
Fat pinging at scale
What if you had to scale fat pinging? 
• Run your own hub
 
• Compute feed deltas at update time; no need to
regenerate a whole feed (or churn your caches)
 
• Send out new content at sustained network rate
 
• Bandwidth is minimum possible per subscriber
o (update size) * (# feeds) * (# subscribers)

• CPU costs is minimum possible per subscriber


 
Fat pinging at scale
Fat pinging at scale
Fat pinging at scale
Advanced protocol pieces
• Connection reuse from HTTP/1.1
• Pipeline HTTP requests for feed fetching
• Use aggregated content delivery
o Many Atom feeds in a single <feed> XML doc
o Fewer connections
Progress
PubSubHubbub status
• Over 100 Million feeds are PubSubHubbub-enabled
• Companies: Google, FriendFeed (FB), livedoor, Six Apart,
LiveJournal, LazyFeed, Superfeedr, ...
• Google products: FeedBurner, Blogger, Reader shared
items, Google Alerts, ...
• Cool apps: Socnode, Reader2Twitter, chat gateways, ...
 
• More publishers, subscribers, hubs, apps on the way
• Publisher clients: Perl, PHP, Python, Ruby, Java, Haskell,
C#, MovableType, WordPress, Django, Zend
• Active mailing list with 240+ members
 
 
 
Getting involved
• Review the spec; recommend improvements
o Open process, will be licensed by Open Web Foundation
• Write some sample code for your favorite language or CMS
• Contribute to one of the open source Hub implementations
• Write on your blog about why we need push for the future
o Do it for the children
 
 
 
What Facebook can do right now
• Subscribe to feeds that are PubSubHubbub-enabled
o Put that great UI to work
o Maybe reuse the FriendFeed index pipeline?
o Call Bret and Ben
 
• Enable PubSubHubbub for activity streams
o Provide Facebook app developers with real-time updates
to users' home streams
o Speeds up surfacing Facebook in other apps
o Detecting new events could trigger the app to take
action in real-time (send an email, classify a photo,
initiate an action in a game, etc)
 
What Facebook can do next
• Figure out if private feeds will work with this model
o Run your own hub
o Use capability URLs (OAuth token in the query string)
 
• Give your developers more feeds to consume and syndicate
 
Rehash
Rehash
• Push for the future! Scale to new use-cases
• Decentralized, open spec: no company owns it
• One API for all stream-based content
 
Rehash
• Project page: http://pubsubhubbub.googlecode.com
o Full Hub source code with tests
o Example publisher and subscriber apps
o Demo hub at http://pubsubhubbub.appspot.com
?
Hub storage space
• How much storage space does a Hub need?
o Manageable costs
 ~10 million feeds
 ~1 million subscribers
o Assume 1 billion events per day (~11,000/second)
 Thar be dragons!
Hub storage space
FeedEntryRecord
• Key name
o "FeedEntryRecord" + entry_id_hash + parent key
o 400 bytes, could be smaller
• Indexed properties
o Entry ID hash (again-- doh!): 160 bytes
o Entry content hash: 160 bytes
o Update time: 8 bytes
• Unindexed properties
o Entry ID: 2048 bytes maximum, 200 on average
 
Result
• ~1KB per entry 
• 27TB per month at ~11,000 req/sec -- no sweat!
WebFinger
Unified discovery for email addresses
• Transform an email address into XRD
• XRD defines all the services that address has
• Helps provide social networking as a protocol
• E.g., Simple way to discover if an account has a Portable
Contacts interface
WebFinger

You might also like