Jump to content

Event Platform/EventStreams HTTP Service: Difference between revisions

From Wikitech
Content deleted Content added
Update Documentt after migrating Eventstreams repository from gerrit to gitlab.
 
(34 intermediate revisions by 9 users not shown)
Line 1: Line 1:
{{Navigation Event Platform}}
[[File:EventStreams Example.png|thumb|500px|link=https://codepen.io/ottomata/pen/VKNyEw/?editors=0010|Example client at [https://codepen.io/ottomata/pen/VKNyEw/?editors=0010 codepen.io/ottomata/pen/VKNyEw].]]
[[File:EventStreams Example.png|thumb|500px|link=https://codepen.io/ottomata/pen/LYpPpxj?editors=1010|Example client at [https://codepen.io/ottomata/pen/LYpPpxj?editors=1010 codepen.io/ottomata/pen/LYpPpxj].]]
[[File:EventStreams Example2.jpg|thumb|500x500px|link=https://codepen.io/Krinkle/pen/BwEKgW?editors=1010|RecentChange stats tool, built with EventStreams – at [https://codepen.io/Krinkle/pen/BwEKgW?editors=1010 https://codepen.io/Krinkle/pen/BwEKgW].]]
[[File:EventStreams Example2.jpg|thumb|500x500px|link=https://codepen.io/Krinkle/pen/BwEKgW?editors=1010|RecentChange stats tool, built with EventStreams – at [https://codepen.io/Krinkle/pen/BwEKgW?editors=1010 https://codepen.io/Krinkle/pen/BwEKgW].]]
'''EventStreams''' is a web service that exposes continuous streams of structured event data. It does so over HTTP using [https://en.wikipedia.org/wiki/Chunked_transfer_encoding chunked transfer encoding] following the [[:en:Server-sent_events|Server-Sent Events]] (SSE) protocol. EventStreams can be consumed directly via HTTP, but is more commonly used via a [https://en.wikipedia.org/wiki/Server-sent_events#Libraries client library].
'''EventStreams''' is a web service that exposes continuous streams of structured event data. It does so over HTTP using [https://en.wikipedia.org/wiki/Chunked_transfer_encoding chunked transfer encoding] following the [[:en:Server-sent_events|Server-Sent Events]] protocol (SSE). EventStreams can be consumed directly via HTTP, but is more commonly used via a client library.


EventStreams provides access to arbitrary streams of data, including [https://www.mediawiki.org/wiki/Manual:RCFeed MediaWiki RecentChanges]. It replaced [[Obsolete:RCStream|RCStream]], and may in the future replace [[irc.wikimedia.org]]. EventStreams is backed by Apache [https://kafka.apache.org/ Kafka].
The service supersedes [[Obsolete:RCStream|RCStream]], and might in the future replace [[irc.wikimedia.org]]. EventStreams is internally backed by [https://kafka.apache.org/ Apache Kafka].

''Note: <code>SSE</code> and <code>EventSource</code> are often used interchangeably as the names of this web technology. This document refers to SSE as the server-side protocol, and EventSource as the client-side interface.''

== Streams ==
EventStreams provides access to several different data streams, most notably the <code>recentchange</code> stream which emits [[mw:Manual:RCFeed|MediaWiki Recent changes]] events.

For a complete list of available streams, refer to the documentation at https://stream.wikimedia.org/?doc#/streams.

The data format of each stream follows a schema. The schemas can be obtained via https://schema.wikimedia.org/#!/primary/jsonschema, for example [https://schema.wikimedia.org/repositories/primary/jsonschema/mediawiki/recentchange/latest.yaml jsonschema/mediawiki/recentchange/latest.yaml].

For the <code>recentchange</code> stream there is additional documentation at [[mw:Manual:RCFeed|Manual:RCFeed on mediawiki.org]].

== When not to use EventStreams ==

The public EventStreams service is intended for use by small scale external tool developers. It should not be used to build production services within Wikimedia Foundation. WMF production services that react to events should directly consume the underlying Kafka topic(s).


''Note: Often 'SSE' and EventSource are used interchangeably. This document refers to SSE as the server-side protocol, and EventSource as the client-side interface.''
== Examples ==
== Examples ==

=== Web browser ===
Use the built-in [https://developer.mozilla.org/en-US/docs/Web/API/EventSource EventSource API] in modern browsers:<syntaxhighlight lang="javascript">
const url = 'https://stream.wikimedia.org/v2/stream/recentchange';
const eventSource = new EventSource(url);

eventSource.onopen = () => {
console.info('Opened connection.');
};
eventSource.onerror = (event) => {
console.error('Encountered error', event);
};
eventSource.onmessage = (event) => {
// event.data will be a JSON message
const data = JSON.parse(event.data);
// discard all canary events
if (meta.domain === 'canary') {
return;
}
// Edits from English Wikipedia
if (data.server_name === 'en.wikipedia.org') {
// Output the title of the edited page
console.log(data.title);
}
};
</syntaxhighlight>


=== JavaScript ===
=== JavaScript ===
Node.js (with [https://github.com/aslakhellesoy/eventsource eventsource])


==== Node.js ESM (with [https://www.npmjs.com/package/wikimedia-streams wikimedia-streams]) ====
<syntaxhighlight lang="javascript">
<syntaxhighlight lang="javascript">
import WikimediaStream from 'wikimedia-streams';
var EventSource = require('eventsource');
var url = 'https://stream.wikimedia.org/v2/stream/recentchange';


// 'recentchange' can be replaced with another stream topic
console.log(`Connecting to EventStreams at ${url}`);
const stream = new WikimediaStream('recentchange');
var eventSource = new EventSource(url);


stream.on('open', () => {
eventSource.onopen = function(event) {
console.log('--- Opened connection.');
console.info('Opened connection.');
};
});
stream.on('error', (event) => {
console.error('Encountered error', event);
});
stream
.filter("mediawiki.recentchange")
.all({ wiki: "enwiki" }) // Edits from English Wikipedia
.on('recentchange', (data, event) => {
// Output page title
console.log(data.title);
});
</syntaxhighlight>


==== Node.js (with [https://github.com/aslakhellesoy/eventsource eventsource]) ====
eventSource.onerror = function(event) {
<syntaxhighlight lang="javascript">
console.error('--- Encountered error', event);
const EventSource = require('eventsource');
};


const url = 'https://stream.wikimedia.org/v2/stream/recentchange';
eventSource.onmessage = function(event) {
const eventSource = new EventSource(url);
// event.data will be a JSON string containing the message event.

console.log(JSON.parse(event.data));
eventSource.onopen = () => {
console.info('Opened connection.');
};
eventSource.onerror = (event) => {
console.error('Encountered error', event);
};
eventSource.onmessage = (event) => {
const data = JSON.parse(event.data);
// discard canary events
if (meta.domain === 'canary') {
return;
}
if (data.server_name === 'en.wikipedia.org') {
// Output the page title
console.log(data.title);
}
};
};
</syntaxhighlight>
</syntaxhighlight>


Server side filtering is not supported, so if you need to filter on something like a wiki name, you'll need to do this client side, e.g.
Server side filtering is not supported. You can filter client-side instead, for example to listen for changes to a specific wiki only:


<syntaxhighlight lang="javascript">
<syntaxhighlight lang="javascript">
Line 39: Line 106:
// event.data will be a JSON string containing the message event.
// event.data will be a JSON string containing the message event.
var change = JSON.parse(event.data);
var change = JSON.parse(event.data);
// discard canary events
if (meta.domain === 'canary') {
return;
}
if (change.wiki == wiki)
if (change.wiki == wiki)
console.log(`Got commons wiki change on page ${change.title}`);
console.log(`Got commons wiki change on page ${change.title}`);
Line 44: Line 115:
</syntaxhighlight>
</syntaxhighlight>


=== Python ===
=== TypeScript ===
Using [https://pypi.python.org/pypi/sseclient/ sseclient].


==== Node.js (with [https://www.npmjs.com/package/wikimedia-streams wikimedia-streams]) ====
There is also a more asynchronous friendly version, [https://github.com/ebraminio/aiosseclient here], needs Python 3.6 async generator capability.
<syntaxhighlight lang="typescript">
import WikimediaStream from "wikimedia-streams";
import MediaWikiRecentChangeEvent from 'wikimedia-streams/build/streams/MediaWikiRecentChangeEvent';

// "recentchange" can be replaced with any valid stream
const stream = new WikimediaStream("recentchange");

stream
.filter("mediawiki.recentchange")
.all({ wiki: "enwiki" }) // Edits from English Wikipedia
.on('recentchange', (data /* MediaWikiRecentChangeEvent & { wiki: 'enwiki' } */, event) => {
// Output page title
console.log(data.title);
});
</syntaxhighlight>

=== Python ===
Using [https://pypi.python.org/pypi/sseclient/ python-sseclient]. There is also a more asynchronous-friendly version called [https://github.com/ebraminio/aiosseclient aiosseclient] (requires Python 3.6+).


<syntaxhighlight lang="python">
<syntaxhighlight lang="python">
Line 61: Line 149:
pass
pass
else:
else:
# discard canary events
if change['meta']['domain'] == 'canary':
continue
print('{user} edited {title}'.format(**change))
print('{user} edited {title}'.format(**change))
</syntaxhighlight>
</syntaxhighlight>


The standard SSE protocol defines ways to continue where you left after a failure or other disconnect. We support this in EventStreams as well. For example:<syntaxhighlight lang="python">
Server side filtering is not supported, so if you need to filter on something like a wiki name, you'll need to do this client side, e.g.
import json
from sseclient import SSEClient as EventSource

url = 'https://stream.wikimedia.org/v2/stream/recentchange'
for event in EventSource(url, last_id=None):
if event.event == 'message':
try:
change = json.loads(event.data)
except ValueError:
pass
else:
# discard canary events
if change['meta']['domain'] == 'canary':
continue
if change.user == 'Yourname':
print(change)
print(event.id)

# - Run this Python script.
# - Publish an edit to [[Sandbox]] on test.wikipedia.org, and observe it getting printed.
# - Quit the Python process.
# - Change last_id=None to last_id='[{"topic":"…"},{…}]', as taken from the last printed line.
# - Publish another edit, while the Python process remains off.
# - Run this Python script again, and notice it finding and printing the missed edit.
</syntaxhighlight>

Server-side filtering is not supported. To filter for something like a wiki domain, you'll need to do this on the consumer side side. For example:
<syntaxhighlight lang="python">
<syntaxhighlight lang="python">


Line 73: Line 191:
change = json.loads(event.data)
change = json.loads(event.data)
except ValueError:
except ValueError:
continue
# discard canary events
if change['meta']['domain'] == 'canary':
continue
continue
if change['wiki'] == wiki:
if change['wiki'] == wiki:
Line 78: Line 199:
</syntaxhighlight>
</syntaxhighlight>


[[mw:Manual:Pywikibot|Pywikibot]] supports EventStreams with freely configurable client side filtering and automatic reconnection. Also composed streams or timestamp for historical consumption are possible.
[[mw:Manual:Pywikibot|Pywikibot]] is another way to consume EventStreams in Python. It provides an abstraction that takes care of automatic reconnection, easy filtering, and combination of multiple topics into one stream. For example:<syntaxhighlight lang="python">

''Usage sample with composed streams and timestamp:''
<syntaxhighlight lang="python">
>>> from pywikibot.comms.eventstreams import EventStreams
>>> from pywikibot.comms.eventstreams import EventStreams
>>> stream = EventStreams(streams=['recentchange', 'revision-create'],
>>> stream = EventStreams(streams=['recentchange', 'revision-create'],
Line 94: Line 212:


With [https://linux.die.net/man/1/curl curl] and [https://stedolan.github.io/jq/manual/ jq]
With [https://linux.die.net/man/1/curl curl] and [https://stedolan.github.io/jq/manual/ jq]
Grep for the <tt>data</tt> part of the event, strip out the <tt>data: </tt> prefix, and prettify the event with jq.
Set the <tt>Accept</tt> header and prettify the events with jq.

<syntaxhighlight lang="bash">
<syntaxhighlight lang="bash">
curl -s https://stream.wikimedia.org/v2/stream/recentchange |
curl -s -H 'Accept: application/json' https://stream.wikimedia.org/v2/stream/recentchange | jq .
grep data |
sed 's/^data: //g' |
jq .
</syntaxhighlight>
</syntaxhighlight>

Setting the <tt>Accept: application/json</tt> will cause EventStreams to send you newline delimited JSON objects, rather than data in the SSE format.


== API ==
== API ==
Line 112: Line 228:
"text/event-stream; charset=utf-8"
"text/event-stream; charset=utf-8"
],
],
"description": "Mediawiki RecentChanges feed. Schema: https://github.com/wikimedia/mediawiki-event-schemas/tree/master/jsonschema/mediawiki/recentchange"
"description": "Mediawiki RecentChanges feed. Schema: https://schema.wikimedia.org/#!//primary/jsonschema/mediawiki/recentchange"
}
}
},
},
Line 120: Line 236:
"text/event-stream; charset=utf-8"
"text/event-stream; charset=utf-8"
],
],
"description": "Mediawiki Revision Create feed. Schema: https://github.com/wikimedia/mediawiki-event-schemas/tree/master/jsonschema/mediawiki/revision/create"
"description": "Mediawiki Revision Create feed. Schema: https://schema.wikimedia.org/#!//primary/jsonschema/mediawiki/revision/create"
}
}
}
}
Line 127: Line 243:
=== Stream selection ===
=== Stream selection ===
Streams are addressable either individually, e.g. <code>/v2/stream/revision-create</code>, or as a comma separated list of streams to compose, e.g. <code>/v2/stream/page-create,page-delete,page-undelete</code>.
Streams are addressable either individually, e.g. <code>/v2/stream/revision-create</code>, or as a comma separated list of streams to compose, e.g. <code>/v2/stream/page-create,page-delete,page-undelete</code>.



See available streams: https://stream.wikimedia.org/?doc
See available streams: https://stream.wikimedia.org/?doc


=== Historical Consumption ===
=== Historical Consumption ===
Since 2018-06, EventStreams supports timestamp based historical consumption. This can be provided as individual assignment objects in the <code>Last-Event-ID</code> by setting a timestamp field instead of an offset field. Or, more simply, a <code>since</code> query parameter can be provided in the stream URL, e.g. <code>since=2018-06-14T00:00:00Z</code>. <code>since</code> can either be given as a milliseconds UTC unix epoch timestamp or anything parseable by Javascript <code>Date.parse()</code>, e.g. a UTC [https://en.m.wikipedia.org/wiki/ISO_8601 ISO-8601] datetime string.
Since 2018-06, EventStreams supports timestamp based historical consumption. This can be provided as individual assignment objects in the <code>Last-Event-ID</code> by setting a timestamp field instead of an offset field. Or, more simply, a <code>since</code> query parameter can be provided in the stream URL, e.g. <code>since=2018-06-14T00:00:00Z</code>. <code>since</code> can either be given as anything parseable by Javascript <code>Date.parse()</code>, e.g. a UTC [https://en.m.wikipedia.org/wiki/ISO_8601 ISO-8601] datetime string.


When given a timestamp, EventStreams will ask Kafka for the message offset in the stream(s) that most closely match the timestamp. Kafka guarantees that all events after the returned message offset will be after the given timestamp. NOTE: The stream history is not kept indefinitely. Depending on the stream configuration, there will likely be between 7 and 31 days of history available. Please be kind when providing timestamps. There may be a lot of historical data available, and reading it and sending it all out can be compute resource intensive. Please only consume the minimum of data you need.
When given a timestamp, EventStreams will ask Kafka for the message offset in the stream(s) that most closely match the timestamp. Kafka guarantees that all events after the returned message offset will be after the given timestamp. NOTE: The stream history is not kept indefinitely. Depending on the stream configuration, there will likely be between 7 and 31 days of history available. Please be kind when providing timestamps. There may be a lot of historical data available, and reading it and sending it all out can be compute resource intensive. Please only consume the minimum of data you need.
Line 162: Line 277:


You may request that EventStreams begins streaming to you from different offsets by setting an array of topic, partition, offset|timestamp objects in the <tt>Last-Event-ID</tt> HTTP header.
You may request that EventStreams begins streaming to you from different offsets by setting an array of topic, partition, offset|timestamp objects in the <tt>Last-Event-ID</tt> HTTP header.

=== Canary Events ===
WMF [[Data Engineering|Data Engineering team]] [[Data Engineering/Systems/Hadoop Event Ingestion Lifecycle|produces artificial 'canary' events]] into each stream multiple times an hour. The presence of these canary events in a stream allow us to differentiate between a broken event stream, and an empty one. If a stream has canary_events_enabled=true, then we should expect at least one event in a stream's Kafka topics every hour. If we get no events in an hour, then we can trigger an alert that a stream is broken.

These events are not filtered out in the streams available at [[stream.wikimedia.org]]. As a user of these streams, you should discard all canary events; i.e. all events where <code>meta.domain === 'canary'</code>.

{{Note
| type = warn
| text = <b>If you are not using canary events for alerting, discard them! </b> Discard all events where <pre>meta.domain === 'canary'</pre>
}}The content of most canary event fields are copied directly from the first example event in the event's schema. E.g. [https://github.com/wikimedia/schemas-event-primary/blob/master/jsonschema/mediawiki/recentchange/1.0.1.yaml#L159 mediawiki/recentchange example], [https://github.com/wikimedia/schemas-event-primary/blob/master/jsonschema/mediawiki/revision/create/2.0.0.yaml#L288 mediawiki/revision/create example]. These examples can also be seen in the [https://stream.wikimedia.org/?doc#/streams OpenAPI docs for the streams], e.g. [https://stream.wikimedia.org/?doc#/streams/get_v2_stream_mediawiki_page_move mediawiki.page-move example value]. The code that creates canary events can be found [[gerrit:plugins/gitiles/wikimedia-event-utilities/+/refs/heads/master/eventutilities/src/main/java/org/wikimedia/eventutilities/monitoring/CanaryEventProducer.java#118|here]] (as of 2023-11).


=== Filtering ===
=== Filtering ===
Line 188: Line 313:


=== SSE vs. WebSockets/Socket.IO ===
=== SSE vs. WebSockets/Socket.IO ===
RCStream was written for consumption via Socket.IO, so why not continue to use it for its replacement?
The previous "RCStream" service was written for consumption via Socket.IO, so why did we change the protocol for its replacement?


WebSockets doesn't use HTTP, which makes it different than most of the other services that Wikimedia runs. It is especially powerful when clients and servers need a bi-directional pipe to communicate with each other asynchronously. EventStreams only needs to send events from the server to clients, and is 100% HTTP. As such, it can be consumed using any HTTP client out there, without the need for programming several RPC like initialization steps.
The WebSocket protocol doesn't use HTTP, which makes it different from most other services we run at Wikimedia Foundation. WebSockets are powerful and can e.g. let clients and servers communicate asynchronously with a bi-directional pipe. EventStreams, on the other hand, is read-only and only needs to send events from the server to a client. By using only 100% standard HTTP, EventStreams can be consumed from any HTTP client out there, without the need for programming several RPC-like initialization steps.


We did originally build a Kafka -> Socket.io library ([https://github.com/wikimedia/kasocki Kasocki]), but after doing so we decided that SSE was a better fit and built [https://github.com/wikimedia/kafkasse KafkaSSE].
We originally prototyped a Kafka -> Socket.io library ([https://github.com/wikimedia/kasocki Kasocki]). After doing so we decided that HTTP-SSE was a better fit, and developed [https://github.com/wikimedia/kafkasse KafkaSSE] instead.


=== KafkaSSE ===
=== KafkaSSE ===
Line 201: Line 326:
WMF maintains several internal [[Kafka]] clusters, producing hundreds of thousands of messages per second. It has proved to be highly scalable and feature-ful. It is multi producer and multi consumer. Our internal events are already produced through Kafka, so using it as the EventStreams backend was a natural choice.
WMF maintains several internal [[Kafka]] clusters, producing hundreds of thousands of messages per second. It has proved to be highly scalable and feature-ful. It is multi producer and multi consumer. Our internal events are already produced through Kafka, so using it as the EventStreams backend was a natural choice.


Kafka allows us to begin consuming from any message offset (that is still present on the backend Kafka cluster). This feature is what allows connected EventStreams clients to auto-resume (via EventSource) when they disconnect. In the future, we may implement timestamp based consumption, so that a client could begin consuming an event stream from a timestamp in the past.
Kafka allows us to begin consuming from any message offset (that is still present on the backend Kafka cluster). This feature is what allows connected EventStreams clients to auto-resume (via EventSource) when they disconnect.


== WMF Administration ==
== Notes ==
[[EventStreams/Administration]]


=== Server side enforced timeout ===
== Uses ==
WMF's HTTP connection termination layer enforces a connection timeout of 15 minutes. A good SSE / EventSource client should be able to automatically reconnect and begin consuming at the right location using the Last-Event-ID header.
Check out the [[EventStreams/Powered_By|Powered By]] page for a list of EventStreams client uses.

See [https://phabricator.wikimedia.org/T242767#6202636 this Phabricator discussion] for more info.


== See also ==
== See also ==
{{SourceLinks|url=https://github.com/wikimedia/mediawiki-services-eventstreams|text=EventStreams}}
{{SourceLinks|url=https://github.com/wikimedia/KafkaSSE|text=kafka-sse}}


* [[EventStreams/Administration]], for WMF administration.
* [[Event*]]: Disambiguation for various event-related services at Wikimedia.

* [[EventStreams/Powered By]], for a list of example tools that are built on EventStreams.
* [[Event*]], disambiguation for various event-related services at Wikimedia.
* [[Obsolete:RCStream|RCStream]], predecessor to EventStreams.
* [[mw:Manual:RCFeed|Manual:RCFeed]] on mediawiki.org, about the underlying format of the recent changes messages.
* [[mw:Manual:$wgRCFeeds|Manual:$wgRCFeeds]] on mediawiki.org, about setting up RCFeed (EventStreams uses the EventBusRCEngine from [[mw:Extension:EventBus|EventBus]]).
* [[mw:API:Recent_changes_stream|API:Recent changes stream]] on mediawiki.org.

== Further reading ==

* [[wmfblog:2017/03/20/eventstreams/|"Get live updates to Wikimedia projects with EventStreams"]], by Andrew Otto (2017).
* [[wmfblog:2018/08/08/eventstreams-updates/|"EventStreams updates: New events, composite streams, historical subscription"]], by Andrew Otto (2018).

* [https://germano.dev/sse-websockets/ "Server-Sent Events: the alternative to WebSockets you should be using"], by Germano Gabbianelli (2022).

== External links ==

* [[gitlab:repos/data-engineering/eventstreams|Source code of eventstreams service]] ([https://github.com/wikimedia/mediawiki-services-eventstreams GitHub mirror])
* [https://github.com/wikimedia/KafkaSSE Source code node-kafka-sse library]
* [[phab:tag/wikimedia-stream|Issue tracker (Phabricator workboard)]]
* [[phab:tag/wikimedia-stream|Issue tracker (Phabricator workboard)]]
* [[mw:Manual:RCFeed|RCFeed]]: MediaWiki documentation about the underlying format of the recent changes messages.
* [[mw:Manual:$wgRCFeeds|$wgRCFeeds]]: MediaWiki documentation about setting up an RCFeed. (EventStreams uses the EventBusRCEngine; [[EventBus]].)
* [[mw:API:Recent_changes_stream|Recent changes stream (Web APIs hub)]]
* [[Obsolete:RCStream|RCStream]]: Predecessor to EventStreams


[[Category:Current]]
[[Category:Services]]
[[Category:Services]]
[[Category:Event Platform]]

Latest revision as of 14:15, 14 June 2024

Example client at codepen.io/ottomata/pen/LYpPpxj.
RecentChange stats tool, built with EventStreams – at https://codepen.io/Krinkle/pen/BwEKgW.

EventStreams is a web service that exposes continuous streams of structured event data. It does so over HTTP using chunked transfer encoding following the Server-Sent Events protocol (SSE). EventStreams can be consumed directly via HTTP, but is more commonly used via a client library.

The service supersedes RCStream, and might in the future replace irc.wikimedia.org. EventStreams is internally backed by Apache Kafka.

Note: SSE and EventSource are often used interchangeably as the names of this web technology. This document refers to SSE as the server-side protocol, and EventSource as the client-side interface.

Streams

EventStreams provides access to several different data streams, most notably the recentchange stream which emits MediaWiki Recent changes events.

For a complete list of available streams, refer to the documentation at https://stream.wikimedia.org/?doc#/streams.

The data format of each stream follows a schema. The schemas can be obtained via https://schema.wikimedia.org/#!/primary/jsonschema, for example jsonschema/mediawiki/recentchange/latest.yaml.

For the recentchange stream there is additional documentation at Manual:RCFeed on mediawiki.org.

When not to use EventStreams

The public EventStreams service is intended for use by small scale external tool developers. It should not be used to build production services within Wikimedia Foundation. WMF production services that react to events should directly consume the underlying Kafka topic(s).

Examples

Web browser

Use the built-in EventSource API in modern browsers:

const url = 'https://stream.wikimedia.org/v2/stream/recentchange';
const eventSource = new EventSource(url);

eventSource.onopen = () => {
    console.info('Opened connection.');
};
eventSource.onerror = (event) => {
    console.error('Encountered error', event);
};
eventSource.onmessage = (event) => {
    // event.data will be a JSON message
    const data = JSON.parse(event.data);
    // discard all canary events
    if (meta.domain === 'canary') {
        return;
    }
    // Edits from English Wikipedia
    if (data.server_name === 'en.wikipedia.org') {
        // Output the title of the edited page
        console.log(data.title);
    }
};

JavaScript

Node.js ESM (with wikimedia-streams)

import WikimediaStream from 'wikimedia-streams';

// 'recentchange' can be replaced with another stream topic
const stream = new WikimediaStream('recentchange');

stream.on('open', () => {
    console.info('Opened connection.');
});
stream.on('error', (event) => {
    console.error('Encountered error', event);
});
stream
    .filter("mediawiki.recentchange")
    .all({ wiki: "enwiki" }) // Edits from English Wikipedia
    .on('recentchange', (data, event) => {
        // Output page title
        console.log(data.title);
    });

Node.js (with eventsource)

const EventSource = require('eventsource');

const url = 'https://stream.wikimedia.org/v2/stream/recentchange';
const eventSource = new EventSource(url);

eventSource.onopen = () => {
    console.info('Opened connection.');
};
eventSource.onerror = (event) => {
    console.error('Encountered error', event);
};
eventSource.onmessage = (event) => {
    const data = JSON.parse(event.data);
    // discard canary events
    if (meta.domain === 'canary') {
        return;
    }
    if (data.server_name === 'en.wikipedia.org') {
        // Output the page title
        console.log(data.title);
    }
};

Server side filtering is not supported. You can filter client-side instead, for example to listen for changes to a specific wiki only:

var wiki = 'commonswiki';
eventSource.onmessage = function(event) {
    // event.data will be a JSON string containing the message event.
    var change = JSON.parse(event.data);
    // discard canary events
    if (meta.domain === 'canary') {
        return;
    }    
    if (change.wiki == wiki)
        console.log(`Got commons wiki change on page ${change.title}`);
};

TypeScript

Node.js (with wikimedia-streams)

import WikimediaStream from "wikimedia-streams";
import MediaWikiRecentChangeEvent from 'wikimedia-streams/build/streams/MediaWikiRecentChangeEvent';

// "recentchange" can be replaced with any valid stream
const stream = new WikimediaStream("recentchange");

stream
    .filter("mediawiki.recentchange")
    .all({ wiki: "enwiki" }) // Edits from English Wikipedia
    .on('recentchange', (data /* MediaWikiRecentChangeEvent & { wiki: 'enwiki' } */, event) => {
        // Output page title
        console.log(data.title);
    });

Python

Using python-sseclient. There is also a more asynchronous-friendly version called aiosseclient (requires Python 3.6+).

import json
from sseclient import SSEClient as EventSource

url = 'https://stream.wikimedia.org/v2/stream/recentchange'
for event in EventSource(url):
    if event.event == 'message':
        try:
            change = json.loads(event.data)
        except ValueError:
            pass
        else:
            # discard canary events
            if change['meta']['domain'] == 'canary':
                continue            
            print('{user} edited {title}'.format(**change))

The standard SSE protocol defines ways to continue where you left after a failure or other disconnect. We support this in EventStreams as well. For example:

import json
from sseclient import SSEClient as EventSource

url = 'https://stream.wikimedia.org/v2/stream/recentchange'
for event in EventSource(url, last_id=None):
    if event.event == 'message':
        try:
            change = json.loads(event.data)
        except ValueError:
            pass
        else:
            # discard canary events
            if change['meta']['domain'] == 'canary':
                continue            
            if change.user == 'Yourname':
                print(change)
                print(event.id)

# - Run this Python script.
# - Publish an edit to [[Sandbox]] on test.wikipedia.org, and observe it getting printed.
# - Quit the Python process.
# - Change last_id=None to last_id='[{"topic":"…"},{…}]', as taken from the last printed line.
# - Publish another edit, while the Python process remains off.
# - Run this Python script again, and notice it finding and printing the missed edit.

Server-side filtering is not supported. To filter for something like a wiki domain, you'll need to do this on the consumer side side. For example:

wiki = 'commonswiki'
for event in EventSource(url):
    if event.event == 'message':
        try:
            change = json.loads(event.data)
        except ValueError:
            continue
        # discard canary events
        if change['meta']['domain'] == 'canary':
            continue
        if change['wiki'] == wiki:
            print('{user} edited {title}'.format(**change))

Pywikibot is another way to consume EventStreams in Python. It provides an abstraction that takes care of automatic reconnection, easy filtering, and combination of multiple topics into one stream. For example:

>>> from pywikibot.comms.eventstreams import EventStreams
>>> stream = EventStreams(streams=['recentchange', 'revision-create'],
		          since='20190111')
>>> stream.register_filter(server_name='fr.wikipedia.org', type='edit')
>>> change = next(iter(stream))
>>> print('{type} on page "{title}" by "{user}" at {meta[dt]}.'.format(**change))
edit on page "Véronique Le Guen" by "Speculos" at 2019-01-12T21:19:43+00:00.

Command-line

With curl and jq Set the Accept header and prettify the events with jq.

curl -s -H 'Accept: application/json'  https://stream.wikimedia.org/v2/stream/recentchange | jq .

Setting the Accept: application/json will cause EventStreams to send you newline delimited JSON objects, rather than data in the SSE format.

API

The list of streams that are available will change over time, so they will not be documented here. To see the active list of available streams, visit the swagger-ui documentation, or request the swagger spec directly from https://stream.wikimedia.org/?spec. The available stream URI paths all begin with /v2/stream, e.g.

"/v2/stream/recentchange": {
    "get": {
      "produces": [
        "text/event-stream; charset=utf-8"
      ],
      "description": "Mediawiki RecentChanges feed. Schema: https://schema.wikimedia.org/#!//primary/jsonschema/mediawiki/recentchange"
    }
  },
"/v2/stream/revision-create": {
      "get": {
        "produces": [
          "text/event-stream; charset=utf-8"
        ],
        "description": "Mediawiki Revision Create feed. Schema: https://schema.wikimedia.org/#!//primary/jsonschema/mediawiki/revision/create"
      }
    }

Stream selection

Streams are addressable either individually, e.g. /v2/stream/revision-create, or as a comma separated list of streams to compose, e.g. /v2/stream/page-create,page-delete,page-undelete.

See available streams: https://stream.wikimedia.org/?doc

Historical Consumption

Since 2018-06, EventStreams supports timestamp based historical consumption. This can be provided as individual assignment objects in the Last-Event-ID by setting a timestamp field instead of an offset field. Or, more simply, a since query parameter can be provided in the stream URL, e.g. since=2018-06-14T00:00:00Z. since can either be given as anything parseable by Javascript Date.parse(), e.g. a UTC ISO-8601 datetime string.

When given a timestamp, EventStreams will ask Kafka for the message offset in the stream(s) that most closely match the timestamp. Kafka guarantees that all events after the returned message offset will be after the given timestamp. NOTE: The stream history is not kept indefinitely. Depending on the stream configuration, there will likely be between 7 and 31 days of history available. Please be kind when providing timestamps. There may be a lot of historical data available, and reading it and sending it all out can be compute resource intensive. Please only consume the minimum of data you need.

Example URL: https://stream.wikimedia.org/v2/stream/revision-create?since=2016-06-14T00:00:00Z.

If you want to manually set which topics, partitions, and timestamps or offsets your client starts consuming from, you can set the Last-Event-ID HTTP request header to an array of objects that specify this. E.g.

[{"topic": "eqiad.mediawiki.recentchange", "partition": 0, "offset": 1234567}, {"topic": "codfw.mediawiki.recentchange", "partition": 0, "timestamp": 1575906290000}]

Response Format

All examples here will consume recent changes from https://stream.wikimedia.org/v2/stream/recentchange. This section describes the format of a response body from a EventStreams stream endpoint.

Requesting /v2/stream/recentchange will start a stream of data in the SSE format. This format is best interpreted using an EventSource client. If you choose not to use one of these, the raw stream is still human readable and looks as follows:

event: message
id: [{"topic":"eqiad.mediawiki.recentchange","partition":0,"timestamp":1532031066001},{"topic":"codfw.mediawiki.recentchange","partition":0,"offset":-1}]
data: {"event": "data", "is": "here"}

Each event will be separated by 2 line breaks (\n\n), and have event, id, and data fields.

The event will be message for data events, and error for error events. id is a JSON-formatted array of Kafka topic, partition and offset|timestamp metadata. The id field can be used to tell EventStreams to start consuming from an earlier position in the stream. This enables clients to automatically resume from where they left off if they are disconnected. EventSource implementations handle this transparently. Note that the topic partition and offset|timestamp for all topics and partitions that make up this stream are included in every message's id field. This allows EventSource to be specific about where it left off even if the consumed stream is composed of multiple Kafka topic-partitions.

Note that offsets and timestamps may be used interchangeably SSE id. WMF runs stream.wikimedia.org in a multi-DC active/active setup, backed by multiple Kafka clusters. Since Kafka offsets are unique per cluster, using them in a multi DC setup is not reliable. Instead, id fields will always use timestamps instead of offsets. This is not as precise as using offsets, but allows for a reliable multi DC service.

You may request that EventStreams begins streaming to you from different offsets by setting an array of topic, partition, offset|timestamp objects in the Last-Event-ID HTTP header.

Canary Events

WMF Data Engineering team produces artificial 'canary' events into each stream multiple times an hour. The presence of these canary events in a stream allow us to differentiate between a broken event stream, and an empty one. If a stream has canary_events_enabled=true, then we should expect at least one event in a stream's Kafka topics every hour. If we get no events in an hour, then we can trigger an alert that a stream is broken.

These events are not filtered out in the streams available at stream.wikimedia.org. As a user of these streams, you should discard all canary events; i.e. all events where meta.domain === 'canary'.

If you are not using canary events for alerting, discard them! Discard all events where
meta.domain === 'canary'

The content of most canary event fields are copied directly from the first example event in the event's schema. E.g. mediawiki/recentchange example, mediawiki/revision/create example. These examples can also be seen in the OpenAPI docs for the streams, e.g. mediawiki.page-move example value. The code that creates canary events can be found here (as of 2023-11).

Filtering

EventStreams does not have $wgServerName (or any other) server side filtering capabilities. You'll need to do your filtering client side, e.g.

/**
 * Calls cb(event) for every event where recentchange event.server_name == server_name.
 */
function filterWiki(event, server_name, cb) {
    if (event.server_name == server_name) {
        cb(event);
    }
}

eventSource.onmessage = function(event) {
    // Print only events that come from Wikimedia Commons.
    filterWiki(JSON.parse(event.data), 'commons.wikimedia.org', console.log);
};

Architecture

SSE vs. WebSockets/Socket.IO

The previous "RCStream" service was written for consumption via Socket.IO, so why did we change the protocol for its replacement?

The WebSocket protocol doesn't use HTTP, which makes it different from most other services we run at Wikimedia Foundation. WebSockets are powerful and can e.g. let clients and servers communicate asynchronously with a bi-directional pipe. EventStreams, on the other hand, is read-only and only needs to send events from the server to a client. By using only 100% standard HTTP, EventStreams can be consumed from any HTTP client out there, without the need for programming several RPC-like initialization steps.

We originally prototyped a Kafka -> Socket.io library (Kasocki). After doing so we decided that HTTP-SSE was a better fit, and developed KafkaSSE instead.

KafkaSSE

KafkaSSE is a library that glues a Kafka Consumer to a connected HTTP SSE client. A Kafka Consumer is assigned topics, partitions, and offsets, and then events are streamed from the consumer to the HTTP client in chunked-transfer encoding. EventStreams maps stream routes (e.g /v2/stream/recentchanges) to specific topics in Kafka.

Kafka

WMF maintains several internal Kafka clusters, producing hundreds of thousands of messages per second. It has proved to be highly scalable and feature-ful. It is multi producer and multi consumer. Our internal events are already produced through Kafka, so using it as the EventStreams backend was a natural choice.

Kafka allows us to begin consuming from any message offset (that is still present on the backend Kafka cluster). This feature is what allows connected EventStreams clients to auto-resume (via EventSource) when they disconnect.

Notes

Server side enforced timeout

WMF's HTTP connection termination layer enforces a connection timeout of 15 minutes. A good SSE / EventSource client should be able to automatically reconnect and begin consuming at the right location using the Last-Event-ID header.

See this Phabricator discussion for more info.

See also

Further reading

External links