Jump to content

Event Platform/EventStreams HTTP Service: Difference between revisions

From Wikitech
Content deleted Content added
m Reverted edits by Revi mel (talk) to last version by Bstorm (script)
Tag: Replaced
stream composition and historical timestamp consumption
Line 122: Line 122:
</syntaxhighlight>
</syntaxhighlight>


=== Format ===
=== Stream selection ===
Streams are addressable either individually, e.g. <code>/v2/stream/revision-create</code>, or as a comma separated list of streams to compose, e.g. <code>/v2/stream/page-create,page-delete,page-undelete</code>.

=== Timestamp Historical Consumption ===
Since 2018-06, EventStreams supports timestamp based historical consumption. This can be provided as individual assignment objects in the <code>Last-Event-ID</code> by setting a timestamp field instead of an offset field. Or, more simply, a <code>since</code> query parameter can be provided in the stream URL, e.g. <code>since=2018-06-14T00:00:00Z</code>. since can either be given as a milliseconds UTC unix epoch timestamp or anything parseable by Javascript <code>Date.parse()</code>, e.g. a UTC [https://en.m.wikipedia.org/wiki/ISO_8601 ISO-8601] datetime string.

When given a timestamp, EventStreams will ask Kafka for the message offset in the stream(s) that most closely match the timestamp. Kafka guarantees that all events after the returned message offset will be after the given timestamp. NOTE: The stream history is not kept indefinitely. Depending on the stream configuration, there will likely be between 7 and 31 days of history available. Please be kind when providing timestamps. There may be a lot of historical data available, and reading it and sending it all out can be compute resource intensive. Please only consume the minimum of data you need.

Example URL: https://stream.wikimedia.org/v2/stream/revision-create?since=2016-06-14T00:00:00Z.

=== Response Format ===
All examples here will consume recent changes from https://stream.wikimedia.org/v2/stream/recentchange. This section describes the format of a response body from a EventStreams stream endpoint.
All examples here will consume recent changes from https://stream.wikimedia.org/v2/stream/recentchange. This section describes the format of a response body from a EventStreams stream endpoint.



Revision as of 16:34, 14 June 2018

Example client at codepen.io/ottomata/pen/VKNyEw/

EventStreams is a web service that exposes continuous streams of structured event data. It does so over HTTP using chunked transfer encoding following the Server-Sent Events protocol. EventStreams can be consumed directly via HTTP, but is more commonly used via a client library.

EventStreams provides access to arbitrary streams of data, including MediaWiki RecentChanges. It replaces RCStream, and may in the future replace irc.wikimedia.org. EventStreams is backed by Kafka.

Note: Often 'SSE' and EventSource are used interchangeably. This document refers to SSE as the server-side protocol, and EventSource as the client-side interface.

Examples

JavaScript

Node.js (with eventsource)

var EventSource = require('eventsource');
var url = 'https://stream.wikimedia.org/v2/stream/recentchange';

console.log(`Connecting to EventStreams at ${url}`);
var eventSource = new EventSource(url);

eventSource.onopen = function(event) {
    console.log('--- Opened connection.');
};

eventSource.onerror = function(event) {
    console.error('--- Encountered error', event);
};

eventSource.onmessage = function(event) {
    // event.data will be a JSON string containing the message event.
    console.log(JSON.parse(event.data));
};

Server side filtering is not (yet) supported, so if you need to filter on something like a wiki name, you'll need to do this client side, e.g.

var wiki = 'commonswiki';
eventSource.onmessage = function(event) {
    // event.data will be a JSON string containing the message event.
    var change = JSON.parse(event.data);
    if (change.wiki == wiki)
        console.log(`Got commons wiki change on page ${change.title}`);
};

Python

Using sseclient.

There is also a more asynchronous friendly version, here, needs Python 3.6 async generator capability.

import json
from sseclient import SSEClient as EventSource

url = 'https://stream.wikimedia.org/v2/stream/recentchange'
for event in EventSource(url):
    if event.event == 'message':
        try:
            change = json.loads(event.data)
        except ValueError:
            pass
        else:
            print('{user} edited {title}'.format(**change))

Server side filtering is not (yet) supported, so if you need to filter on something like a wiki name, you'll need to do this client side, e.g.

wiki = 'commonswiki'
for event in EventSource(url):
    if event.event == 'message':
        try:
            change = json.loads(event.data)
        except ValueError:
            continue
        if change['wiki'] == wiki:
            print('{user} edited {title}'.format(**change))

Pywikibot supports EventStreams with freely configurable client side filtering and automatic reconnection.

Usage sample:

>>> from pywikibot.comms.eventstreams import EventStreams
>>> stream = EventStreams(stream='recentchange')
>>> stream.register_filter(server_name='es.wikipedia.org', type='edit')
>>> change = next(iter(stream))
>>> print('{type} on page {title} by {user}.'.format(**change))
edit on page Pirarajá by CITY MVD.

Command-line

With curl and jq Grep for the data part of the event, strip out the data: prefix, and prettify the event with jq.

curl -s  https://stream.wikimedia.org/v2/stream/recentchange | 
  grep data |
  sed 's/^data: //g' |
  jq .

API

The list of streams that are available will change over time, so they will not be documented here. To see the active list of available streams, visit the swagger-ui documentation, or request the swagger spec directly from https://stream.wikimedia.org/?spec. The available stream URI paths all begin with /v2/stream, e.g.

"/v2/stream/recentchange": {
    "get": {
      "produces": [
        "text/event-stream; charset=utf-8"
      ],
      "description": "Mediawiki RecentChanges feed. Schema: https://github.com/wikimedia/mediawiki-event-schemas/tree/master/jsonschema/mediawiki/recentchange"
    }
  },
"/v2/stream/revision-create": {
      "get": {
        "produces": [
          "text/event-stream; charset=utf-8"
        ],
        "description": "Mediawiki Revision Create feed. Schema: https://github.com/wikimedia/mediawiki-event-schemas/tree/master/jsonschema/mediawiki/revision/create"
      }
    }

Stream selection

Streams are addressable either individually, e.g. /v2/stream/revision-create, or as a comma separated list of streams to compose, e.g. /v2/stream/page-create,page-delete,page-undelete.

Timestamp Historical Consumption

Since 2018-06, EventStreams supports timestamp based historical consumption. This can be provided as individual assignment objects in the Last-Event-ID by setting a timestamp field instead of an offset field. Or, more simply, a since query parameter can be provided in the stream URL, e.g. since=2018-06-14T00:00:00Z. since can either be given as a milliseconds UTC unix epoch timestamp or anything parseable by Javascript Date.parse(), e.g. a UTC ISO-8601 datetime string.

When given a timestamp, EventStreams will ask Kafka for the message offset in the stream(s) that most closely match the timestamp. Kafka guarantees that all events after the returned message offset will be after the given timestamp. NOTE: The stream history is not kept indefinitely. Depending on the stream configuration, there will likely be between 7 and 31 days of history available. Please be kind when providing timestamps. There may be a lot of historical data available, and reading it and sending it all out can be compute resource intensive. Please only consume the minimum of data you need.

Example URL: https://stream.wikimedia.org/v2/stream/revision-create?since=2016-06-14T00:00:00Z.

Response Format

All examples here will consume recent changes from https://stream.wikimedia.org/v2/stream/recentchange. This section describes the format of a response body from a EventStreams stream endpoint.

Requesting /v2/stream/recentchange will start a stream of data in the SSE format. This format is best interpreted using an EventSource client. If you choose not to use one of these, the raw stream is still human readable and looks as follows:

event: message
id: [{"topic":"eqiad.mediawiki.recentchange","partition":0,"offset":142461965},{"topic":"codfw.mediawiki.recentchange","partition":0,"offset":-1}]
data: {"event": "data", "is": "here"}

Each event will be separated by 2 line breaks (\n\n), and have event, id, and data fields.

The event will be message for data events, and error for error events. id is a JSON-formatted array of Kafka topic, partition and offset metadata. The id field can be used to tell EventStreams to start consuming from an earlier position in the stream. This enables clients to automatically resume from where they left off if they are disconnected. EventSource implementations handle this transparently. Note that the topic partition and offsets for all topics and partitions that make up this stream are included in every message's id field. This allows EventSource to be specific about where it left off even if the consumed stream is composed of multiple Kafka topic-partitions.

You may request that EventStreams begins streaming to you from different offsets by setting an array of topic, partition, offset objects in the Last-Event-ID HTTP header.

Filtering

Unlike RCStream, EventStreams does not (yet) have $wgServerName (or any other) server side filtering capabilities. We would like to add this, but exactly how this would fit into the API and what filtering features is still under debate.

Until server side filtering exists, you'll need to do your filtering client side, e.g.

/**
 * Calls cb(event) for every event where recentchange event.server_name == server_name.
 */
function filterWiki(event, server_name, cb) {
    if (event.server_name == server_name) {
        cb(event);
    }
}

eventSource.onmessage = function(event) {
    // Print only events that come from Wikimedia Commons.
    filterWiki(JSON.parse(event.data), 'commons.wikimedia.org', console.log);
};

Architecture

SSE vs. WebSockets/Socket.IO

RCStream was written for consumption via Socket.IO, so why not continue to use it for its replacement?

WebSockets doesn't use HTTP, which makes it different than most of the other services that Wikimedia runs. It is especially powerful when clients and servers need a bi-directional pipe to communicate with each other asynchronously. EventStreams only needs to send events from the server to clients, and is 100% HTTP. As such, it can be consumed using any HTTP client out there, without the need for programming several RPC like initialization steps.

We did originally build a Kafka -> Socket.io library (Kasocki), but after doing so we decided that SSE was a better fit and built KafkaSSE.

KafkaSSE

KafkaSSE is a library that glues a Kafka Consumer to a connected HTTP SSE client. A Kafka Consumer is assigned topics, partitions, and offsets, and then events are streamed from the consumer to the HTTP client in chunked-transfer encoding. EventStreams maps stream routes (e.g /v2/stream/recentchanges) to specific topics in Kafka.

Kafka

WMF maintains several internal Kafka clusters, producing hundreds of thousands of messages per second. It has proved to be highly scalable and feature-ful. It is multi producer and multi consumer. Our internal events are already produced through Kafka, so using it as the EventStreams backend was a natural choice.

Kafka allows us to begin consuming from any message offset (that is still present on the backend Kafka cluster). This feature is what allows connected EventStreams clients to auto-resume (via EventSource) when they disconnect. In the future, we may implement timestamp based consumption, so that a client could begin consuming an event stream from a timestamp in the past.

WMF Administration

EventStreams/Administration

Uses

Check out the Powered By page for a list of EventStreams client uses.

See also

Source code: EventStreams
Source code: kafka-sse