13.8. The Kafka Connector

Documentation

VoltDB Home » Documentation » Using VoltDB

13.8. The Kafka Connector

The Kafka connector receives serialized data from the export tables and writes it to a message queue using the Apache Kafka version 0.8 protocols. Apache Kafka is a distributed messaging service that lets you set up message queues which are written to and read from by "producers" and "consumers", respectively. In the Apache Kafka model, VoltDB export acts as a "producer".

Before using the Kafka connector, we strongly recommend reading the Kafka documentation and becoming familiar with the software, since you will need to set up a Kafka 0.8 service and appropriate "consumer" clients to make use of VoltDB's Kafka export functionality. The instructions in this section assume a working knowledge of Kafka and the Kafka operational model.

When the Kafka connector receives data from the VoltDB export tables, it establishes a connection to the Kafka messaging service as a Kafka producer. It then writes records to the service using the VoltDB table name and a predetermined prefix as the Kafka "topic". How and when the data is transmitted to Kafka and the name of the topic prefix are controlled by the export connector properties.

The majority of the Kafka properties are identical in both in name and content to the Kafka producer properties listed in the Kafka documentation. All but one of these properties are optional for the Kafka connector and will use the standard Kafka default value. For example, if you do not specify the queue.buffering.max.ms property it defaults to 5000 milliseconds.

The only required property is metadata.broker.list, which lists the Kafka servers that the VoltDB export connector should connect to. You must specify this property so VoltDB knows where to send the export data.

In addition to the standard Kafka producer properties, there are several custom properties specific to VoltDB. The properties binaryencoding, skipinternals, and timezone affect the format of the data. The topic.prefix and batch.mode properties affect how and when the data is written to Kafka.

The topic.prefix property specifies the text that precedes the table name when constructing the Kafka topic. If you do not specify a prefix, it defaults to "voltdbexport". Note that unless you configure the Kafka brokers with the auto.create.topics.enable property set to true, you must create the topics for every export table manually before starting the export process. Enabling auto-creation of topics when setting up the Kafka brokers is recommended.

The batch.mode property specifies whether messages are sent in batches, like the JDBC connector, or one message at a time. When configuring the export connector, it is important to understand the relationship between batch mode and synchronous versus asynchronous processing and their effect on database latency.

Using batch mode reduces the number of packets that must be sent to the Kafka servers, optimizing network bandwidth. If the export data is sent asynchronously, by setting the property producer.type to "async", the impact of export on the database is further reduced, since the export connector does not wait for the Kafka server to respond. However, with asynchronous processing, VoltDB is not able to resend the data if the message fails after it is sent.

If export to Kafka is done synchronously, the export connector waits for acknowledgement of each message sent to the Kafka server before processing the next packet. This allows the connector to resend any packets that fail. The drawback to synchronous processing is that on a heavily loaded database, the latency it introduces means export may not be able to keep up with the influx of export data and and have to write to overflow.

VoltDB guarantees that at least one copy of all export data is sent by the export connector. But when operating in asynchronous mode, the Kafka connector cannot guarantee that the packet is actually received and accepted by the Kafka broker. By operating in synchronous mode, VoltDB can catch errors returned by the Kafka broker and resend any failed packets. However, you pay the penalty of additional latency and possible export overflow.

To balance performance with durability of the exported data, the following are the two recommended configurations for producer type and batch mode:

  • Synchronous with batch mode — Using synchronous mode ensures all packets are received by the Kafka system while batch mode reduces the possible latency impact by decreasing the number of packets that get sent.

    <property name="producer.type">sync</property>
    <property namr="batch.mode">true</property>
  • Asynchronous without batch mode — Using asynchronous mode eliminates latency due to waiting for responses from the Kafka infrastructure while not using batch mode ensures that if a request fails, only one row of export data is affected, reducing the durability impact.

    <property name="producer.type">async</property>
    <property namr="batch.mode">false</property>

Finally, the actual export data is sent to Kafka as a comma-separated values (CSV) formatted string. The message includes six columns of metadata (such as the transaction ID and timestamp) followed by the column values of the export table.

Table 13.4, “Kafka Export Properties” lists the supported properties for the Kafka connector, including the standard Kafka producer properties and the VoltDB unique properties.

Table 13.4. Kafka Export Properties

PropertyAllowable ValuesDescription
metadata.broker.list*stringA comma-separated list of Kafka brokers.
batch.modetrue, falseWhether to submit multiple rows as a single request or send each export row separately. The default is true.
partition.key{table}.{column}[,...]

Specifies which table column value to use as the Kafka partitioning key for each table. Kafka uses the partition key to distribute messages across multiple servers.

By default, the value of the table's partitioning column is used as the Kafka partition key. Using this property you can specify a list of table column names, where the table name and column name are separated by a period and the list of table references is separated by commas. If the table is not partitioned and you do not specify a key, the server partition ID is used as a default.

binaryencodinghex, base64Specifies whether VARBINARY data is encoded in hexadecimal or BASE64 format. The default is hexadecimal.
skipinternalstrue, falseSpecifies whether to include six columns of VoltDB metadata (such as transaction ID and timestamp) in the output. If you specify skipinternals as true, the output contains only the exported table data. The default is false.
timezonestringThe time zone to use when formatting the timestamp. Specify the time zone as a Java timezone identifier. The default is GMT.
topic.prefixstringThe prefix to use when constructing the topic name. Each row is sent to a topic identified by {prefix}{table-name}. The default prefix is "voltdbexport".
metadata.broker.list, request.required.acks, request.timeout.ms, producer.type, serializer.class, key.serializer.class, partitioner.class, compression.codec, compressed.topics, message.send.max.retries, retry.backoff.ms, topic.metadata.refresh.interval.ms, queue.buffering.max.ms, queue.buffering.max.messages, queue.enqueue.timeout.ms, batch.num.messages, send.buffer.bytes, client.idvarious

Standard Kafka producer properties can be specified as properties to the VoltDB Kafka connector.

*Required