15.12. Understanding Import

Documentation

VoltDB Home » Documentation » Using VoltDB

15.12. Understanding Import

Just as VoltDB can export data from selected tables to various targets, it supports importing data to selected tables from external sources. Import works in two ways:

One-time import of data using one of several data loading utilities VoltDB provides. These data loaders support multiple standard input protocols and can be run from any server, even remotely from the database itself.
Streaming import as part of the database server process. For data that is imported on an ongoing basis, use of the built-in import functionality ensures that import starts and stops with the database.

The following sections discuss these two approaches to data import.

15.12.1. One-Time Import Using Data Loading Utilities

Often, when migrating data from one database to another or when pre-loading a set of data into VoltDB as a starting point, you just want to perform the import once and then use the data natively within VoltDB. For these one-time uses, VoltDB provides separate data loader utilities that you can run once and then stop.

Each data loader supports a different source format. You can load data from text files — such as comma-separated value (CSV) files — using the csvloader utility. You can load data from another JDBC-compliant database using the jdbcloader utility. Or you can load data from a streaming message service with the Kafka loader utility, kafkaloader.

All of the data loaders operate in much the same way. For each utility you specify the source for the import and either a table that the data will be loaded into or a stored procedure that will be used to load the data. So, for example, to load records from a CSV file named staff.csv into the table EMPLOYEES, the command might be the following:

$ csvloader employees --file=staff.csv

If instead you are copying the data from a JDBC-compliant database, the command might look like this:

$ jdbcloader employees \
     --jdbcurl=jdbc:postgresql://remotesvr/corphr \
     --jdbctable=employees \
     --jdbcdriver=org.postgresql.Driver

Each utility has arguments unique to the data source (such as --jdbcurl) that allow you to properly configure and connect to the source. See the description of each utility in Appendix D, VoltDB CLI Commands for details.

15.12.2. Streaming Import Using Built-in Import Features

If importing data is an ongoing business process, rather than a one-time event, then it is desirable to make it an integral part of the database system. This can be done by building a custom application to push data into VoltDB using one of its standard APIs, such as the JDBC interface. Or you can take advantage of VoltDB's built-in import infrastructure.

The built-in importers work in much the same way as the data loading utilities, where incoming data is written into one or more database tables using an existing stored procedure. The difference is that the built-in importers start automatically whenever the database starts and stop when the database stops, making import an integral part of the database process.

You configure the built-in importers in the deployment file the same way you configure export connections. Within the <import> element, you declare each import stream using separate <configuration> elements. Within the <configuration> tag you use attributes to specify the type and format of data being imported and whether the import configuration is enabled or not. Then enclosed within the <configuration> tags you use <property> elements to provide information required by the specific importer. For example:

<import>
  <configuration type="kafka" format="csv" enabled="true">
    <property name="brokers">kafkasvr:9092</property>
    <property name="topics">employees</property>
    <property name="procedure">EMPLOYEE.insert</property>
  </configuration>
</import>

Note

For the initial release of built-in importers, Kafka is the only supported import type.

VoltDB currently provides support for only one type of import: kafka. VoltDB also provides support for two import formats: comma-separated values (csv) and tab-separated values (tsv). Command-separated values are the default format. So if you are using CSV-formatted input, you can leave out the format attribute, as in the following examples.

When the database starts, the import infrastructure starts any enabled configurations. If you are importing multiple streams to separate tables through separate procedures, you must include multiple configurations, even if they come from the same source. For example, the following configuration imports data from two Kafka topics from the same Kafka servers into separate VoltDB tables.

<import>
  <configuration type="kafka" enabled="true">
    <property name="brokers">kafkasvr:9092</property>
    <property name="topics">employees</property>
    <property name="procedure">EMPLOYEE.insert</property>
  </configuration>
  <configuration type="kafka" enabled="true">
    <property name="brokers">kafkasvr:9092</property>
    <property name="topics">managers</property>
    <property name="procedure">MANAGER.insert</property>
  </configuration>
</import>

The following section describes the Kafka importer in more detail.

15.12.2.1. The Kafka Importer

The Kafka importer connects to the specified Kafka messaging service and imports one or more Kafka topics and writes the records into the VoltDB database. The data is decoded according to the specified format — comma-separated values (CSV) by default — and is inserted into the VoltDB database using the specified stored procedure.

You must specify the following properties for each configuration:

brokers — Identifies one or more Kafka brokers. That is, servers hosting the Kafka service and desired topics. Specify a single server or a comma-separated list of brokers.
topics — Identifies the Kafka topics that will be imported. The property value can be a single topic name or a comma-separated list of topics.
procedure — Identifies the stored procedure that is invoked to insert the records into the VoltDB database.

When import starts, the importer first checks to make sure the specified stored procedure exists in the database schema. If not (for example, when you first create a database and before a schema is loaded), the importer issues periodic warnings to the console.

Once the specified stored procedure is declared, the importer looks for the specified Kafka brokers and topics. If the specified brokers cannot be found or the specified topics do not exist on the brokers, the importer reports an error and stops. You will need to restart import once this error condition is corrected. You can restart import using any of the following methods:

Stop and restart or recover the database
Pause and resume the database using the voltadmin pause and voltadmin resume commands
Update the deployment file using the voltadmin update command

If the brokers are found and the topics exist, the importer starts fetching data from the Kafka topics and submitting it to the stored procedure to insert into the database. In the simplest case, you can use the default insert procedure for a table to insert records into a single table. For more complex data you can write your own import stored procedure to interpret the data and insert it into the appropriate table(s).

Table 15.7, “Kafka Import Properties” lists the allowable properties for the Kafka importer.

Table 15.7. Kafka Import Properties

Property	Allowable Values	Description
brokers^*	string	A comma-separated list of Kafka brokers.
procedure^*	string	The stored procedure to invoke to insert the incoming data into the database.
topics^*	string	A comma-separated list of Kafka topics.
fetch.message.max.bytes	integer	The maximum size, in bytes, of the message that is fetched from Kafka. The Kafka default for this property is 64 Kilobytes.
groupid	string	A user-defined name for the group that the client belongs to. Kafka maintains a single pointer for the current position within the stream for all clients in the same group. The default group ID is "voltdb". In the rare case where you have two or more databases importing data from the same Kafka brokers and topics, be sure to set this property to give each database a unique group ID and avoid the databases interfering with each other.
socket.timeout.ms	integer	The time, in milliseconds, before the socket times out if no response is received. The Kafka default for this property is 30,000 (30 seconds). If the socket times out when the importer first tries to connect to the brokers, import will stop. If it times out after the initial connection is made, the importer will retry the connection until it succeeds.
^*Required

< previous

table of contents

next >