csvloader

Documentation

VoltDB Home » Documentation » Using VoltDB

csvloader

csvloader — Imports the contents of a CSV file and inserts it into a VoltDB table.

Synopsis

csvloader table-name [arguments]

csvloader -p procedure-name [arguments]

Description

The csvloader command reads comma-separated values and inserts each valid line of data into the specified table in a VoltDB database. The most common way to use csvloader is to specify the database table to be loaded and a CSV file containing the data, like so:

$ csvloader employees -f acme_employees.csv

Alternately, you can use standard input as the source of the data:

$ csvloader employees < acme_employees.csv

In addition to inserting all valid content into the specified database table, csvloader creates three output files:

  • Error log — The error log provides details concerning any errors that occur while processing the input file. This includes errors in the format of the input as well as errors that occur attempting the insert into VoltDB. For example, if two rows contain the same value for a column that is declared as unique, the error log indicates that the second insert fails due to a constraint violation.

  • Failed input — A separate file contains the contents of each line that failed to load. This file is useful because it allows you to correct any formatting issues and retry just the failed content, rather than having to restart and reload the entire table.

  • Summary report — Once all input lines are processed, csvloader generates a summary report listing how many lines were read, how many were successfully loaded and how long the operation took.

All three files are created, by default, in the current working directory using "csvloader" and the table name as prefixes. For example, using csvloader to insert contestants into the sample voter database creates the following files:

csvloader_contestants_insert_log.log
csvloader_contestants_invalidrows.csv
csvloader_contestants_insert_report.log

It is possible to use csvloader to load text files other than CSV files, using the --separator, --quotechar, and --escape flags. Note that csvloader uses Python to process the command line arguments. So to enter certain non-alphanumeric characters, you must use the appropriate escaping mechanism for Python command lines. For example, to use a tab-delimited file as input, you need to use the --separator flag, escaping the tab character like so:

$ csvloader --separator=$'\t'  \
            -f employees.tab  employees

Arguments

--batch {integer}

Specifies the number of rows to submit in a batch. If you do not specify an insert procedure, rows of input are sent in batches to maximize overall throughput. You can specify how many rows are sent in each batch using the --batch flag. The default batch size is 200. If you use the --procedure flag, no batching occurs and each row is sent separately.

--blank {error | null | empty }

Specifies what to do with missing values in the input. By default, if a line contains a missing value, it is interpreted as a null value in the appropriate datatype. If you do not want missing values to be interpreted as nulls, you can use the --blank argument to specify other behaviors. Specifying --blank error results in an error if a line contains any missing values and the line is not inserted. Specifying --blank empty returns the corresponding "empty" value in the appropriate datatype. An empty value is interpreted as the following:

  • Zero for all numeric columns

  • Zero, or the Unix epoch value, for timestamp columns

  • An empty or zero-length string for VARCHAR and VARBINARY columns

--columnsizelimit {integer}

Specifies the maximum size of quoted column input, in bytes. Mismatched quotation marks in the input can cause csvloader to read all subsequent input — including line breaks — as part of the column. To avoid excessive memory use in this situation, the flag sets a limit on the maximum number of bytes that will be accepted as input for a column that is enclosed in quotation marks and spans multiple lines. The default is 16777216 (that is, 16MB).

--escape {character}

Specifies the escape character that must precede a separator or quotation character that is supposed to be interpreted as a literal character in the CSV input. The default escape character is the backslash (\).

-f, --file {file-specification}

Specifies the location of a CSV file to read as input. If you do not specify an input file, csvloader reads input from standard input.

--limitrows {integer}

Specifies the maximum number of rows to be read from the input stream. This argument (along with --skip) lets you load a subset of a larger CSV file.

-m, --maxerrors {integer}

Specifies the target number of errors before csvloader stops processing input. Once csvloader encounters the specified number of errors while trying to insert rows, it will stop reading input and end the process. Note that, since csvloader performs inserts asynchronously, it often attempts more inserts before the target number of exceptions are returned from the database. So it is possible more errors could be returned after the target is met. This argument lets you conditionally stop a large loading process if more than an acceptable number of errors occur.

--nowhitespace

Specifies that the CSV input must not contain any whitespace between data values and separators. By default, csvloader ignores extra space between values, quotation marks, and the value separators. If you use this argument, any input lines containing whitespace will generate an error and not be inserted into the database.

--password {text]

Specifies the password to use when connecting to the database. You must specify a username and password if security is enabled for the database.

--port {port-number}

Specifies the network port to use when connecting to the database. If you do not specify a port, csvloader uses the default client port 21212.

-p, --procedure {procedure-name}

Specifies a stored procedure to use for loading each record from the data file. The named procedure must exist in the database catalog and must accept the fields of the data record as input parameters. By default, csvloader uses a custom procedure to batch multiple rows into a single insert operation. If you explicitly name a procedure, batching does not occur.

--quotechar {character}

Specifies the quotation character that is used to enclose values. By default, the quotation character is the double quotation mark (").

-r, --reportdir {directory}

Specifies the directory where csvloader writes the three output files. By default, csvloader writes output files to the current working directory. This argument lets you redirect output to an alternative location.

--s, --servers=server-id[,...]

Specifies the network address of one or more nodes of a database cluster. By default, csvloader attempts to insert the CSV data into a database on the local system (localhost). To load data into a remote database, use the --servers argument to specify the database nodes the loader should connect to.

--separator {charactor}

Specifies the character used to separate individual values in the input. By default, the separator character is the comma (,).

--skip {integer}

Specifies the number of lines from the input stream to skip before inserting rows into the database. This argument (along with --limitrows) lets you load a subset of a larger CSV file.

--strictquotes

Specifies that all values in the CSV input must be enclosed in quotation marks. If you use this argument, any input lines containing unquoted values will generate an error and not be inserted into the database.

--user {text}

Specifies the username to use when connecting to the database. You must specify a username and password if security is enabled for the database.

Examples

The following example loads the data from a CSV file, languages.csv, into the helloworld table from the Hello World example database and redirects the output files to the ./logs subfolder.

$ csvloader helloworld -f languages.csv -r ./logs

The following example performs the same function, providing the input interactively.

$ csvloader helloworld -r ./logs
"Hello", "World", "English"
"Bonjour", "Monde", "French"
"Hola", "Mundo", "Spanish"
"Hej", "Verden", "Danish"
"Ciao", "Mondo", "Italian"
CTRL-D