Duckdb Docs
Duckdb Docs
Contents i
Summary 1
Documentation 3
Connect 5
Connect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Concurrency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Data Import 7
Importing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
CSV Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
CSV Import . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
CSV Auto Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Reading Faulty CSV Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
CSV Import Tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
JSON Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
JSON Loading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Multiple Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Reading Multiple Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Combining Schemas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Parquet Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Reading and Writing Parquet Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Querying Parquet Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Parquet Encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Parquet Tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Hive Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Partitioned Writes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Appender . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
INSERT Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Client APIs 39
Client APIs Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Startup & Shutdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Data Chunks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Prepared Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Appender . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Table Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
i
DuckDB Documentation
Configuration 301
Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
Pragmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
Secrets Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
SQL 313
SQL Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
Statements Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
ii
DuckDB Documentation
iii
DuckDB Documentation
Extensions 533
Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
Official Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535
iv
DuckDB Documentation
Guides 599
Performance 615
Performance Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615
v
DuckDB Documentation
Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615
Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 616
Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617
File Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 618
Tuning Workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 620
My Workload Is Slow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623
Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623
ODBC 635
ODBC 101: A Duck Themed Guide to ODBC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635
Python 643
Installing the Python Client . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643
Executing SQL in Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643
Jupyter Notebooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644
SQL on Pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647
Import from Pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647
Export to Pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 648
SQL on Apache Arrow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 648
Import from Apache Arrow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 650
Export to Apache Arrow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 650
Relational API on Pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 651
Multiple Python Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 652
Integration with Ibis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654
Integration with Polars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 663
Using fsspec Filesystems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664
Internals 683
Overview of DuckDB Internals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683
Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684
Execution Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 687
vi
DuckDB Documentation
Acknowledgments 709
vii
Summary
This document contains DuckDB's official documentation and guides in a single‑file easy‑to‑search form. If you find any issues, please
report them as a GitHub issue. Contributions are very welcome in the form of pull requests. If you are considering submitting a contribution
to the documentation, please consult our contributor guide.
Code repositories:
1
DuckDB Documentation
2
DuckDB Documentation
Documentation
3
Connect
Connect
To use DuckDB, you must first create a connection to a database. The exact syntax varies between the client APIs but it typically involves
passing an argument to configure persistence.
Persistence
DuckDB can operate in both persistent mode, where the data is saved to disk, and in in‑memory mode, where the entire data set is stored
in the main memory.
Persistent Database To create or open a persistent database, set the path of the database file, e.g., my_database.duckdb, when
creating the connection. This path can point to an existing database or to a file that does not yet exist and DuckDB will open or create a
database at that location as needed. The file may have an arbitrary extension, but .db or .duckdb are two common choices.
Note. Tip Running on a persistent database allows spilling to disk, thus facilitating larger‑than‑memory workloads (i.e., out‑of‑core‑
processing).
Starting with v0.10, DuckDB's storage format is backwards‑compatible, i.e., DuckDB is able to read database files produced by an older
versions of DuckDB. For example, DuckDB v0.10 can read and operate on files created by the previous DuckDB version, v0.9. For more
details on DuckDB's storage format, see the storage page.
In‑Memory Database DuckDB can operate in in‑memory mode. In most clients, this can be activated by passing the special value :mem-
ory: as the database file or omitting the database file argument. In in‑memory mode, no data is persisted to disk, therefore, all data is
lost when the process finishes.
Concurrency
Handling Concurrency
When using option 1, DuckDB supports multiple writer threads using a combination of MVCC (Multi‑Version Concurrency Control) and
optimistic concurrency control (see Concurrency within a Single Process), but all within that single writer process. The reason for this con‑
currency model is to allow for the caching of data in RAM for faster analytical queries, rather than going back and forth to disk during each
query. It also allows the caching of functions pointers, the database catalog, and other items so that subsequent queries on the same
connection are faster.
5
DuckDB Documentation
Note. DuckDB is optimized for bulk operations, so executing many small transactions is not a primary design goal.
DuckDB supports concurrency within a single process according to the following rules. As long as there are no write conflicts, multiple
concurrent writes will succeed. Appends will never conflict, even on the same table. Multiple threads can also simultaneously update
separate tables or separate subsets of the same table. Optimistic concurrency control comes into play when two threads attempt to edit
(update or delete) the same row at the same time. In that situation, the second thread to attempt the edit will fail with a conflict error.
Writing to DuckDB from multiple processes is not supported automatically and is not a primary design goal (see Handling Concurrency).
If multiple processes must write to the same file, several design patterns are possible, but would need to be implemented in application
logic. For example, each process could acquire a cross‑process mutex lock, then open the database in read/write mode and close it when the
query is complete. Instead of using a mutex lock, each process could instead retry the connection if another process is already connected to
the database (being sure to close the connection upon query completion). Another alternative would be to do multi‑process transactions
on a MySQL, PostgreSQL, or SQLite database, and use DuckDB's MySQL, PostgreSQL, or SQLite extensions to execute analytical queries on
that data periodically.
Additional options include writing data to Parquet files and using DuckDB's ability to read multiple Parquet files, taking a similar approach
with CSV files, or creating a web server to receive requests and manage reads and writes to DuckDB.
6
Data Import
Importing Data
The first step to using a database system is to insert data into that system. DuckDB provides several data ingestion methods that allow you
to easily and efficiently fill up the database. In this section, we provide an overview of these methods so you can select which one is correct
for you.
Insert Statements
Insert statements are the standard way of loading data into a database system. They are suitable for quick prototyping, but should be
avoided for bulk loading as they have significant per‑row overhead.
For a more detailed description, see the page on the INSERT statement.
CSV Loading
Data can be efficiently loaded from CSV files using the read_csv function or the COPY statement.
You can also load data from compressed (e.g., compressed with gzip) CSV files, for example:
Parquet Loading
Parquet files can be efficiently loaded and queried using the read_parquet function.
JSON Loading
JSON files can be efficiently loaded and queried using the read_json_auto function.
Appender
In several APIs (C, C++, Go, Java, and Rust), the Appender can be used as an alternative for bulk data loading. This class can be used to
efficiently add rows to the database system without using SQL statements.
7
DuckDB Documentation
CSV Files
CSV Import
Examples
CSV Loading
CSV loading, i.e., importing CSV files to the database, is a very common, and yet surprisingly tricky, task. While CSVs seem simple on the
surface, there are a lot of inconsistencies found within CSV files that can make loading them a challenge. CSV files come in many different
varieties, are often corrupt, and do not have a schema. The CSV reader needs to cope with all of these different situations.
The DuckDB CSV reader can automatically infer which configuration flags to use by analyzing the CSV file using the CSV sniffer. This will
work correctly in most situations, and should be the first option attempted. In rare situations where the CSV reader cannot figure out the
correct configuration it is possible to manually configure the CSV reader to correctly parse the CSV file. See the auto detection page for
more information.
Parameters
Below are parameters that can be passed to the CSV reader. These parameters are accepted by both the COPY statement and the read_
csv function.
8
DuckDB Documentation
all_varchar Option to skip type detection for CSV parsing and assume BOOL false
all columns to be of type VARCHAR.
allow_quoted_nulls Option to allow the conversion of quoted values to NULL BOOL true
values
auto_detect Enables auto detection of CSV parameters. BOOL true
auto_type_candidates This option allows you to specify the types that the sniffer TYPE[] ['SQLNULL',
will use when detecting CSV column types, e.g., SELECT 'BOOLEAN',
* FROM read_csv('csv_file.csv', auto_ 'BIGINT',
type_candidates=['BIGINT', 'DATE']). The 'DOUBLE',
VARCHAR type is always included in the detected types (as 'TIME',
a fallback option). 'DATE',
'TIMESTAMP',
'VARCHAR']
columns A struct that specifies the column names and column types STRUCT (empty)
contained within the CSV file (e.g., {'col1':
'INTEGER', 'col2': 'VARCHAR'}). Using this
option implies that auto detection is not used.
compression The compression type for the file. By default this will be VARCHAR auto
detected automatically from the file extension (e.g.,
t.csv.gz will use gzip, t.csv will use none). Options
are none, gzip, zstd.
dateformat Specifies the date format to use when parsing dates. See VARCHAR (empty)
Date Format.
decimal_separator The decimal separator of numbers. VARCHAR .
delim or sep Specifies the string that separates columns within each VARCHAR ,
row (line) of the file.
escape Specifies the string that should appear before a data VARCHAR "
character sequence that matches the quote value.
filename Whether or not an extra filename column should be BOOL false
included in the result.
force_not_null Do not match the specified columns' values against the VARCHAR[] []
NULL string. In the default case where the NULL string is
empty, this means that empty values will be read as
zero‑length strings rather than NULLs.
header Specifies that the file contains a header line with the BOOL false
names of each column in the file.
hive_partitioning Whether or not to interpret the path as a Hive partitioned BOOL false
path.
ignore_errors Option to ignore any parsing errors encountered ‑ and BOOL false
instead ignore rows with errors.
max_line_size The maximum line size in bytes. BIGINT 2097152
names The column names as a list, see example. VARCHAR[] (empty)
new_line Set the new line character(s) in the file. Options are VARCHAR (empty)
'\r','\n', or '\r\n'.
9
DuckDB Documentation
normalize_names Boolean value that specifies whether or not column names BOOL false
should be normalized, removing any non‑alphanumeric
characters from them.
null_padding If this option is enabled, when a row lacks columns, it will BOOL false
pad the remaining columns on the right with null values.
nullstr Specifies the string that represents a NULL value. VARCHAR (empty)
parallel Whether or not the parallel CSV reader is used. BOOL true
quote Specifies the quoting string to be used when a data value is VARCHAR "
quoted.
sample_size The number of sample rows for auto detection of BIGINT 20480
parameters.
skip The number of lines at the top of the file to skip. BIGINT 0
timestampformat Specifies the date format to use when parsing timestamps. VARCHAR (empty)
See Date Format
types or dtypes The column types as either a list (by position) or a struct VARCHAR[] or (empty)
(by name). Example here. STRUCT
union_by_name Whether the columns of multiple schemas should be BOOL false
unified by name, rather than by position.
CSV Functions
Note. Deprecated DuckDB v0.10.0 introduced breaking changes to the read_csv function. Namely, The read_csv function now
attempts auto‑detecting the CSV parameters, making its behavior identical to the old read_csv_auto function. If you would like to
use read_csv with its old behavior, turn off the auto‑detection manually by using read_csv(..., auto_detect = false).
The read_csv automatically attempts to figure out the correct configuration of the CSV reader using the CSV sniffer. It also automatically
deduces types of columns. If the CSV file has a header, it will use the names found in that header to name the columns. Otherwise, the
columns will be named column0, column1, column2, .... An example with the flights.csv file:
The path can either be a relative path (relative to the current working directory) or an absolute path.
10
DuckDB Documentation
If we set delim/sep, quote, escape, or header explicitly, we can bypass the automatic detection of this particular parameter:
Multiple files can be read at once by providing a glob or a list of files. Refer to the multiple files section for more information.
The COPY statement can be used to load data from a CSV file into a table. This statement has the same syntax as the one used in PostgreSQL.
To load the data using the COPY statement, we must first create a table with the correct schema (which matches the order of the columns
in the CSV file and uses types that fit the values in the CSV file). COPY detects the CSV's configuration options automatically.
If we want to manually specify the CSV format, we can do so using the configuration options of COPY.
CREATE TABLE ontime (flightdate DATE, uniquecarrier VARCHAR, origincityname VARCHAR, destcityname
VARCHAR);
COPY ontime FROM 'flights.csv' (DELIMITER '|', HEADER);
SELECT * FROM ontime;
DuckDB supports reading erroneous CSV files. For details, see the Reading Faulty CSV Files page.
Limitations
The CSV reader only supports input files using UTF‑8 character encoding. For CSV files using different encodings, use e.g. the iconv
command‑line tool to convert them to UTF‑8.
11
DuckDB Documentation
When using read_csv, the system tries to automatically infer how to read the CSV file using the CSV sniffer. This step is necessary because
CSV files are not self‑describing and come in many different dialects. The auto‑detection works roughly as follows:
• Detect the dialect of the CSV file (delimiter, quoting rule, escape)
• Detect the types of each of the columns
• Detect whether or not the file has a header row
By default the system will try to auto‑detect all options. However, options can be individually overridden by the user. This can be useful in
case the system makes a mistake. For example, if the delimiter is chosen incorrectly, we can override it by calling the read_csv with an
explicit delimiter (e.g., read_csv('file.csv', delim = '|')).
The detection works by operating on a sample of the file. The size of the sample can be modified by setting the sample_size parameter.
The default sample size is 20480 rows. Setting the sample_size parameter to -1 means the entire file is read for sampling. The way
sampling is performed depends on the type of file. If we are reading from a regular file on disk, we will jump into the file and try to sample
from different locations in the file. If we are reading from a file in which we cannot jump ‑ such as a .gz compressed CSV file or stdin ‑
samples are taken only from the beginning of the file.
sniff_csv Function
It is possible to run the CSV sniffer as a separate step using the sniff_csv(filename) function, which returns the detected CSV prop‑
erties as a table with a single row. The sniff_csv function accepts an optional sample_size parameter to configure the number of
rows sampled.
FROM sniff_csv('my_file.csv');
FROM sniff_csv('my_file.csv', sample_size = 1000);
Delimiter delimiter ,
Quote quote character "
Escape escape \
NewLineDelimiter new‑line delimiter \r\n
SkipRow number of rows skipped 1
HasHeader whether the CSV has a header true
Columns column types encoded as a LIST of ({'name': 'VARCHAR', 'age': 'BIGINT'})
STRUCTs
DateFormat date Format %d/%m/%Y
TimestampFormat timestamp Format %Y-%m-%dT%H:%M:%S.%f
UserArguments arguments used to invoke sniff_csv sample_size = 1000
Prompt prompt ready to be used to read the CSV FROM read_csv('my_file.csv', auto_
detect=false, delim=',', ...)
Prompt The Prompt column contains a SQL command with the configurations detected by the sniffer.
12
DuckDB Documentation
Detection Steps
Dialect Detection Dialect detection works by attempting to parse the samples using the set of considered values. The detected dialect
is the dialect that has (1) a consistent number of columns for each row, and (2) the highest number of columns for each row.
delim , | ; \t
quote " ' (empty)
escape " ' \ (empty)
FlightDate|UniqueCarrier|OriginCityName|DestCityName
1988-01-01|AA|New York, NY|Los Angeles, CA
1988-01-02|AA|New York, NY|Los Angeles, CA
1988-01-03|AA|New York, NY|Los Angeles, CA
In this example ‑ the system selects the | as the delimiter. All rows are split into the same amount of columns, and there is more than one
column per row meaning the delimiter was actually found in the CSV file.
Type Detection After detecting the dialect, the system will attempt to figure out the types of each of the columns. Note that this step
is only performed if we are calling read_csv. In case of the COPY statement the types of the table that we are copying into will be used
instead.
The type detection works by attempting to convert the values in each column to the candidate types. If the conversion is unsuccessful, the
candidate type is removed from the set of candidate types for that column. After all samples have been handled ‑ the remaining candidate
type with the highest priority is chosen. The set of considered candidate types in order of priority is given below:
Types
BOOLEAN
BIGINT
DOUBLE
TIME
DATE
TIMESTAMP
VARCHAR
Note everything can be cast to VARCHAR. This type has the lowest priority ‑ i.e., columns are converted to VARCHAR if they cannot be cast
to anything else. In flights.csv the FlightDate column will be cast to a DATE, while the other columns will be cast to VARCHAR.
13
DuckDB Documentation
The detected types can be individually overridden using the types option. This option takes either a list of types (e.g., types=[INT,
VARCHAR, DATE]) which overrides the types of the columns in‑order of occurrence in the CSV file. Alternatively, types takes a name
-> type map which overrides options of individual columns (e.g., types={'quarter': INT}).
The type detection can be entirely disabled by using the all_varchar option. If this is set all columns will remain as VARCHAR (as they
originally occur in the CSV file).
Header Detection
Header detection works by checking if the candidate header row deviates from the other rows in the file in terms of types. For example,
in flights.csv, we can see that the header row consists of only VARCHAR columns ‑ whereas the values contain a DATE value for the
FlightDate column. As such ‑ the system defines the first row as the header row and extracts the column names from the header row.
In files that do not have a header row, the column names are generated as column0, column1, etc.
Note that headers cannot be detected correctly if all columns are of type VARCHAR ‑ as in this case the system cannot distinguish the header
row from the other rows in the file. In this case the system assumes the file has no header. This can be overridden using the header
option.
Dates and Timestamps DuckDB supports the ISO 8601 format format by default for timestamps, dates and times. Unfortunately, not all
dates and times are formatted using this standard. For that reason, the CSV reader also supports the dateformat and timestampfor-
mat options. Using this format the user can specify a format string that specifies how the date or timestamp should be read.
As part of the auto‑detection, the system tries to figure out if dates and times are stored in a different representation. This is not always
possible ‑ as there are ambiguities in the representation. For example, the date 01-02-2000 can be parsed as either January 2nd or
February 1st. Often these ambiguities can be resolved. For example, if we later encounter the date 21-02-2000 then we know that the
format must have been DD-MM-YYYY. MM-DD-YYYY is no longer possible as there is no 21nd month.
If the ambiguities cannot be resolved by looking at the data the system has a list of preferences for which date format to use. If the system
choses incorrectly, the user can specify the dateformat and timestampformat options manually.
The system considers the following formats for dates (dateformat). Higher entries are chosen over lower entries in case of ambiguities
(i.e., ISO 8601 is preferred over MM-DD-YYYY).
dateformat
ISO 8601
%y-%m-%d
%Y-%m-%d
%d-%m-%y
%d-%m-%Y
%m-%d-%y
%m-%d-%Y
The system considers the following formats for timestamps (timestampformat). Higher entries are chosen over lower entries in case of
ambiguities.
timestampformat
ISO 8601
%y-%m-%d %H:%M:%S
%Y-%m-%d %H:%M:%S
14
DuckDB Documentation
timestampformat
%d-%m-%y %H:%M:%S
%d-%m-%Y %H:%M:%S
%m-%d-%y %I:%M:%S %p
%m-%d-%Y %I:%M:%S %p
%Y-%m-%d %H:%M:%S.%f
Reading erroneous CSV files is possible by utilizing the ignore_errors option. With that option set, rows containing data that would
otherwise cause the CSV Parser to generate an error will be ignored.
Pedro,31
Oogie Boogie, three
If you read the CSV file, specifying that the first column is a VARCHAR and the second column is an INTEGER, loading the file would fail, as
the string three cannot be converted to an INTEGER.
However, with ignore_errors set, the second row of the file is skipped, outputting only the complete first row. For example:
FROM read_csv(
'faulty.csv',
columns = {'name': 'VARCHAR', 'age': 'INTEGER'},
ignore_errors = true
);
Outputs:
name age
Pedro 31
One should note that the CSV Parser is affected by the projection pushdown optimization. Hence, if we were to select only the name column,
both rows would be considered valid, as the casting error on the age would never occur. For example:
SELECT name
FROM read_csv('faulty.csv', columns = {'name': 'VARCHAR', 'age': 'INTEGER'});
Outputs:
name
Pedro
Oogie Boogie
15
DuckDB Documentation
Being able to read faulty CSV files is important, but for many data cleaning operations, it is also necessary to know exactly which lines are
corrupted and what errors the parser discovered on them. For scenarios like these, it is possible to use DuckDB's CSV Rejects Table feature.
It is important to note that the Rejects Table can only be used when ignore_errors is set, and currently, only stores casting errors and
does not save errors when the number of columns differ.
Parameters
The parameters listed below are used in the read_csv function to configure the CSV Rejects Table.
rejects_table Name of a temporary table where the information of the VARCHAR (empty)
faulty lines of a CSV file are stored.
rejects_limit Upper limit on the number of faulty records from a CSV file BIGINT 0
that will be recorded in the rejects table. 0 is used when no
limit should be applied.
rejects_recovery_ Column values that serve as a primary key to the CSV file. VARCHAR[] (empty)
columns The are stored in the CSV Rejects Table to help identify the
faulty tuples.
To store the information of the faulty CSV lines in a rejects table, the user must simply provide the rejects table name in therejects_
table option. For example:
FROM read_csv(
'faulty.csv',
columns = {'name': 'VARCHAR', 'age': 'INTEGER'},
rejects_table = 'rejects_table',
ignore_errors = true
);
You can then query the rejects_table table, to retrieve information about the rejected tuples. For example:
FROM rejects_table;
Outputs:
16
DuckDB Documentation
faulty.csv 2 1 age three Could not convert string ' three' to 'INTEGER'
Additionally, the name column could also be provided as a primary key via the rejects_recovery_columns option to provide more
information over the faulty lines. For example:
FROM read_csv(
'faulty.csv',
columns = {'name': 'VARCHAR', 'age': 'INTEGER'},
rejects_table = 'rejects_table',
rejects_recovery_columns = '[name]',
ignore_errors = true
);
column_ parsed_
file line column name value recovery_columns error
faulty.csv 2 1 age three {'name': 'Oogie Could not convert string ' three' to 'INTEGER'
Boogie'}
Below is a collection of tips to help when attempting to import complex CSV files. In the examples, we use the flights.csv file.
Override the Header Flag if the Header Is Not Correctly Detected If a file contains only string columns the header auto‑detection
might fail. Provide the header option to override this behavior.
Provide Names if the File Does Not Contain a Header If the file does not contain a header, names will be auto‑generated by default.
You can provide your own names with the names option.
Override the Types of Specific Columns The types flag can be used to override types of only certain columns by providing a struct of
name -> type mappings.
Use COPY When Loading Data into a Table The COPY statement copies data directly into a table. The CSV reader uses the schema of
the table instead of auto‑detecting types from the file. This speeds up the auto‑detection, and prevents mistakes from being made during
auto‑detection.
Use union_by_name When Loading Files with Different Schemas The union_by_name option can be used to unify the schema of
files that have different or missing columns. For files that do not have certain columns, NULL values are filled in.
17
DuckDB Documentation
JSON Files
JSON Loading
Examples
JSON Loading
JSON is an open standard file format and data interchange format that uses human‑readable text to store and transmit data objects con‑
sisting of attribute–value pairs and arrays (or other serializable values). While it is not a very efficient format for tabular data, it is very
commonly used, especially as a data interchange format.
The DuckDB JSON reader can automatically infer which configuration flags to use by analyzing the JSON file. This will work correctly in most
situations, and should be the first option attempted. In rare situations where the JSON reader cannot figure out the correct configuration,
it is possible to manually configure the JSON reader to correctly parse the JSON file.
Parameters
auto_detect Whether to auto‑detect detect the names of the keys and BOOL false
data types of the values automatically
columns A struct that specifies the key names and value types STRUCT (empty)
contained within the JSON file (e.g., {key1:
'INTEGER', key2: 'VARCHAR'}). If auto_detect
is enabled these will be inferred
18
DuckDB Documentation
compression The compression type for the file. By default this will be VARCHAR 'auto'
detected automatically from the file extension (e.g.,
t.json.gz will use gzip, t.json will use none). Options
are 'none', 'gzip', 'zstd', and 'auto'.
convert_strings_to_ Whether strings representing integer values should be BOOL false
integers converted to a numerical type.
dateformat Specifies the date format to use when parsing dates. See VARCHAR 'iso'
Date Format
filename Whether or not an extra filename column should be BOOL false
included in the result.
format Can be one of ['auto', 'unstructured', VARCHAR 'array'
'newline_delimited', 'array']
hive_partitioning Whether or not to interpret the path as a Hive partitioned BOOL false
path.
ignore_errors Whether to ignore parse errors (only possible when BOOL false
format is 'newline_delimited')
maximum_depth Maximum nesting depth to which the automatic schema BIGINT -1
detection detects types. Set to ‑1 to fully detect nested
JSON types
maximum_object_size The maximum size of a JSON object (in bytes) UINTEGER 16777216
records Can be one of ['auto', 'true', 'false'] VARCHAR 'records'
sample_size Option to define number of sample objects for automatic UBIGINT 20480
JSON type detection. Set to ‑1 to scan the entire input file
timestampformat Specifies the date format to use when parsing timestamps. VARCHAR 'iso'
See Date Format
union_by_name Whether the schema's of multiple JSON files should be BOOL false
unified.
The JSON extension can attempt to determine the format of a JSON file when setting format to auto. Here are some example JSON files
and the corresponding format settings that should be used.
In each of the below cases, the format setting was not needed, as DuckDB was able to infer it correctly, but it is included for illustrative
purposes. A query of this shape would work in each case:
SELECT *
FROM filename.json;
Format: newline_delimited With format = 'newline_delimited' newline‑delimited JSON can be parsed. Each line is a
JSON.
19
DuckDB Documentation
SELECT *
FROM read_json_auto('records.json', format = 'newline_delimited');
key1 key2
value1 value1
value2 value2
value3 value3
Format: array If the JSON file contains a JSON array of objects (pretty‑printed or not), array_of_objects may be used.
[
{"key1":"value1", "key2": "value1"},
{"key1":"value2", "key2": "value2"},
{"key1":"value3", "key2": "value3"}
]
SELECT *
FROM read_json_auto('array.json', format = 'array');
key1 key2
value1 value1
value2 value2
value3 value3
Format: unstructured If the JSON file contains JSON that is not newline‑delimited or an array, unstructured may be used.
{
"key1":"value1",
"key2":"value1"
}
{
"key1":"value2",
"key2":"value2"
}
{
"key1":"value3",
"key2":"value3"
}
SELECT *
FROM read_json_auto('unstructured.json', format = 'unstructured');
key1 key2
value1 value1
value2 value2
value3 value3
20
DuckDB Documentation
The JSON extension can attempt to determine whether a JSON file contains records when setting records = auto. When records =
true, the JSON extension expects JSON objects, and will unpack the fields of JSON objects into individual columns.
SELECT *
FROM read_json_auto('records.json', records = true);
key1 key2
value1 value1
value2 value2
value3 value3
When records = false, the JSON extension will not unpack the top‑level objects, and create STRUCTs instead:
SELECT *
FROM read_json_auto('records.json', records = false);
json
[1, 2, 3]
[4, 5, 6]
[7, 8, 9]
SELECT *
FROM read_json_auto('arrays.json', records = false);
json
[1, 2, 3]
[4, 5, 6]
[7, 8, 9]
Writing
The contents of tables or the result of queries can be written directly to a JSON file using the COPY statement. See the COPY documentation
for more information.
21
DuckDB Documentation
read_json_auto Function
The read_json_auto is the simplest method of loading JSON files: it automatically attempts to figure out the correct configuration of
the JSON reader. It also automatically deduces types of columns.
SELECT *
FROM read_json_auto('todos.json')
LIMIT 5;
The path can either be a relative path (relative to the current working directory) or an absolute path.
If we specify the columns, we can bypass the automatic detection. Note that not all columns need to be specified:
SELECT *
FROM read_json_auto('todos.json',
columns = {userId: 'UBIGINT',
completed: 'BOOLEAN'});
Multiple files can be read at once by providing a glob or a list of files. Refer to the multiple files section for more information.
COPY Statement
The COPY statement can be used to load data from a JSON file into a table. For the COPY statement, we must first create a table with the
correct schema to load the data into. We then specify the JSON file to load from plus any configuration options separately.
CREATE TABLE todos (userId UBIGINT, id UBIGINT, title VARCHAR, completed BOOLEAN);
COPY todos FROM 'todos.json';
SELECT * FROM todos LIMIT 5;
22
DuckDB Documentation
Multiple Files
DuckDB can read multiple files of different types (CSV, Parquet, JSON files) at the same time using either the glob syntax, or by providing a
list of files to read. See the combining schemas page for tips on reading files with different schemas.
CSV
-- read all files with a name ending in ".csv" in the folder "dir"
SELECT * FROM 'dir/*.csv';
-- read all files with a name ending in ".csv", two directories deep
SELECT * FROM '*/*/*.csv';
-- read all files with a name ending in ".csv", at any depth in the folder "dir"
SELECT * FROM 'dir/**/*.csv';
-- read the CSV files 'flights1.csv' and 'flights2.csv'
SELECT * FROM read_csv(['flights1.csv', 'flights2.csv']);
-- read the CSV files 'flights1.csv' and 'flights2.csv', unifying schemas by name and outputting a
`filename` column
SELECT * FROM read_csv(['flights1.csv', 'flights2.csv'], union_by_name = true, filename = true);
Parquet
DuckDB can also read a series of Parquet files and treat them as if they were a single table. Note that this only works if the Parquet files
have the same schema. You can specify which Parquet files you want to read using a list parameter, glob pattern matching syntax, or a
combination of both.
23
DuckDB Documentation
List Parameter The read_parquet function can accept a list of filenames as the input parameter.
Glob Syntax Any file name input to the read_parquet function can either be an exact filename, or use a glob syntax to read multiple files
that match a pattern.
Wildcard Description
Note that the ? wildcard in globs is not supported for reads over S3 due to HTTP encoding issues.
Here is an example that reads all the files that end with .parquet located in the test folder:
List of Globs The glob syntax and the list input parameter can be combined to scan files that meet one of multiple patterns.
DuckDB can read multiple CSV files at the same time using either the glob syntax, or by providing a list of files to read.
Filename
The filename argument can be used to add an extra filename column to the result that indicates which row came from which file. For
example:
The glob pattern matching syntax can also be used to search for filenames using the glob table function. It accepts one parameter: the
path to search (which may include glob patterns).
24
DuckDB Documentation
file
duckdb.exe
test.csv
test.json
test.parquet
test2.csv
test2.parquet
todos.json
Combining Schemas
Examples
Combining Schemas
When reading from multiple files, we have to combine schemas from those files. That is because each file has its own schema that can
differ from the other files. DuckDB offers two ways of unifying schemas of multiple files: by column position and by column name.
By default, DuckDB reads the schema of the first file provided, and then unifies columns in subsequent files by column position. This works
correctly as long as all files have the same schema. If the schema of the files differs, you might want to use the union_by_name option
to allow DuckDB to construct the schema by reading all of the names instead.
Union by Position
By default, DuckDB unifies the columns of these different files by position. This means that the first column in each file is combined
together, as well as the second column in each file, etc. For example, consider the following two files.
flights1.csv:
FlightDate|UniqueCarrier|OriginCityName|DestCityName
1988-01-01|AA|New York, NY|Los Angeles, CA
1988-01-02|AA|New York, NY|Los Angeles, CA
flights2.csv:
FlightDate|UniqueCarrier|OriginCityName|DestCityName
1988-01-03|AA|New York, NY|Los Angeles, CA
Reading the two files at the same time will produce the following result set:
25
DuckDB Documentation
Union by Name
If you are processing multiple files that have different schemas, perhaps because columns have been added or renamed, it might be de‑
sirable to unify the columns of different files by name instead. This can be done by providing the union_by_name option. For example,
consider the following two files, where flights4.csv has an extra column (UniqueCarrier).
flights3.csv:
FlightDate|OriginCityName|DestCityName
1988-01-01|New York, NY|Los Angeles, CA
1988-01-02|New York, NY|Los Angeles, CA
flights4.csv:
FlightDate|UniqueCarrier|OriginCityName|DestCityName
1988-01-03|AA|New York, NY|Los Angeles, CA
Reading these when unifying column names by position results in an error ‑ as the two files have a different number of columns. When
specifying the union_by_name option, the columns are correctly unified, and any missing values are set to NULL.
Parquet Files
Examples
26
DuckDB Documentation
-- write the results of a query to a Parquet file using the default compression (Snappy)
COPY
(SELECT * FROM tbl)
TO 'result-snappy.parquet'
(FORMAT 'parquet');
-- write the results from a query to a Parquet file with specific compression and row group size
COPY
(FROM generate_series(100_000))
TO 'test.parquet'
(FORMAT 'parquet', COMPRESSION 'zstd', ROW_GROUP_SIZE 100_000);
Parquet Files
Parquet files are compressed columnar files that are efficient to load and process. DuckDB provides support for both reading and writing
Parquet files in an efficient manner, as well as support for pushing filters and projections into the Parquet file scans.
Note. Parquet data sets differ based on the number of files, the size of individual files, the compression algorithm used row group
size, etc. These have a significant effect on performance. Please consult the Performance Guide for details.
read_parquet Function
If your file ends in .parquet, the function syntax is optional. The system will automatically infer that you are reading a Parquet file.
Multiple files can be read at once by providing a glob or a list of files. Refer to the multiple files section for more information.
Parameters There are a number of options exposed that can be passed to the read_parquet function or the COPY statement.
27
DuckDB Documentation
binary_as_string Parquet files generated by legacy writers do not correctly BOOL false
set the UTF8 flag for strings, causing string columns to be
loaded as BLOB instead. Set this to true to load binary
columns as strings.
encryption_config Configuration for Parquet encryption. STRUCT ‑
filename Whether or not an extra filename column should be BOOL false
included in the result.
file_row_number Whether or not to include the file_row_number BOOL false
column.
hive_partitioning Whether or not to interpret the path as a Hive partitioned BOOL false
path.
union_by_name Whether the columns of multiple schemas should be BOOL false
unified by name, rather than by position.
Partial Reading
DuckDB supports projection pushdown into the Parquet file itself. That is to say, when querying a Parquet file, only the columns required
for the query are read. This allows you to read only the part of the Parquet file that you are interested in. This will be done automatically
by DuckDB.
DuckDB also supports filter pushdown into the Parquet reader. When you apply a filter to a column that is scanned from a Parquet file,
the filter will be pushed down into the scan, and can even be used to skip parts of the file using the built‑in zonemaps. Note that this will
depend on whether or not your Parquet file contains zonemaps.
Filter and projection pushdown provide significant performance benefits. See our blog post on this for more information.
You can also insert the data into a table or create a table from the Parquet file directly. This will load the data from the Parquet file and
insert it into the database.
If you wish to keep the data stored inside the Parquet file, but want to query the Parquet file directly, you can create a view over the read_
parquet function. You can then query the Parquet file as if it were a built‑in table.
DuckDB also has support for writing to Parquet files using the COPY statement syntax. See the COPY Statement page for details, including
all possible parameters for the COPY statement.
28
DuckDB Documentation
-- write a query to a Parquet file with ZSTD compression (same as CODEC) and row_group_size
COPY
(FROM generate_series(100_000))
TO 'row-groups-zstd.parquet'
(FORMAT PARQUET, COMPRESSION ZSTD, ROW_GROUP_SIZE 100_000);
DuckDB's EXPORT command can be used to export an entire database to a series of Parquet files. See the Export statement documentation
for more details.
Encryption
The support for Parquet files is enabled via extension. The parquet extension is bundled with almost all clients. However, if your client
does not bundle the parquet extension, the extension must be installed and loaded separately.
INSTALL parquet;
LOAD parquet;
Parquet Metadata
The parquet_metadata function can be used to query the metadata contained within a Parquet file, which reveals various internal
details of the Parquet file such as the statistics of the different columns. This can be useful for figuring out what kind of skipping is possible
in Parquet files, or even to obtain a quick overview of what the different columns contain.
SELECT *
FROM parquet_metadata('test.parquet');
Field Type
file_name VARCHAR
row_group_id BIGINT
29
DuckDB Documentation
Field Type
row_group_num_rows BIGINT
row_group_num_columns BIGINT
row_group_bytes BIGINT
column_id BIGINT
file_offset BIGINT
num_values BIGINT
path_in_schema VARCHAR
type VARCHAR
stats_min VARCHAR
stats_max VARCHAR
stats_null_count BIGINT
stats_distinct_count BIGINT
stats_min_value VARCHAR
stats_max_value VARCHAR
compression VARCHAR
encodings VARCHAR
index_page_offset BIGINT
dictionary_page_offset BIGINT
data_page_offset BIGINT
total_compressed_size BIGINT
total_uncompressed_size BIGINT
key_value_metadata MAP(BLOB, BLOB)
Parquet Schema
The parquet_schema function can be used to query the internal schema contained within a Parquet file. Note that this is the schema
as it is contained within the metadata of the Parquet file. If you want to figure out the column names and types contained within a Parquet
file it is easier to use DESCRIBE.
Field Type
file_name VARCHAR
name VARCHAR
type VARCHAR
type_length VARCHAR
repetition_type VARCHAR
30
DuckDB Documentation
Field Type
num_children BIGINT
converted_type VARCHAR
scale BIGINT
precision BIGINT
field_id BIGINT
logical_type VARCHAR
The parquet_file_metadata function can be used to query file‑level metadata such as the format version and the encryption algo‑
rithm used.
SELECT *
FROM parquet_file_metadata('test.parquet');
Field Type
file_name VARCHAR
created_by VARCHAR
num_rows BIGINT
num_row_groups BIGINT
format_version BIGINT
encryption_algorithm VARCHAR
footer_signing_key_metadata VARCHAR
The parquet_kv_metadata function can be used to query custom metadata defined as key‑value pairs.
SELECT *
FROM parquet_kv_metadata('test.parquet');
Field Type
file_name VARCHAR
key BLOB
value BLOB
Parquet Encryption
Starting with version 0.10.0, DuckDB supports reading and writing encrypted Parquet files. DuckDB broadly follows the Parquet Modular
Encryption specification with some limitations.
31
DuckDB Documentation
Using the PRAGMA add_parquet_key function, named encryption keys of 128, 192, or 256 bits can be added to a session. These keys
are stored in‑memory.
Writing Encrypted Parquet Files After specifying the key (e.g., key256), files can be encrypted as follows:
Reading Encrpyted Parquet Files An encrypted Parquet file using a specific key (e.g., key256), can then be read as follows:
Or:
SELECT *
FROM read_parquet('tbl.parquet', encryption_config = {footer_key: 'key256'});
Limitations
1. It is not compatible with the encryption of, e.g., PyArrow, until the missing details are implemented.
2. DuckDB encrypts the footer and all columns using the footer_key. The Parquet specification allows encryption of individual
columns with different keys, e.g.:
However, this is unsupported at the moment and will cause an error to be thrown (for now):
Performance Implications
Note that encryption has some performance implications. Without encryption, reading/writing the lineitem table from TPC-H at SF1,
which is 6M rows and 15 columns, from/to a Parquet file takes 0.26 and 0.99 seconds, respectively. With encryption, this takes 0.64 and 2.21
seconds, both approximately 2.5× slower than the unencrypted version.
Parquet Tips
Use union_by_name When Loading Files with Different Schemas The union_by_name option can be used to unify the schema of
files that have different or missing columns. For files that do not have certain columns, NULL values are filled in.
SELECT *
FROM read_parquet('flights*.parquet', union_by_name = true);
32
DuckDB Documentation
Enabling PER_THREAD_OUTPUT If the final number of Parquet files is not important, writing one file per thread can significantly im‑
prove performance. Using a glob pattern upon read or a Hive partitioning structure are good ways to transparently handle multiple files.
COPY
(FROM generate_series(10_000_000))
TO 'test.parquet'
(FORMAT PARQUET, PER_THREAD_OUTPUT true);
Selecting a ROW_GROUP_SIZE The ROW_GROUP_SIZE parameter specifies the minimum number of rows in a Parquet row group, with
a minimum value equal to DuckDB's vector size (currently 2048, but adjustable when compiling DuckDB), and a default of 122,880. A Parquet
row group is a partition of rows, consisting of a column chunk for each column in the dataset.
Compression algorithms are only applied per row group, so the larger the row group size, the more opportunities to compress the data.
DuckDB can read Parquet row groups in parallel even within the same file and uses predicate pushdown to only scan the row groups whose
metadata ranges match the WHERE clause of the query. However there is some overhead associated with reading the metadata in each
group. A good approach would be to ensure that within each file, the total number of row groups is at least as large as the number of CPU
threads used to query that file. More row groups beyond the thread count would improve the speed of highly selective queries, but slow
down queries that must scan the whole file like aggregations.
Partitioning
Hive Partitioning
Examples
Hive Partitioning
Hive partitioning is a partitioning strategy that is used to split a table into multiple files based on partition keys. The files are organized
into folders. Within each folder, the partition key has a value that is determined by the name of the folder.
Below is an example of a Hive partitioned file hierarchy. The files are partitioned on two keys (year and month).
orders
├── year=2021
│ ├── month=1
│ │ ├── file1.parquet
│ │ └── file2.parquet
│ └── month=2
│ └── file3.parquet
└── year=2022
├── month=11
33
DuckDB Documentation
│ ├── file4.parquet
│ └── file5.parquet
└── month=12
└── file6.parquet
Files stored in this hierarchy can be read using the hive_partitioning flag.
SELECT *
FROM read_parquet('orders/*/*/*.parquet', hive_partitioning = true);
When we specify the hive_partitioning flag, the values of the columns will be read from the directories.
Filter Pushdown Filters on the partition keys are automatically pushed down into the files. This way the system skips reading files that
are not necessary to answer a query. For example, consider the following query on the above dataset:
SELECT *
FROM read_parquet('orders/*/*/*.parquet', hive_partitioning = true)
WHERE year = 2022 AND month = 11;
When executing this query, only the following files will be read:
orders
└── year=2022
└── month=11
├── file4.parquet
└── file5.parquet
Autodetection By default the system tries to infer if the provided files are in a hive partitioned hierarchy. And if so, the hive_
partitioning flag is enabled automatically. The autodetection will look at the names of the folders and search for a 'key' =
'value' pattern. This behaviour can be overridden by setting the hive_partitioning flag manually.
Hive Types hive_types is a way to specify the logical types of the hive partitions in a struct:
SELECT *
FROM read_parquet(
'dir/**/*.parquet',
hive_partitioning = true,
hive_types = {'release': DATE, 'orders': BIGINT}
);
hive_types will be autodetected for the following types: DATE, TIMESTAMP and BIGINT. To switch off the autodetection, the flag
hive_types_autocast = 0 can be set.
Partitioned Writes
Examples
34
DuckDB Documentation
Partitioned Writes
When the partition_by clause is specified for the COPY statement, the files are written in a Hive partitioned folder hierarchy. The target
is the name of the root directory (in the example above: orders). The files are written in‑order in the file hierarchy. Currently, one file is
written per thread to each directory.
orders
├── year=2021
│ ├── month=1
│ │ ├── data_1.parquet
│ │ └── data_2.parquet
│ └── month=2
│ └── data_1.parquet
└── year=2022
├── month=11
│ ├── data_1.parquet
│ └── data_2.parquet
└── month=12
└── data_1.parquet
The values of the partitions are automatically extracted from the data. Note that it can be very expensive to write many partitions as many
files will be created. The ideal partition count depends on how large your data set is.
Note. Bestpractice Writing data into many small partitions is expensive. It is generally recommended to have at least 100MB of
data per partition.
Overwriting By default the partitioned write will not allow overwriting existing directories. Use the OVERWRITE_OR_IGNORE option
to allow overwriting an existing directory.
Filename Pattern By default, files will be named data_0.parquet or data_0.csv. With the flag FILENAME_PATTERN a pattern
with {i} or {uuid} can be defined to create specific filenames:
-- write a table to a Hive partitioned data set of .parquet files, with an index in the filename
COPY orders TO 'orders' (FORMAT PARQUET, PARTITION_BY (year, month), OVERWRITE_OR_IGNORE, FILENAME_
PATTERN "orders_{i}");
-- write a table to a Hive partitioned data set of .parquet files, with unique filenames
COPY orders TO 'orders' (FORMAT PARQUET, PARTITION_BY (year, month), OVERWRITE_OR_IGNORE, FILENAME_
PATTERN "file_{uuid}");
Appender
The Appender can be used to load bulk data into a DuckDB database. It is currently available in the C, C++, Go, Java, and Rust APIs. The
Appender is tied to a connection, and will use the transaction context of that connection when appending. An Appender always appends
to a single table in the database file.
DuckDB db;
Connection con(db);
// create the table
con.Query("CREATE TABLE people (id INTEGER, name VARCHAR)");
// initialize the appender
Appender appender(con, "people");
35
DuckDB Documentation
The AppendRow function is the easiest way of appending data. It uses recursive templates to allow you to put all the values of a single row
within one function call, as follows:
appender.AppendRow(1, "Mark");
Rows can also be individually constructed using the BeginRow, EndRow and Append methods. This is done internally by AppendRow,
and hence has the same performance characteristics.
appender.BeginRow();
appender.Append<int32_t>(2);
appender.Append<string>("Hannes");
appender.EndRow();
Any values added to the appender are cached prior to being inserted into the database system for performance reasons. That means that,
while appending, the rows might not be immediately visible in the system. The cache is automatically flushed when the appender goes
out of scope or when appender.Close() is called. The cache can also be manually flushed using the appender.Flush() method.
After either Flush or Close is called, all the data has been written to the database system.
While numbers and strings are rather self‑explanatory, dates, times and timestamps require some explanation. They can be directly ap‑
pended using the methods provided by duckdb::Date, duckdb::Time or duckdb::Timestamp. They can also be appended using
the internal duckdb::Value type, however, this adds some additional overheads and should be avoided if possible.
If the appender encounters a PRIMARY KEY conflict or a UNIQUE constraint violation, it fails and returns the following error:
Constraint Error: PRIMARY KEY or UNIQUE constraint violated: duplicate key "..."
In this case, the entire append operation fails and no rows are inserted.
• C
• Go
• JDBC (Java)
• Rust
36
DuckDB Documentation
INSERT Statements
INSERT statements are the standard way of loading data into a relational database. When using INSERT statements, the values are
supplied row‑by‑row. While simple, there is significant overhead involved in parsing and processing individual INSERT statements. This
makes lots of individual row‑by‑row insertions very inefficient for bulk insertion.
Note. Bestpractice As a rule‑of‑thumb, avoid using lots of individual row‑by‑row INSERT statements when inserting more than a
few rows (i.e., avoid using INSERT statements as part of a loop). When bulk inserting data, try to maximize the amount of data that
is inserted per statement.
If you must use INSERT statements to load data in a loop, avoid executing the statements in auto‑commit mode. After every commit,
the database is required to sync the changes made to disk to ensure no data is lost. In auto‑commit mode every single statement will be
wrapped in a separate transaction, meaning fsync will be called for every statement. This is typically unnecessary when bulk loading and
will significantly slow down your program.
Note. If you absolutely must use INSERT statements in a loop to load data, wrap them in calls to BEGIN TRANSACTION and
COMMIT.
Syntax
For a more detailed description together with syntax diagram can be found, see the page on the INSERT statement.
37
DuckDB Documentation
38
Client APIs
• C
• C++
• Go by marcboeker
• Java
• Julia
• Node.js
• Python
• R
• Rust
• WebAssembly/Wasm
• ADBC API
• ODBC API
There are also contributed third‑party DuckDB wrappers, which currently do not have an official documentation page:
• C# by Giorgi
• Common Lisp by ak‑coram
• Crystal by amauryt
• Ruby by suketa
• Zig by karlseguin
Overview
DuckDB implements a custom C API modelled somewhat following the SQLite C API. The API is contained in the duckdb.h header. Con‑
tinue to Startup & Shutdown to get started, or check out the Full API overview.
We also provide a SQLite API wrapper which means that if your applications is programmed against the SQLite C API, you can re‑link to
DuckDB and it should continue working. See the sqlite_api_wrapper folder in our source repository for more information.
Installation
The DuckDB C API can be installed as part of the libduckdb packages. Please see the installation page for details.
To use DuckDB, you must first initialize a duckdb_database handle using duckdb_open(). duckdb_open() takes as parameter the
database file to read and write from. The special value NULL (nullptr) can be used to create an in‑memory database. Note that for an
in‑memory database no data is persisted to disk (i.e., all data is lost when you exit the process).
39
DuckDB Documentation
With the duckdb_database handle, you can create one or many duckdb_connection using duckdb_connect(). While individual
connections are thread‑safe, they will be locked during querying. It is therefore recommended that each thread uses its own connection
to allow for the best parallel performance.
All duckdb_connections have to explicitly be disconnected with duckdb_disconnect() and the duckdb_database has to be
explicitly closed with duckdb_close() to avoid memory and file handle leaking.
Example
duckdb_database db;
duckdb_connection con;
// run queries...
// cleanup
duckdb_disconnect(&con);
duckdb_close(&db);
API Reference
duckdb_open Creates a new database or opens an existing database file stored at the given path. If no path is given a new in‑memory
database is created instead. The instantiated database should be closed with 'duckdb_close'.
Syntax
duckdb_state duckdb_open(
const char *path,
duckdb_database *out_database
);
Parameters
• path
Path to the database file on disk, or nullptr or :memory: to open an in‑memory database.
• out_database
40
DuckDB Documentation
• returns
duckdb_open_ext Extended version of duckdb_open. Creates a new database or opens an existing database file stored at the given
path. The instantiated database should be closed with 'duckdb_close'.
Syntax
duckdb_state duckdb_open_ext(
const char *path,
duckdb_database *out_database,
duckdb_config config,
char **out_error
);
Parameters
• path
Path to the database file on disk, or nullptr or :memory: to open an in‑memory database.
• out_database
• config
• out_error
If set and the function returns DuckDBError, this will contain the reason why the start‑up failed. Note that the error must be freed using
duckdb_free.
• returns
duckdb_close Closes the specified database and de‑allocates all memory allocated for that database. This should be called after you
are done with any database allocated through duckdb_open or duckdb_open_ext. Note that failing to call duckdb_close (in case
of e.g., a program crash) will not cause data corruption. Still, it is recommended to always correctly close a database object after you are
done with it.
Syntax
void duckdb_close(
duckdb_database *database
);
Parameters
• database
duckdb_connect Opens a connection to a database. Connections are required to query the database, and store transactional state
associated with the connection. The instantiated connection should be closed using 'duckdb_disconnect'.
41
DuckDB Documentation
Syntax
duckdb_state duckdb_connect(
duckdb_database database,
duckdb_connection *out_connection
);
Parameters
• database
• out_connection
• returns
Syntax
void duckdb_interrupt(
duckdb_connection connection
);
Parameters
• connection
Syntax
duckdb_query_progress_type duckdb_query_progress(
duckdb_connection connection
);
Parameters
• connection
• returns
duckdb_disconnect Closes the specified connection and de‑allocates all memory allocated for that connection.
42
DuckDB Documentation
Syntax
void duckdb_disconnect(
duckdb_connection *connection
);
Parameters
• connection
duckdb_library_version Returns the version of the linked DuckDB, with a version postfix for dev versions
Usually used for developing C extensions that must return this for a compatibility check.
Syntax
);
Configuration
Configuration options can be provided to change different settings of the database system. Note that many of these settings can be changed
later on using PRAGMA statements as well. The configuration object should be created, filled with values and passed to duckdb_open_
ext.
Example
duckdb_database db;
duckdb_config config;
// run queries...
// cleanup
duckdb_close(&db);
43
DuckDB Documentation
API Reference
duckdb_create_config Initializes an empty configuration object that can be used to provide start‑up options for the DuckDB in‑
stance through duckdb_open_ext. The duckdb_config must be destroyed using 'duckdb_destroy_config'
Syntax
duckdb_state duckdb_create_config(
duckdb_config *out_config
);
Parameters
• out_config
• returns
duckdb_config_count This returns the total amount of configuration options available for usage with duckdb_get_config_
flag.
This should not be called in a loop as it internally loops over all the options.
Syntax
size_t duckdb_config_count(
);
Parameters
• returns
duckdb_get_config_flag Obtains a human‑readable name and description of a specific configuration option. This can be used to
e.g. display configuration options. This will succeed unless index is out of range (i.e., >= duckdb_config_count).
Syntax
duckdb_state duckdb_get_config_flag(
size_t index,
const char **out_name,
const char **out_description
);
44
DuckDB Documentation
Parameters
• index
• out_name
• out_description
• returns
duckdb_set_config Sets the specified option for the specified configuration. The configuration option is indicated by name. To
obtain a list of config options, see duckdb_get_config_flag.
This can fail if either the name is invalid, or if the value provided for the option is invalid.
Syntax
duckdb_state duckdb_set_config(
duckdb_config config,
const char *name,
const char *option
);
Parameters
• duckdb_config
• name
• option
• returns
duckdb_destroy_config Destroys the specified configuration object and de‑allocates all memory allocated for the object.
Syntax
void duckdb_destroy_config(
duckdb_config *config
);
45
DuckDB Documentation
Parameters
• config
Query
The duckdb_query method allows SQL queries to be run in DuckDB from C. This method takes two parameters, a (null‑terminated) SQL
query string and a duckdb_result result pointer. The result pointer may be NULL if the application is not interested in the result set
or if the query produces no result. After the result is consumed, the duckdb_destroy_result method should be used to clean up the
result.
Elements can be extracted from the duckdb_result object using a variety of methods. The duckdb_column_count and duckdb_
row_count methods can be used to extract the number of columns and the number of rows, respectively. duckdb_column_name and
duckdb_column_type can be used to extract the names and types of individual columns.
Example
duckdb_state state;
duckdb_result result;
// create a table
state = duckdb_query(con, "CREATE TABLE integers (i INTEGER, j INTEGER);", NULL);
if (state == DuckDBError) {
// handle error
}
// insert three rows into the table
state = duckdb_query(con, "INSERT INTO integers VALUES (3, 4), (5, 6), (7, NULL);", NULL);
if (state == DuckDBError) {
// handle error
}
// query rows again
state = duckdb_query(con, "SELECT * FROM integers", &result);
if (state == DuckDBError) {
// handle error
}
// handle the result
// ...
Value Extraction
Values can be extracted using either the duckdb_column_data/duckdb_nullmask_data functions, or using the duckdb_value
convenience functions. The duckdb_column_data/duckdb_nullmask_data functions directly hand you a pointer to the result
arrays in columnar format, and can therefore be very fast. The duckdb_value functions perform bounds‑ and type‑checking, and will
automatically cast values to the desired type. This makes them more convenient and easier to use, at the expense of being slower.
Note. For optimal performance, use duckdb_column_data and duckdb_nullmask_data to extract data from the query
result. The duckdb_value functions perform internal type‑checking, bounds‑checking and casting which makes them slower.
46
DuckDB Documentation
duckdb_value Below is an example that prints the above result to CSV format using the duckdb_value_varchar function. Note
that the function is generic: we do not need to know about the types of the individual result columns.
duckdb_column_data Below is an example that prints the above result to CSV format using the duckdb_column_data function.
Note that the function is NOT generic: we do need to know exactly what the types of the result columns are.
Note. Warning When using duckdb_column_data, be careful that the type matches exactly what you expect it to be. As the code
directly accesses an internal array, there is no type‑checking. Accessing a DUCKDB_TYPE_INTEGER column as if it was a DUCKDB_
TYPE_BIGINT column will provide unpredictable results!
API Reference
47
DuckDB Documentation
duckdb_query Executes a SQL query within a connection and stores the full (materialized) result in the out_result pointer. If the query
fails to execute, DuckDBError is returned and the error message can be retrieved by calling duckdb_result_error.
Note that after running duckdb_query, duckdb_destroy_result must be called on the result object even if the query fails, other‑
wise the error stored within the result will not be freed correctly.
Syntax
duckdb_state duckdb_query(
duckdb_connection connection,
const char *query,
duckdb_result *out_result
);
Parameters
• connection
• query
• out_result
• returns
duckdb_destroy_result Closes the result and de‑allocates all memory allocated for that connection.
Syntax
void duckdb_destroy_result(
duckdb_result *result
);
Parameters
• result
duckdb_column_name Returns the column name of the specified column. The result should not need to be freed; the column names
will automatically be destroyed when the result is destroyed.
Syntax
48
DuckDB Documentation
Parameters
• result
• col
• returns
Syntax
duckdb_type duckdb_column_type(
duckdb_result *result,
idx_t col
);
Parameters
• result
• col
• returns
duckdb_result_statement_type Returns the statement type of the statement that was executed
Syntax
duckdb_statement_type duckdb_result_statement_type(
duckdb_result result
);
Parameters
• result
• returns
49
DuckDB Documentation
Syntax
duckdb_logical_type duckdb_column_logical_type(
duckdb_result *result,
idx_t col
);
Parameters
• result
• col
• returns
Syntax
idx_t duckdb_column_count(
duckdb_result *result
);
Parameters
• result
• returns
Syntax
idx_t duckdb_row_count(
duckdb_result *result
);
Parameters
• result
• returns
duckdb_rows_changed Returns the number of rows changed by the query stored in the result. This is relevant only for IN‑
SERT/UPDATE/DELETE queries. For other queries the rows_changed will be 0.
50
DuckDB Documentation
Syntax
idx_t duckdb_rows_changed(
duckdb_result *result
);
Parameters
• result
• returns
The function returns a dense array which contains the result data. The exact type stored in the array depends on the corresponding duckdb_
type (as provided by duckdb_column_type). For the exact type by which the data should be accessed, see the comments in the types
section or the DUCKDB_TYPE enum.
For example, for a column of type DUCKDB_TYPE_INTEGER, rows can be accessed in the following manner:
Syntax
void *duckdb_column_data(
duckdb_result *result,
idx_t col
);
Parameters
• result
• col
• returns
Returns the nullmask of a specific column of a result in columnar format. The nullmask indicates for every row whether or not the corre‑
sponding row is NULL. If a row is NULL, the values present in the array provided by duckdb_column_data are undefined.
51
DuckDB Documentation
Syntax
bool *duckdb_nullmask_data(
duckdb_result *result,
idx_t col
);
Parameters
• result
• col
• returns
duckdb_result_error Returns the error message contained within the result. The error is only set if duckdb_query returns
DuckDBError.
The result of this function must not be freed. It will be cleaned up when duckdb_destroy_result is called.
Syntax
Parameters
• result
• returns
Data Chunks
Data chunks represent a horizontal slice of a table. They hold a number of vectors, that can each hold up to the VECTOR_SIZE rows. The
vector size can be obtained through the duckdb_vector_size function and is configurable, but is usually set to 2048.
Data chunks and vectors are what DuckDB uses natively to store and represent data. For this reason, the data chunk interface is the most
efficient way of interfacing with DuckDB. Be aware, however, that correctly interfacing with DuckDB using the data chunk API does require
knowledge of DuckDB's internal vector format.
The primary manner of interfacing with data chunks is by obtaining the internal vectors of the data chunk using the duckdb_data_
chunk_get_vector method, and subsequently using the duckdb_vector_get_data and duckdb_vector_get_validity
methods to read the internal data and the validity mask of the vector. For composite types (list and struct vectors), duckdb_list_
vector_get_child and duckdb_struct_vector_get_child should be used to read child vectors.
52
DuckDB Documentation
API Reference
Vector Interface
Syntax
duckdb_data_chunk duckdb_create_data_chunk(
duckdb_logical_type *types,
idx_t column_count
);
Parameters
• types
• column_count
• returns
53
DuckDB Documentation
duckdb_destroy_data_chunk Destroys the data chunk and de‑allocates all memory allocated for that chunk.
Syntax
void duckdb_destroy_data_chunk(
duckdb_data_chunk *chunk
);
Parameters
• chunk
duckdb_data_chunk_reset Resets a data chunk, clearing the validity masks and setting the cardinality of the data chunk to 0.
Syntax
void duckdb_data_chunk_reset(
duckdb_data_chunk chunk
);
Parameters
• chunk
Syntax
idx_t duckdb_data_chunk_get_column_count(
duckdb_data_chunk chunk
);
Parameters
• chunk
• returns
duckdb_data_chunk_get_vector Retrieves the vector at the specified column index in the data chunk.
The pointer to the vector is valid for as long as the chunk is alive. It does NOT need to be destroyed.
Syntax
duckdb_vector duckdb_data_chunk_get_vector(
duckdb_data_chunk chunk,
idx_t col_idx
);
54
DuckDB Documentation
Parameters
• chunk
• returns
The vector
Syntax
idx_t duckdb_data_chunk_get_size(
duckdb_data_chunk chunk
);
Parameters
• chunk
• returns
Syntax
void duckdb_data_chunk_set_size(
duckdb_data_chunk chunk,
idx_t size
);
Parameters
• chunk
• size
Syntax
duckdb_logical_type duckdb_vector_get_column_type(
duckdb_vector vector
);
55
DuckDB Documentation
Parameters
• vector
• returns
The data pointer can be used to read or write values from the vector. How to read or write values depends on the type of the vector.
Syntax
void *duckdb_vector_get_data(
duckdb_vector vector
);
Parameters
• vector
• returns
The validity mask is a bitset that signifies null‑ness within the data chunk. It is a series of uint64_t values, where each uint64_t value contains
validity for 64 tuples. The bit is set to 1 if the value is valid (i.e., not NULL) or 0 if the value is invalid (i.e., NULL).
idx_t entry_idx = row_idx / 64; idx_t idx_in_entry = row_idx % 64; bool is_valid = validity_mask[entry_idx] & (1 « idx_in_entry);
Syntax
uint64_t *duckdb_vector_get_validity(
duckdb_vector vector
);
Parameters
• vector
• returns
56
DuckDB Documentation
After this function is called, duckdb_vector_get_validity will ALWAYS return non‑NULL. This allows null values to be written to the
vector, regardless of whether a validity mask was present before.
Syntax
void duckdb_vector_ensure_validity_writable(
duckdb_vector vector
);
Parameters
• vector
Syntax
void duckdb_vector_assign_string_element(
duckdb_vector vector,
idx_t index,
const char *str
);
Parameters
• vector
• index
• str
duckdb_vector_assign_string_element_len Assigns a string element in the vector at the specified location. You may also
use this function to assign BLOBs.
Syntax
void duckdb_vector_assign_string_element_len(
duckdb_vector vector,
idx_t index,
const char *str,
idx_t str_len
);
57
DuckDB Documentation
Parameters
• vector
• index
• str
The string
• str_len
Syntax
duckdb_vector duckdb_list_vector_get_child(
duckdb_vector vector
);
Parameters
• vector
The vector
• returns
Syntax
idx_t duckdb_list_vector_get_size(
duckdb_vector vector
);
Parameters
• vector
The vector
• returns
duckdb_list_vector_set_size Sets the total size of the underlying child‑vector of a list vector.
58
DuckDB Documentation
Syntax
duckdb_state duckdb_list_vector_set_size(
duckdb_vector vector,
idx_t size
);
Parameters
• vector
• size
• returns
Syntax
duckdb_state duckdb_list_vector_reserve(
duckdb_vector vector,
idx_t required_capacity
);
Parameters
• vector
• required_capacity
• return
Syntax
duckdb_vector duckdb_struct_vector_get_child(
duckdb_vector vector,
idx_t index
);
59
DuckDB Documentation
Parameters
• vector
The vector
• index
• returns
The resulting vector is valid as long as the parent vector is valid. The resulting vector has the size of the parent vector multiplied by the
array size.
Syntax
duckdb_vector duckdb_array_vector_get_child(
duckdb_vector vector
);
Parameters
• vector
The vector
• returns
duckdb_validity_row_is_valid Returns whether or not a row is valid (i.e., not NULL) in the given validity mask.
Syntax
bool duckdb_validity_row_is_valid(
uint64_t *validity,
idx_t row
);
Parameters
• validity
• row
• returns
60
DuckDB Documentation
Syntax
void duckdb_validity_set_row_validity(
uint64_t *validity,
idx_t row,
bool valid
);
Parameters
• validity
• row
• valid
Syntax
void duckdb_validity_set_row_invalid(
uint64_t *validity,
idx_t row
);
Parameters
• validity
• row
Syntax
void duckdb_validity_set_row_valid(
uint64_t *validity,
idx_t row
);
61
DuckDB Documentation
Parameters
• validity
• row
Values
API Reference
duckdb_destroy_value Destroys the value and de‑allocates all memory allocated for that type.
Syntax
void duckdb_destroy_value(
duckdb_value *value
);
Parameters
• value
Syntax
duckdb_value duckdb_create_varchar(
const char *text
);
62
DuckDB Documentation
Parameters
• value
• returns
Syntax
duckdb_value duckdb_create_varchar_length(
const char *text,
idx_t length
);
Parameters
• value
The text
• length
• returns
Syntax
duckdb_value duckdb_create_int64(
int64_t val
);
Parameters
• value
• returns
Syntax
duckdb_value duckdb_create_struct_value(
duckdb_logical_type type,
duckdb_value *values
);
63
DuckDB Documentation
Parameters
• type
• values
• returns
duckdb_create_list_value Creates a list value from a type and an array of values of length value_count
Syntax
duckdb_value duckdb_create_list_value(
duckdb_logical_type type,
duckdb_value *values,
idx_t value_count
);
Parameters
• type
• values
• value_count
• returns
duckdb_create_array_value Creates a array value from a type and an array of values of length value_count
Syntax
duckdb_value duckdb_create_array_value(
duckdb_logical_type type,
duckdb_value *values,
idx_t value_count
);
Parameters
• type
• values
64
DuckDB Documentation
• value_count
• returns
duckdb_get_varchar Obtains a string representation of the given value. The result must be destroyed with duckdb_free.
Syntax
char *duckdb_get_varchar(
duckdb_value value
);
Parameters
• value
The value
• returns
Syntax
int64_t duckdb_get_int64(
duckdb_value value
);
Parameters
• value
The value
• returns
Types
DuckDB is a strongly typed database system. As such, every column has a single type specified. This type is constant over the entire column.
That is to say, a column that is labeled as an INTEGER column will only contain INTEGER values.
DuckDB also supports columns of composite types. For example, it is possible to define an array of integers (INT[]). It is also possible to
define types as arbitrary structs (ROW(i INTEGER, j VARCHAR)). For that reason, native DuckDB type objects are not mere enums,
but a class that can potentially be nested.
Types in the C API are modeled using an enum (duckdb_type) and a complex class (duckdb_logical_type). For most primitive
types, e.g., integers or varchars, the enum is sufficient. For more complex types, such as lists, structs or decimals, the logical type must be
used.
65
DuckDB Documentation
Functions
The enum type of a column in the result can be obtained using the duckdb_column_type function. The logical type of a column can be
obtained using the duckdb_column_logical_type function.
duckdb_value The duckdb_value functions will auto‑cast values as required. For example, it is no problem to use duckdb_
value_double on a column of type duckdb_value_int32. The value will be auto‑cast and returned as a double. Note that in certain
cases the cast may fail. For example, this can happen if we request a duckdb_value_int8 and the value does not fit within an int8
value. In this case, a default value will be returned (usually 0 or nullptr). The same default value will also be returned if the corresponding
value is NULL.
The duckdb_value_is_null function can be used to check if a specific value is NULL or not.
The exception to the auto‑cast rule is the duckdb_value_varchar_internal function. This function does not auto‑cast and only
works for VARCHAR columns. The reason this function exists is that the result does not need to be freed.
Note. duckdb_value_varchar and duckdb_value_blob require the result to be de‑allocated using duckdb_free.
duckdb_result_get_chunk The duckdb_result_get_chunk function can be used to read data chunks from a DuckDB result
set, and is the most efficient way of reading data from a DuckDB result using the C API. It is also the only way of reading data of certain types
from a DuckDB result. For example, the duckdb_value functions do not support structural reading of composite types (lists or structs)
or more complex types like enums and decimals.
For more information about data chunks, see the documentation on data chunks.
66
DuckDB Documentation
API Reference
Date/Time/Timestamp Helpers
Hugeint Helpers
Decimal Helpers
67
DuckDB Documentation
duckdb_result_get_chunk Fetches a data chunk from the duckdb_result. This function should be called repeatedly until the result
is exhausted.
This function supersedes all duckdb_value functions, as well as the duckdb_column_data and duckdb_nullmask_data func‑
tions. It results in significantly better performance, and should be preferred in newer code‑bases.
If this function is used, none of the other result functions can be used and vice versa (i.e., this function cannot be mixed with the legacy
result functions).
Use duckdb_result_chunk_count to figure out how many chunks there are in the result.
Syntax
duckdb_data_chunk duckdb_result_get_chunk(
duckdb_result result,
idx_t chunk_index
);
Parameters
• result
• chunk_index
• returns
The resulting data chunk. Returns NULL if the chunk index is out of bounds.
Syntax
bool duckdb_result_is_streaming(
duckdb_result result
);
Parameters
• result
• returns
68
DuckDB Documentation
Syntax
idx_t duckdb_result_chunk_count(
duckdb_result result
);
Parameters
• result
• returns
Syntax
duckdb_result_type duckdb_result_return_type(
duckdb_result result
);
Parameters
• result
• returns
The return_type
duckdb_from_date Decompose a duckdb_date object into year, month and date (stored as duckdb_date_struct).
Syntax
duckdb_date_struct duckdb_from_date(
duckdb_date date
);
Parameters
• date
• returns
69
DuckDB Documentation
Syntax
duckdb_date duckdb_to_date(
duckdb_date_struct date
);
Parameters
• date
• returns
Syntax
bool duckdb_is_finite_date(
duckdb_date date
);
Parameters
• date
• returns
duckdb_from_time Decompose a duckdb_time object into hour, minute, second and microsecond (stored as duckdb_time_
struct).
Syntax
duckdb_time_struct duckdb_from_time(
duckdb_time time
);
Parameters
• time
• returns
70
DuckDB Documentation
Syntax
duckdb_time_tz duckdb_create_time_tz(
int64_t micros,
int32_t offset
);
Parameters
• micros
• offset
• returns
Use duckdb_from_time to further decompose the micros into hour, minute, second and microsecond.
Syntax
duckdb_time_tz_struct duckdb_from_time_tz(
duckdb_time_tz micros
);
Parameters
• micros
• out_micros
• out_offset
duckdb_to_time Re‑compose a duckdb_time from hour, minute, second and microsecond (duckdb_time_struct).
Syntax
duckdb_time duckdb_to_time(
duckdb_time_struct time
);
Parameters
• time
• returns
71
DuckDB Documentation
Syntax
duckdb_timestamp_struct duckdb_from_timestamp(
duckdb_timestamp ts
);
Parameters
• ts
• returns
Syntax
duckdb_timestamp duckdb_to_timestamp(
duckdb_timestamp_struct ts
);
Parameters
• ts
• returns
Syntax
bool duckdb_is_finite_timestamp(
duckdb_timestamp ts
);
Parameters
• ts
• returns
duckdb_hugeint_to_double Converts a duckdb_hugeint object (as obtained from a DUCKDB_TYPE_HUGEINT column) into a
double.
72
DuckDB Documentation
Syntax
double duckdb_hugeint_to_double(
duckdb_hugeint val
);
Parameters
• val
• returns
If the conversion fails because the double value is too big the result will be 0.
Syntax
duckdb_hugeint duckdb_double_to_hugeint(
double val
);
Parameters
• val
• returns
If the conversion fails because the double value is too big, or the width/scale are invalid the result will be 0.
Syntax
duckdb_decimal duckdb_double_to_decimal(
double val,
uint8_t width,
uint8_t scale
);
Parameters
• val
• returns
73
DuckDB Documentation
duckdb_decimal_to_double Converts a duckdb_decimal object (as obtained from a DUCKDB_TYPE_DECIMAL column) into a
double.
Syntax
double duckdb_decimal_to_double(
duckdb_decimal val
);
Parameters
• val
• returns
duckdb_create_logical_type Creates a duckdb_logical_type from a standard primitive type. The resulting type should be
destroyed with duckdb_destroy_logical_type.
Syntax
duckdb_logical_type duckdb_create_logical_type(
duckdb_type type
);
Parameters
• type
• returns
duckdb_logical_type_get_alias Returns the alias of a duckdb_logical_type, if one is set, else NULL. The result must be de‑
stroyed with duckdb_free.
Syntax
char *duckdb_logical_type_get_alias(
duckdb_logical_type type
);
Parameters
• type
• returns
74
DuckDB Documentation
duckdb_create_list_type Creates a list type from its child type. The resulting type should be destroyed with duckdb_destroy_
logical_type.
Syntax
duckdb_logical_type duckdb_create_list_type(
duckdb_logical_type type
);
Parameters
• type
• returns
duckdb_create_array_type Creates a array type from its child type. The resulting type should be destroyed with duckdb_
destroy_logical_type.
Syntax
duckdb_logical_type duckdb_create_array_type(
duckdb_logical_type type,
idx_t array_size
);
Parameters
• type
• array_size
• returns
duckdb_create_map_type Creates a map type from its key type and value type. The resulting type should be destroyed with
duckdb_destroy_logical_type.
Syntax
duckdb_logical_type duckdb_create_map_type(
duckdb_logical_type key_type,
duckdb_logical_type value_type
);
75
DuckDB Documentation
Parameters
• type
• returns
duckdb_create_union_type Creates a UNION type from the passed types array. The resulting type should be destroyed with
duckdb_destroy_logical_type.
Syntax
duckdb_logical_type duckdb_create_union_type(
duckdb_logical_type *member_types,
const char **member_names,
idx_t member_count
);
Parameters
• types
• type_amount
• returns
duckdb_create_struct_type Creates a STRUCT type from the passed member name and type arrays. The resulting type should
be destroyed with duckdb_destroy_logical_type.
Syntax
duckdb_logical_type duckdb_create_struct_type(
duckdb_logical_type *member_types,
const char **member_names,
idx_t member_count
);
Parameters
• member_types
• member_names
• member_count
• returns
76
DuckDB Documentation
duckdb_create_enum_type Creates an ENUM type from the passed member name array. The resulting type should be destroyed
with duckdb_destroy_logical_type.
Syntax
duckdb_logical_type duckdb_create_enum_type(
const char **member_names,
idx_t member_count
);
Parameters
• enum_name
• member_names
• member_count
• returns
duckdb_create_decimal_type Creates a duckdb_logical_type of type decimal with the specified width and scale. The re‑
sulting type should be destroyed with duckdb_destroy_logical_type.
Syntax
duckdb_logical_type duckdb_create_decimal_type(
uint8_t width,
uint8_t scale
);
Parameters
• width
• scale
• returns
Syntax
duckdb_type duckdb_get_type_id(
duckdb_logical_type type
);
77
DuckDB Documentation
Parameters
• type
• returns
The type id
Syntax
uint8_t duckdb_decimal_width(
duckdb_logical_type type
);
Parameters
• type
• returns
Syntax
uint8_t duckdb_decimal_scale(
duckdb_logical_type type
);
Parameters
• type
• returns
Syntax
duckdb_type duckdb_decimal_internal_type(
duckdb_logical_type type
);
78
DuckDB Documentation
Parameters
• type
• returns
Syntax
duckdb_type duckdb_enum_internal_type(
duckdb_logical_type type
);
Parameters
• type
• returns
Syntax
uint32_t duckdb_enum_dictionary_size(
duckdb_logical_type type
);
Parameters
• type
• returns
duckdb_enum_dictionary_value Retrieves the dictionary value at the specified position from the enum.
Syntax
char *duckdb_enum_dictionary_value(
duckdb_logical_type type,
idx_t index
);
79
DuckDB Documentation
Parameters
• type
• index
• returns
The string value of the enum type. Must be freed with duckdb_free.
Syntax
duckdb_logical_type duckdb_list_type_child_type(
duckdb_logical_type type
);
Parameters
• type
• returns
The child type of the list type. Must be destroyed with duckdb_destroy_logical_type.
Syntax
duckdb_logical_type duckdb_array_type_child_type(
duckdb_logical_type type
);
Parameters
• type
• returns
The child type of the array type. Must be destroyed with duckdb_destroy_logical_type.
80
DuckDB Documentation
Syntax
idx_t duckdb_array_type_array_size(
duckdb_logical_type type
);
Parameters
• type
• returns
The fixed number of elements the values of this array type can store.
Syntax
duckdb_logical_type duckdb_map_type_key_type(
duckdb_logical_type type
);
Parameters
• type
• returns
The key type of the map type. Must be destroyed with duckdb_destroy_logical_type.
Syntax
duckdb_logical_type duckdb_map_type_value_type(
duckdb_logical_type type
);
Parameters
• type
• returns
The value type of the map type. Must be destroyed with duckdb_destroy_logical_type.
81
DuckDB Documentation
Syntax
idx_t duckdb_struct_type_child_count(
duckdb_logical_type type
);
Parameters
• type
• returns
Syntax
char *duckdb_struct_type_child_name(
duckdb_logical_type type,
idx_t index
);
Parameters
• type
• index
• returns
duckdb_struct_type_child_type Retrieves the child type of the given struct type at the specified index.
Syntax
duckdb_logical_type duckdb_struct_type_child_type(
duckdb_logical_type type,
idx_t index
);
82
DuckDB Documentation
Parameters
• type
• index
• returns
The child type of the struct type. Must be destroyed with duckdb_destroy_logical_type.
duckdb_union_type_member_count Returns the number of members that the union type has.
Syntax
idx_t duckdb_union_type_member_count(
duckdb_logical_type type
);
Parameters
• type
• returns
Syntax
char *duckdb_union_type_member_name(
duckdb_logical_type type,
idx_t index
);
Parameters
• type
• index
• returns
duckdb_union_type_member_type Retrieves the child type of the given union member at the specified index.
83
DuckDB Documentation
Syntax
duckdb_logical_type duckdb_union_type_member_type(
duckdb_logical_type type,
idx_t index
);
Parameters
• type
• index
• returns
The child type of the union member. Must be destroyed with duckdb_destroy_logical_type.
duckdb_destroy_logical_type Destroys the logical type and de‑allocates all memory allocated for that type.
Syntax
void duckdb_destroy_logical_type(
duckdb_logical_type *type
);
Parameters
• type
Prepared Statements
A prepared statement is a parameterized query. The query is prepared with question marks (?) or dollar symbols ($1) indicating the
parameters of the query. Values can then be bound to these parameters, after which the prepared statement can be executed using those
parameters. A single query can be prepared once and executed many times.
• Easily supply parameters to functions while avoiding string concatenation/SQL injection attacks.
• Speeding up queries that will be executed many times with different parameters.
DuckDB supports prepared statements in the C API with the duckdb_prepare method. The duckdb_bind family of functions is used
to supply values for subsequent execution of the prepared statement using duckdb_execute_prepared. After we are done with the
prepared statement it can be cleaned up using the duckdb_destroy_prepare method.
Example
duckdb_prepared_statement stmt;
duckdb_result result;
if (duckdb_prepare(con, "INSERT INTO integers VALUES ($1, $2)", &stmt) == DuckDBError) {
// handle error
}
84
DuckDB Documentation
// clean up
duckdb_destroy_result(&result);
duckdb_destroy_prepare(&stmt);
After calling duckdb_prepare, the prepared statement parameters can be inspected using duckdb_nparams and duckdb_param_
type. In case the prepare fails, the error can be obtained through duckdb_prepare_error.
It is not required that the duckdb_bind family of functions matches the prepared statement parameter type exactly. The values will be
auto‑cast to the required value as required. For example, calling duckdb_bind_int8 on a parameter type of DUCKDB_TYPE_INTEGER
will work as expected.
Note. Warning Do not use prepared statements to insert large amounts of data into DuckDB. Instead it is recommended to use the
Appender.
API Reference
Note that after calling duckdb_prepare, the prepared statement should always be destroyed using duckdb_destroy_prepare,
even if the prepare fails.
If the prepare fails, duckdb_prepare_error can be called to obtain the reason why the prepare failed.
Syntax
duckdb_state duckdb_prepare(
duckdb_connection connection,
const char *query,
duckdb_prepared_statement *out_prepared_statement
);
85
DuckDB Documentation
Parameters
• connection
• query
• out_prepared_statement
• returns
duckdb_destroy_prepare Closes the prepared statement and de‑allocates all memory allocated for the statement.
Syntax
void duckdb_destroy_prepare(
duckdb_prepared_statement *prepared_statement
);
Parameters
• prepared_statement
duckdb_prepare_error Returns the error message associated with the given prepared statement. If the prepared statement has no
error message, this returns nullptr instead.
The error message should not be freed. It will be de‑allocated when duckdb_destroy_prepare is called.
Syntax
Parameters
• prepared_statement
• returns
duckdb_nparams Returns the number of parameters that can be provided to the given prepared statement.
86
DuckDB Documentation
Syntax
idx_t duckdb_nparams(
duckdb_prepared_statement prepared_statement
);
Parameters
• prepared_statement
duckdb_parameter_name Returns the name used to identify the parameter The returned string should be freed using duckdb_
free.
Returns NULL if the index is out of range for the provided prepared statement.
Syntax
Parameters
• prepared_statement
The prepared statement for which to get the parameter name from.
duckdb_param_type Returns the parameter type for the parameter at the given index.
Returns DUCKDB_TYPE_INVALID if the parameter index is out of range or the statement was not successfully prepared.
Syntax
duckdb_type duckdb_param_type(
duckdb_prepared_statement prepared_statement,
idx_t param_idx
);
Parameters
• prepared_statement
• param_idx
• returns
87
DuckDB Documentation
Syntax
duckdb_state duckdb_clear_bindings(
duckdb_prepared_statement prepared_statement
);
Syntax
duckdb_statement_type duckdb_prepared_statement_type(
duckdb_prepared_statement statement
);
Parameters
• statement
• returns
Appender
Appenders are the most efficient way of loading data into DuckDB from within the C interface, and are recommended for fast data loading.
The appender is much faster than using prepared statements or individual INSERT INTO statements.
Appends are made in row‑wise format. For every column, a duckdb_append_[type] call should be made, after which the row should
be finished by calling duckdb_appender_end_row. After all rows have been appended, duckdb_appender_destroy should be
used to finalize the appender and clean up the resulting memory.
Note that duckdb_appender_destroy should always be called on the resulting appender, even if the function returns DuckDBEr-
ror.
Example
duckdb_appender appender;
if (duckdb_appender_create(con, NULL, "people", &appender) == DuckDBError) {
// handle error
}
// append the first row (1, Mark)
duckdb_append_int32(appender, 1);
duckdb_append_varchar(appender, "Mark");
duckdb_appender_end_row(appender);
88
DuckDB Documentation
API Reference
Syntax
duckdb_state duckdb_appender_create(
duckdb_connection connection,
const char *schema,
const char *table,
duckdb_appender *out_appender
);
Parameters
• connection
• schema
The schema of the table to append to, or nullptr for the default schema.
• table
89
DuckDB Documentation
• out_appender
• returns
duckdb_appender_column_count Returns the number of columns in the table that belongs to the appender.
Syntax
idx_t duckdb_appender_column_count(
duckdb_appender appender
);
Parameters
• returns
Syntax
duckdb_logical_type duckdb_appender_column_type(
duckdb_appender appender,
idx_t col_idx
);
Parameters
• returns
duckdb_appender_error Returns the error message associated with the given appender. If the appender has no error message, this
returns nullptr instead.
The error message should not be freed. It will be de‑allocated when duckdb_appender_destroy is called.
Syntax
90
DuckDB Documentation
Parameters
• appender
• returns
duckdb_appender_flush Flush the appender to the table, forcing the cache of the appender to be cleared and the data to be ap‑
pended to the base table.
This should generally not be used unless you know what you are doing. Instead, call duckdb_appender_destroy when you are done
with the appender.
Syntax
duckdb_state duckdb_appender_flush(
duckdb_appender appender
);
Parameters
• appender
• returns
duckdb_appender_close Close the appender, flushing all intermediate state in the appender to the table and closing it for further
appends.
Syntax
duckdb_state duckdb_appender_close(
duckdb_appender appender
);
Parameters
• appender
• returns
duckdb_appender_destroy Close the appender and destroy it. Flushing all intermediate state in the appender to the table, and
de‑allocating all memory associated with the appender.
91
DuckDB Documentation
Syntax
duckdb_state duckdb_appender_destroy(
duckdb_appender *appender
);
Parameters
• appender
• returns
duckdb_appender_begin_row A nop function, provided for backwards compatibility reasons. Does nothing. Only duckdb_
appender_end_row is required.
Syntax
duckdb_state duckdb_appender_begin_row(
duckdb_appender appender
);
duckdb_appender_end_row Finish the current row of appends. After end_row is called, the next row can be appended.
Syntax
duckdb_state duckdb_appender_end_row(
duckdb_appender appender
);
Parameters
• appender
The appender.
• returns
Syntax
duckdb_state duckdb_append_bool(
duckdb_appender appender,
bool value
);
92
DuckDB Documentation
Syntax
duckdb_state duckdb_append_int8(
duckdb_appender appender,
int8_t value
);
Syntax
duckdb_state duckdb_append_int16(
duckdb_appender appender,
int16_t value
);
Syntax
duckdb_state duckdb_append_int32(
duckdb_appender appender,
int32_t value
);
Syntax
duckdb_state duckdb_append_int64(
duckdb_appender appender,
int64_t value
);
Syntax
duckdb_state duckdb_append_hugeint(
duckdb_appender appender,
duckdb_hugeint value
);
Syntax
duckdb_state duckdb_append_uint8(
duckdb_appender appender,
uint8_t value
);
93
DuckDB Documentation
Syntax
duckdb_state duckdb_append_uint16(
duckdb_appender appender,
uint16_t value
);
Syntax
duckdb_state duckdb_append_uint32(
duckdb_appender appender,
uint32_t value
);
Syntax
duckdb_state duckdb_append_uint64(
duckdb_appender appender,
uint64_t value
);
Syntax
duckdb_state duckdb_append_uhugeint(
duckdb_appender appender,
duckdb_uhugeint value
);
Syntax
duckdb_state duckdb_append_float(
duckdb_appender appender,
float value
);
Syntax
duckdb_state duckdb_append_double(
duckdb_appender appender,
double value
);
94
DuckDB Documentation
Syntax
duckdb_state duckdb_append_date(
duckdb_appender appender,
duckdb_date value
);
Syntax
duckdb_state duckdb_append_time(
duckdb_appender appender,
duckdb_time value
);
Syntax
duckdb_state duckdb_append_timestamp(
duckdb_appender appender,
duckdb_timestamp value
);
Syntax
duckdb_state duckdb_append_interval(
duckdb_appender appender,
duckdb_interval value
);
Syntax
duckdb_state duckdb_append_varchar(
duckdb_appender appender,
const char *val
);
Syntax
duckdb_state duckdb_append_varchar_length(
duckdb_appender appender,
const char *val,
idx_t length
);
95
DuckDB Documentation
Syntax
duckdb_state duckdb_append_blob(
duckdb_appender appender,
const void *data,
idx_t length
);
Syntax
duckdb_state duckdb_append_null(
duckdb_appender appender
);
The types of the data chunk must exactly match the types of the table, no casting is performed. If the types do not match or the appender
is in an invalid state, DuckDBError is returned. If the append is successful, DuckDBSuccess is returned.
Syntax
duckdb_state duckdb_append_data_chunk(
duckdb_appender appender,
duckdb_data_chunk chunk
);
Parameters
• appender
• chunk
• returns
Table Functions
The table function API can be used to define a table function that can then be called from within DuckDB in the FROM clause of a query.
API Reference
duckdb_table_function duckdb_create_table_function();
void duckdb_destroy_table_function(duckdb_table_function *table_function);
void duckdb_table_function_set_name(duckdb_table_function table_function, const char *name);
void duckdb_table_function_add_parameter(duckdb_table_function table_function, duckdb_logical_type
type);
96
DuckDB Documentation
Table Function
Syntax
duckdb_table_function duckdb_create_table_function(
);
97
DuckDB Documentation
Parameters
• returns
Syntax
void duckdb_destroy_table_function(
duckdb_table_function *table_function
);
Parameters
• table_function
Syntax
void duckdb_table_function_set_name(
duckdb_table_function table_function,
const char *name
);
Parameters
• table_function
• name
Syntax
void duckdb_table_function_add_parameter(
duckdb_table_function table_function,
duckdb_logical_type type
);
Parameters
• table_function
• type
98
DuckDB Documentation
Syntax
void duckdb_table_function_add_named_parameter(
duckdb_table_function table_function,
const char *name,
duckdb_logical_type type
);
Parameters
• table_function
• name
• type
duckdb_table_function_set_extra_info Assigns extra information to the table function that can be fetched during binding,
etc.
Syntax
void duckdb_table_function_set_extra_info(
duckdb_table_function table_function,
void *extra_info,
duckdb_delete_callback_t destroy
);
Parameters
• table_function
• extra_info
• destroy
The callback that will be called to destroy the bind data (if any)
Syntax
void duckdb_table_function_set_bind(
duckdb_table_function table_function,
duckdb_table_function_bind_t bind
);
99
DuckDB Documentation
Parameters
• table_function
• bind
Syntax
void duckdb_table_function_set_init(
duckdb_table_function table_function,
duckdb_table_function_init_t init
);
Parameters
• table_function
• init
Syntax
void duckdb_table_function_set_local_init(
duckdb_table_function table_function,
duckdb_table_function_init_t init
);
Parameters
• table_function
• init
Syntax
void duckdb_table_function_set_function(
duckdb_table_function table_function,
duckdb_table_function_t function
);
100
DuckDB Documentation
Parameters
• table_function
• function
The function
duckdb_table_function_supports_projection_pushdown Sets whether or not the given table function supports projec‑
tion pushdown.
If this is set to true, the system will provide a list of all required columns in the init stage through the duckdb_init_get_column_
count and duckdb_init_get_column_index functions. If this is set to false (the default), the system will expect all columns to be
projected.
Syntax
void duckdb_table_function_supports_projection_pushdown(
duckdb_table_function table_function,
bool pushdown
);
Parameters
• table_function
• pushdown
duckdb_register_table_function Register the table function object within the given connection.
The function requires at least a name, a bind function, an init function and a main function.
If the function is incomplete or a function with this name already exists DuckDBError is returned.
Syntax
duckdb_state duckdb_register_table_function(
duckdb_connection con,
duckdb_table_function function
);
Parameters
• con
• function
• returns
101
DuckDB Documentation
Syntax
void *duckdb_bind_get_extra_info(
duckdb_bind_info info
);
Parameters
• info
• returns
Syntax
void duckdb_bind_add_result_column(
duckdb_bind_info info,
const char *name,
duckdb_logical_type type
);
Parameters
• info
• name
• type
Syntax
idx_t duckdb_bind_get_parameter_count(
duckdb_bind_info info
);
Parameters
• info
• returns
102
DuckDB Documentation
Syntax
duckdb_value duckdb_bind_get_parameter(
duckdb_bind_info info,
idx_t index
);
Parameters
• info
• index
• returns
Syntax
duckdb_value duckdb_bind_get_named_parameter(
duckdb_bind_info info,
const char *name
);
Parameters
• info
• name
• returns
duckdb_bind_set_bind_data Sets the user‑provided bind data in the bind object. This object can be retrieved again during exe‑
cution.
Syntax
void duckdb_bind_set_bind_data(
duckdb_bind_info info,
void *bind_data,
duckdb_delete_callback_t destroy
);
103
DuckDB Documentation
Parameters
• info
• extra_data
• destroy
The callback that will be called to destroy the bind data (if any)
duckdb_bind_set_cardinality Sets the cardinality estimate for the table function, used for optimization.
Syntax
void duckdb_bind_set_cardinality(
duckdb_bind_info info,
idx_t cardinality,
bool is_exact
);
Parameters
• info
• is_exact
Syntax
void duckdb_bind_set_error(
duckdb_bind_info info,
const char *error
);
Parameters
• info
• error
104
DuckDB Documentation
Syntax
void *duckdb_init_get_extra_info(
duckdb_init_info info
);
Parameters
• info
• returns
duckdb_init_get_bind_data Gets the bind data set by duckdb_bind_set_bind_data during the bind.
Note that the bind data should be considered as read‑only. For tracking state, use the init data instead.
Syntax
void *duckdb_init_get_bind_data(
duckdb_init_info info
);
Parameters
• info
• returns
duckdb_init_set_init_data Sets the user‑provided init data in the init object. This object can be retrieved again during execu‑
tion.
Syntax
void duckdb_init_set_init_data(
duckdb_init_info info,
void *init_data,
duckdb_delete_callback_t destroy
);
Parameters
• info
• extra_data
• destroy
The callback that will be called to destroy the init data (if any)
105
DuckDB Documentation
This function must be used if projection pushdown is enabled to figure out which columns to emit.
Syntax
idx_t duckdb_init_get_column_count(
duckdb_init_info info
);
Parameters
• info
• returns
duckdb_init_get_column_index Returns the column index of the projected column at the specified position.
This function must be used if projection pushdown is enabled to figure out which columns to emit.
Syntax
idx_t duckdb_init_get_column_index(
duckdb_init_info info,
idx_t column_index
);
Parameters
• info
• column_index
The index at which to get the projected column index, from 0..duckdb_init_get_column_count(info)
• returns
duckdb_init_set_max_threads Sets how many threads can process this table function in parallel (default: 1)
Syntax
void duckdb_init_set_max_threads(
duckdb_init_info info,
idx_t max_threads
);
106
DuckDB Documentation
Parameters
• info
• max_threads
The maximum amount of threads that can process this table function
Syntax
void duckdb_init_set_error(
duckdb_init_info info,
const char *error
);
Parameters
• info
• error
Syntax
void *duckdb_function_get_extra_info(
duckdb_function_info info
);
Parameters
• info
• returns
duckdb_function_get_bind_data Gets the bind data set by duckdb_bind_set_bind_data during the bind.
Note that the bind data should be considered as read‑only. For tracking state, use the init data instead.
Syntax
void *duckdb_function_get_bind_data(
duckdb_function_info info
);
107
DuckDB Documentation
Parameters
• info
• returns
duckdb_function_get_init_data Gets the init data set by duckdb_init_set_init_data during the init.
Syntax
void *duckdb_function_get_init_data(
duckdb_function_info info
);
Parameters
• info
• returns
duckdb_function_get_local_init_data Gets the thread‑local init data set by duckdb_init_set_init_data during the
local_init.
Syntax
void *duckdb_function_get_local_init_data(
duckdb_function_info info
);
Parameters
• info
• returns
duckdb_function_set_error Report that an error has occurred while executing the function.
Syntax
void duckdb_function_set_error(
duckdb_function_info info,
const char *error
);
108
DuckDB Documentation
Parameters
• info
• error
Replacement Scans
The replacement scan API can be used to register a callback that is called when a table is read that does not exist in the catalog. For example,
when a query such as SELECT * FROM my_table is executed and my_table does not exist, the replacement scan callback will be
called with my_table as parameter. The replacement scan can then insert a table function with a specific parameter to replace the read
of the table.
API Reference
Syntax
void duckdb_add_replacement_scan(
duckdb_database db,
duckdb_replacement_callback_t replacement,
void *extra_data,
duckdb_delete_callback_t delete_callback
);
Parameters
• db
• replacement
• extra_data
• delete_callback
duckdb_replacement_scan_set_function_name Sets the replacement function name. If this function is called in the replace‑
ment callback, the replacement scan is performed. If it is not called, the replacement callback is not performed.
109
DuckDB Documentation
Syntax
void duckdb_replacement_scan_set_function_name(
duckdb_replacement_scan_info info,
const char *function_name
);
Parameters
• info
• function_name
Syntax
void duckdb_replacement_scan_add_parameter(
duckdb_replacement_scan_info info,
duckdb_value parameter
);
Parameters
• info
• parameter
duckdb_replacement_scan_set_error Report that an error has occurred while executing the replacement scan.
Syntax
void duckdb_replacement_scan_set_error(
duckdb_replacement_scan_info info,
const char *error
);
Parameters
• info
• error
110
DuckDB Documentation
Complete API
API Reference
Open/Connect
duckdb_state duckdb_open(const char *path, duckdb_database *out_database);
duckdb_state duckdb_open_ext(const char *path, duckdb_database *out_database, duckdb_config config, char
**out_error);
void duckdb_close(duckdb_database *database);
duckdb_state duckdb_connect(duckdb_database database, duckdb_connection *out_connection);
void duckdb_interrupt(duckdb_connection connection);
duckdb_query_progress_type duckdb_query_progress(duckdb_connection connection);
void duckdb_disconnect(duckdb_connection *connection);
const char *duckdb_library_version();
Configuration
duckdb_state duckdb_create_config(duckdb_config *out_config);
size_t duckdb_config_count();
duckdb_state duckdb_get_config_flag(size_t index, const char **out_name, const char **out_description);
duckdb_state duckdb_set_config(duckdb_config config, const char *name, const char *option);
void duckdb_destroy_config(duckdb_config *config);
Query Execution
duckdb_state duckdb_query(duckdb_connection connection, const char *query, duckdb_result *out_result);
void duckdb_destroy_result(duckdb_result *result);
const char *duckdb_column_name(duckdb_result *result, idx_t col);
duckdb_type duckdb_column_type(duckdb_result *result, idx_t col);
duckdb_statement_type duckdb_result_statement_type(duckdb_result result);
duckdb_logical_type duckdb_column_logical_type(duckdb_result *result, idx_t col);
idx_t duckdb_column_count(duckdb_result *result);
idx_t duckdb_row_count(duckdb_result *result);
idx_t duckdb_rows_changed(duckdb_result *result);
void *duckdb_column_data(duckdb_result *result, idx_t col);
bool *duckdb_nullmask_data(duckdb_result *result, idx_t col);
const char *duckdb_result_error(duckdb_result *result);
Result Functions
duckdb_data_chunk duckdb_result_get_chunk(duckdb_result result, idx_t chunk_index);
bool duckdb_result_is_streaming(duckdb_result result);
idx_t duckdb_result_chunk_count(duckdb_result result);
duckdb_result_type duckdb_result_return_type(duckdb_result result);
111
DuckDB Documentation
Helpers
void *duckdb_malloc(size_t size);
void duckdb_free(void *ptr);
idx_t duckdb_vector_size();
bool duckdb_string_is_inlined(duckdb_string_t string);
Date/Time/Timestamp Helpers
duckdb_date_struct duckdb_from_date(duckdb_date date);
duckdb_date duckdb_to_date(duckdb_date_struct date);
bool duckdb_is_finite_date(duckdb_date date);
duckdb_time_struct duckdb_from_time(duckdb_time time);
duckdb_time_tz duckdb_create_time_tz(int64_t micros, int32_t offset);
duckdb_time_tz_struct duckdb_from_time_tz(duckdb_time_tz micros);
duckdb_time duckdb_to_time(duckdb_time_struct time);
duckdb_timestamp_struct duckdb_from_timestamp(duckdb_timestamp ts);
duckdb_timestamp duckdb_to_timestamp(duckdb_timestamp_struct ts);
bool duckdb_is_finite_timestamp(duckdb_timestamp ts);
Hugeint Helpers
double duckdb_hugeint_to_double(duckdb_hugeint val);
duckdb_hugeint duckdb_double_to_hugeint(double val);
Decimal Helpers
duckdb_decimal duckdb_double_to_decimal(double val, uint8_t width, uint8_t scale);
double duckdb_decimal_to_double(duckdb_decimal val);
Prepared Statements
duckdb_state duckdb_prepare(duckdb_connection connection, const char *query, duckdb_prepared_statement
*out_prepared_statement);
void duckdb_destroy_prepare(duckdb_prepared_statement *prepared_statement);
const char *duckdb_prepare_error(duckdb_prepared_statement prepared_statement);
idx_t duckdb_nparams(duckdb_prepared_statement prepared_statement);
112
DuckDB Documentation
113
DuckDB Documentation
Extract Statements
idx_t duckdb_extract_statements(duckdb_connection connection, const char *query, duckdb_extracted_
statements *out_extracted_statements);
duckdb_state duckdb_prepare_extracted_statement(duckdb_connection connection, duckdb_extracted_
statements extracted_statements, idx_t index, duckdb_prepared_statement *out_prepared_statement);
const char *duckdb_extract_statements_error(duckdb_extracted_statements extracted_statements);
void duckdb_destroy_extracted(duckdb_extracted_statements *extracted_statements);
Value Interface
void duckdb_destroy_value(duckdb_value *value);
duckdb_value duckdb_create_varchar(const char *text);
duckdb_value duckdb_create_varchar_length(const char *text, idx_t length);
duckdb_value duckdb_create_int64(int64_t val);
duckdb_value duckdb_create_struct_value(duckdb_logical_type type, duckdb_value *values);
duckdb_value duckdb_create_list_value(duckdb_logical_type type, duckdb_value *values, idx_t value_
count);
duckdb_value duckdb_create_array_value(duckdb_logical_type type, duckdb_value *values, idx_t value_
count);
char *duckdb_get_varchar(duckdb_value value);
int64_t duckdb_get_int64(duckdb_value value);
114
DuckDB Documentation
Vector Interface
Table Functions
duckdb_table_function duckdb_create_table_function();
void duckdb_destroy_table_function(duckdb_table_function *table_function);
void duckdb_table_function_set_name(duckdb_table_function table_function, const char *name);
void duckdb_table_function_add_parameter(duckdb_table_function table_function, duckdb_logical_type
type);
115
DuckDB Documentation
Table Function
Replacement Scans
116
DuckDB Documentation
Appender
Arrow Interface
117
DuckDB Documentation
Threading Information
void duckdb_execute_tasks(duckdb_database database, idx_t max_tasks);
duckdb_task_state duckdb_create_task_state(duckdb_database database);
void duckdb_execute_tasks_state(duckdb_task_state state);
idx_t duckdb_execute_n_tasks_state(duckdb_task_state state, idx_t max_tasks);
void duckdb_finish_execution(duckdb_task_state state);
bool duckdb_task_state_is_finished(duckdb_task_state state);
void duckdb_destroy_task_state(duckdb_task_state state);
bool duckdb_execution_is_finished(duckdb_connection con);
duckdb_open Creates a new database or opens an existing database file stored at the given path. If no path is given a new in‑memory
database is created instead. The instantiated database should be closed with 'duckdb_close'.
Syntax
duckdb_state duckdb_open(
const char *path,
duckdb_database *out_database
);
Parameters
• path
Path to the database file on disk, or nullptr or :memory: to open an in‑memory database.
• out_database
• returns
duckdb_open_ext Extended version of duckdb_open. Creates a new database or opens an existing database file stored at the given
path. The instantiated database should be closed with 'duckdb_close'.
Syntax
duckdb_state duckdb_open_ext(
const char *path,
duckdb_database *out_database,
duckdb_config config,
char **out_error
);
Parameters
• path
Path to the database file on disk, or nullptr or :memory: to open an in‑memory database.
• out_database
118
DuckDB Documentation
• config
• out_error
If set and the function returns DuckDBError, this will contain the reason why the start‑up failed. Note that the error must be freed using
duckdb_free.
• returns
duckdb_close Closes the specified database and de‑allocates all memory allocated for that database. This should be called after you
are done with any database allocated through duckdb_open or duckdb_open_ext. Note that failing to call duckdb_close (in case
of e.g., a program crash) will not cause data corruption. Still, it is recommended to always correctly close a database object after you are
done with it.
Syntax
void duckdb_close(
duckdb_database *database
);
Parameters
• database
duckdb_connect Opens a connection to a database. Connections are required to query the database, and store transactional state
associated with the connection. The instantiated connection should be closed using 'duckdb_disconnect'.
Syntax
duckdb_state duckdb_connect(
duckdb_database database,
duckdb_connection *out_connection
);
Parameters
• database
• out_connection
• returns
119
DuckDB Documentation
Syntax
void duckdb_interrupt(
duckdb_connection connection
);
Parameters
• connection
Syntax
duckdb_query_progress_type duckdb_query_progress(
duckdb_connection connection
);
Parameters
• connection
• returns
duckdb_disconnect Closes the specified connection and de‑allocates all memory allocated for that connection.
Syntax
void duckdb_disconnect(
duckdb_connection *connection
);
Parameters
• connection
duckdb_library_version Returns the version of the linked DuckDB, with a version postfix for dev versions
Usually used for developing C extensions that must return this for a compatibility check.
Syntax
);
120
DuckDB Documentation
duckdb_create_config Initializes an empty configuration object that can be used to provide start‑up options for the DuckDB in‑
stance through duckdb_open_ext. The duckdb_config must be destroyed using 'duckdb_destroy_config'
Syntax
duckdb_state duckdb_create_config(
duckdb_config *out_config
);
Parameters
• out_config
• returns
duckdb_config_count This returns the total amount of configuration options available for usage with duckdb_get_config_
flag.
This should not be called in a loop as it internally loops over all the options.
Syntax
size_t duckdb_config_count(
);
Parameters
• returns
duckdb_get_config_flag Obtains a human‑readable name and description of a specific configuration option. This can be used to
e.g. display configuration options. This will succeed unless index is out of range (i.e., >= duckdb_config_count).
Syntax
duckdb_state duckdb_get_config_flag(
size_t index,
const char **out_name,
const char **out_description
);
121
DuckDB Documentation
Parameters
• index
• out_name
• out_description
• returns
duckdb_set_config Sets the specified option for the specified configuration. The configuration option is indicated by name. To
obtain a list of config options, see duckdb_get_config_flag.
This can fail if either the name is invalid, or if the value provided for the option is invalid.
Syntax
duckdb_state duckdb_set_config(
duckdb_config config,
const char *name,
const char *option
);
Parameters
• duckdb_config
• name
• option
• returns
duckdb_destroy_config Destroys the specified configuration object and de‑allocates all memory allocated for the object.
Syntax
void duckdb_destroy_config(
duckdb_config *config
);
122
DuckDB Documentation
Parameters
• config
duckdb_query Executes a SQL query within a connection and stores the full (materialized) result in the out_result pointer. If the query
fails to execute, DuckDBError is returned and the error message can be retrieved by calling duckdb_result_error.
Note that after running duckdb_query, duckdb_destroy_result must be called on the result object even if the query fails, other‑
wise the error stored within the result will not be freed correctly.
Syntax
duckdb_state duckdb_query(
duckdb_connection connection,
const char *query,
duckdb_result *out_result
);
Parameters
• connection
• query
• out_result
• returns
duckdb_destroy_result Closes the result and de‑allocates all memory allocated for that connection.
Syntax
void duckdb_destroy_result(
duckdb_result *result
);
Parameters
• result
duckdb_column_name Returns the column name of the specified column. The result should not need to be freed; the column names
will automatically be destroyed when the result is destroyed.
123
DuckDB Documentation
Syntax
const char *duckdb_column_name(
duckdb_result *result,
idx_t col
);
Parameters
• result
• col
• returns
Syntax
duckdb_type duckdb_column_type(
duckdb_result *result,
idx_t col
);
Parameters
• result
• col
• returns
duckdb_result_statement_type Returns the statement type of the statement that was executed
Syntax
duckdb_statement_type duckdb_result_statement_type(
duckdb_result result
);
Parameters
• result
• returns
124
DuckDB Documentation
Syntax
duckdb_logical_type duckdb_column_logical_type(
duckdb_result *result,
idx_t col
);
Parameters
• result
• col
• returns
Syntax
idx_t duckdb_column_count(
duckdb_result *result
);
Parameters
• result
• returns
Syntax
idx_t duckdb_row_count(
duckdb_result *result
);
Parameters
• result
• returns
125
DuckDB Documentation
duckdb_rows_changed Returns the number of rows changed by the query stored in the result. This is relevant only for IN‑
SERT/UPDATE/DELETE queries. For other queries the rows_changed will be 0.
Syntax
idx_t duckdb_rows_changed(
duckdb_result *result
);
Parameters
• result
• returns
The function returns a dense array which contains the result data. The exact type stored in the array depends on the corresponding duckdb_
type (as provided by duckdb_column_type). For the exact type by which the data should be accessed, see the comments in the types
section or the DUCKDB_TYPE enum.
For example, for a column of type DUCKDB_TYPE_INTEGER, rows can be accessed in the following manner:
Syntax
void *duckdb_column_data(
duckdb_result *result,
idx_t col
);
Parameters
• result
• col
• returns
126
DuckDB Documentation
Returns the nullmask of a specific column of a result in columnar format. The nullmask indicates for every row whether or not the corre‑
sponding row is NULL. If a row is NULL, the values present in the array provided by duckdb_column_data are undefined.
Syntax
bool *duckdb_nullmask_data(
duckdb_result *result,
idx_t col
);
Parameters
• result
• col
• returns
duckdb_result_error Returns the error message contained within the result. The error is only set if duckdb_query returns
DuckDBError.
The result of this function must not be freed. It will be cleaned up when duckdb_destroy_result is called.
Syntax
Parameters
• result
• returns
127
DuckDB Documentation
duckdb_result_get_chunk Fetches a data chunk from the duckdb_result. This function should be called repeatedly until the result
is exhausted.
This function supersedes all duckdb_value functions, as well as the duckdb_column_data and duckdb_nullmask_data func‑
tions. It results in significantly better performance, and should be preferred in newer code‑bases.
If this function is used, none of the other result functions can be used and vice versa (i.e., this function cannot be mixed with the legacy
result functions).
Use duckdb_result_chunk_count to figure out how many chunks there are in the result.
Syntax
duckdb_data_chunk duckdb_result_get_chunk(
duckdb_result result,
idx_t chunk_index
);
Parameters
• result
• chunk_index
• returns
The resulting data chunk. Returns NULL if the chunk index is out of bounds.
Syntax
bool duckdb_result_is_streaming(
duckdb_result result
);
Parameters
• result
• returns
Syntax
idx_t duckdb_result_chunk_count(
duckdb_result result
);
128
DuckDB Documentation
Parameters
• result
• returns
Syntax
duckdb_result_type duckdb_result_return_type(
duckdb_result result
);
Parameters
• result
• returns
The return_type
duckdb_value_boolean
Syntax
bool duckdb_value_boolean(
duckdb_result *result,
idx_t col,
idx_t row
);
Parameters
• returns
The boolean value at the specified location, or false if the value cannot be converted.
duckdb_value_int8
Syntax
int8_t duckdb_value_int8(
duckdb_result *result,
idx_t col,
idx_t row
);
129
DuckDB Documentation
Parameters
• returns
The int8_t value at the specified location, or 0 if the value cannot be converted.
duckdb_value_int16
Syntax
int16_t duckdb_value_int16(
duckdb_result *result,
idx_t col,
idx_t row
);
Parameters
• returns
The int16_t value at the specified location, or 0 if the value cannot be converted.
duckdb_value_int32
Syntax
int32_t duckdb_value_int32(
duckdb_result *result,
idx_t col,
idx_t row
);
Parameters
• returns
The int32_t value at the specified location, or 0 if the value cannot be converted.
duckdb_value_int64
Syntax
int64_t duckdb_value_int64(
duckdb_result *result,
idx_t col,
idx_t row
);
Parameters
• returns
The int64_t value at the specified location, or 0 if the value cannot be converted.
duckdb_value_hugeint
130
DuckDB Documentation
Syntax
duckdb_hugeint duckdb_value_hugeint(
duckdb_result *result,
idx_t col,
idx_t row
);
Parameters
• returns
The duckdb_hugeint value at the specified location, or 0 if the value cannot be converted.
duckdb_value_uhugeint
Syntax
duckdb_uhugeint duckdb_value_uhugeint(
duckdb_result *result,
idx_t col,
idx_t row
);
Parameters
• returns
The duckdb_uhugeint value at the specified location, or 0 if the value cannot be converted.
duckdb_value_decimal
Syntax
duckdb_decimal duckdb_value_decimal(
duckdb_result *result,
idx_t col,
idx_t row
);
Parameters
• returns
The duckdb_decimal value at the specified location, or 0 if the value cannot be converted.
duckdb_value_uint8
Syntax
uint8_t duckdb_value_uint8(
duckdb_result *result,
idx_t col,
idx_t row
);
131
DuckDB Documentation
Parameters
• returns
The uint8_t value at the specified location, or 0 if the value cannot be converted.
duckdb_value_uint16
Syntax
uint16_t duckdb_value_uint16(
duckdb_result *result,
idx_t col,
idx_t row
);
Parameters
• returns
The uint16_t value at the specified location, or 0 if the value cannot be converted.
duckdb_value_uint32
Syntax
uint32_t duckdb_value_uint32(
duckdb_result *result,
idx_t col,
idx_t row
);
Parameters
• returns
The uint32_t value at the specified location, or 0 if the value cannot be converted.
duckdb_value_uint64
Syntax
uint64_t duckdb_value_uint64(
duckdb_result *result,
idx_t col,
idx_t row
);
Parameters
• returns
The uint64_t value at the specified location, or 0 if the value cannot be converted.
duckdb_value_float
132
DuckDB Documentation
Syntax
float duckdb_value_float(
duckdb_result *result,
idx_t col,
idx_t row
);
Parameters
• returns
The float value at the specified location, or 0 if the value cannot be converted.
duckdb_value_double
Syntax
double duckdb_value_double(
duckdb_result *result,
idx_t col,
idx_t row
);
Parameters
• returns
The double value at the specified location, or 0 if the value cannot be converted.
duckdb_value_date
Syntax
duckdb_date duckdb_value_date(
duckdb_result *result,
idx_t col,
idx_t row
);
Parameters
• returns
The duckdb_date value at the specified location, or 0 if the value cannot be converted.
duckdb_value_time
Syntax
duckdb_time duckdb_value_time(
duckdb_result *result,
idx_t col,
idx_t row
);
133
DuckDB Documentation
Parameters
• returns
The duckdb_time value at the specified location, or 0 if the value cannot be converted.
duckdb_value_timestamp
Syntax
duckdb_timestamp duckdb_value_timestamp(
duckdb_result *result,
idx_t col,
idx_t row
);
Parameters
• returns
The duckdb_timestamp value at the specified location, or 0 if the value cannot be converted.
duckdb_value_interval
Syntax
duckdb_interval duckdb_value_interval(
duckdb_result *result,
idx_t col,
idx_t row
);
Parameters
• returns
The duckdb_interval value at the specified location, or 0 if the value cannot be converted.
duckdb_value_varchar
Syntax
char *duckdb_value_varchar(
duckdb_result *result,
idx_t col,
idx_t row
);
Parameters
• DEPRECATED
use duckdb_value_string instead. This function does not work correctly if the string contains null bytes.
• returns
The text value at the specified location as a null‑terminated string, or nullptr if the value cannot be converted. The result must be freed
with duckdb_free.
134
DuckDB Documentation
duckdb_value_string
Syntax
duckdb_string duckdb_value_string(
duckdb_result *result,
idx_t col,
idx_t row
);
Parameters
• returns
duckdb_value_varchar_internal
Syntax
char *duckdb_value_varchar_internal(
duckdb_result *result,
idx_t col,
idx_t row
);
Parameters
• DEPRECATED
use duckdb_value_string_internal instead. This function does not work correctly if the string contains null bytes.
• returns
The char* value at the specified location. ONLY works on VARCHAR columns and does not auto‑cast. If the column is NOT a VARCHAR
column this function will return NULL.
duckdb_value_string_internal
Syntax
duckdb_string duckdb_value_string_internal(
duckdb_result *result,
idx_t col,
idx_t row
);
135
DuckDB Documentation
Parameters
• DEPRECATED
use duckdb_value_string_internal instead. This function does not work correctly if the string contains null bytes.
• returns
The char* value at the specified location. ONLY works on VARCHAR columns and does not auto‑cast. If the column is NOT a VARCHAR
column this function will return NULL.
duckdb_value_blob
Syntax
duckdb_blob duckdb_value_blob(
duckdb_result *result,
idx_t col,
idx_t row
);
Parameters
• returns
The duckdb_blob value at the specified location. Returns a blob with blob.data set to nullptr if the value cannot be converted. The resulting
field ”blob.data” must be freed with duckdb_free.
duckdb_value_is_null
Syntax
bool duckdb_value_is_null(
duckdb_result *result,
idx_t col,
idx_t row
);
Parameters
• returns
Returns true if the value at the specified index is NULL, and false otherwise.
duckdb_malloc Allocate size bytes of memory using the duckdb internal malloc function. Any memory allocated in this manner
should be freed using duckdb_free.
Syntax
void *duckdb_malloc(
size_t size
);
136
DuckDB Documentation
Parameters
• size
• returns
Syntax
void duckdb_free(
void *ptr
);
Parameters
• ptr
duckdb_vector_size The internal vector size used by DuckDB. This is the amount of tuples that will fit into a data chunk created by
duckdb_create_data_chunk.
Syntax
idx_t duckdb_vector_size(
);
Parameters
• returns
duckdb_string_is_inlined Whether or not the duckdb_string_t value is inlined. This means that the data of the string does not
have a separate allocation.
Syntax
bool duckdb_string_is_inlined(
duckdb_string_t string
);
duckdb_from_date Decompose a duckdb_date object into year, month and date (stored as duckdb_date_struct).
Syntax
duckdb_date_struct duckdb_from_date(
duckdb_date date
);
137
DuckDB Documentation
Parameters
• date
• returns
Syntax
duckdb_date duckdb_to_date(
duckdb_date_struct date
);
Parameters
• date
• returns
Syntax
bool duckdb_is_finite_date(
duckdb_date date
);
Parameters
• date
• returns
duckdb_from_time Decompose a duckdb_time object into hour, minute, second and microsecond (stored as duckdb_time_
struct).
Syntax
duckdb_time_struct duckdb_from_time(
duckdb_time time
);
138
DuckDB Documentation
Parameters
• time
• returns
Syntax
duckdb_time_tz duckdb_create_time_tz(
int64_t micros,
int32_t offset
);
Parameters
• micros
• offset
• returns
Use duckdb_from_time to further decompose the micros into hour, minute, second and microsecond.
Syntax
duckdb_time_tz_struct duckdb_from_time_tz(
duckdb_time_tz micros
);
Parameters
• micros
• out_micros
• out_offset
duckdb_to_time Re‑compose a duckdb_time from hour, minute, second and microsecond (duckdb_time_struct).
139
DuckDB Documentation
Syntax
duckdb_time duckdb_to_time(
duckdb_time_struct time
);
Parameters
• time
• returns
Syntax
duckdb_timestamp_struct duckdb_from_timestamp(
duckdb_timestamp ts
);
Parameters
• ts
• returns
Syntax
duckdb_timestamp duckdb_to_timestamp(
duckdb_timestamp_struct ts
);
Parameters
• ts
• returns
Syntax
bool duckdb_is_finite_timestamp(
duckdb_timestamp ts
);
140
DuckDB Documentation
Parameters
• ts
• returns
duckdb_hugeint_to_double Converts a duckdb_hugeint object (as obtained from a DUCKDB_TYPE_HUGEINT column) into a
double.
Syntax
double duckdb_hugeint_to_double(
duckdb_hugeint val
);
Parameters
• val
• returns
If the conversion fails because the double value is too big the result will be 0.
Syntax
duckdb_hugeint duckdb_double_to_hugeint(
double val
);
Parameters
• val
• returns
duckdb_uhugeint_to_double Converts a duckdb_uhugeint object (as obtained from a DUCKDB_TYPE_UHUGEINT column) into
a double.
Syntax
double duckdb_uhugeint_to_double(
duckdb_uhugeint val
);
141
DuckDB Documentation
Parameters
• val
• returns
If the conversion fails because the double value is too big the result will be 0.
Syntax
duckdb_uhugeint duckdb_double_to_uhugeint(
double val
);
Parameters
• val
• returns
If the conversion fails because the double value is too big, or the width/scale are invalid the result will be 0.
Syntax
duckdb_decimal duckdb_double_to_decimal(
double val,
uint8_t width,
uint8_t scale
);
Parameters
• val
• returns
duckdb_decimal_to_double Converts a duckdb_decimal object (as obtained from a DUCKDB_TYPE_DECIMAL column) into a
double.
Syntax
double duckdb_decimal_to_double(
duckdb_decimal val
);
142
DuckDB Documentation
Parameters
• val
• returns
Note that after calling duckdb_prepare, the prepared statement should always be destroyed using duckdb_destroy_prepare,
even if the prepare fails.
If the prepare fails, duckdb_prepare_error can be called to obtain the reason why the prepare failed.
Syntax
duckdb_state duckdb_prepare(
duckdb_connection connection,
const char *query,
duckdb_prepared_statement *out_prepared_statement
);
Parameters
• connection
• query
• out_prepared_statement
• returns
duckdb_destroy_prepare Closes the prepared statement and de‑allocates all memory allocated for the statement.
Syntax
void duckdb_destroy_prepare(
duckdb_prepared_statement *prepared_statement
);
Parameters
• prepared_statement
duckdb_prepare_error Returns the error message associated with the given prepared statement. If the prepared statement has no
error message, this returns nullptr instead.
The error message should not be freed. It will be de‑allocated when duckdb_destroy_prepare is called.
143
DuckDB Documentation
Syntax
Parameters
• prepared_statement
• returns
duckdb_nparams Returns the number of parameters that can be provided to the given prepared statement.
Syntax
idx_t duckdb_nparams(
duckdb_prepared_statement prepared_statement
);
Parameters
• prepared_statement
duckdb_parameter_name Returns the name used to identify the parameter The returned string should be freed using duckdb_
free.
Returns NULL if the index is out of range for the provided prepared statement.
Syntax
Parameters
• prepared_statement
The prepared statement for which to get the parameter name from.
duckdb_param_type Returns the parameter type for the parameter at the given index.
Returns DUCKDB_TYPE_INVALID if the parameter index is out of range or the statement was not successfully prepared.
144
DuckDB Documentation
Syntax
duckdb_type duckdb_param_type(
duckdb_prepared_statement prepared_statement,
idx_t param_idx
);
Parameters
• prepared_statement
• param_idx
• returns
Syntax
duckdb_state duckdb_clear_bindings(
duckdb_prepared_statement prepared_statement
);
Syntax
duckdb_statement_type duckdb_prepared_statement_type(
duckdb_prepared_statement statement
);
Parameters
• statement
• returns
Syntax
duckdb_state duckdb_bind_value(
duckdb_prepared_statement prepared_statement,
idx_t param_idx,
duckdb_value val
);
145
DuckDB Documentation
duckdb_bind_parameter_index Retrieve the index of the parameter for the prepared statement, identified by name
Syntax
duckdb_state duckdb_bind_parameter_index(
duckdb_prepared_statement prepared_statement,
idx_t *param_idx_out,
const char *name
);
duckdb_bind_boolean Binds a bool value to the prepared statement at the specified index.
Syntax
duckdb_state duckdb_bind_boolean(
duckdb_prepared_statement prepared_statement,
idx_t param_idx,
bool val
);
duckdb_bind_int8 Binds an int8_t value to the prepared statement at the specified index.
Syntax
duckdb_state duckdb_bind_int8(
duckdb_prepared_statement prepared_statement,
idx_t param_idx,
int8_t val
);
duckdb_bind_int16 Binds an int16_t value to the prepared statement at the specified index.
Syntax
duckdb_state duckdb_bind_int16(
duckdb_prepared_statement prepared_statement,
idx_t param_idx,
int16_t val
);
duckdb_bind_int32 Binds an int32_t value to the prepared statement at the specified index.
Syntax
duckdb_state duckdb_bind_int32(
duckdb_prepared_statement prepared_statement,
idx_t param_idx,
int32_t val
);
duckdb_bind_int64 Binds an int64_t value to the prepared statement at the specified index.
146
DuckDB Documentation
Syntax
duckdb_state duckdb_bind_int64(
duckdb_prepared_statement prepared_statement,
idx_t param_idx,
int64_t val
);
duckdb_bind_hugeint Binds a duckdb_hugeint value to the prepared statement at the specified index.
Syntax
duckdb_state duckdb_bind_hugeint(
duckdb_prepared_statement prepared_statement,
idx_t param_idx,
duckdb_hugeint val
);
duckdb_bind_uhugeint Binds an duckdb_uhugeint value to the prepared statement at the specified index.
Syntax
duckdb_state duckdb_bind_uhugeint(
duckdb_prepared_statement prepared_statement,
idx_t param_idx,
duckdb_uhugeint val
);
duckdb_bind_decimal Binds a duckdb_decimal value to the prepared statement at the specified index.
Syntax
duckdb_state duckdb_bind_decimal(
duckdb_prepared_statement prepared_statement,
idx_t param_idx,
duckdb_decimal val
);
duckdb_bind_uint8 Binds an uint8_t value to the prepared statement at the specified index.
Syntax
duckdb_state duckdb_bind_uint8(
duckdb_prepared_statement prepared_statement,
idx_t param_idx,
uint8_t val
);
duckdb_bind_uint16 Binds an uint16_t value to the prepared statement at the specified index.
147
DuckDB Documentation
Syntax
duckdb_state duckdb_bind_uint16(
duckdb_prepared_statement prepared_statement,
idx_t param_idx,
uint16_t val
);
duckdb_bind_uint32 Binds an uint32_t value to the prepared statement at the specified index.
Syntax
duckdb_state duckdb_bind_uint32(
duckdb_prepared_statement prepared_statement,
idx_t param_idx,
uint32_t val
);
duckdb_bind_uint64 Binds an uint64_t value to the prepared statement at the specified index.
Syntax
duckdb_state duckdb_bind_uint64(
duckdb_prepared_statement prepared_statement,
idx_t param_idx,
uint64_t val
);
duckdb_bind_float Binds a float value to the prepared statement at the specified index.
Syntax
duckdb_state duckdb_bind_float(
duckdb_prepared_statement prepared_statement,
idx_t param_idx,
float val
);
duckdb_bind_double Binds a double value to the prepared statement at the specified index.
Syntax
duckdb_state duckdb_bind_double(
duckdb_prepared_statement prepared_statement,
idx_t param_idx,
double val
);
duckdb_bind_date Binds a duckdb_date value to the prepared statement at the specified index.
148
DuckDB Documentation
Syntax
duckdb_state duckdb_bind_date(
duckdb_prepared_statement prepared_statement,
idx_t param_idx,
duckdb_date val
);
duckdb_bind_time Binds a duckdb_time value to the prepared statement at the specified index.
Syntax
duckdb_state duckdb_bind_time(
duckdb_prepared_statement prepared_statement,
idx_t param_idx,
duckdb_time val
);
duckdb_bind_timestamp Binds a duckdb_timestamp value to the prepared statement at the specified index.
Syntax
duckdb_state duckdb_bind_timestamp(
duckdb_prepared_statement prepared_statement,
idx_t param_idx,
duckdb_timestamp val
);
duckdb_bind_interval Binds a duckdb_interval value to the prepared statement at the specified index.
Syntax
duckdb_state duckdb_bind_interval(
duckdb_prepared_statement prepared_statement,
idx_t param_idx,
duckdb_interval val
);
duckdb_bind_varchar Binds a null‑terminated varchar value to the prepared statement at the specified index.
Syntax
duckdb_state duckdb_bind_varchar(
duckdb_prepared_statement prepared_statement,
idx_t param_idx,
const char *val
);
duckdb_bind_varchar_length Binds a varchar value to the prepared statement at the specified index.
149
DuckDB Documentation
Syntax
duckdb_state duckdb_bind_varchar_length(
duckdb_prepared_statement prepared_statement,
idx_t param_idx,
const char *val,
idx_t length
);
duckdb_bind_blob Binds a blob value to the prepared statement at the specified index.
Syntax
duckdb_state duckdb_bind_blob(
duckdb_prepared_statement prepared_statement,
idx_t param_idx,
const void *data,
idx_t length
);
duckdb_bind_null Binds a NULL value to the prepared statement at the specified index.
Syntax
duckdb_state duckdb_bind_null(
duckdb_prepared_statement prepared_statement,
idx_t param_idx
);
duckdb_execute_prepared Executes the prepared statement with the given bound parameters, and returns a materialized query
result.
This method can be called multiple times for each prepared statement, and the parameters can be modified between calls to this func‑
tion.
Syntax
duckdb_state duckdb_execute_prepared(
duckdb_prepared_statement prepared_statement,
duckdb_result *out_result
);
Parameters
• prepared_statement
• out_result
• returns
150
DuckDB Documentation
duckdb_execute_prepared_streaming Executes the prepared statement with the given bound parameters, and returns an
optionally‑streaming query result. To determine if the resulting query was in fact streamed, use duckdb_result_is_streaming
This method can be called multiple times for each prepared statement, and the parameters can be modified between calls to this func‑
tion.
Syntax
duckdb_state duckdb_execute_prepared_streaming(
duckdb_prepared_statement prepared_statement,
duckdb_result *out_result
);
Parameters
• prepared_statement
• out_result
• returns
duckdb_extract_statements Extract all statements from a query. Note that after calling duckdb_extract_statements, the
extracted statements should always be destroyed using duckdb_destroy_extracted, even if no statements were extracted.
If the extract fails, duckdb_extract_statements_error can be called to obtain the reason why the extract failed.
Syntax
idx_t duckdb_extract_statements(
duckdb_connection connection,
const char *query,
duckdb_extracted_statements *out_extracted_statements
);
Parameters
• connection
• query
• out_extracted_statements
• returns
151
DuckDB Documentation
If the prepare fails, duckdb_prepare_error can be called to obtain the reason why the prepare failed.
Syntax
duckdb_state duckdb_prepare_extracted_statement(
duckdb_connection connection,
duckdb_extracted_statements extracted_statements,
idx_t index,
duckdb_prepared_statement *out_prepared_statement
);
Parameters
• connection
• extracted_statements
• index
• out_prepared_statement
• returns
duckdb_extract_statements_error Returns the error message contained within the extracted statements. The result of this
function must not be freed. It will be cleaned up when duckdb_destroy_extracted is called.
Syntax
Parameters
• result
• returns
152
DuckDB Documentation
Syntax
void duckdb_destroy_extracted(
duckdb_extracted_statements *extracted_statements
);
Parameters
• extracted_statements
duckdb_pending_prepared Executes the prepared statement with the given bound parameters, and returns a pending result. The
pending result represents an intermediate structure for a query that is not yet fully executed. The pending result can be used to incremen‑
tally execute a query, returning control to the client between tasks.
Note that after calling duckdb_pending_prepared, the pending result should always be destroyed using duckdb_destroy_
pending, even if this function returns DuckDBError.
Syntax
duckdb_state duckdb_pending_prepared(
duckdb_prepared_statement prepared_statement,
duckdb_pending_result *out_result
);
Parameters
• prepared_statement
• out_result
• returns
duckdb_pending_prepared_streaming Executes the prepared statement with the given bound parameters, and returns a pend‑
ing result. This pending result will create a streaming duckdb_result when executed. The pending result represents an intermediate struc‑
ture for a query that is not yet fully executed.
Note that after calling duckdb_pending_prepared_streaming, the pending result should always be destroyed using duckdb_
destroy_pending, even if this function returns DuckDBError.
Syntax
duckdb_state duckdb_pending_prepared_streaming(
duckdb_prepared_statement prepared_statement,
duckdb_pending_result *out_result
);
153
DuckDB Documentation
Parameters
• prepared_statement
• out_result
• returns
duckdb_destroy_pending Closes the pending result and de‑allocates all memory allocated for the result.
Syntax
void duckdb_destroy_pending(
duckdb_pending_result *pending_result
);
Parameters
• pending_result
duckdb_pending_error Returns the error message contained within the pending result.
The result of this function must not be freed. It will be cleaned up when duckdb_destroy_pending is called.
Syntax
Parameters
• result
• returns
duckdb_pending_execute_task Executes a single task within the query, returning whether or not the query is ready.
If this returns DUCKDB_PENDING_RESULT_READY, the duckdb_execute_pending function can be called to obtain the result. If this returns
DUCKDB_PENDING_RESULT_NOT_READY, the duckdb_pending_execute_task function should be called again. If this returns DUCKDB_
PENDING_ERROR, an error occurred during execution.
154
DuckDB Documentation
Syntax
duckdb_pending_state duckdb_pending_execute_task(
duckdb_pending_result pending_result
);
Parameters
• pending_result
• returns
Syntax
duckdb_pending_state duckdb_pending_execute_check_state(
duckdb_pending_result pending_result
);
Parameters
• pending_result
• returns
duckdb_execute_pending Fully execute a pending query result, returning the final query result.
If duckdb_pending_execute_task has been called until DUCKDB_PENDING_RESULT_READY was returned, this will return fast. Otherwise,
all remaining tasks must be executed first.
Syntax
duckdb_state duckdb_execute_pending(
duckdb_pending_result pending_result,
duckdb_result *out_result
);
155
DuckDB Documentation
Parameters
• pending_result
• out_result
• returns
Syntax
bool duckdb_pending_execution_is_finished(
duckdb_pending_state pending_state
);
Parameters
• pending_state
• returns
duckdb_destroy_value Destroys the value and de‑allocates all memory allocated for that type.
Syntax
void duckdb_destroy_value(
duckdb_value *value
);
Parameters
• value
Syntax
duckdb_value duckdb_create_varchar(
const char *text
);
156
DuckDB Documentation
Parameters
• value
• returns
Syntax
duckdb_value duckdb_create_varchar_length(
const char *text,
idx_t length
);
Parameters
• value
The text
• length
• returns
Syntax
duckdb_value duckdb_create_int64(
int64_t val
);
Parameters
• value
• returns
Syntax
duckdb_value duckdb_create_struct_value(
duckdb_logical_type type,
duckdb_value *values
);
157
DuckDB Documentation
Parameters
• type
• values
• returns
duckdb_create_list_value Creates a list value from a type and an array of values of length value_count
Syntax
duckdb_value duckdb_create_list_value(
duckdb_logical_type type,
duckdb_value *values,
idx_t value_count
);
Parameters
• type
• values
• value_count
• returns
duckdb_create_array_value Creates a array value from a type and an array of values of length value_count
Syntax
duckdb_value duckdb_create_array_value(
duckdb_logical_type type,
duckdb_value *values,
idx_t value_count
);
Parameters
• type
• values
158
DuckDB Documentation
• value_count
• returns
duckdb_get_varchar Obtains a string representation of the given value. The result must be destroyed with duckdb_free.
Syntax
char *duckdb_get_varchar(
duckdb_value value
);
Parameters
• value
The value
• returns
Syntax
int64_t duckdb_get_int64(
duckdb_value value
);
Parameters
• value
The value
• returns
duckdb_create_logical_type Creates a duckdb_logical_type from a standard primitive type. The resulting type should be
destroyed with duckdb_destroy_logical_type.
Syntax
duckdb_logical_type duckdb_create_logical_type(
duckdb_type type
);
159
DuckDB Documentation
Parameters
• type
• returns
duckdb_logical_type_get_alias Returns the alias of a duckdb_logical_type, if one is set, else NULL. The result must be de‑
stroyed with duckdb_free.
Syntax
char *duckdb_logical_type_get_alias(
duckdb_logical_type type
);
Parameters
• type
• returns
duckdb_create_list_type Creates a list type from its child type. The resulting type should be destroyed with duckdb_destroy_
logical_type.
Syntax
duckdb_logical_type duckdb_create_list_type(
duckdb_logical_type type
);
Parameters
• type
• returns
duckdb_create_array_type Creates a array type from its child type. The resulting type should be destroyed with duckdb_
destroy_logical_type.
Syntax
duckdb_logical_type duckdb_create_array_type(
duckdb_logical_type type,
idx_t array_size
);
160
DuckDB Documentation
Parameters
• type
• array_size
• returns
duckdb_create_map_type Creates a map type from its key type and value type. The resulting type should be destroyed with
duckdb_destroy_logical_type.
Syntax
duckdb_logical_type duckdb_create_map_type(
duckdb_logical_type key_type,
duckdb_logical_type value_type
);
Parameters
• type
• returns
duckdb_create_union_type Creates a UNION type from the passed types array. The resulting type should be destroyed with
duckdb_destroy_logical_type.
Syntax
duckdb_logical_type duckdb_create_union_type(
duckdb_logical_type *member_types,
const char **member_names,
idx_t member_count
);
Parameters
• types
• type_amount
• returns
161
DuckDB Documentation
duckdb_create_struct_type Creates a STRUCT type from the passed member name and type arrays. The resulting type should
be destroyed with duckdb_destroy_logical_type.
Syntax
duckdb_logical_type duckdb_create_struct_type(
duckdb_logical_type *member_types,
const char **member_names,
idx_t member_count
);
Parameters
• member_types
• member_names
• member_count
• returns
duckdb_create_enum_type Creates an ENUM type from the passed member name array. The resulting type should be destroyed
with duckdb_destroy_logical_type.
Syntax
duckdb_logical_type duckdb_create_enum_type(
const char **member_names,
idx_t member_count
);
Parameters
• enum_name
• member_names
• member_count
• returns
duckdb_create_decimal_type Creates a duckdb_logical_type of type decimal with the specified width and scale. The re‑
sulting type should be destroyed with duckdb_destroy_logical_type.
162
DuckDB Documentation
Syntax
duckdb_logical_type duckdb_create_decimal_type(
uint8_t width,
uint8_t scale
);
Parameters
• width
• scale
• returns
Syntax
duckdb_type duckdb_get_type_id(
duckdb_logical_type type
);
Parameters
• type
• returns
The type id
Syntax
uint8_t duckdb_decimal_width(
duckdb_logical_type type
);
Parameters
• type
• returns
163
DuckDB Documentation
Syntax
uint8_t duckdb_decimal_scale(
duckdb_logical_type type
);
Parameters
• type
• returns
Syntax
duckdb_type duckdb_decimal_internal_type(
duckdb_logical_type type
);
Parameters
• type
• returns
Syntax
duckdb_type duckdb_enum_internal_type(
duckdb_logical_type type
);
Parameters
• type
• returns
Syntax
uint32_t duckdb_enum_dictionary_size(
duckdb_logical_type type
);
164
DuckDB Documentation
Parameters
• type
• returns
duckdb_enum_dictionary_value Retrieves the dictionary value at the specified position from the enum.
Syntax
char *duckdb_enum_dictionary_value(
duckdb_logical_type type,
idx_t index
);
Parameters
• type
• index
• returns
The string value of the enum type. Must be freed with duckdb_free.
Syntax
duckdb_logical_type duckdb_list_type_child_type(
duckdb_logical_type type
);
Parameters
• type
• returns
The child type of the list type. Must be destroyed with duckdb_destroy_logical_type.
165
DuckDB Documentation
Syntax
duckdb_logical_type duckdb_array_type_child_type(
duckdb_logical_type type
);
Parameters
• type
• returns
The child type of the array type. Must be destroyed with duckdb_destroy_logical_type.
Syntax
idx_t duckdb_array_type_array_size(
duckdb_logical_type type
);
Parameters
• type
• returns
The fixed number of elements the values of this array type can store.
Syntax
duckdb_logical_type duckdb_map_type_key_type(
duckdb_logical_type type
);
Parameters
• type
• returns
The key type of the map type. Must be destroyed with duckdb_destroy_logical_type.
166
DuckDB Documentation
Syntax
duckdb_logical_type duckdb_map_type_value_type(
duckdb_logical_type type
);
Parameters
• type
• returns
The value type of the map type. Must be destroyed with duckdb_destroy_logical_type.
Syntax
idx_t duckdb_struct_type_child_count(
duckdb_logical_type type
);
Parameters
• type
• returns
Syntax
char *duckdb_struct_type_child_name(
duckdb_logical_type type,
idx_t index
);
Parameters
• type
• index
• returns
167
DuckDB Documentation
duckdb_struct_type_child_type Retrieves the child type of the given struct type at the specified index.
Syntax
duckdb_logical_type duckdb_struct_type_child_type(
duckdb_logical_type type,
idx_t index
);
Parameters
• type
• index
• returns
The child type of the struct type. Must be destroyed with duckdb_destroy_logical_type.
duckdb_union_type_member_count Returns the number of members that the union type has.
Syntax
idx_t duckdb_union_type_member_count(
duckdb_logical_type type
);
Parameters
• type
• returns
Syntax
char *duckdb_union_type_member_name(
duckdb_logical_type type,
idx_t index
);
168
DuckDB Documentation
Parameters
• type
• index
• returns
duckdb_union_type_member_type Retrieves the child type of the given union member at the specified index.
Syntax
duckdb_logical_type duckdb_union_type_member_type(
duckdb_logical_type type,
idx_t index
);
Parameters
• type
• index
• returns
The child type of the union member. Must be destroyed with duckdb_destroy_logical_type.
duckdb_destroy_logical_type Destroys the logical type and de‑allocates all memory allocated for that type.
Syntax
void duckdb_destroy_logical_type(
duckdb_logical_type *type
);
Parameters
• type
169
DuckDB Documentation
Syntax
duckdb_data_chunk duckdb_create_data_chunk(
duckdb_logical_type *types,
idx_t column_count
);
Parameters
• types
• column_count
• returns
duckdb_destroy_data_chunk Destroys the data chunk and de‑allocates all memory allocated for that chunk.
Syntax
void duckdb_destroy_data_chunk(
duckdb_data_chunk *chunk
);
Parameters
• chunk
duckdb_data_chunk_reset Resets a data chunk, clearing the validity masks and setting the cardinality of the data chunk to 0.
Syntax
void duckdb_data_chunk_reset(
duckdb_data_chunk chunk
);
Parameters
• chunk
Syntax
idx_t duckdb_data_chunk_get_column_count(
duckdb_data_chunk chunk
);
170
DuckDB Documentation
Parameters
• chunk
• returns
duckdb_data_chunk_get_vector Retrieves the vector at the specified column index in the data chunk.
The pointer to the vector is valid for as long as the chunk is alive. It does NOT need to be destroyed.
Syntax
duckdb_vector duckdb_data_chunk_get_vector(
duckdb_data_chunk chunk,
idx_t col_idx
);
Parameters
• chunk
• returns
The vector
Syntax
idx_t duckdb_data_chunk_get_size(
duckdb_data_chunk chunk
);
Parameters
• chunk
• returns
Syntax
void duckdb_data_chunk_set_size(
duckdb_data_chunk chunk,
idx_t size
);
171
DuckDB Documentation
Parameters
• chunk
• size
Syntax
duckdb_logical_type duckdb_vector_get_column_type(
duckdb_vector vector
);
Parameters
• vector
• returns
The data pointer can be used to read or write values from the vector. How to read or write values depends on the type of the vector.
Syntax
void *duckdb_vector_get_data(
duckdb_vector vector
);
Parameters
• vector
• returns
The validity mask is a bitset that signifies null‑ness within the data chunk. It is a series of uint64_t values, where each uint64_t value contains
validity for 64 tuples. The bit is set to 1 if the value is valid (i.e., not NULL) or 0 if the value is invalid (i.e., NULL).
idx_t entry_idx = row_idx / 64; idx_t idx_in_entry = row_idx % 64; bool is_valid = validity_mask[entry_idx] & (1 « idx_in_entry);
172
DuckDB Documentation
Syntax
uint64_t *duckdb_vector_get_validity(
duckdb_vector vector
);
Parameters
• vector
• returns
After this function is called, duckdb_vector_get_validity will ALWAYS return non‑NULL. This allows null values to be written to the
vector, regardless of whether a validity mask was present before.
Syntax
void duckdb_vector_ensure_validity_writable(
duckdb_vector vector
);
Parameters
• vector
Syntax
void duckdb_vector_assign_string_element(
duckdb_vector vector,
idx_t index,
const char *str
);
Parameters
• vector
• index
• str
duckdb_vector_assign_string_element_len Assigns a string element in the vector at the specified location. You may also
use this function to assign BLOBs.
173
DuckDB Documentation
Syntax
void duckdb_vector_assign_string_element_len(
duckdb_vector vector,
idx_t index,
const char *str,
idx_t str_len
);
Parameters
• vector
• index
• str
The string
• str_len
Syntax
duckdb_vector duckdb_list_vector_get_child(
duckdb_vector vector
);
Parameters
• vector
The vector
• returns
Syntax
idx_t duckdb_list_vector_get_size(
duckdb_vector vector
);
174
DuckDB Documentation
Parameters
• vector
The vector
• returns
duckdb_list_vector_set_size Sets the total size of the underlying child‑vector of a list vector.
Syntax
duckdb_state duckdb_list_vector_set_size(
duckdb_vector vector,
idx_t size
);
Parameters
• vector
• size
• returns
Syntax
duckdb_state duckdb_list_vector_reserve(
duckdb_vector vector,
idx_t required_capacity
);
Parameters
• vector
• required_capacity
• return
175
DuckDB Documentation
Syntax
duckdb_vector duckdb_struct_vector_get_child(
duckdb_vector vector,
idx_t index
);
Parameters
• vector
The vector
• index
• returns
The resulting vector is valid as long as the parent vector is valid. The resulting vector has the size of the parent vector multiplied by the
array size.
Syntax
duckdb_vector duckdb_array_vector_get_child(
duckdb_vector vector
);
Parameters
• vector
The vector
• returns
duckdb_validity_row_is_valid Returns whether or not a row is valid (i.e., not NULL) in the given validity mask.
Syntax
bool duckdb_validity_row_is_valid(
uint64_t *validity,
idx_t row
);
Parameters
• validity
• row
176
DuckDB Documentation
• returns
Syntax
void duckdb_validity_set_row_validity(
uint64_t *validity,
idx_t row,
bool valid
);
Parameters
• validity
• row
• valid
Syntax
void duckdb_validity_set_row_invalid(
uint64_t *validity,
idx_t row
);
Parameters
• validity
• row
177
DuckDB Documentation
Syntax
void duckdb_validity_set_row_valid(
uint64_t *validity,
idx_t row
);
Parameters
• validity
• row
Syntax
duckdb_table_function duckdb_create_table_function(
);
Parameters
• returns
Syntax
void duckdb_destroy_table_function(
duckdb_table_function *table_function
);
Parameters
• table_function
Syntax
void duckdb_table_function_set_name(
duckdb_table_function table_function,
const char *name
);
178
DuckDB Documentation
Parameters
• table_function
• name
Syntax
void duckdb_table_function_add_parameter(
duckdb_table_function table_function,
duckdb_logical_type type
);
Parameters
• table_function
• type
Syntax
void duckdb_table_function_add_named_parameter(
duckdb_table_function table_function,
const char *name,
duckdb_logical_type type
);
Parameters
• table_function
• name
• type
duckdb_table_function_set_extra_info Assigns extra information to the table function that can be fetched during binding,
etc.
179
DuckDB Documentation
Syntax
void duckdb_table_function_set_extra_info(
duckdb_table_function table_function,
void *extra_info,
duckdb_delete_callback_t destroy
);
Parameters
• table_function
• extra_info
• destroy
The callback that will be called to destroy the bind data (if any)
Syntax
void duckdb_table_function_set_bind(
duckdb_table_function table_function,
duckdb_table_function_bind_t bind
);
Parameters
• table_function
• bind
Syntax
void duckdb_table_function_set_init(
duckdb_table_function table_function,
duckdb_table_function_init_t init
);
Parameters
• table_function
• init
180
DuckDB Documentation
Syntax
void duckdb_table_function_set_local_init(
duckdb_table_function table_function,
duckdb_table_function_init_t init
);
Parameters
• table_function
• init
Syntax
void duckdb_table_function_set_function(
duckdb_table_function table_function,
duckdb_table_function_t function
);
Parameters
• table_function
• function
The function
duckdb_table_function_supports_projection_pushdown Sets whether or not the given table function supports projec‑
tion pushdown.
If this is set to true, the system will provide a list of all required columns in the init stage through the duckdb_init_get_column_
count and duckdb_init_get_column_index functions. If this is set to false (the default), the system will expect all columns to be
projected.
Syntax
void duckdb_table_function_supports_projection_pushdown(
duckdb_table_function table_function,
bool pushdown
);
181
DuckDB Documentation
Parameters
• table_function
• pushdown
duckdb_register_table_function Register the table function object within the given connection.
The function requires at least a name, a bind function, an init function and a main function.
If the function is incomplete or a function with this name already exists DuckDBError is returned.
Syntax
duckdb_state duckdb_register_table_function(
duckdb_connection con,
duckdb_table_function function
);
Parameters
• con
• function
• returns
Syntax
void *duckdb_bind_get_extra_info(
duckdb_bind_info info
);
Parameters
• info
• returns
182
DuckDB Documentation
Syntax
void duckdb_bind_add_result_column(
duckdb_bind_info info,
const char *name,
duckdb_logical_type type
);
Parameters
• info
• name
• type
Syntax
idx_t duckdb_bind_get_parameter_count(
duckdb_bind_info info
);
Parameters
• info
• returns
Syntax
duckdb_value duckdb_bind_get_parameter(
duckdb_bind_info info,
idx_t index
);
Parameters
• info
• index
183
DuckDB Documentation
• returns
Syntax
duckdb_value duckdb_bind_get_named_parameter(
duckdb_bind_info info,
const char *name
);
Parameters
• info
• name
• returns
duckdb_bind_set_bind_data Sets the user‑provided bind data in the bind object. This object can be retrieved again during exe‑
cution.
Syntax
void duckdb_bind_set_bind_data(
duckdb_bind_info info,
void *bind_data,
duckdb_delete_callback_t destroy
);
Parameters
• info
• extra_data
• destroy
The callback that will be called to destroy the bind data (if any)
duckdb_bind_set_cardinality Sets the cardinality estimate for the table function, used for optimization.
184
DuckDB Documentation
Syntax
void duckdb_bind_set_cardinality(
duckdb_bind_info info,
idx_t cardinality,
bool is_exact
);
Parameters
• info
• is_exact
Syntax
void duckdb_bind_set_error(
duckdb_bind_info info,
const char *error
);
Parameters
• info
• error
Syntax
void *duckdb_init_get_extra_info(
duckdb_init_info info
);
Parameters
• info
• returns
duckdb_init_get_bind_data Gets the bind data set by duckdb_bind_set_bind_data during the bind.
Note that the bind data should be considered as read‑only. For tracking state, use the init data instead.
185
DuckDB Documentation
Syntax
void *duckdb_init_get_bind_data(
duckdb_init_info info
);
Parameters
• info
• returns
duckdb_init_set_init_data Sets the user‑provided init data in the init object. This object can be retrieved again during execu‑
tion.
Syntax
void duckdb_init_set_init_data(
duckdb_init_info info,
void *init_data,
duckdb_delete_callback_t destroy
);
Parameters
• info
• extra_data
• destroy
The callback that will be called to destroy the init data (if any)
This function must be used if projection pushdown is enabled to figure out which columns to emit.
Syntax
idx_t duckdb_init_get_column_count(
duckdb_init_info info
);
Parameters
• info
• returns
186
DuckDB Documentation
duckdb_init_get_column_index Returns the column index of the projected column at the specified position.
This function must be used if projection pushdown is enabled to figure out which columns to emit.
Syntax
idx_t duckdb_init_get_column_index(
duckdb_init_info info,
idx_t column_index
);
Parameters
• info
• column_index
The index at which to get the projected column index, from 0..duckdb_init_get_column_count(info)
• returns
duckdb_init_set_max_threads Sets how many threads can process this table function in parallel (default: 1)
Syntax
void duckdb_init_set_max_threads(
duckdb_init_info info,
idx_t max_threads
);
Parameters
• info
• max_threads
The maximum amount of threads that can process this table function
Syntax
void duckdb_init_set_error(
duckdb_init_info info,
const char *error
);
187
DuckDB Documentation
Parameters
• info
• error
Syntax
void *duckdb_function_get_extra_info(
duckdb_function_info info
);
Parameters
• info
• returns
duckdb_function_get_bind_data Gets the bind data set by duckdb_bind_set_bind_data during the bind.
Note that the bind data should be considered as read‑only. For tracking state, use the init data instead.
Syntax
void *duckdb_function_get_bind_data(
duckdb_function_info info
);
Parameters
• info
• returns
duckdb_function_get_init_data Gets the init data set by duckdb_init_set_init_data during the init.
Syntax
void *duckdb_function_get_init_data(
duckdb_function_info info
);
188
DuckDB Documentation
Parameters
• info
• returns
duckdb_function_get_local_init_data Gets the thread‑local init data set by duckdb_init_set_init_data during the
local_init.
Syntax
void *duckdb_function_get_local_init_data(
duckdb_function_info info
);
Parameters
• info
• returns
duckdb_function_set_error Report that an error has occurred while executing the function.
Syntax
void duckdb_function_set_error(
duckdb_function_info info,
const char *error
);
Parameters
• info
• error
Syntax
void duckdb_add_replacement_scan(
duckdb_database db,
duckdb_replacement_callback_t replacement,
void *extra_data,
duckdb_delete_callback_t delete_callback
);
189
DuckDB Documentation
Parameters
• db
• replacement
• extra_data
• delete_callback
duckdb_replacement_scan_set_function_name Sets the replacement function name. If this function is called in the replace‑
ment callback, the replacement scan is performed. If it is not called, the replacement callback is not performed.
Syntax
void duckdb_replacement_scan_set_function_name(
duckdb_replacement_scan_info info,
const char *function_name
);
Parameters
• info
• function_name
Syntax
void duckdb_replacement_scan_add_parameter(
duckdb_replacement_scan_info info,
duckdb_value parameter
);
Parameters
• info
• parameter
duckdb_replacement_scan_set_error Report that an error has occurred while executing the replacement scan.
190
DuckDB Documentation
Syntax
void duckdb_replacement_scan_set_error(
duckdb_replacement_scan_info info,
const char *error
);
Parameters
• info
• error
Syntax
duckdb_state duckdb_appender_create(
duckdb_connection connection,
const char *schema,
const char *table,
duckdb_appender *out_appender
);
Parameters
• connection
• schema
The schema of the table to append to, or nullptr for the default schema.
• table
• out_appender
• returns
duckdb_appender_column_count Returns the number of columns in the table that belongs to the appender.
Syntax
idx_t duckdb_appender_column_count(
duckdb_appender appender
);
191
DuckDB Documentation
Parameters
• returns
Syntax
duckdb_logical_type duckdb_appender_column_type(
duckdb_appender appender,
idx_t col_idx
);
Parameters
• returns
duckdb_appender_error Returns the error message associated with the given appender. If the appender has no error message, this
returns nullptr instead.
The error message should not be freed. It will be de‑allocated when duckdb_appender_destroy is called.
Syntax
const char *duckdb_appender_error(
duckdb_appender appender
);
Parameters
• appender
• returns
duckdb_appender_flush Flush the appender to the table, forcing the cache of the appender to be cleared and the data to be ap‑
pended to the base table.
This should generally not be used unless you know what you are doing. Instead, call duckdb_appender_destroy when you are done
with the appender.
Syntax
duckdb_state duckdb_appender_flush(
duckdb_appender appender
);
192
DuckDB Documentation
Parameters
• appender
• returns
duckdb_appender_close Close the appender, flushing all intermediate state in the appender to the table and closing it for further
appends.
Syntax
duckdb_state duckdb_appender_close(
duckdb_appender appender
);
Parameters
• appender
• returns
duckdb_appender_destroy Close the appender and destroy it. Flushing all intermediate state in the appender to the table, and
de‑allocating all memory associated with the appender.
Syntax
duckdb_state duckdb_appender_destroy(
duckdb_appender *appender
);
Parameters
• appender
• returns
duckdb_appender_begin_row A nop function, provided for backwards compatibility reasons. Does nothing. Only duckdb_
appender_end_row is required.
Syntax
duckdb_state duckdb_appender_begin_row(
duckdb_appender appender
);
193
DuckDB Documentation
duckdb_appender_end_row Finish the current row of appends. After end_row is called, the next row can be appended.
Syntax
duckdb_state duckdb_appender_end_row(
duckdb_appender appender
);
Parameters
• appender
The appender.
• returns
Syntax
duckdb_state duckdb_append_bool(
duckdb_appender appender,
bool value
);
Syntax
duckdb_state duckdb_append_int8(
duckdb_appender appender,
int8_t value
);
Syntax
duckdb_state duckdb_append_int16(
duckdb_appender appender,
int16_t value
);
Syntax
duckdb_state duckdb_append_int32(
duckdb_appender appender,
int32_t value
);
194
DuckDB Documentation
Syntax
duckdb_state duckdb_append_int64(
duckdb_appender appender,
int64_t value
);
Syntax
duckdb_state duckdb_append_hugeint(
duckdb_appender appender,
duckdb_hugeint value
);
Syntax
duckdb_state duckdb_append_uint8(
duckdb_appender appender,
uint8_t value
);
Syntax
duckdb_state duckdb_append_uint16(
duckdb_appender appender,
uint16_t value
);
Syntax
duckdb_state duckdb_append_uint32(
duckdb_appender appender,
uint32_t value
);
Syntax
duckdb_state duckdb_append_uint64(
duckdb_appender appender,
uint64_t value
);
195
DuckDB Documentation
Syntax
duckdb_state duckdb_append_uhugeint(
duckdb_appender appender,
duckdb_uhugeint value
);
Syntax
duckdb_state duckdb_append_float(
duckdb_appender appender,
float value
);
Syntax
duckdb_state duckdb_append_double(
duckdb_appender appender,
double value
);
Syntax
duckdb_state duckdb_append_date(
duckdb_appender appender,
duckdb_date value
);
Syntax
duckdb_state duckdb_append_time(
duckdb_appender appender,
duckdb_time value
);
Syntax
duckdb_state duckdb_append_timestamp(
duckdb_appender appender,
duckdb_timestamp value
);
196
DuckDB Documentation
Syntax
duckdb_state duckdb_append_interval(
duckdb_appender appender,
duckdb_interval value
);
Syntax
duckdb_state duckdb_append_varchar(
duckdb_appender appender,
const char *val
);
Syntax
duckdb_state duckdb_append_varchar_length(
duckdb_appender appender,
const char *val,
idx_t length
);
Syntax
duckdb_state duckdb_append_blob(
duckdb_appender appender,
const void *data,
idx_t length
);
Syntax
duckdb_state duckdb_append_null(
duckdb_appender appender
);
The types of the data chunk must exactly match the types of the table, no casting is performed. If the types do not match or the appender
is in an invalid state, DuckDBError is returned. If the append is successful, DuckDBSuccess is returned.
197
DuckDB Documentation
Syntax
duckdb_state duckdb_append_data_chunk(
duckdb_appender appender,
duckdb_data_chunk chunk
);
Parameters
• appender
• chunk
• returns
duckdb_query_arrow Executes a SQL query within a connection and stores the full (materialized) result in an arrow structure. If the
query fails to execute, DuckDBError is returned and the error message can be retrieved by calling duckdb_query_arrow_error.
Note that after running duckdb_query_arrow, duckdb_destroy_arrow must be called on the result object even if the query fails,
otherwise the error stored within the result will not be freed correctly.
Syntax
duckdb_state duckdb_query_arrow(
duckdb_connection connection,
const char *query,
duckdb_arrow *out_result
);
Parameters
• connection
• query
• out_result
• returns
duckdb_query_arrow_schema Fetch the internal arrow schema from the arrow result. Remember to call release on the respective
ArrowSchema object.
198
DuckDB Documentation
Syntax
duckdb_state duckdb_query_arrow_schema(
duckdb_arrow result,
duckdb_arrow_schema *out_schema
);
Parameters
• result
• out_schema
• returns
duckdb_prepared_arrow_schema Fetch the internal arrow schema from the prepared statement. Remember to call release on the
respective ArrowSchema object.
Syntax
duckdb_state duckdb_prepared_arrow_schema(
duckdb_prepared_statement prepared,
duckdb_arrow_schema *out_schema
);
Parameters
• result
• out_schema
• returns
duckdb_result_arrow_array Convert a data chunk into an arrow struct array. Remember to call release on the respective ArrowAr‑
ray object.
Syntax
void duckdb_result_arrow_array(
duckdb_result result,
duckdb_data_chunk chunk,
duckdb_arrow_array *out_array
);
199
DuckDB Documentation
Parameters
• result
The result object the data chunk have been fetched from.
• chunk
• out_array
duckdb_query_arrow_array Fetch an internal arrow struct array from the arrow result. Remember to call release on the respective
ArrowArray object.
This function can be called multiple time to get next chunks, which will free the previous out_array. So consume the out_array before
calling this function again.
Syntax
duckdb_state duckdb_query_arrow_array(
duckdb_arrow result,
duckdb_arrow_array *out_array
);
Parameters
• result
• out_array
• returns
duckdb_arrow_column_count Returns the number of columns present in the arrow result object.
Syntax
idx_t duckdb_arrow_column_count(
duckdb_arrow result
);
Parameters
• result
• returns
duckdb_arrow_row_count Returns the number of rows present in the arrow result object.
200
DuckDB Documentation
Syntax
idx_t duckdb_arrow_row_count(
duckdb_arrow result
);
Parameters
• result
• returns
duckdb_arrow_rows_changed Returns the number of rows changed by the query stored in the arrow result. This is relevant only
for INSERT/UPDATE/DELETE queries. For other queries the rows_changed will be 0.
Syntax
idx_t duckdb_arrow_rows_changed(
duckdb_arrow result
);
Parameters
• result
• returns
duckdb_query_arrow_error Returns the error message contained within the result. The error is only set if duckdb_query_
arrow returns DuckDBError.
The error message should not be freed. It will be de‑allocated when duckdb_destroy_arrow is called.
Syntax
Parameters
• result
• returns
duckdb_destroy_arrow Closes the result and de‑allocates all memory allocated for the arrow result.
201
DuckDB Documentation
Syntax
void duckdb_destroy_arrow(
duckdb_arrow *result
);
Parameters
• result
duckdb_destroy_arrow_stream Releases the arrow array stream and de‑allocates its memory.
Syntax
void duckdb_destroy_arrow_stream(
duckdb_arrow_stream *stream_p
);
Parameters
• stream
duckdb_execute_prepared_arrow Executes the prepared statement with the given bound parameters, and returns an arrow
query result. Note that after running duckdb_execute_prepared_arrow, duckdb_destroy_arrow must be called on the result
object.
Syntax
duckdb_state duckdb_execute_prepared_arrow(
duckdb_prepared_statement prepared_statement,
duckdb_arrow *out_result
);
Parameters
• prepared_statement
• out_result
• returns
duckdb_arrow_scan Scans the Arrow stream and creates a view with the given name.
202
DuckDB Documentation
Syntax
duckdb_state duckdb_arrow_scan(
duckdb_connection connection,
const char *table_name,
duckdb_arrow_stream arrow
);
Parameters
• connection
• table_name
• arrow
• returns
duckdb_arrow_array_scan Scans the Arrow array and creates a view with the given name. Note that after running duckdb_
arrow_array_scan, duckdb_destroy_arrow_stream must be called on the out stream.
Syntax
duckdb_state duckdb_arrow_array_scan(
duckdb_connection connection,
const char *table_name,
duckdb_arrow_schema arrow_schema,
duckdb_arrow_array arrow_array,
duckdb_arrow_stream *out_stream
);
Parameters
• connection
• table_name
• arrow_schema
• arrow_array
• out_stream
Output array stream that wraps around the passed schema, for releasing/deleting once done.
• returns
203
DuckDB Documentation
Will return after max_tasks have been executed, or if there are no more tasks present.
Syntax
void duckdb_execute_tasks(
duckdb_database database,
idx_t max_tasks
);
Parameters
• database
• max_tasks
duckdb_create_task_state Creates a task state that can be used with duckdb_execute_tasks_state to execute tasks until
duckdb_finish_execution is called on the state.
Syntax
duckdb_task_state duckdb_create_task_state(
duckdb_database database
);
Parameters
• database
• returns
The thread will keep on executing tasks forever, until duckdb_finish_execution is called on the state. Multiple threads can share the same
duckdb_task_state.
Syntax
void duckdb_execute_tasks_state(
duckdb_task_state state
);
Parameters
• state
204
DuckDB Documentation
The thread will keep on executing tasks until either duckdb_finish_execution is called on the state, max_tasks tasks have been executed or
there are no more tasks to be executed.
Syntax
idx_t duckdb_execute_n_tasks_state(
duckdb_task_state state,
idx_t max_tasks
);
Parameters
• state
• max_tasks
• returns
Syntax
void duckdb_finish_execution(
duckdb_task_state state
);
Parameters
• state
Syntax
bool duckdb_task_state_is_finished(
duckdb_task_state state
);
Parameters
• state
• returns
205
DuckDB Documentation
Note that this should not be called while there is an active duckdb_execute_tasks_state running on the task state.
Syntax
void duckdb_destroy_task_state(
duckdb_task_state state
);
Parameters
• state
Syntax
bool duckdb_execution_is_finished(
duckdb_connection con
);
Parameters
• con
duckdb_stream_fetch_chunk Fetches a data chunk from the (streaming) duckdb_result. This function should be called repeatedly
until the result is exhausted.
If this function is used, none of the other result functions can be used and vice versa (i.e., this function cannot be mixed with the legacy
result functions or the materialized result functions).
It is not known beforehand how many chunks will be returned by this result.
Syntax
duckdb_data_chunk duckdb_stream_fetch_chunk(
duckdb_result result
);
Parameters
• result
• returns
The resulting data chunk. Returns NULL if the result has an error.
206
DuckDB Documentation
C++ API
Installation
The DuckDB C++ API can be installed as part of the libduckdb packages. Please see the installation page for details.
DuckDB implements a custom C++ API. This is built around the abstractions of a database instance (DuckDB class), multiple Connections
to the database instance and QueryResult instances as the result of queries. The header file for the C++ API is duckdb.hpp.
Note. The standard source distribution of libduckdb contains an ”amalgamation” of the DuckDB sources, which combine all
sources into two files duckdb.hpp and duckdb.cpp. The duckdb.hpp header is much larger in this case. Regardless of whether
you are using the amalgamation or not, just include duckdb.hpp.
Startup & Shutdown To use DuckDB, you must first initialize a DuckDB instance using its constructor. DuckDB() takes as parameter
the database file to read and write from. The special value nullptr can be used to create an in‑memory database. Note that for an
in‑memory database no data is persisted to disk (i.e., all data is lost when you exit the process). The second parameter to the DuckDB
constructor is an optional DBConfig object. In DBConfig, you can set various database parameters, for example the read/write mode
or memory limits. The DuckDB constructor may throw exceptions, for example if the database file is not usable.
With the DuckDB instance, you can create one or many Connection instances using the Connection() constructor. While connections
should be thread‑safe, they will be locked during querying. It is therefore recommended that each thread uses its own connection if you
are in a multithreaded environment.
DuckDB db(nullptr);
Connection con(db);
Querying Connections expose the Query() method to send a SQL query string to DuckDB from C++. Query() fully materializes the
query result as a MaterializedQueryResult in memory before returning at which point the query result can be consumed. There is
also a streaming API for queries, see further below.
// create a table
con.Query("CREATE TABLE integers (i INTEGER, j INTEGER)");
The MaterializedQueryResult instance contains firstly two fields that indicate whether the query was successful. Query will not
throw exceptions under normal circumstances. Instead, invalid queries or other issues will lead to the success boolean field in the query
result instance to be set to false. In this case an error message may be available in error as a string. If successful, other fields are set:
the type of statement that was just executed (e.g., StatementType::INSERT_STATEMENT) is contained in statement_type. The
high‑level (”Logical type”/”SQL type”) types of the result set columns are in types. The names of the result columns are in the names
string vector. In case multiple result sets are returned, for example because the result set contained multiple statements, the result set can
be chained using the next field.
DuckDB also supports prepared statements in the C++ API with the Prepare() method. This returns an instance of PreparedState-
ment. This instance can be used to execute the prepared statement with parameters. Below is an example:
207
DuckDB Documentation
Note. Warning Do not use prepared statements to insert large amounts of data into DuckDB. See the data import documentation
for better options.
UDF API The UDF API allows the definition of user‑defined functions. It is exposed in duckdb:Connection through the methods:
CreateScalarFunction(), CreateVectorizedFunction(), and variants. These methods created UDFs into the temporary
schema (TEMP_SCHEMA) of the owner connection that is the only one allowed to use and change them.
CreateScalarFunction The user can code an ordinary scalar function and invoke the CreateScalarFunction() to register and af‑
terward use the UDF in a SELECT statement, for instance:
The CreateScalarFunction() methods automatically creates vectorized scalar UDFs so they are as efficient as built‑in functions, we
have two variants of this method interface as follows:
1.
• template parameters:
This method automatically discovers from the template typenames the corresponding LogicalTypes:
• bool → LogicalType::BOOLEAN
• int8_t → LogicalType::TINYINT
• int16_t → LogicalType::SMALLINT
• int32_t → LogicalType::INTEGER
• int64_t →LogicalType::BIGINT
• float → LogicalType::FLOAT
• double → LogicalType::DOUBLE
• string_t → LogicalType::VARCHAR
In DuckDB some primitive types, e.g., int32_t, are mapped to the same LogicalType: INTEGER, TIME and DATE, then for disam‑
biguation the users can use the following overloaded method.
2.
int32_t udf_date(int32_t a) {
return a;
}
208
DuckDB Documentation
• template parameters:
This function checks the template types against the LogicalTypes passed as arguments and they must match as follow:
• LogicalTypeId::BOOLEAN → bool
• LogicalTypeId::TINYINT → int8_t
• LogicalTypeId::SMALLINT → int16_t
• LogicalTypeId::DATE, LogicalTypeId::TIME, LogicalTypeId::INTEGER → int32_t
• LogicalTypeId::BIGINT, LogicalTypeId::TIMESTAMP → int64_t
• LogicalTypeId::FLOAT, LogicalTypeId::DOUBLE, LogicalTypeId::DECIMAL → double
• LogicalTypeId::VARCHAR, LogicalTypeId::CHAR, LogicalTypeId::BLOB → string_t
• LogicalTypeId::VARBINARY → blob_t
/*
* This vectorized function copies the input values to the result vector
*/
template<typename TYPE>
static void udf_vectorized(DataChunk &args, ExpressionState &state, Vector &result) {
// set the result vector type
result.vector_type = VectorType::FLAT_VECTOR;
// get a raw array from the result
auto result_data = FlatVector::GetData<TYPE>(result);
209
DuckDB Documentation
• args is a DataChunk that holds a set of input vectors for the UDF that all have the same length;
• expr is an ExpressionState that provides information to the query's expression state;
• result: is a Vector to store the result values.
• ConstantVector;
• DictionaryVector;
• FlatVector;
• ListVector;
• StringVector;
• StructVector;
• SequenceVector.
1.
• template parameters:
This method automatically discovers from the template typenames the corresponding LogicalTypes:
• bool → LogicalType::BOOLEAN;
• int8_t → LogicalType::TINYINT;
• int16_t → LogicalType::SMALLINT
• int32_t → LogicalType::INTEGER
• int64_t → LogicalType::BIGINT
• float → LogicalType::FLOAT
• double → LogicalType::DOUBLE
• string_t → LogicalType::VARCHAR
2.
210
DuckDB Documentation
CLI
CLI API
Installation
The DuckDB CLI (Command Line Interface) is a single, dependency‑free executable. It is precompiled for Windows, Mac, and Linux for both
the stable version and for nightly builds produced by GitHub Actions. Please see the installation page under the CLI tab for download
links.
The DuckDB CLI is based on the SQLite command line shell, so CLI‑client‑specific functionality is similar to what is described in the SQLite
documentation (although DuckDB's SQL syntax follows PostgreSQL conventions).
Note. DuckDB has a tldr page that summarizes the most common uses of the CLI client. If you have tldr installed, you can display
it by running tldr duckdb.
Getting Started
Once the CLI executable has been downloaded, unzip it and save it to any directory. Navigate to that directory in a terminal and enter the
command duckdb to run the executable. If in a PowerShell or POSIX shell environment, use the command ./duckdb instead.
Usage
Options The [OPTIONS] part encodes arguments for the CLI client. Common options include:
For a full list of options, see the command line arguments page.
In‑Memory vs. Persistent Database When no [FILENAME] argument is provided, the DuckDB CLI will open a temporary in‑memory
database. You will see DuckDB's version number, the information on the connection and a prompt starting with a D.
$ duckdb
v0.10.0 20b1486d11
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D
To open or create a persistent database, simply include a path as a command line argument like duckdb path/to/my_
database.duckdb or duckdb my_database.db.
Running SQL Statements in the CLI Once the CLI has been opened, enter a SQL statement followed by a semicolon, then hit enter and
it will be executed. Results will be displayed in a table in the terminal. If a semicolon is omitted, hitting enter will allow for multi‑line SQL
statements to be entered.
211
DuckDB Documentation
┌───────────┐
│ my_column │
│ varchar │
├───────────┤
│ quack │
└───────────┘
The CLI supports all of DuckDB's rich SQL syntax including SELECT, CREATE, and ALTER statements.
Editor Features The CLI supports autocompletion, and has sophisticated editor features and syntax highlighting on certain platforms.
Exiting the CLI To exit the CLI, press Ctrl‑D if your platform supports it. Otherwise press Ctrl‑C or use the .exit command. If used
a persistent database, DuckDB will automatically checkpoint (save the latest edits to disk) and close. This will remove the .wal file (the
Write‑Ahead‑Log) and consolidate all of your data into the single‑file database.
Dot Commands In addition to SQL syntax, special dot commands may be entered into the CLI client. To use one of these commands,
begin the line with a period (.) immediately followed by the name of the command you wish to execute. Additional arguments to the
command are entered, space separated, after the command. If an argument must contain a space, either single or double quotes may
be used to wrap that parameter. Dot commands must be entered on a single line, and no whitespace may occur before the period. No
semicolon is required at the end of the line.
Frequently‑used configurations can be stored in the file ~/.duckdbrc, which will be loaded when starting the CLI client. See the Config‑
uring the CLI section below for further information on these options.
Below, we summarize a few important dot commands. To see all available commands, see the dot commands page or use the .help
command.
Opening Database Files In addition to connecting to a database when opening the CLI, a new database connection can be made by using
the .open command. If no additional parameters are supplied, a new in‑memory database connection is created. This database will not
be persisted when the CLI connection is closed.
.open
The .open command optionally accepts several options, but the final parameter can be used to indicate a path to a persistent database
(or where one should be created). The special string :memory: can also be used to open a temporary in‑memory database.
.open persistent.duckdb
One important option accepted by .open is the --readonly flag. This disallows any editing of the database. To open in read only mode,
the database must already exist. This also means that a new in‑memory database can't be opened in read only mode since in‑memory
databases are created upon connection.
Output Formats The .mode dot command may be used to change the appearance of the tables returned in the terminal output. These
include the default duckbox mode, csv and json mode for ingestion by other tools, markdown and latex for documents, and insert
mode for generating SQL statements.
Writing Results to a File By default, the DuckDB CLI sends results to the terminal's standard output. However, this can be modified using
either the .output or .once commands. For details, see the documentation for the output dot command.
212
DuckDB Documentation
Reading SQL from a File The DuckDB CLI can read both SQL commands and dot commands from an external file instead of the terminal
using the .read command. This allows for a number of commands to be run in sequence and allows command sequences to be saved
and reused.
The .read command requires only one argument: the path to the file containing the SQL and/or commands to execute. After running the
commands in the file, control will revert back to the terminal. Output from the execution of that file is governed by the same .output and
.once commands that have been discussed previously. This allows the output to be displayed back to the terminal, as in the first example
below, or out to another file, as in the second example.
In this example, the file select_example.sql is located in the same directory as duckdb.exe and contains the following SQL state‑
ment:
SELECT *
FROM generate_series(5);
.read select_example.sql
The output below is returned to the terminal by default. The formatting of the table can be adjusted using the .output or .once com‑
mands.
| generate_series |
|-----------------|
| 0 |
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
Multiple commands, including both SQL and dot commands, can also be run in a single .read command. In this example, the file write_
markdown_to_file.sql is located in the same directory as duckdb.exe and contains the following commands:
.mode markdown
.output series.md
SELECT *
FROM generate_series(5);
.read write_markdown_to_file.sql
In this case, no output is returned to the terminal. Instead, the file series.md is created (or replaced if it already existed) with the
markdown‑formatted results shown here:
| generate_series |
|-----------------|
| 0 |
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
Several dot commands can be used to configure the CLI. On startup, the CLI reads and executes all commands in the file ~/.duckdbrc,
including dot commands and SQL statements. This allows you to store the configuration state of the CLI. You may also point to a different
initialization file using the -init.
213
DuckDB Documentation
Setting a Custom Prompt As an example, a file in the same directory as the DuckDB CLI named prompt.sql will change the DuckDB
prompt to be a duck head and run a SQL statement. Note that the duck head is built with Unicode characters and does not work in all
terminal environments (e.g., in Windows, unless running with WSL and using the Windows Terminal).
This outputs:
Non‑Interactive Usage
To read/process a file and exit immediately, pipe the file contents in to duckdb:
To execute a command with SQL text passed in directly from the command line, call duckdb with two arguments: the database location
(or :memory:), and a string with the SQL statement to execute.
Loading Extensions
To load extensions, use DuckDB's SQL INSTALL and LOAD commands as you would other SQL statements.
INSTALL fts;
LOAD fts;
When in a Unix environment, it can be useful to pipe data between multiple commands. DuckDB is able to read data from stdin as well
as write to stdout using the file location of stdin (/dev/stdin) and stdout (/dev/stdout) within SQL commands, as pipes act very
similarly to file handles.
First, read a file and pipe it to the duckdb CLI executable. As arguments to the DuckDB CLI, pass in the location of the database to open,
in this case, an in‑memory database, and a SQL command that utilizes /dev/stdin as a file location.
┌───────┐
│ woot │
│ int32 │
├───────┤
│ 42 │
│ 43 │
└───────┘
214
DuckDB Documentation
To write back to stdout, the copy command can be used with the /dev/stdout file location.
$ cat test.csv | duckdb :memory: "COPY (SELECT * FROM read_csv('/dev/stdin')) TO '/dev/stdout' WITH
(FORMAT 'csv', HEADER)"
woot
42
43
Examples To retrieve the home directory's path from the HOME environment variable, use:
┌──────────────────┐
│ home │
│ varchar │
├──────────────────┤
│ /Users/user_name │
└──────────────────┘
The output of the getenv function can be used to set configuration options. For example, to set the NULL order based on the environment
variable DEFAULT_NULL_ORDER, use:
Restrictions for Reading Environment Variables The getenv function can only be run when the enable_external_access is set
to true (the default setting). It is only available in the CLI client and is not supported in other DuckDB clients.
Prepared Statements
The DuckDB CLI supports executing prepared statements in addition to regular SELECT statements. To create and execute a prepared
statement in the CLI client, use the PREPARE clause and the EXECUTE statement.
The table below summarizes DuckDB's command line options. To list all command line options, use the command duckdb -help. Fot
a list of dot commands available in the CLI shell, see the Dot Commands page.
Argument Description
215
DuckDB Documentation
Argument Description
Dot Commands
Dot commands are available in the DuckDB CLI client. To use one of these commands, begin the line with a period (.) immediately followed
by the name of the command you wish to execute. Additional arguments to the command are entered, space separated, after the command.
If an argument must contain a space, either single or double quotes may be used to wrap that parameter. Dot commands must be entered
on a single line, and no whitespace may occur before the period. No semicolon is required at the end of the line. To see available commands,
use the .help command.
Dot Commands
Command Description
216
DuckDB Documentation
Command Description
217
DuckDB Documentation
Command Description
The .help text may be filtered by passing in a text string as the second argument.
.help m
.maxrows COUNT Sets the maximum number of rows for display (default: 40). Only for duckbox mode.
.maxwidth COUNT Sets the maximum width in characters. 0 defaults to terminal width. Only for duckbox
mode.
.mode MODE ?TABLE? Set output mode
.output: Writing Results to a File By default, the DuckDB CLI sends results to the terminal's standard output. However, this can be
modified using either the .output or .once commands. Pass in the desired output file location as a parameter. The .once command will
only output the next set of results and then revert to standard out, but .output will redirect all subsequent output to that file location.
Note that each result will overwrite the entire file at that destination. To revert back to standard output, enter .output with no file
parameter.
In this example, the output format is changed to markdown, the destination is identified as a Markdown file, and then DuckDB will write
the output of the SQL statement to that file. Output is then reverted to standard output using .output with no parameter.
.mode markdown
.output my_results.md
SELECT 'taking flight' AS output_column;
.output
SELECT 'back to the terminal' AS displayed_column;
| output_column |
|---------------|
| taking flight |
| displayed_column |
|----------------------|
| back to the terminal |
A common output format is CSV, or comma separated values. DuckDB supports SQL syntax to export data as CSV or Parquet, but the CLI‑
specific commands may be used to write a CSV instead if desired.
.mode csv
.once my_output_file.csv
SELECT 1 AS col_1, 2 AS col_2
UNION ALL
SELECT 10 AS col1, 20 AS col_2;
col_1,col_2
1,2
10,20
218
DuckDB Documentation
By passing special options (flags) to the .once command, query results can also be sent to a temporary file and automatically opened in
the user's default program. Use either the -e flag for a text file (opened in the default text editor), or the -x flag for a CSV file (opened in
the default spreadsheet editor). This is useful for more detailed inspection of query results, especially if there is a relatively large result set.
The .excel command is equivalent to .once -x.
.once -e
SELECT 'quack' AS hello;
The results then open in the default text file editor of the system, for example:
All DuckDB clients support querying the database schema with SQL, but the CLI has additional dot commands that can make it easier to
understand the contents of a database. The .tables command will return a list of tables in the database. It has an optional argument
that will filter the results according to a LIKE pattern.
For example, to filter to only tables that contain an ”l”, use the LIKE pattern %l%.
.tables %l%
fliers walkers
The .schema command will show all of the SQL statements used to define the schema of the database.
.schema
By default the shell includes support for syntax highlighting. The CLI's syntax highlighter can be configured using the following com‑
mands.
.highlight on
.highlight off
.constant
[red|green|yellow|blue|magenta|cyan|white|brightblack|brightred|brightgreen|brightyellow|brightblue|brightmagenta
.constantcode [terminal_code]
.keyword
[red|green|yellow|blue|magenta|cyan|white|brightblack|brightred|brightgreen|brightyellow|brightblue|brightmagenta
.keywordcode [terminal_code]
219
DuckDB Documentation
Note. Deprecated This feature is only included for compatibility reasons and may be removed in the future. Use the read_csv
function or the COPY statement to load CSV files.
DuckDB supports SQL syntax to directly query or import CSV files, but the CLI‑specific commands may be used to import a CSV instead if
desired. The .import command takes two arguments and also supports several options. The first argument is the path to the CSV file,
and the second is the name of the DuckDB table to create. Since DuckDB requires stricter typing than SQLite (upon which the DuckDB CLI
is based), the destination table must be created before using the .import command. To automatically detect the schema and create a
table from a CSV, see the read_csv examples in the import docs.
In this example, a CSV file is generated by changing to CSV mode and setting an output file location:
.mode csv
.output import_example.csv
SELECT 1 AS col_1, 2 AS col_2 UNION ALL SELECT 10 AS col1, 20 AS col_2;
Now that the CSV has been written, a table can be created with the desired schema and the CSV can be imported. The output is reset to the
terminal to avoid continuing to edit the output file specified above. The --skip N option is used to ignore the first row of data since it is
a header row and the table has already been created with the correct column names.
.mode csv
.output
CREATE TABLE test_table (col_1 INT, col_2 INT);
.import import_example.csv test_table --skip 1
Note that the .import command utilizes the current .mode and .separator settings when identifying the structure of the data to
import. The --csv option can be used to override that behavior.
Output Formats
The .mode dot command may be used to change the appearance of the tables returned in the terminal output. In addition to customizing
the appearance, these modes have additional benefits. This can be useful for presenting DuckDB output elsewhere by redirecting the
terminal output to a file. Using the insert mode will build a series of SQL statements that can be used to insert the data at a later point.
The markdown mode is particularly useful for building documentation and the latex mode is useful for writing academic papers.
Mode Description
220
DuckDB Documentation
Mode Description
.mode markdown
SELECT 'quacking intensifies' AS incoming_ducks;
| incoming_ducks |
|----------------------|
| quacking intensifies |
The output appearance can also be adjusted with the .separator command. If using an export mode that relies on a separator (csv or
tabs for example), the separator will be reset when the mode is changed. For example, .mode csv will set the separator to a comma (,).
Using .separator "|" will then convert the output to be pipe‑separated.
.mode csv
SELECT 1 AS col_1, 2 AS col_2
UNION ALL
SELECT 10 AS col1, 20 AS col_2;
col_1,col_2
1,2
10,20
.separator "|"
SELECT 1 AS col_1, 2 AS col_2
UNION ALL
SELECT 10 AS col1, 20 AS col_2;
col_1|col_2
1|2
10|20
Editing
Note. The linenoise‑based CLI editor is currently only available for macOS and Linux.
DuckDB's CLI uses a line‑editing library based on linenoise, which has short‑cuts that are based on Emacs mode of readline. Below is a list
of available commands.
Moving
Key Action
221
DuckDB Documentation
Key Action
History
Key Action
Changing Text
Key Action
222
DuckDB Documentation
Key Action
Completing
Key Action
Miscellaneous
Key Action
Enter Execute query. If query is not complete, insert a newline at the end of the buffer
Ctrl+J Execute query. If query is not complete, insert a newline at the end of the buffer
Ctrl+C Cancel editing of current query
Ctrl+G Cancel editing of current query
Ctrl+L Clear screen
Ctrl+O Cancel editing of current query
Ctrl+X Insert a newline after the cursor
Ctrl+Z Suspend CLI and return to shell, use fg to re‑open
Using Read‑Line
If you prefer, you can use rlwrap to use read‑line directly with the shell. Then, use Shift+Enter to insert a newline and Enter to execute
the query:
Autocomplete
The shell offers context‑aware autocomplete of SQL queries through the autocomplete extension. autocomplete is triggered by pressing
Tab.
Multiple autocomplete suggestions can be present. You can cycle forwards through the suggestions by repeatedly pressing Tab, or
Shift+Tab to cycle backwards. autocompletion can be reverted by pressing ESC twice.
• Keywords
• Table names and table functions
• Column names and scalar functions
223
DuckDB Documentation
• File names
The shell looks at the position in the SQL statement to determine which of these autocompletions to trigger. For example:
Syntax Highlighting
Note. Syntax highlighting in the CLI is currently only available for macOS and Linux.
SQL queries that are written in the shell are automatically highlighted using syntax highlighting.
There are several components of a query that are highlighted in different colors. The colors can be configured using dot commands. Syntax
highlighting can also be disabled entirely using the .highlight off command.
224
DuckDB Documentation
The components can be configured using either a supported color name (e.g., .keyword red), or by directly providing a terminal code to
use for rendering (e.g., .keywordcode \033[31m). Below is a list of supported color names and their corresponding terminal codes.
red \033[31m
green \033[32m
yellow \033[33m
blue \033[34m
magenta \033[35m
cyan \033[36m
white \033[37m
brightblack \033[90m
brightred \033[91m
brightgreen \033[92m
brightyellow \033[93m
brightblue \033[94m
brightmagenta \033[95m
brightcyan \033[96m
brightwhite \033[97m
.keyword brightred
.constant brightwhite
.comment cyan
.error yellow
.cont blue
.cont_sel brightblue
If you wish to start up the CLI with a different set of colors every time, you can place these commands in the ~/.duckdbrc file that is
loaded on start‑up of the CLI.
Error Highlighting
The shell has support for highlighting certain errors. In particular, mismatched brackets and unclosed quotes are highlighted in red (or
another color if specified). This highlighting is automatically disabled for large queries. In addition, it can be disabled manually using the
.render_errors off command.
Go
The DuckDB Go driver, go-duckdb, allows using DuckDB via the database/sql interface. For examples on how to use this interface,
see the official documentation and tutorial.
225
DuckDB Documentation
Installation
go get github.com/marcboeker/go-duckdb
Importing
To import the DuckDB Go package, add the following entries to your imports:
import (
"database/sql"
_ "github.com/marcboeker/go-duckdb"
)
Appender
The DuckDB Go client supports the DuckDB Appender API for bulk inserts. You can obtain a new Appender by supplying a DuckDB connec‑
tion to NewAppenderFromConn(). For example:
// Retrieve appender from connection (note that you have to create the table 'test' beforehand).
appender, err := NewAppenderFromConn(conn, "", "test")
if err != nil {
...
}
defer appender.Close()
err = appender.AppendRow(...)
if err != nil {
...
}
Examples
226
DuckDB Documentation
package main
import (
"database/sql"
"errors"
"fmt"
"log"
_ "github.com/marcboeker/go-duckdb"
)
func main() {
db, err := sql.Open("duckdb", "")
if err != nil {
log.Fatal(err)
}
defer db.Close()
var (
id int
name string
)
row := db.QueryRow(` SELECT id, name FROM people`)
err = row.Scan(&id, &name)
if errors.Is(err, sql.ErrNoRows) {
log.Println("no rows")
} else if err != nil {
log.Fatal(err)
}
More Examples For more examples, see the examples in the duckdb-go repository.
Installation
The DuckDB Java JDBC API can be installed from Maven Central. Please see the installation page for details.
DuckDB's JDBC API implements the main parts of the standard Java Database Connectivity (JDBC) API, version 4.1. Describing JDBC is
beyond the scope of this page, see the official documentation for details. Below we focus on the DuckDB‑specific parts.
227
DuckDB Documentation
Refer to the externally hosted API Reference for more information about our extensions to the JDBC specification, or the below Arrow
Methods.
Startup & Shutdown In JDBC, database connections are created through the standard java.sql.DriverManager class. The driver
should auto‑register in the DriverManager, if that does not work for some reason, you can enforce registration like so:
Class.forName("org.duckdb.DuckDBDriver");
To create a DuckDB connection, call DriverManager with the jdbc:duckdb: JDBC URL prefix, like so:
import java.sql.Connection;
import java.sql.DriverManager;
To use DuckDB‑specific features such as the Appender, cast the object to a DuckDBConnection:
import org.duckdb.DuckDBConnection;
When using the jdbc:duckdb: URL alone, an in‑memory database is created. Note that for an in‑memory database no data is persisted
to disk (i.e., all data is lost when you exit the Java program). If you would like to access or create a persistent database, append its file name
after the path. For example, if your database is stored in /tmp/my_database, use the JDBC URL jdbc:duckdb:/tmp/my_database
to create a connection to it.
It is possible to open a DuckDB database file in read‑only mode. This is for example useful if multiple Java processes want to read the same
database file at the same time. To open an existing database file in read‑only mode, set the connection property duckdb.read_only
like so:
Additional connections can be created using the DriverManager. A more efficient mechanism is to call the DuckDBConnec-
tion#duplicate() method like so:
Multiple connections are allowed, but mixing read‑write and read‑only connections is unsupported.
Configuring Connections Configuration options can be provided to change different settings of the database system. Note that many
of these settings can be changed later on using PRAGMA statements as well.
Querying DuckDB supports the standard JDBC methods to send queries and retrieve result sets. First a Statement object has to be
created from the Connection, this object can then be used to send queries using execute and executeQuery. execute() is meant
for queries where no results are expected like CREATE TABLE or UPDATE etc. and executeQuery() is meant to be used for queries
that produce results (e.g., SELECT). Below two examples. See also the JDBC Statement and ResultSet documentations.
// create a table
Statement stmt = conn.createStatement();
stmt.execute("CREATE TABLE items (item VARCHAR, value DECIMAL(10, 2), count INTEGER)");
// insert two items into the table
stmt.execute("INSERT INTO items VALUES ('jeans', 20.0, 1), ('hammer', 42.2, 2)");
stmt.close();
228
DuckDB Documentation
Note. Warning Do not use prepared statements to insert large amounts of data into DuckDB. See the data import documentation
for better options.
Arrow Export The following demonstrates exporting an arrow stream and consuming it using the java arrow bindings
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.ipc.ArrowReader;
import org.duckdb.DuckDBResultSet;
Arrow Import The following demonstrates consuming an arrow stream from the java arrow bindings
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.ipc.ArrowReader;
import org.duckdb.DuckDBConnection;
// Arrow stuff
try (var allocator = new RootAllocator();
ArrowStreamReader reader = null; // should not be null of course
var arrow_array_stream = ArrowArrayStream.allocateNew(allocator)) {
Data.exportArrayStream(allocator, reader, arrow_array_stream);
// DuckDB stuff
229
DuckDB Documentation
// run a query
try (var stmt = conn.createStatement();
var rs = (DuckDBResultSet) stmt.executeQuery("SELECT count(*) FROM asdf")) {
while (rs.next()) {
System.out.println(rs.getInt(1));
}
}
}
}
Streaming Results Result streaming is opt‑in in the JDBC driver ‑ by setting the jdbc_stream_results config to true before run‑
ning a query. The easiest way do that is to pass it in the Properties object.
Appender The Appender is available in the DuckDB JDBC driver via the org.duckdb.DuckDBAppender class. The constructor of the
class requires the schema name and the table name it is applied to. The Appender is flushed when the close() method is called.
Example:
import org.duckdb.DuckDBConnection;
// using try-with-resources to automatically close the appender at the end of the scope
try (var appender = conn.createAppender(DuckDBConnection.DEFAULT_SCHEMA, "tbl")) {
appender.beginRow();
appender.append(10);
appender.append(3.2);
appender.append("hello");
appender.endRow();
appender.beginRow();
appender.append(20);
appender.append(-8.1);
appender.append("world");
appender.endRow();
}
stmt.close();
Batch Writer The DuckDB JDBC driver offers batch write functionality. The batch writer supports prepared statements to mitigate the
overhead of query parsing.
Note. The preferred method for bulk inserts is to use the Appender due to its higher performance. However, when using the Ap‑
pender is not possbile, the batch writer is available as alternative.
230
DuckDB Documentation
stmt.setObject(1, 1);
stmt.setObject(2, 2);
stmt.setObject(3, 3);
stmt.addBatch();
stmt.setObject(1, 4);
stmt.setObject(2, 5);
stmt.setObject(3, 6);
stmt.addBatch();
stmt.executeBatch();
stmt.close();
Batch Writer with Vanilla Statements The batch writer also supports vanilla SQL statements:
import org.duckdb.DuckDBConnection;
stmt.executeBatch();
stmt.close();
Julia Package
The DuckDB Julia package provides a high‑performance front‑end for DuckDB. Much like SQLite, DuckDB runs in‑process within the Julia
client, and provides a DBInterface front‑end.
The package also supports multi‑threaded execution. It uses Julia threads/tasks for this purpose. If you wish to run queries in parallel, you
must launch Julia with multi‑threading support (by e.g., setting the JULIA_NUM_THREADS environment variable).
Installation
using Pkg
Pkg.add("DuckDB")
Alternatively, enter the package manager using the ] key, and issue the following command:
Basics
using DuckDB
231
DuckDB Documentation
# create a table
DBInterface.execute(con, "CREATE TABLE integers (i INTEGER)")
Scanning DataFrames
The DuckDB Julia package also provides support for querying Julia DataFrames. Note that the DataFrames are directly read by DuckDB ‑
they are not inserted or copied into the database itself.
If you wish to load data from a DataFrame into a DuckDB table you can run a CREATE TABLE ... AS or INSERT INTO query.
using DuckDB
using DataFrames
# create a DataFrame
df = DataFrame(a = [1, 2, 3], b = [42, 84, 42])
Appender API
The DuckDB Julia package also supports the Appender api, which is much faster than using prepared statements or individual INSERT
INTO statements. Appends are made in row‑wise format. For every column, an append() call should be made, after which the row should
be finished by calling flush(). After all rows have been appended, close() should be used to finalize the appender and clean up the resulting
memory.
232
DuckDB Documentation
for j in i
DuckDB.append(appender, j)
end
DuckDB.end_row(appender)
end
# flush the appender after all rows
DuckDB.flush(appender)
DuckDB.close(appender)
Concurrency
Within a julia process, tasks are able to concurrently read and write to the database, as long as each task maintains its own connection to
the database. In the example below, a single task is spawned to periodically read the database and many tasks are spawned to write to the
database using both INSERT statements as well as the appender api.
function run_reader(db)
# create a DuckDB connection specifically for this task
conn = DBInterface.connect(db)
while true
println(DBInterface.execute(conn,
"SELECT id, count(date) as count, max(date) as max_date
FROM data group by id order by id") |> DataFrames.DataFrame)
Threads.sleep(1)
end
DBInterface.close(conn)
end
# spawn one reader task
Threads.@spawn run_reader(db)
233
DuckDB Documentation
DuckDB.flush(appender);
end
DuckDB.close(appender);
end
# spawn many appender tasks
for i in 1:100
Threads.@spawn run_appender(db, 2)
end
Node.js
Node.js API
This package provides a Node.js API for DuckDB. The API for this client is somewhat compliant to the SQLite Node.js client for easier tran‑
sition.
Initializing
All options as described on Database configuration can be (optionally) supplied to the Database constructor as second argument. The
third argument can be optionally supplied to get feedback on the given options.
Running a Query
The following code snippet runs a simple query using the Database.all() method.
234
DuckDB Documentation
Other available methods are each, where the callback is invoked for each row, run to execute a single statement without results and
exec, which can execute several SQL commands at once but also does not return results. All those commands can work with prepared
statements, taking the values for the parameters as additional arguments. For example like so:
db.all('SELECT ?::INTEGER AS fortytwo, ?::STRING AS hello', 42, 'Hello, World', function(err, res) {
if (err) {
console.warn(err);
return;
}
console.log(res[0].fortytwo)
console.log(res[0].hello)
});
Connections
A database can have multiple Connections, those are created using db.connect().
You can create multiple connections, each with their own transaction context.
Connection objects also contain shorthands to directly call run(), all() and each() with parameters and callbacks, respectively,
for example:
Prepared Statements
From connections, you can create prepared statements (and only that) using con.prepare():
To execute this statement, you can call for example all() on the stmt object:
You can also execute the prepared statement multiple times. This is for example useful to fill a table with data:
235
DuckDB Documentation
console.log(res)
}
});
prepare() can also take a callback which gets the prepared statement as an argument:
Apache Arrow can be used to insert data into DuckDB without making a copy:
const jsonData = [
{"userId":1,"id":1,"title":"delectus aut autem","completed":false},
{"userId":1,"id":2,"title":"quis ut nam facilis et officia qui","completed":false}
];
236
DuckDB Documentation
Node.js API
Modules
Typedefs
duckdb
• duckdb
– ~Connection
* .sql ⇒
* .get()
* .run(sql, ...params, callback) ⇒ void
* .all(sql, ...params, callback) ⇒ void
* .arrowIPCAll(sql, ...params, callback) ⇒ void
* .each(sql, ...params, callback) ⇒ void
* .finalize(sql, ...params, callback) ⇒ void
* .stream(sql, ...params)
* .columns() ⇒ Array.<ColumnInfo>
– ~QueryResult
* .nextChunk() ⇒
* .nextIpcBuffer() ⇒
* .asyncIterator()
– ~Database
* .close(callback) ⇒ void
* .close_internal(callback) ⇒ void
* .wait(callback) ⇒ void
* .serialize(callback) ⇒ void
* .parallelize(callback) ⇒ void
* .connect(path) ⇒ Connection
* .interrupt(callback) ⇒ void
* .prepare(sql) ⇒ Statement
* .run(sql, ...params, callback) ⇒ void
* .scanArrowIpc(sql, ...params, callback) ⇒ void
* .each(sql, ...params, callback) ⇒ void
* .all(sql, ...params, callback) ⇒ void
237
DuckDB Documentation
• ~Connection
connection.run(sql, ...params, callback) ⇒ void Run a SQL statement and trigger a callback when done
Param Type
sql
...params *
callback
connection.all(sql, ...params, callback) ⇒ void Run a SQL query and triggers the callback once for all result rows
238
DuckDB Documentation
Param Type
sql
...params *
callback
connection.arrowIPCAll(sql, ...params, callback) ⇒ void Run a SQL query and serialize the result into the Apache Arrow IPC format
(requires arrow extension to be loaded)
Param Type
sql
...params *
callback
connection.arrowIPCStream(sql, ...params, callback) ⇒ Run a SQL query, returns a IpcResultStreamIterator that allows streaming the
result into the Apache Arrow IPC format (requires arrow extension to be loaded)
Param Type
sql
...params *
callback
connection.each(sql, ...params, callback) ⇒ void Runs a SQL query and triggers the callback for each result row
Param Type
sql
...params *
callback
Param Type
sql
...params *
239
DuckDB Documentation
Param
name
return_type
fun
Param Type
sql
...params *
callback
Param Type
sql
...params *
callback
Param
name
return_type
callback
Param
name
return_type
240
DuckDB Documentation
Param
callback
connection.register_buffer(name, array, force, callback) ⇒ void Register a Buffer to be scanned using the Apache Arrow IPC scanner
(requires arrow extension to be loaded)
Param
name
array
force
callback
Param
name
callback
Param
callback
• ~Statement
– .sql ⇒
– .get()
– .run(sql, ...params, callback) ⇒ void
– .all(sql, ...params, callback) ⇒ void
– .arrowIPCAll(sql, ...params, callback) ⇒ void
– .each(sql, ...params, callback) ⇒ void
– .finalize(sql, ...params, callback) ⇒ void
– .stream(sql, ...params)
– .columns() ⇒ Array.<ColumnInfo>
241
DuckDB Documentation
Param Type
sql
...params *
callback
Param Type
sql
...params *
callback
Param Type
sql
...params *
callback
Param Type
sql
...params *
callback
Param Type
sql
...params *
callback
242
DuckDB Documentation
Param Type
sql
...params *
• ~QueryResult
– .nextChunk() ⇒
– .nextIpcBuffer() ⇒
– .asyncIterator()
queryResult.nextIpcBuffer() ⇒ Function to fetch the next result blob of an Arrow IPC Stream in a zero‑copy way. (requires arrow exten‑
sion to be loaded)
Param Description
• ~Database
– .close(callback) ⇒ void
– .close_internal(callback) ⇒ void
– .wait(callback) ⇒ void
– .serialize(callback) ⇒ void
– .parallelize(callback) ⇒ void
243
DuckDB Documentation
– .connect(path) ⇒ Connection
– .interrupt(callback) ⇒ void
– .prepare(sql) ⇒ Statement
– .run(sql, ...params, callback) ⇒ void
– .scanArrowIpc(sql, ...params, callback) ⇒ void
– .each(sql, ...params, callback) ⇒ void
– .all(sql, ...params, callback) ⇒ void
– .arrowIPCAll(sql, ...params, callback) ⇒ void
– .arrowIPCStream(sql, ...params, callback) ⇒ void
– .exec(sql, ...params, callback) ⇒ void
– .register_udf(name, return_type, fun) ⇒ this
– .register_buffer(name) ⇒ this
– .unregister_buffer(name) ⇒ this
– .unregister_udf(name) ⇒ this
– .registerReplacementScan(fun) ⇒ this
– .tokenize(text) ⇒ ScriptTokens
– .get()
Param
callback
Param
callback
database.wait(callback) ⇒ void Triggers callback when all scheduled database tasks have completed.
Param
callback
Param
callback
244
DuckDB Documentation
Param
callback
Param Description
database.interrupt(callback) ⇒ void Supposedly interrupt queries, but currently does not do anything.
Param
callback
Param
sql
database.run(sql, ...params, callback) ⇒ void Convenience method for Connection#run using a built‑in default connection
Param Type
sql
...params *
callback
database.scanArrowIpc(sql, ...params, callback) ⇒ void Convenience method for Connection#scanArrowIpc using a built‑in default
connection
Param Type
sql
...params *
245
DuckDB Documentation
Param Type
callback
Param Type
sql
...params *
callback
database.all(sql, ...params, callback) ⇒ void Convenience method for Connection#apply using a built‑in default connection
Param Type
sql
...params *
callback
database.arrowIPCAll(sql, ...params, callback) ⇒ void Convenience method for Connection#arrowIPCAll using a built‑in default con‑
nection
Param Type
sql
...params *
callback
database.arrowIPCStream(sql, ...params, callback) ⇒ void Convenience method for Connection#arrowIPCStream using a built‑in de‑
fault connection
Param Type
sql
...params *
callback
246
DuckDB Documentation
Param Type
sql
...params *
callback
Param
name
return_type
fun
database.register_buffer(name) ⇒ this Register a buffer containing serialized data to be scanned from DuckDB.
Param
name
Param
name
Param
name
247
DuckDB Documentation
Param Description
Param
text
duckdb~ERROR : number Check that errno attribute equals this to check for a duckdb error
248
DuckDB Documentation
ColumnInfo : object
TypeInfo : object
id string Type ID
[alias] string SQL type alias
sql_type string SQL type name
DuckDbError : object
HTTPError : object
249
DuckDB Documentation
Python
Python API
Installation
The DuckDB Python API can be installed using pip: pip install duckdb. Please see the installation page for details. It is also possible
to install DuckDB using conda: conda install python-duckdb -c conda-forge.
The most straight‑forward manner of running SQL queries using DuckDB is using the duckdb.sql command.
import duckdb
duckdb.sql("SELECT 42").show()
This will run queries using an in‑memory database that is stored globally inside the Python module. The result of the query is returned
as a Relation. A relation is a symbolic representation of the query. The query is not executed until the result is fetched or requested to be
printed to the screen.
Relations can be referenced in subsequent queries by storing them inside variables, and using them as tables. This way queries can be
constructed incrementally.
import duckdb
r1 = duckdb.sql("SELECT 42 AS i")
duckdb.sql("SELECT i * 2 AS k FROM r1").show()
Data Input
DuckDB can ingest data from a wide variety of formats – both on‑disk and in‑memory. See the data ingestion page for more information.
import duckdb
duckdb.read_csv("example.csv") # read a CSV file into a Relation
duckdb.read_parquet("example.parquet") # read a Parquet file into a Relation
duckdb.read_json("example.json") # read a JSON file into a Relation
DataFrames DuckDB can also directly query Pandas DataFrames, Polars DataFrames and Arrow tables.
250
DuckDB Documentation
import duckdb
Result Conversion
DuckDB supports converting query results efficiently to a variety of formats. See the result conversion page for more information.
import duckdb
duckdb.sql("SELECT 42").fetchall() # Python objects
duckdb.sql("SELECT 42").df() # Pandas DataFrame
duckdb.sql("SELECT 42").pl() # Polars DataFrame
duckdb.sql("SELECT 42").arrow() # Arrow Table
duckdb.sql("SELECT 42").fetchnumpy() # NumPy Arrays
DuckDB supports writing Relation objects directly to disk in a variety of formats. The COPY statement can be used to write data to disk
using SQL as an alternative.
import duckdb
duckdb.sql("SELECT 42").write_parquet("out.parquet") # Write to a Parquet file
duckdb.sql("SELECT 42").write_csv("out.csv") # Write to a CSV file
duckdb.sql("COPY (SELECT 42) TO 'out.parquet'") # Copy to a Parquet file
When using DuckDB through duckdb.sql(), it operates on an in‑memory database, i.e., no tables are persisted on disk. Invoking the
duckdb.connect() method without arguments returns a connection, which also uses an in‑memory database:
import duckdb
con = duckdb.connect()
con.sql("SELECT 42 AS x").show()
Persistent Storage
The duckdb.connect( dbname) creates a connection to a persistent database. Any data written to that connection will be persisted,
and can be reloaded by re‑connecting to the same file, both from Python and from other DuckDB clients.
import duckdb
251
DuckDB Documentation
con = duckdb.connect("file.db")
# create a table and load data into it
con.sql("CREATE TABLE test (i INTEGER)")
con.sql("INSERT INTO test VALUES (42)")
# query the table
con.table("test").show()
# explicitly close the connection
con.close()
# Note: connections also closed implicitly when they go out of scope
You can also use a context manager to ensure that the connection is closed:
import duckdb
The connection object and the duckdb module can be used interchangeably – they support the same methods. The only difference is that
when using the duckdb module a global in‑memory database is used.
Note that if you are developing a package designed for others to use, and use DuckDB in the package, it is recommend that you create con‑
nection objects instead of using the methods on the duckdb module. That is because the duckdb module uses a shared global database
– which can cause hard to debug issues if used from within multiple different packages.
The DuckDBPyConnection object is not thread‑safe. If you would like to write to the same database from multiple threads, create a
cursor for each thread with the DuckDBPyConnection.cursor() method.
DuckDB's Python API provides functions for installing and loading extensions, which perform the equivalent operations to running the
INSTALL and LOAD SQL commands, respectively. An example that installs and loads the spatial extension looks like follows:
import duckdb
con = duckdb.connect()
con.install_extension("spatial")
con.load_extension("spatial")
Note. To load unsigned extensions, add the config = {"allow_unsigned_extensions": "true"} argument to the
duckdb.connect() method.
Data Ingestion
CSV Files
CSV files can be read using the read_csv function, called either from within Python or directly from within SQL. By default, the read_
csv function attempts to auto‑detect the CSV settings by sampling from the provided file.
252
DuckDB Documentation
import duckdb
# read from a file using fully auto-detected settings
duckdb.read_csv("example.csv")
# read multiple CSV files from a folder
duckdb.read_csv("folder/*.csv")
# specify options on how the CSV is formatted internally
duckdb.read_csv("example.csv", header = False, sep = ",")
# override types of the first two columns
duckdb.read_csv("example.csv", dtype = ["int", "varchar"])
# use the (experimental) parallel CSV reader
duckdb.read_csv("example.csv", parallel = True)
# directly read a CSV file from within SQL
duckdb.sql("SELECT * FROM 'example.csv'")
# call read_csv from within SQL
duckdb.sql("SELECT * FROM read_csv('example.csv')")
Parquet Files
Parquet files can be read using the read_parquet function, called either from within Python or directly from within SQL.
import duckdb
# read from a single Parquet file
duckdb.read_parquet("example.parquet")
# read multiple Parquet files from a folder
duckdb.read_parquet("folder/*.parquet")
# read a Parquet over https
duckdb.read_parquet("https://some.url/some_file.parquet")
# read a list of Parquet files
duckdb.read_parquet(["file1.parquet", "file2.parquet", "file3.parquet"])
# directly read a Parquet file from within SQL
duckdb.sql("SELECT * FROM 'example.parquet'")
# call read_parquet from within SQL
duckdb.sql("SELECT * FROM read_parquet('example.parquet')")
JSON Files
JSON files can be read using the read_json function, called either from within Python or directly from within SQL. By default, the read_
json function will automatically detect if a file contains newline‑delimited JSON or regular JSON, and will detect the schema of the objects
stored within the JSON file.
import duckdb
# read from a single JSON file
duckdb.read_json("example.json")
# read multiple JSON files from a folder
duckdb.read_json("folder/*.json")
# directly read a JSON file from within SQL
duckdb.sql("SELECT * FROM 'example.json'")
# call read_json from within SQL
duckdb.sql("SELECT * FROM read_json_auto('example.json')")
DuckDB is automatically able to query a Pandas DataFrame, Polars DataFrame, or Arrow object that is stored in a Python variable by name.
Accessing these is made possible by replacement scans.
253
DuckDB Documentation
DuckDB supports querying multiple types of Apache Arrow objects including tables, datasets, RecordBatchReaders, and scanners. See the
Python guides for more examples.
import duckdb
import pandas as pd
test_df = pd.DataFrame.from_dict({"i": [1, 2, 3, 4], "j": ["one", "two", "three", "four"]})
duckdb.sql("SELECT * FROM test_df").fetchall()
# [(1, 'one'), (2, 'two'), (3, 'three'), (4, 'four')]
DuckDB also supports ”registering” a DataFrame or Arrow object as a virtual table, comparable to a SQL VIEW. This is useful when querying
a DataFrame/Arrow object that is stored in another way (as a class variable, or a value in a dictionary). Below is a Pandas example:
If your Pandas DataFrame is stored in another location, here is an example of manually registering it:
import duckdb
import pandas as pd
my_dictionary = {}
my_dictionary["test_df"] = pd.DataFrame.from_dict({"i": [1, 2, 3, 4], "j": ["one", "two", "three",
"four"]})
duckdb.register("test_df_view", my_dictionary["test_df"])
duckdb.sql("SELECT * FROM test_df_view").fetchall()
# [(1, 'one'), (2, 'two'), (3, 'three'), (4, 'four')]
You can also create a persistent table in DuckDB from the contents of the DataFrame (or the view):
Pandas DataFrames – object Columns pandas.DataFrame columns of an object dtype require some special care, since this
stores values of arbitrary type. To convert these columns to DuckDB, we first go through an analyze phase before converting the values. In
this analyze phase a sample of all the rows of the column are analyzed to determine the target type. This sample size is by default set to
1000. If the type picked during the analyze step is incorrect, this will result in a ”Failed to cast value:” error, in which case you will need to
increase the sample size. The sample size can be changed by setting the pandas_analyze_sample config option.
Object Conversion
int Since integers can be of arbitrary size in Python, there is not a one‑to‑one conversion possible for ints. Intead we perform these casts
in order until one succeeds:
• BIGINT
254
DuckDB Documentation
• INTEGER
• UBIGINT
• UINTEGER
• DOUBLE
When using the DuckDB Value class, it's possible to set a target type, which will influence the conversion.
• DOUBLE
• FLOAT
datetime.datetime For datetime we will check pandas.isnull if it's available and return NULL if it returns true. We check
against datetime.datetime.min and datetime.datetime.max to convert to -inf and +inf respectively.
If the datetime has tzinfo, we will use TIMESTAMPTZ, otherwise it becomes TIMESTAMP.
datetime.time If the time has tzinfo, we will use TIMETZ, otherwise it becomes TIME.
datetime.date date converts to the DATE type. We check against datetime.date.min and datetime.date.max to convert
to -inf and +inf respectively.
bytes bytes converts to BLOB by default, when it's used to construct a Value object of type BITSTRING, it maps to BITSTRING
instead.
list list becomes a LIST type of the ”most permissive” type of its children, for example:
my_list_value = [
12345,
"test"
]
Will become VARCHAR[] because 12345 can convert to VARCHAR but test can not convert to INTEGER.
[12345, test]
dict The dict object can convert to either STRUCT(...) or MAP(..., ...) depending on its structure. If the dict has a structure
similar to:
my_map_dict = {
"key": [
1, 2, 3
],
"value": [
"one", "two", "three"
]
}
Then we'll convert it to a MAP of key‑value pairs of the two lists zipped together. The example above becomes a MAP(INTEGER, VAR-
CHAR):
255
DuckDB Documentation
Note. The names of the fields matter and the two lists need to have the same size.
my_struct_dict = {
1: "one",
"2": 2,
"three": [1, 2, 3],
False: True
}
Becomes:
tuple tuple converts to LIST by default, when it's used to construct a Value object of type STRUCT it will convert to STRUCT in‑
stead.
numpy.ndarray and numpy.datetime64 ndarray and datetime64 are converted by calling tolist() and converting the
result of that.
Result Conversion
DuckDB's Python client provides multiple additional methods that can be used to efficiently retrieve data.
NumPy
Pandas
Apache Arrow
Polars
Below are some examples using this functionality. See the Python guides for more examples.
256
DuckDB Documentation
# fetch as an Arrow table. Converting to Pandas afterwards just for pretty printing
tbl = con.execute("SELECT * FROM items").fetch_arrow_table()
print(tbl.to_pandas())
# item value count
# 0 jeans 20.00 1
# 1 hammer 42.20 2
# 2 laptop 2000.00 1
# 3 chainsaw 500.00 10
# 4 iphone 300.00 2
Python DB API
The standard DuckDB Python API provides a SQL interface compliant with the DB‑API 2.0 specification described by PEP 249 similar to the
SQLite Python API.
Connection
To use the module, you must first create a DuckDBPyConnection object that represents the database. The connection object takes as
a parameter the database file to read and write from. If the database file does not exist, it will be created (the file extension may be .db,
.duckdb, or anything else). The special value :memory: (the default) can be used to create an in‑memory database. Note that for an
in‑memory database no data is persisted to disk (i.e., all data is lost when you exit the Python process). If you would like to connect to an
existing database in read‑only mode, you can set the read_only flag to True. Read‑only mode is required if multiple Python processes
want to access the same database file at the same time.
By default we create an in‑memory‑database that lives inside the duckdb module. Every method of DuckDBPyConnection is also
available on the duckdb module, this connection is what's used by these methods. You can also get a reference to this connection by
providing the special value :default: to connect.
import duckdb
257
DuckDB Documentation
┌───────┐
│ a │
│ int32 │
├───────┤
│ 42 │
└───────┘
import duckdb
# to start an in-memory database
con = duckdb.connect(database = ":memory:")
# to use a database file (not shared between processes)
con = duckdb.connect(database = "my-db.duckdb", read_only = False)
# to use a database file (shared between processes)
con = duckdb.connect(database = "my-db.duckdb", read_only = True)
# to explicitly get the default connection
con = duckdb.connect(database = ":default:")
If you want to create a second connection to an existing database, you can use the cursor() method. This might be useful for example
to allow parallel threads running queries independently. A single connection is thread‑safe but is locked for the duration of the queries,
effectively serializing database access in this case.
Connections are closed implicitly when they go out of scope or if they are explicitly closed using close(). Once the last connection to a
database instance is closed, the database instance is closed as well.
Querying
SQL queries can be sent to DuckDB using the execute() method of connections. Once a query has been executed, results can be re‑
trieved using the fetchone and fetchall methods on the connection. fetchall will retrieve all results and complete the transaction.
fetchone will retrieve a single row of results each time that it is invoked until no more results are available. The transaction will only close
once fetchone is called and there are no more results remaining (the return value will be None). As an example, in the case of a query
only returning a single row, fetchone should be called once to retrieve the results and a second time to close the transaction. Below are
some short examples:
# create a table
con.execute("CREATE TABLE items (item VARCHAR, value DECIMAL(10, 2), count INTEGER)")
# insert two items into the table
con.execute("INSERT INTO items VALUES ('jeans', 20.0, 1), ('hammer', 42.2, 2)")
The description property of the connection object contains the column names as per the standard.
Prepared Statements DuckDB also supports prepared statements in the API with the execute and executemany methods. The val‑
ues may be passed as an additional parameter after a query that contains ? or $1 (dollar symbol and a number) placeholders. Using the ?
notation adds the values in the same sequence as passed within the Python parameter. Using the $ notation allows for values to be reused
within the SQL statement based on the number and index of the value found within the Python parameter.
258
DuckDB Documentation
Note. Warning Do not use executemany to insert large amounts of data into DuckDB. See the data ingestion page for better
options.
Named Parameters
Besides the standard unnamed parameters, like $1, $2 etc, it's also possible to supply named parameters, like $my_parameter. When
using named parameters, you have to provide a dictionary mapping of str to value in the parameters argument
An example use:
import duckdb
res = duckdb.execute("""
SELECT
$my_param,
$other_param,
$also_param
""",
{
"my_param": 5,
"other_param": "DuckDB",
"also_param": [42]
}
).fetchall()
print(res)
# [(5, 'DuckDB', [42])]
Relational API
The Relational API is an alternative API that can be used to incrementally construct queries. The API is centered around DuckDBPyRela-
tion nodes. The relations can be seen as symbolic representations of SQL queries. They do not hold any data ‑ and nothing is executed ‑
until a method that triggers execution is called.
Constructing Relations
Relations can be created from SQL queries using the duckdb.sql method. Alternatively, they can be created from the various data inges‑
tion methods (read_parquet, read_csv, read_json).
259
DuckDB Documentation
import duckdb
rel = duckdb.sql("SELECT * FROM range(10_000_000_000) tbl(id)")
rel.show()
┌────────────────────────┐
│ id │
│ int64 │
├────────────────────────┤
│ 0 │
│ 1 │
│ 2 │
│ 3 │
│ 4 │
│ 5 │
│ 6 │
│ 7 │
│ 8 │
│ 9 │
│ · │
│ · │
│ · │
│ 9990 │
│ 9991 │
│ 9992 │
│ 9993 │
│ 9994 │
│ 9995 │
│ 9996 │
│ 9997 │
│ 9998 │
│ 9999 │
├────────────────────────┤
│ ? rows │
│ (>9999 rows, 20 shown) │
└────────────────────────┘
Note how we are constructing a relation that computes an immense amount of data (10B rows, or 74GB of data). The relation is constructed
instantly ‑ and we can even print the relation instantly.
When printing a relation using show or displaying it in the terminal, the first 10K rows are fetched. If there are more than 10K rows, the
output window will show >9999 rows (as the amount of rows in the relation is unknown).
Data Ingestion
Outside of SQL queries, the following methods are provided to construct relation objects from external data.
• from_arrow
• from_df
• read_csv
• read_json
• read_parquet
SQL Queries
Relation objects can be queried through SQL through so‑called replacement scans. If you have a relation object stored in a variable, you
can refer to that variable as if it was a SQL table (in the FROM clause). This allows you to incrementally build queries using relation objects.
260
DuckDB Documentation
import duckdb
rel = duckdb.sql("SELECT * FROM range(1_000_000) tbl(id)")
duckdb.sql("SELECT sum(id) FROM rel").show()
┌──────────────┐
│ sum(id) │
│ int128 │
├──────────────┤
│ 499999500000 │
└──────────────┘
Operations
There are a number of operations that can be performed on relations. These are all short‑hand for running the SQL queries ‑ and will return
relations again themselves.
aggregate(expr, groups = {}) Apply an (optionally grouped) aggregate over the relation. The system will automatically group
by any columns that are not aggregates.
import duckdb
rel = duckdb.sql("SELECT * FROM range(1_000_000) tbl(id)")
rel.aggregate("id % 2 AS g, sum(id), min(id), max(id)")
┌───────┬──────────────┬─────────┬─────────┐
│ g │ sum(id) │ min(id) │ max(id) │
│ int64 │ int128 │ int64 │ int64 │
├───────┼──────────────┼─────────┼─────────┤
│ 0 │ 249999500000 │ 0 │ 999998 │
│ 1 │ 250000000000 │ 1 │ 999999 │
└───────┴──────────────┴─────────┴─────────┘
except_(rel) Select all rows in the first relation, that do not occur in the second relation. The relations must have the same number
of columns.
import duckdb
r1 = duckdb.sql("SELECT * FROM range(10) tbl(id)")
r2 = duckdb.sql("SELECT * FROM range(5) tbl(id)")
r1.except_(r2).show()
┌───────┐
│ id │
│ int64 │
├───────┤
│ 5 │
│ 6 │
│ 7 │
│ 8 │
│ 9 │
└───────┘
filter(condition) Apply the given condition to the relation, filtering any rows that do not satisfy the condition.
import duckdb
rel = duckdb.sql("SELECT * FROM range(1_000_000) tbl(id)")
rel.filter("id > 5").limit(3).show()
261
DuckDB Documentation
┌───────┐
│ id │
│ int64 │
├───────┤
│ 6 │
│ 7 │
│ 8 │
└───────┘
intersect(rel) Select the intersection of two relations ‑ returning all rows that occur in both relations. The relations must have the
same number of columns.
import duckdb
r1 = duckdb.sql("SELECT * FROM range(10) tbl(id)")
r2 = duckdb.sql("SELECT * FROM range(5) tbl(id)")
r1.intersect(r2).show()
┌───────┐
│ id │
│ int64 │
├───────┤
│ 0 │
│ 1 │
│ 2 │
│ 3 │
│ 4 │
└───────┘
join(rel, condition, type = "inner") Combine two relations, joining them based on the provided condition.
import duckdb
r1 = duckdb.sql("SELECT * FROM range(5) tbl(id)").set_alias("r1")
r2 = duckdb.sql("SELECT * FROM range(10, 15) tbl(id)").set_alias("r2")
r1.join(r2, "r1.id + 10 = r2.id").show()
┌───────┬───────┐
│ id │ id │
│ int64 │ int64 │
├───────┼───────┤
│ 0 │ 10 │
│ 1 │ 11 │
│ 2 │ 12 │
│ 3 │ 13 │
│ 4 │ 14 │
└───────┴───────┘
import duckdb
rel = duckdb.sql("SELECT * FROM range(1_000_000) tbl(id)")
rel.limit(3).show()
┌───────┐
│ id │
│ int64 │
├───────┤
│ 0 │
│ 1 │
262
DuckDB Documentation
│ 2 │
└───────┘
import duckdb
rel = duckdb.sql("SELECT * FROM range(1_000_000) tbl(id)")
rel.order("id DESC").limit(3).show()
┌────────┐
│ id │
│ int64 │
├────────┤
│ 999999 │
│ 999998 │
│ 999997 │
└────────┘
import duckdb
rel = duckdb.sql("SELECT * FROM range(1_000_000) tbl(id)")
rel.project("id + 10 AS id_plus_ten").limit(3).show()
┌─────────────┐
│ id_plus_ten │
│ int64 │
├─────────────┤
│ 10 │
│ 11 │
│ 12 │
└─────────────┘
union(rel) Combine two relations, returning all rows in r1 followed by all rows in r2. The relations must have the same number of
columns.
import duckdb
r1 = duckdb.sql("SELECT * FROM range(5) tbl(id)")
r2 = duckdb.sql("SELECT * FROM range(10, 15) tbl(id)")
r1.union(r2).show()
┌───────┐
│ id │
│ int64 │
├───────┤
│ 0 │
│ 1 │
│ 2 │
│ 3 │
│ 4 │
│ 10 │
│ 11 │
│ 12 │
│ 13 │
│ 14 │
└───────┘
263
DuckDB Documentation
Result Output
The result of relations can be converted to various types of Python structures, see the result conversion page for more information.
The result of relations can also be directly written to files using the below methods.
• write_csv
• write_parquet
You can create a DuckDB user‑defined function (UDF) out of a Python function so it can be used in SQL queries. Similarly to regular functions,
they need to have a name, a return type and parameter types.
import duckdb
from duckdb.typing import *
from faker import Faker
def random_name():
fake = Faker()
return fake.name()
Creating Functions
To register a Python UDF, simply use the create_function method from a DuckDB connection. Here is the syntax:
import duckdb
con = duckdb.connect()
con.create_function(name, function, argument_type_list, return_type, type, null_handling)
1. name: A string representing the unique name of the UDF within the connection catalog.
2. function: The Python function you wish to register as a UDF.
3. return_type: Scalar functions return one element per row. This parameter specifies the return type of the function.
4. parameters: Scalar functions can operate on one or more columns. This parameter takes a list of column types used as input.
5. type (Optional): DuckDB supports both built‑in Python types and PyArrow Tables. By default, built‑in types are assumed, but you
can specify type = 'arrow' to use PyArrow Tables.
6. null_handling (Optional): By default, null values are automatically handled as Null‑In Null‑Out. Users can specify a desired behavior
for null values by setting null_handling = 'special'.
7. exception_handling (Optional): By default, when an exception is thrown from the Python function, it will be re‑thrown in Python.
Users can disable this behavior, and instead return null, by set this parameter to 'return_null'
8. side_effects (Optional): By default, functions are expected to produce the same result for the same input. If the result of a function
is impacted by any type of randomness, side_effects must be set to True.
To unregister a UDF, you can call the remove_function method with the UDF name:
con.remove_function(name)
264
DuckDB Documentation
Type Annotation
When the function has type annotation it's often possible to leave out all of the optional parameters. Using DuckDBPyType we can im‑
plicitly convert many known types to DuckDBs type system. For example:
import duckdb
duckdb.create_function("my_func", my_function)
duckdb.sql("SELECT my_func(42)")
┌─────────────┐
│ my_func(42) │
│ varchar │
├─────────────┤
│ 42 │
└─────────────┘
If only the parameter list types can be inferred, you'll need to pass in None as argument_type_list.
Null Handling
By default when functions receive a NULL value, this instantly returns NULL, as part of the default NULL‑handling. When this is not desired,
you need to explicitly set this parameter to "special".
import duckdb
from duckdb.typing import *
def dont_intercept_null(x):
return 5
duckdb.remove_function("dont_intercept")
duckdb.create_function("dont_intercept", dont_intercept_null, [BIGINT], BIGINT, null_handling="special")
res = duckdb.sql("SELECT dont_intercept(NULL)").fetchall()
print(res)
# [(5,)]
Exception Handling
By default, when an exception is thrown from the Python function, we'll forward (re‑throw) the exception. If you want to disable this
behavior, and instead return null, you'll need to set this parameter to "return_null"
import duckdb
from duckdb.typing import *
def will_throw():
raise ValueError("ERROR")
265
DuckDB Documentation
except duckdb.InvalidInputException as e:
print(e)
Side Effects
By default DuckDB will assume the created function is a pure function, meaning it will produce the same output when given the same
input. If your function does not follow that rule, for example when your function makes use of randomness, then you will need to mark this
function as having side_effects.
For example, this function will produce a new count for every invocation
count.counter = 0
If we create this function without marking it as having side effects, the result will be the following:
con = duckdb.connect()
con.create_function("my_counter", count, side_effects = False)
res = con.sql("SELECT my_counter() FROM range(10)").fetchall()
print(res)
# [(0,), (0,), (0,), (0,), (0,), (0,), (0,), (0,), (0,), (0,)]
Which is obviously not the desired result, when we add side_effects = True, the result is as we would expect:
con.remove_function("my_counter")
count.counter = 0
con.create_function("my_counter", count, side_effects = True)
res = con.sql("SELECT my_counter() FROM range(10)").fetchall()
print(res)
# [(0,), (1,), (2,), (3,), (4,), (5,), (6,), (7,), (8,), (9,)]
Currently, two function types are supported, native (default) and arrow.
Arrow If the function is expected to receive arrow arrays, set the type parameter to 'arrow'.
This will let the system know to provide arrow arrays of up to STANDARD_VECTOR_SIZE tuples to the function, and also expect an array
of the same amount of tuples to be returned from the function.
Native When the function type is set to native the function will be provided with a single tuple at a time, and expect only a single value
to be returned. This can be useful to interact with Python libraries that don't operate on Arrow, such as faker:
import duckdb
266
DuckDB Documentation
def random_date():
fake = Faker()
return fake.date_between()
Types API
To make the API as easy to use as possible, we have added implicit conversions from existing type objects to a DuckDBPyType instance.
This means that wherever a DuckDBPyType object is expected, it is also possible to provide any of the options listed below.
Python Built‑ins The table below shows the mapping of Python Built‑in types to DuckDB type.
bool BOOLEAN
bytearray BLOB
bytes BLOB
float DOUBLE
int BIGINT
str VARCHAR
Numpy DTypes The table below shows the mapping of Numpy DType to DuckDB type.
bool BOOLEAN
float32 FLOAT
float64 DOUBLE
int16 SMALLINT
int32 INTEGER
int64 BIGINT
int8 TINYINT
uint16 USMALLINT
uint32 UINTEGER
uint64 UBIGINT
uint8 UTINYINT
Nested Types
267
DuckDB Documentation
list[child_type] list type objects map to a LIST type of the child type. Which can also be arbitrarily nested.
import duckdb
from typing import Union
dict[key_type, value_type] dict type objects map to a MAP type of the key type and the value type.
import duckdb
duckdb.typing.DuckDBPyType(dict[str, int])
# MAP(VARCHAR, BIGINT)
{'a': field_one, 'b': field_two, .., 'n': field_n} dict objects map to a STRUCT composed of the keys and
values of the dict.
import duckdb
Union[ type_1 , ... type_n ] typing.Union objects map to a UNION type of the provided types.
import duckdb
from typing import Union
Creation Functions For the built‑in types, you can use the constants defined in duckdb.typing:
DuckDB type
BIGINT
BIT
BLOB
BOOLEAN
DATE
DOUBLE
FLOAT
HUGEINT
INTEGER
INTERVAL
SMALLINT
SQLNULL
TIME_TZ
TIME
TIMESTAMP_MS
268
DuckDB Documentation
DuckDB type
TIMESTAMP_NS
TIMESTAMP_S
TIMESTAMP_TZ
TIMESTAMP
TINYINT
UBIGINT
UHUGEINT
UINTEGER
USMALLINT
UTINYINT
UUID
VARCHAR
For the complex types there are methods available on the DuckDBPyConnection object or the duckdb module. Anywhere a Duck-
DBPyType is accepted, we will also accept one of the type objects that can implicitly convert to a DuckDBPyType.
• child_type: DuckDBPyType
map_type Parameters:
• key_type: DuckDBPyType
• value_type: DuckDBPyType
decimal_type Parameters:
• width: int
• scale: int
union_type Parameters:
string_type Parameters:
• collation: Optional[str]
Expression API
269
DuckDB Documentation
Using this API makes it possible to dynamically build up expressions, which are typically created by the parser from the query string. This
allows you to skip that and have more fine‑grained control over the used expressions.
Below is a list of currently supported expressions that can be created through the API.
Column Expression
import duckdb
import pandas as pd
df = pd.DataFrame({
'a': [1, 2, 3, 4],
'b': [True, None, False, True],
'c': [42, 21, 13, 14]
})
Star Expression
Optionally it's possible to provide an exclude list to filter out columns of the table. This exclude list can contain either strings or
Expressions.
import duckdb
import pandas as pd
df = pd.DataFrame({
'a': [1, 2, 3, 4],
'b': [True, None, False, True],
'c': [42, 21, 13, 14]
})
Constant Expression
270
DuckDB Documentation
import duckdb
import pandas as pd
df = pd.DataFrame({
'a': [1, 2, 3, 4],
'b': [True, None, False, True],
'c': [42, 21, 13, 14]
})
const = duckdb.ConstantExpression('hello')
res = duckdb.df(df).select(const).fetchall()
print(res)
# [('hello',), ('hello',), ('hello',), ('hello',)]
Case Expression
This expression contains a CASE WHEN (...) THEN (...) ELSE (...) END expression. By default ELSE is NULL and it can be
set using .else(value = ...). Additional WHEN (...) THEN (...) blocks can be added with .when(condition = ...,
value = ...).
import duckdb
import pandas as pd
from duckdb import (
ConstantExpression,
ColumnExpression,
CaseExpression
)
df = pd.DataFrame({
'a': [1, 2, 3, 4],
'b': [True, None, False, True],
'c': [42, 21, 13, 14]
})
hello = ConstantExpression('hello')
world = ConstantExpression('world')
case = \
CaseExpression(condition = ColumnExpression('b') == False, value = world) \
.otherwise(hello)
res = duckdb.df(df).select(case).fetchall()
print(res)
# [('hello',), ('hello',), ('world',), ('hello',)]
Function Expression
This expression contains a function call. It can be constructed by providing the function name and an arbitrary amount of Expressions as
arguments.
import duckdb
import pandas as pd
from duckdb import (
ConstantExpression,
ColumnExpression,
FunctionExpression
)
271
DuckDB Documentation
df = pd.DataFrame({
'a': [
'test',
'pest',
'text',
'rest',
]
})
Common Operations
The Expression class also contains many operations that can be applied to any Expression type.
.cast(type: DuckDBPyType)
Applies a cast to the provided type on the expression.
.alias(name: str)
Apply an alias to the expression.
.isin(*exprs: Expression)
Create a IN expression against the provided expressions as the list.
.isnotin(*exprs: Expression)
Create a NOT IN expression against the provided expressions as the list.
Order Operations When expressions are provided to DuckDBPyRelation.order() these take effect:
.asc()
Indicates that this expression should be sorted in ascending order.
.desc()
Indicates that this expression should be sorted in descending order.
.nulls_first()
Indicates that the nulls in this expression should preceed the non‑null values.
.nulls_last()
Indicates that the nulls in this expression should come after the non‑null values.
Spark API
The DuckDB Spark API implements the PySpark API, allowing you to use the familiar Spark API to interact with DuckDB. All statements are
translated to DuckDB's internal plans using our relational API and executed using DuckDB's query engine.
Note. Warning The DuckDB Spark API is currently experimental and features are still missing. We are very interested in feedback.
Please report any functionality that you are missing, either through Discord or on GitHub.
Example
272
DuckDB Documentation
spark = session.builder.getOrCreate()
pandas_df = pd.DataFrame({
'age': [34, 45, 23, 56],
'name': ['Joan', 'Peter', 'John', 'Bob']
})
df = spark.createDataFrame(pandas_df)
df = df.withColumn(
'location', lit('Seattle')
)
res = df.select(
col('age'),
col('location')
).collect()
print(res)
[
Row(age=34, location='Seattle'),
Row(age=45, location='Seattle'),
Row(age=23, location='Seattle'),
Row(age=56, location='Seattle')
]
Contribution Guidelines
Contributions to the experimental Spark API are welcome. When making a contribution, please follow these guidelines:
Unfortunately there are some issues that are either beyond our control or are very elusive / hard to track down. Below is a list of these
issues that you might have to be aware of, depending on your workflow.
When making use of multi threading and fetching results either directly as Numpy arrays or indirectly through a Pandas DataFrame, it might
be necessary to ensure that numpy.core.multiarray is imported. If this module has not been imported from the main thread, and a
different thread during execution attempts to import it this causes either a deadlock or a crash.
When DuckDB is run in Jupyter notebooks or in the IPython shell, the output of the EXPLAIN statement contains hard line breaks (\n):
273
DuckDB Documentation
Out[1]:
┌───────────────┬────────────────────────────────────────────────────────────────────────────────────────────────
│ explain_key │ explain_value
│
│ varchar │ varchar
│
├───────────────┼────────────────────────────────────────────────────────────────────────────────────────────────
│ physical_plan │ ┌───────────────────────────┐\n│ PROJECTION │\n│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
│\n│ x … │
└───────────────┴────────────────────────────────────────────────────────────────────────────────────────────────
Out[2]:
┌───────────────────────────┐
│ PROJECTION │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ x │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ DUMMY_SCAN │
└───────────────────────────┘
Please also check out the Jupyter guide for tips on using Jupyter with JupySQL.
When importing DuckDB on Windows, the Python runtime may return the following error:
import duckdb
ImportError: DLL load failed while importing duckdb: The specified module could not be found.
R API
Installation
The DuckDB R API can be installed using install.packages("duckdb"). Please see the installation page for details.
Reference Manual
The standard DuckDB R API implements the DBI interface for R. If you are not familiar with DBI yet, see here for an introduction.
274
DuckDB Documentation
Startup & Shutdown To use DuckDB, you must first create a connection object that represents the database. The connection object
takes as parameter the database file to read and write from. If the database file does not exist, it will be created (the file extension may be
.db, .duckdb, or anything else). The special value :memory: (the default) can be used to create an in‑memory database. Note that
for an in‑memory database no data is persisted to disk (i.e., all data is lost when you exit the R process). If you would like to connect to an
existing database in read‑only mode, set the read_only flag to TRUE. Read‑only mode is required if multiple R processes want to access
the same database file at the same time.
library("duckdb")
# to start an in-memory database
con <- dbConnect(duckdb())
# or
con <- dbConnect(duckdb(), dbdir = ":memory:")
# to use a database file (not shared between processes)
con <- dbConnect(duckdb(), dbdir = "my-db.duckdb", read_only = FALSE)
# to use a database file (shared between processes)
con <- dbConnect(duckdb(), dbdir = "my-db.duckdb", read_only = TRUE)
Connections are closed implicitly when they go out of scope or if they are explicitly closed using dbDisconnect(). To shut down the
database instance associated with the connection, use dbDisconnect(con, shutdown = TRUE)
Querying DuckDB supports the standard DBI methods to send queries and retrieve result sets. dbExecute() is meant for queries
where no results are expected like CREATE TABLE or UPDATE etc. and dbGetQuery() is meant to be used for queries that produce
results (e.g., SELECT). Below an example.
# create a table
dbExecute(con, "CREATE TABLE items (item VARCHAR, value DECIMAL(10, 2), count INTEGER)")
# insert two items into the table
dbExecute(con, "INSERT INTO items VALUES ('jeans', 20.0, 1), ('hammer', 42.2, 2)")
DuckDB also supports prepared statements in the R API with the dbExecute and dbGetQuery methods. Here is an example:
# if you want to reuse a prepared statement multiple times, use dbSendStatement() and dbBind()
stmt <- dbSendStatement(con, "INSERT INTO items VALUES (?, ?, ?)")
dbBind(stmt, list('iphone', 300, 2))
dbBind(stmt, list('android', 3.5, 1))
dbClearResult(stmt)
Note. Warning Do not use prepared statements to insert large amounts of data into DuckDB. See below for better options.
275
DuckDB Documentation
Efficient Transfer
To write a R data frame into DuckDB, use the standard DBI function dbWriteTable(). This creates a table in DuckDB and populates it
with the data frame contents. For example:
It is also possible to ”register” a R data frame as a virtual table, comparable to a SQL VIEW. This does not actually transfer data into DuckDB
yet. Below is an example:
Note. DuckDB keeps a reference to the R data frame after registration. This prevents the data frame from being garbage‑collected.
The reference is cleared when the connection is closed, but can also be cleared manually using the duckdb_unregister()
method.
Also refer to the data import documentation for more options of efficiently importing data.
dbplyr
DuckDB also plays well with the dbplyr / dplyr packages for programmatic query construction from R. Here is an example:
library("duckdb")
library("dplyr")
con <- dbConnect(duckdb())
duckdb_register(con, "flights", nycflights13::flights)
When using dbplyr, CSV and Parquet files can be read using the dplyr::tbl function.
# Summarize the dataset in DuckDB to avoid reading the entire CSV into R's memory
tbl(con, "mtcars.csv") |>
group_by(cyl) |>
summarise(across(disp:wt, .fns = mean)) |>
collect()
# Summarize the dataset in DuckDB to avoid reading 12 Parquet files into R's memory
tbl(con, "read_parquet('dataset/**/*.parquet', hive_partitioning = true)") |>
filter(month == "3") |>
summarise(delay = mean(dep_time, na.rm = TRUE)) |>
collect()
276
DuckDB Documentation
Rust API
Installation
The DuckDB Rust API can be installed from crates.io. Please see the docs.rs for details.
duckdb‑rs is an ergonomic wrapper based on the DuckDB C API, please refer to the README for details.
Startup & Shutdown To use duckdb, you must first initialize a Connection handle using Connection::open(). Connec-
tion::open() takes as parameter the database file to read and write from. If the database file does not exist, it will be created (the file
extension may be .db, .duckdb, or anything else). You can also use Connection::open_in_memory() to create an in‑memory
database. Note that for an in‑memory database no data is persisted to disk (i.e., all data is lost when you exit the process).
You can conn.close() the Connection manually, or just leave it out of scope, we had implement the Drop trait which will automati‑
cally close the underlining db connection for you.
Querying SQL queries can be sent to DuckDB using the execute() method of connections, or we can also prepare the statement and
then query on that.
#[derive(Debug)]
struct Person {
id: i32,
name: String,
data: Option<Vec<u8>>,
}
conn.execute(
"INSERT INTO person (name, data) VALUES (?, ?)",
params![me.name, me.data],
)?;
Appender
The Rust client supports the DuckDB Appender API for bulk inserts. For example:
277
DuckDB Documentation
Swift API
DuckDB offers a Swift API. See the announcement post for details.
Instantiating DuckDB
DuckDB supports both in‑memory and persistent databases. To work with an in‑memory datatabase, run:
Application Example
The rest of the page is based on the example of our announcement post, which uses raw data from NASA's Exoplanet Archive loaded directly
into DuckDB.
Creating an Application‑Specific Type We first create an application‑specific type that we'll use to house our database and connection
and through which we'll eventually define our app‑specific queries.
import DuckDB
Loading a CSV File We load the data from NASA's Exoplanet Archive:
wget https://exoplanetarchive.ipac.caltech.edu/TAP/sync?query=select+pl_name+,+disc_
year+from+pscomppars&format=csv -O downloaded_exoplanets.csv
Once we have our CSV downloaded locally, we can use the following SQL command to load it as a new table to DuckDB:
278
DuckDB Documentation
Let's package this up as a new asynchronous factory method on our ExoplanetStore type:
import DuckDB
import Foundation
Querying the Database The following example queires DuckDB from within Swift via an async function. This means the callee won't be
blocked while the query is executing. We'll then cast the result columns to Swift native types using DuckDB's ResultSet cast(to:)
family of methods, before finally wrapping them up in a DataFrame from the TabularData framework.
...
import TabularData
extension ExoplanetStore {
279
DuckDB Documentation
GROUP BY disc_year
ORDER BY disc_year
""")
Complete Project For the complete example project, clone the DuckDB Swift repo and open up the runnable app project located in
Examples/SwiftUI/ExoplanetExplorer.xcodeproj.
Wasm
DuckDB Wasm
DuckDB has been compiled to WebAssembly, so it can run inside any browser on any device.
DuckDB‑Wasm offers a layered API, it can be embedded as a JavaScript + WebAssembly library, as a Web shell, or built from source according
to your needs.
Instantiation
Instantiation
cdn(jsdelivr)
280
DuckDB Documentation
webpack
vite
281
DuckDB Documentation
Statically Served
Data Ingestion
DuckDB‑Wasm has multiple ways to import data, depending on the format of the data.
First, the data file is imported into a local file system using register functions (registerEmptyFileBuffer, registerFileBuffer, registerFileHandle,
registerFileText, registerFileURL).
Then, the data file is imported into DuckDB using insert functions (insertArrowFromIPCStream, insertArrowTable, insertCSVFromPath, in‑
sertJSONFromPath) or directly using FROM SQL query (using extensions like Parquet or Wasm‑flavored httpfs).
Data Import
282
DuckDB Documentation
Apache Arrow
// Write EOS
streamInserts.push(c.insertArrowFromIPCStream(EOS, { name: 'streamed' }));
await Promise.all(streamInserts);
CSV
JSON
283
DuckDB Documentation
// From API
const streamResponse = await fetch(` someapi/content.json`);
await db.registerFileBuffer('file.json', new Uint8Array(await streamResponse.arrayBuffer()))
await c.insertJSONFromPath('file.json', { name: 'JSONContent' });
Parquet
httpfs (Wasm‑flavored)
Insert Statement
284
DuckDB Documentation
Query
DuckDB‑Wasm provides functions for querying data. Queries are run sequentially.
First, a connection need to be created by calling connect. Then, queries can be run by calling query or send.
Query Execution
Prepared Statements
// Query
const arrowResult = await conn.query<{ v: arrow.Int }>(`
SELECT * FROM generate_series(1, 100) t(v)
285
DuckDB Documentation
`);
Export Parquet
// Export Parquet
conn.send(` COPY (SELECT * FROM tbl) TO 'result-snappy.parquet' (FORMAT 'parquet');`);
const parquet_buffer = await this._db.copyFileToBuffer('result-snappy.parquet');
Extensions
DuckDB‑Wasm's (dynamic) extension loading is modeled after the regular DuckDB's extension loading, with a few relevant differences due
to the difference in platform.
Format
Extensions in DuckDB are binaries to be dynamically loaded via dlopen. A cryptographical signature is appended to the binary. An exten‑
sion in DuckDB‑Wasm is a regular Wasm file to be dynamically loaded via Emscripten's dlopen. A cryptographical signature is appended
to the Wasm file as a WebAssembly custom section called duckdb_signature. This ensures the file remains a valid WebAssembly file.
Note. Currently, we require this custom section to be the last one, but this can be potentially relaxed in the future.
The INSTALL semantic in native embeddings of DuckDB is to fetch, decompress from gzip and store data in local disk. The LOAD semantic
in native embeddings of DuckDB is to (optionally) perform signature checks and dynamic load the binary with the main DuckDB binary.
In DuckDB‑Wasm, INSTALL is a no‑op given there is no durable cross‑session storage. The LOAD operation will fetch (and decompress on
the fly), perform signature checks and dynamically load via the Emscripten implementation of dlopen.
Autoloading
Autoloading, i.e., the possibility for DuckDB to add extension functionality on‑the‑fly, is enabled by default in DuckDB‑Wasm.
286
DuckDB Documentation
WebAssembly is basically an additional platform, and there might be platform‑specific limitations that make some extensions not able
to match their native capabilities or to perform them in a different way. We will document here relevant differences for DuckDB‑hosted
extensions.
HTTPFS The HTTPFS extension is, at the moment, not available in DuckDB‑Wasm. Https protocol capabilities needs to go through an
additional layer, the browser, which adds both differences and some restrictions to what is doable from native.
Instead, DuckDB‑Wasm has a separate implementation that for most purposes is interchangable, but does not support all use cases (as
it must follow security rules imposed by the browser, such as CORS). Due to this CORS restriction, any requests for data made using the
HTTPFS extension must be to websites that allow (using CORS headers) the website hosting the DuckDB‑Wasm instance to access that data.
The MDN website is a great resource for more information regarding CORS.
Extension Signing
As with regular DuckDB extensions, DuckDB‑Wasm extension are by default checked on LOAD to verify the signature confirm the extension
has not been tampered with. Extension signature verification can be disabled via a configuration option. Signing is a property of the binary
itself, so copying a DuckDB extension (say to serve it from a different location) will still keep a valid signature (e.g., for local development).
Official DuckDB extensions are served at extensions.duckdb.org, and this is also the default value for the default_extension_
repository option. When installing extensions, a relevant URL will be built that will look like extensions.duckdb.org/$duckdb_
version_hash/$duckdb_platform/$name.duckdb_extension.gz.
DuckDB‑Wasm extension are fetched only on load, and the URL will look like: extensions.duckdb.org/duckdb-wasm/$duckdb_
version_hash/$duckdb_platform/$name.duckdb_extension.wasm.
Note that an additional duckdb-wasm is added to the folder structure, and the file is served as a .wasm file.
DuckDB‑Wasm extensions are served pre‑compressed using Brotli compression. While fetched from a browser, extensions will be
transparently uncompressed. If you want to fetch the duckdb-wasm extension manually, you can use curl --compress exten-
sions.duckdb.org/<...>/icu.duckdb_extension.wasm.
287
DuckDB Documentation
As with regular DuckDB, if you use SET custom_extension_repository = some.url.com, subsequent loads will be attempted
at some.url.com/duckdb-wasm/$duckdb_version_hash/$duckdb_platform/$name.duckdb_extension.wasm.
Note that GET requests on the extensions needs to be CORS enabled for a browser to allow the connection.
Tooling
Both DuckDB‑Wasm and its extensions have been compiled using latest packaged Emscripten toolchain.
ADBC API
Arrow Database Connectivity (ADBC), similarly to ODBC and JDBC, is a C‑style API that enables code portability between different database
systems. This allows developers to effortlessly build applications that communicate with database systems without using code specific to
that system. The main difference between ADBC and ODBC/JDBC is that ADBC uses Arrow to transfer data between the database system and
the application. DuckDB has an ADBC driver, which takes advantage of the zero‑copy integration between DuckDB and Arrow to efficiently
transfer data.
Please refer to the ADBC documentation page for a more extensive discussion on ADBC and a detailed API explanation.
Implemented Functionality
The DuckDB‑ADBC driver implements the full ADBC specification, with the exception of the ConnectionReadPartition and State-
mentExecutePartitions functions. Both of these functions exist to support systems that internally partition the query results, which
does not apply to DuckDB. In this section, we will describe the main functions that exist in ADBC, along with the arguments they take and
provide examples for each function.
288
DuckDB Documentation
Connection A set of functions that create and destroy a connection to interact with a database.
A set of functions that retrieve metadata about the database. In general, these functions will return Arrow objects, specifically an ArrowAr‑
rayStream.
A set of functions with transaction semantics for the connection. By default, all connections start with auto‑commit mode on, but this can
be turned off via the ConnectionSetOption function.
289
DuckDB Documentation
Statement Statements hold state related to query execution. They represent both one‑off queries and prepared statements. They can
be reused; however, doing so will invalidate prior result sets from that statement.
The functions used to create, destroy, and set options for a statement:
290
DuckDB Documentation
Examples
Regardless of the programming language being used, there are two database options which will be required to utilize ADBC with DuckDB.
The first one is the driver, which takes a path to the DuckDB library. The second option is the entrypoint, which is an exported function
from the DuckDB‑ADBC driver that initializes all the ADBC functions. Once we have configured these two options, we can optionally set the
path option, providing a path on disk to store our DuckDB database. If not set, an in‑memory database is created. After configuring all the
necessary options, we can proceed to initialize our database. Below is how you can do so with various different language environments.
C++ We begin our C++ example by declaring the essential variables for querying data through ADBC. These variables include Error,
Database, Connection, Statement handling, and an Arrow Stream to transfer data between DuckDB and the application.
AdbcError adbc_error;
AdbcDatabase adbc_database;
AdbcConnection adbc_connection;
AdbcStatement adbc_statement;
ArrowArrayStream arrow_stream;
We can then initialize our database variable. Before initializing the database, we need to set the driver and entrypoint
options as mentioned above. Then we set the path option and initialize the database. With the example below, the string
"path/to/libduckdb.dylib" should be the path to the dynamic library for DuckDB. This will be .dylib on macOS, and
.so on Linux.
AdbcDatabaseNew(&adbc_database, &adbc_error);
AdbcDatabaseSetOption(&adbc_database, "driver", "path/to/libduckdb.dylib", &adbc_error);
AdbcDatabaseSetOption(&adbc_database, "entrypoint", "duckdb_adbc_init", &adbc_error);
// By default, we start an in-memory database, but you can optionally define a path to store it on disk.
AdbcDatabaseSetOption(&adbc_database, "path", "test.db", &adbc_error);
AdbcDatabaseInit(&adbc_database, &adbc_error);
After initializing the database, we must create and initialize a connection to it.
AdbcConnectionNew(&adbc_connection, &adbc_error);
AdbcConnectionInit(&adbc_connection, &adbc_database, &adbc_error);
291
DuckDB Documentation
We can now initialize our statement and run queries through our connection. After the AdbcStatementExecuteQuery the arrow_
stream is populated with the result.
Besides running queries, we can also ingest data via arrow_streams. For this we need to set an option with the table name we want to
insert to, bind the stream and then execute the query.
Python The first thing to do is to use pip and install the ADBC Driver manager. You will also need to install the pyarrow to directly
access Apache Arrow formatted result sets (such as using fetch_arrow_table).
Note. For details on the adbc_driver_manager package, see the adbc_driver_manager package documentation.
As with C++, we need to provide initialization options consisting of the location of the libduckdb shared object and entrypoint function.
Notice that the path argument for DuckDB is passed in through the db_kwargs dictionary.
import adbc_driver_duckdb.dbapi
Alongside fetch_arrow_table, other methods from DBApi are also implemented on the cursor, such as fetchone and fetchall.
Data can also be ingested via arrow_streams. We just need to set options on the statement to bind the stream of data and execute the
query.
import adbc_driver_duckdb.dbapi
import pyarrow
data = pyarrow.record_batch(
[[1, 2, 3, 4], ["a", "b", "c", "d"]],
names = ["ints", "strs"],
)
ODBC
The ODBC (Open Database Connectivity) is a C‑style API that provides access to different flavors of Database Management Systems (DBMSs).
The ODBC API consists of the Driver Manager (DM) and the ODBC drivers.
The DM is part of the system library, e.g., unixODBC, which manages the communications between the user applications and the ODBC
drivers. Typically, applications are linked against the DM, which uses Data Source Name (DSN) to look up the correct ODBC driver.
292
DuckDB Documentation
The ODBC driver is a DBMS implementation of the ODBC API, which handles all the internals of that DBMS.
The DM maps user application calls of ODBC functions to the correct ODBC driver that performs the specified function and returns the
proper values.
DuckDB supports the ODBC version 3.0 according to the Core Interface Conformance.
We release the ODBC driver as assets for Linux and Windows. Users can download them from the Latest Release of DuckDB.
Operating Systems
A driver manager is required to manage communication between applications and the ODBC driver. We tested and support unixODBC
that is a complete ODBC driver manager for Linux. Users can install it from the command line:
Debian Flavors
Fedora Flavors
DuckDB releases the ODBC driver as asset. For linux, download it from ODBC Linux Asset that contains the following artifacts:
mkdir duckdb_odbc
unzip duckdb_odbc-linux-amd64.zip -d duckdb_odbc
The unixodbc_setup.sh script aids the configuration of the DuckDB ODBC Driver. It is based on the unixODBC package that provides
some commands to handle the ODBC setup and test like odbcinst and isql.
In a terminal window, change to the duckdb_odbc permanent directory, and run the following commands with level options -u or -s
either to configure DuckDB ODBC.
293
DuckDB Documentation
User‑Level ODBC Setup (‑u) The -u option based on the user home directory to setup the ODBC init files.
./unixodbc_setup.sh -u
System‑Level ODBC setup (‑s) The ‑s changes the system level files that will be visible for all users, because of that it requires root
privileges.
sudo unixodbc_setup.sh -s
Show Usage (‑‑help) The option --help shows the usage of unixodbc_setup.sh that provides alternative options for a customer
configuration, like -db and -D.
unixodbc_setup.sh --help
Level:
-s: System-level, using 'sudo' to configure DuckDB ODBC at the system-level, changing the files:
/etc/odbc[inst].ini
-u: User-level, configuring the DuckDB ODBC at the user-level, changing the files: ~/.odbc[inst].ini.
Options:
-db database_path>: the DuckDB database file path, the default is ':memory:' if not provided.
-D driver_path: the driver file path (i.e., the path for libduckdb_odbc.so), the default is using the
base script directory
The ODBC setup on Linux is based on files, the well‑known .odbc.ini and .odbcinst.ini. These files can be placed at the system
/etc directory or at the user home directory /home/ user (shortcut as ~/). The DM prioritizes the user configuration files and then
the system files.
The .odbc.ini File The .odbc.ini contains the DSNs for the drivers, which can have specific knobs.
[DuckDB]
Driver = DuckDB Driver
Database = :memory:
Driver: it describes the driver's name, and other configurations will be placed at the .odbcinst.ini.
Database: it describes the database name used by DuckDB, and it can also be a file path to a .db in the system.
The .odbcinst.ini File The .odbcinst.ini contains general configurations for the ODBC installed drivers in the system. A driver
section starts with the driver name between brackets, and then it follows specific configuration knobs belonging to that driver.
294
DuckDB Documentation
[ODBC]
Trace = yes
TraceFile = /tmp/odbctrace
[DuckDB Driver]
Driver = /home/ user /duckdb_odbc/libduckdb_odbc.so
Trace: it enables the ODBC trace file using the option yes.
TraceFile: the absolute system file path for the ODBC trace file.
The Microsoft Windows requires an ODBC Driver Manager to manage communication between applications and the ODBC drivers. The DM
on Windows is provided in a DLL file odbccp32.dll, and other files and tools. For detailed information checkout out the Common ODBC
Component Files.
DuckDB releases the ODBC driver as asset. For Windows, download it from Windows Asset that contains the following artifacts:
duckdb_odbc_setup.dll: a setup DLL used by the Windows ODBC Data Source Administrator tool.
mkdir duckdb_odbc
unzip duckdb_odbc-linux-amd64.zip -d duckdb_odbc
The odbc_install.exe aids the configuration of the DuckDB ODBC Driver on Windows. It depends on the Odbccp32.dll that pro‑
vides functions to configure the ODBC registry entries.
Windows administrator privileges is required, in case of a non‑administrator a User Account Control shall display:
The odbc_install.exe adds a default DSN configuration into the ODBC registries with a default database :memory:.
DSN Windows Setup After the installation, it is possible to change the default DSN configuration or add a new one using the Windows
ODBC Data Source Administrator tool odbcad32.exe.
295
DuckDB Documentation
Default DuckDB DSN The newly installed DSN is visible on the System DSN in the Windows ODBC Data Source Administrator tool:
Changing DuckDB DSN When selecting the default DSN (i.e., DuckDB) or adding a new configuration, the following setup window will
display:
296
DuckDB Documentation
This window allows you to set the DSN and the database file path associated with that DSN.
There are two ways to configure the ODBC driver, either by altering the registry keys as detailed below, or by connecting with SQLDriver-
Connect. A combination of the two is also possible.
Furthermore, the ODBC driver supports all the configuration options included in DuckDB.
Note. If a configuration is set in both the connection string passed to SQLDriverConnect and in the odbc.ini file, the one
passed to SQLDriverConnect will take precedence.
Registry Keys The ODBC setup on Windows is based on registry keys (see Registry Entries for ODBC Components). The ODBC entries can
be placed at the current user registry key (HKCU) or the system registry key (HKLM).
We have tested and used the system entries based on HKLM->SOFTWARE->ODBC. The odbc_install.exe changes this entry that has
two subkeys: ODBC.INI and ODBCINST.INI.
The ODBC.INI is where users usually insert DSN registry entries for the drivers.
For example, the DSN registry for DuckDB would look like this:
297
DuckDB Documentation
The ODBCINST.INI contains one entry for each ODBC driver and other keys predefined for Windows ODBC configuration.
A driver manager is required to manage communication between applications and the ODBC driver. We tested and support unixODBC
that is a complete ODBC driver manager for macOS (and Linux). Users can install it from the command line:
Brew
DuckDB releases the ODBC driver as asset. For macOS, download it from the ODBC macOS asset that contains the following artifacts:
libduckdb_odbc.dylib: the DuckDB ODBC driver compiled to macOS (with Intel and Apple Silicon support).
mkdir duckdb_odbc
unzip duckdb_odbc-osx-universal.zip -d duckdb_odbc
There are two ways to configure the ODBC driver, either by initializing the configuration files listed below, or by connecting with
SQLDriverConnect. A combination of the two is also possible.
Furthermore, the ODBC driver supports all the configuration options included in DuckDB.
Note. If a configuration is set in both the connection string passed to SQLDriverConnect and in the odbc.ini file, the one
passed to SQLDriverConnect will take precedence.
298
DuckDB Documentation
The odbc.ini or .odbc.ini File The .odbc.ini contains the DSNs for the drivers, which can have specific knobs.
[DuckDB]
Driver = DuckDB Driver
Database=:memory:
access_mode=read_only
allow_unsigned_extensions=true
The .odbcinst.ini File The .odbcinst.ini contains general configurations for the ODBC installed drivers in the system. A driver
section starts with the driver name between brackets, and then it follows specific configuration knobs belonging to that driver.
[ODBC]
Trace = yes
TraceFile = /tmp/odbctrace
[DuckDB Driver]
Driver = /User/ user /duckdb_odbc/libduckdb_odbc.dylib
After the configuration, for validate the installation, it is possible to use an odbc client. unixODBC use a command line tool called isql.
isql DuckDB
+---------------------------------------+
| Connected! |
| |
| sql-statement |
| help [tablename] |
| echo [string] |
| quit |
| |
+---------------------------------------+
+------------+
| 42 |
+------------+
| 42 |
+------------+
299
DuckDB Documentation
SQLRowCount returns -1
1 rows fetched
300
Configuration
Configuration
DuckDB has a number of configuration options that can be used to change the behavior of the system.
The configuration options can be set using either the SET statement or the PRAGMA statement. They can be reset to their original values
using the RESET statement. The values of configuration options can be queried via the current_setting() scalar function or using
the duckdb_settings() table function.
Examples
┌─────────┐
│ threads │
│ int64 │
├─────────┤
│ 10 │
└─────────┘
┌─────────┬─────────┬─────────────────────────────────────────────────┬────────────┐
│ name │ value │ description │ input_type │
│ varchar │ varchar │ varchar │ varchar │
├─────────┼─────────┼─────────────────────────────────────────────────┼────────────┤
│ threads │ 10 │ The number of total threads used by the system. │ BIGINT │
└─────────┴─────────┴─────────────────────────────────────────────────┴────────────┘
Secrets Manager
DuckDB has a Secrets manager, which provides a unified user interface for secrets across all backends (e.g., AWS S3) that use them.
301
DuckDB Documentation
Configuration Reference
302
DuckDB Documentation
303
DuckDB Documentation
ordered_aggregate_ The number of rows to accumulate before sorting, used for UBIGINT 262144
threshold tuning
password The password to use. Ignored for legacy compatibility. VARCHAR NULL
perfect_ht_threshold Threshold in bytes for when to use a perfect hash table BIGINT 12
(default: 12)
pivot_filter_ The threshold to switch from using filtered aggregates to BIGINT 10
threshold LIST with a dedicated pivot operator
pivot_limit The maximum number of pivot columns in a pivot BIGINT 100000
statement (default: 100000)
prefer_range_joins Force use of range joins with mixed predicates BOOLEAN false
preserve_identifier_ Whether or not to preserve the identifier case, instead of BOOLEAN true
case always lowercasing all non‑quoted identifiers
preserve_insertion_ Whether or not to preserve insertion order. If set to false the BOOLEAN true
order system is allowed to re‑order any results that do not contain
ORDER BY clauses.
profile_output, The file to which profile output should be saved, or empty VARCHAR
profiling_output to print to the terminal
profiling_mode The profiling mode (STANDARD or DETAILED) VARCHAR NULL
progress_bar_time Sets the time (in milliseconds) how long a query needs to BIGINT 2000
take before we start printing a progress bar
s3_access_key_id S3 Access Key ID VARCHAR
s3_endpoint S3 Endpoint (empty for default endpoint) VARCHAR
s3_region S3 Region (default us‑east‑1) VARCHAR us-east-1
s3_secret_access_key S3 Access Key VARCHAR
s3_session_token S3 Session Token VARCHAR
s3_uploader_max_ S3 Uploader max filesize (between 50GB and 5TB, default VARCHAR 800GB
filesize 800GB)
s3_uploader_max_ S3 Uploader max parts per file (between 1 and 10000, UBIGINT 10000
parts_per_file default 10000)
s3_uploader_thread_ S3 Uploader global thread limit (default 50) UBIGINT 50
limit
s3_url_ Disable Globs and Query Parameters on S3 URLs BOOLEAN 0
compatibility_mode
s3_url_style S3 URL style ('vhost' (default) or 'path') VARCHAR vhost
s3_use_ssl S3 use SSL (default true) BOOLEAN 1
schema Sets the default search schema. Equivalent to setting VARCHAR main
search_path to a single value.
search_path Sets the default catalog search path as a comma‑separated VARCHAR
list of values
secret_directory Set the directory to which persistent secrets are stored VARCHAR ~/.duckdb/stored_
secrets
temp_directory Set the directory to which to write temp files VARCHAR
304
DuckDB Documentation
threads, worker_ The number of total threads used by the system. BIGINT # Cores
threads
username, user The username to use. Ignored for legacy compatibility. VARCHAR NULL
Pragmas
The PRAGMA statement is an SQL extension adopted by DuckDB from SQLite. PRAGMA statements can be issued in a similar manner to reg‑
ular SQL statements. PRAGMA commands may alter the internal state of the database engine, and can influence the subsequent execution
or behavior of the engine.
PRAGMA statements that assign a value to an option can also be issued using the SET statement and the value of an option can be retrieved
using SELECT current_setting(option_name).
PRAGMA database_list;
PRAGMA show_tables;
PRAGMA show_tables_expanded;
PRAGMA functions;
PRAGMA table_info('table_name');
CALL pragma_table_info('table_name');
table_info returns information about the columns of the table with name table_name. The exact format of the table returned is given
below:
To also show table structure, but in a slightly different format (included for compatibility):
PRAGMA show('table_name');
305
DuckDB Documentation
Memory Limit Set the memory limit for the buffer manager:
Note. Warning The specified memory limit is only applied to the buffer manager. For most queries, the buffer manager handles
the majority of the data processed. However, certain in‑memory data structures such as vectors and query results are allocated
outside of the buffer manager. Additionally, aggregate functions with complex state (e.g., list, mode, quantile, string_agg,
and approx functions) use memory outside of the buffer manager. Therefore, the actual memory consumption can be higher than
the specified memory limit.
SET threads = 4;
Database Size Get the file and memory size of each database:
SET database_size;
CALL pragma_database_size();
database_size returns information about the file and memory size of each database. The column types of the returned results are
given below:
PRAGMA collations;
Implicit Casting to VARCHAR Prior to version 0.10.0, DuckDB would automatically allow any type to be implicitly cast to VARCHAR
during function binding. As a result it was possible to e.g., compute the substring of an integer without using an implicit cast. For version
v0.10.0 and later an explicit cast is needed instead. To revert to the old behaviour that performs implicit casting, set the old_implicit_
casting variable to true.
Default Ordering for NULLs Set the default ordering for NULLs to be either NULLS FIRST or NULLS LAST:
306
DuckDB Documentation
PRAGMA version;
CALL pragma_version();
Platform platform returns an identifier for the platform the current DuckDB executable has been compiled for, e.g., osx_arm64. The
format of this identifier matches the platform name as described on the extension loading explainer.
PRAGMA platform;
CALL pragma_platform();
PRAGMA enable_progress_bar;
PRAGMA disable_progress_bar;
Profiling
PRAGMA enable_profiling;
PRAGMA enable_profile;
Profiling Format The format of the resulting profiling information can be specified as either json, query_tree, or query_tree_
optimizer. The default format is query_tree, which prints the physical operator tree together with the timings and cardinalities of
each operator in the tree to the screen.
PRAGMA disable_profiling;
PRAGMA disable_profile;
Profiling Output By default, profiling information is printed to the console. However, if you prefer to write the profiling information to a
file the PRAGMA profiling_output can be used to write to a specified file. Note that the file contents will be overwritten for every
new query that is issued, hence the file will only contain the profiling information of the last query that is run.
307
DuckDB Documentation
Profiling Mode By default, a limited amount of profiling information is provided (standard). For more details, use the detailed profiling
mode by setting profiling_mode to detailed. The output of this mode shows how long it takes to apply certain optimizers on the
query tree and how long physical planning takes.
PRAGMA disable_optimizer;
PRAGMA enable_optimizer;
Explain Plan Output The output of EXPLAIN output can be configured to show only the physical plan. This is the default configura‑
tion.
Full‑Text Search Indexes The create_fts_index and drop_fts_index options are only available when the fts extension is
loaded. Their usage is documented on the Full‑Text Search extension page.
PRAGMA verify_external;
PRAGMA disable_verify_external;
Verification of Round‑Trip Capabilities Enable verification of round‑trip capabilities for supported logical plans:
PRAGMA verify_serializer;
PRAGMA disable_verify_serializer;
PRAGMA enable_object_cache;
PRAGMA disable_object_cache;
308
DuckDB Documentation
Checkpoint
Force Checkpoint When CHECKPOINT is called when no changes are made, force a checkpoint regardless.
PRAGMA force_checkpoint;
Checkpoint on Shutdown Run a CHECKPOINT on successful shutdown and delete the WAL, to leave only a single database file behind:
PRAGMA enable_checkpoint_on_shutdown;
PRAGMA disable_checkpoint_on_shutdown;
Progress Bar Enable printing of the progress bar (if it's possible):
PRAGMA enable_print_progress_bar;
PRAGMA disable_print_progress_bar;
Temp Directory for Spilling Data to Disk By default, DuckDB uses a temporary directory named database_file_name .tmp to
spill to disk, located in the same directory as the database file. To change this, use:
PRAGMA storage_info('table_name');
CALL pragma_storage_info('table_name');
This call returns the following information for the given table:
row_group_id BIGINT
column_name VARCHAR
column_id BIGINT
column_path VARCHAR
segment_id BIGINT
segment_type VARCHAR
start BIGINT The start row id of this chunk
count BIGINT The amount of entries in this storage chunk
compression VARCHAR Compression type used for this column ‑ see blog post
stats VARCHAR
has_updates BOOLEAN
persistent BOOLEAN false if temporary table
block_id BIGINT empty unless persistent
block_offset BIGINT empty unless persistent
309
DuckDB Documentation
Show Databases The following statement is equivalent to the SHOW DATABASES statement:
PRAGMA show_databases;
User Agent The following statement returns the user agent information, e.g., duckdb/v0.10.0(osx_arm64).
PRAGMA user_agent;
Metadata Information The following statement returns information on the metadata store (block_id, total_blocks, free_
blocks, and free_list).
PRAGMA metadata_info;
Selectively Disabling Optimizers The disabled_optimizers option allows selectively disabling optimization steps. For example,
to disable filter_pushdown and statistics_propagation, run:
The available optimizations can be queried using the duckdb_optimizers() table function.
Note. Warning The disabled_optimizers option should only be used for debugging performance issues and should be
avoided in production.
Returning Errors as JSON The errors_as_json option can be set to obtain error information in raw JSON format. For certain errors,
extra information or decomposed information is provided for easier machine processing. For example:
{
"exception_type":"Catalog",
"exception_message":"Table with name nonexistent_tbl does not exist!\nDid you mean
\"temp.information_schema.tables\"?",
"name":"nonexistent_tbl",
"candidates":"temp.information_schema.tables",
"position":"14",
"type":"Table",
"error_subtype":"MISSING_ENTRY"
}
Query Verification (for Development) The following PRAGMAs are mostly used for development and internal testing.
PRAGMA enable_verification;
PRAGMA disable_verification;
PRAGMA verify_parallelism;
PRAGMA disable_verify_parallelism;
310
DuckDB Documentation
Secrets Manager
The Secrets manager provides a unified user interface for secrets across all backends that use them. Secrets can be scoped, so different
storage prefixes can have different secrets, allowing for example to join data across organizations in a single query. Secrets can also be
persisted, so that they do not need to be specified every time DuckDB is launched.
Note. Warning Persistent secrets are stored in unencrypted binary format on the disk.
Secrets
Types of Secrets Secrets are typed, their type identifies which service they are for. Currently, the following cloud services are available:
For each type, there are one or more ”secret providers” that specify how the secret is created. Secrets can also have an optional scope,
which is a file path prefix that the secret applies to. When fetching a secret for a path, the secret scopes are compared to the path, returning
the matching secret for the path. In the case of multiple matching secrets, the longest prefix is chosen.
Creating a Secret Secrets can be created using the CREATE SECRET SQL statement. Secrets can be temporary or persistent. Tem‑
porary secrets are used by default – and are stored in‑memory for the life span of the DuckDB instance similar to how settings worked
previously. Persistent secrets are stored in unencrypted binary format in the ~/.duckdb/stored_secrets directory. On startup of
DuckDB, persistent secrets are read from this directory and automatically loaded.
Secret Providers To create a secret, a Secret Provider needs to be used. A Secret Provider is a mechanism through which a secret
is generated. To illustrate this, for the S3, GCS, R2, and AZURE secret types, DuckDB currently supports two providers: CONFIG and
CREDENTIAL_CHAIN. The CONFIG provider requires the user to pass all configuration information into the CREATE SECRET, whereas
the CREDENTIAL_CHAIN provider will automatically try to fetch credentials. When no Secret Provider is specified, the CONFIG provider
is used. For more details on how to create secrets using different providers checkout the respective pages on httpfs and azure
Temporary Secrets To create a temporary unscoped secret to access S3, we can now use the following:
CREATE SECRET (
TYPE S3,
KEY_ID 'mykey',
SECRET 'mysecret',
REGION 'myregion'
);
Note that we implicitly use the default CONFIG secret provider here.
Persistent Secrets In order to persist secrets between DuckDB database instances, we can now use the CREATE PERSISTENT SECRET
command, e.g.:
311
DuckDB Documentation
Deleting Secrets Secrets can be deleted using the DROP SECRET statement, e.g.:
Creating Multiple Secrets for the Same Service Type If two secrets exist for a service type, the scope can be used to decide which one
should be used. For example:
Now, if the user queries something from s3://my-other-bucket/something, secret secret2 will be chosen automatically for
that request. To see which secret is being used, the which_secret scalar function can be used, which takes a path and a secret type as
parameters:
Listing Secrets Secrets can be listed using the built‑in table‑producing function, e.g., by using the duckdb_secrets() table func‑
tion:
FROM duckdb_secrets();
312
SQL
SQL Introduction
Here we provide an overview of how to perform simple operations in SQL. This tutorial is only intended to give you an introduction and is
in no way a complete tutorial on SQL. This tutorial is adapted from the PostgreSQL tutorial.
In the examples that follow, we assume that you have installed the DuckDB Command Line Interface (CLI) shell. See the installation page
for information on how to install the CLI.
Concepts
DuckDB is a relational database management system (RDBMS). That means it is a system for managing data stored in relations. A relation
is essentially a mathematical term for a table.
Each table is a named collection of rows. Each row of a given table has the same set of named columns, and each column is of a specific
data type. Tables themselves are stored inside schemas, and a collection of schemas constitutes the entire database that you can access.
You can create a new table by specifying the table name, along with all column names and their types:
You can enter this into the shell with the line breaks. The command is not terminated until the semicolon.
White space (i.e., spaces, tabs, and newlines) can be used freely in SQL commands. That means you can type the command aligned differ‑
ently than above, or even all on one line. Two dash characters (--) introduce comments. Whatever follows them is ignored up to the end
of the line. SQL is case insensitive about key words and identifiers.
In the SQL command, we first specify the type of command that we want to perform: CREATE TABLE. After that follows the parameters
for the command. First, the table name, weather, is given. Then the column names and column types follow.
city VARCHAR specifies that the table has a column called city that is of type VARCHAR. VARCHAR specifies a data type that can store
text of arbitrary length. The temperature fields are stored in an INTEGER type, a type that stores integer numbers (i.e., whole numbers
without a decimal point). REAL columns store single precision floating‑point numbers (i.e., numbers with a decimal point). DATE stores a
date (i.e., year, month, day combination). DATE only stores the specific day, not a time associated with that day.
DuckDB supports the standard SQL types INTEGER, SMALLINT, REAL, DOUBLE, DECIMAL, CHAR(n), VARCHAR(n), DATE, TIME and
TIMESTAMP.
The second example will store cities and their associated geographical location:
313
DuckDB Documentation
lon DECIMAL
);
Finally, it should be mentioned that if you don't need a table any longer or want to recreate it differently you can remove it using the
following command:
INSERT INTO weather VALUES ('San Francisco', 46, 50, 0.25, '1994-11-27');
Constants that are not numeric values (e.g., text and dates) must be surrounded by single quotes (''), as in the example. Input dates for
the date type must be formatted as 'YYYY-MM-DD'.
The syntax used so far requires you to remember the order of the columns. An alternative syntax allows you to list the columns explicitly:
You can list the columns in a different order if you wish or even omit some columns, e.g., if the prcp is unknown:
Many developers consider explicitly listing the columns better style than relying on the order implicitly.
Please enter all the commands shown above so you have some data to work with in the following sections.
You could also have used COPY to load large amounts of data from CSV files. This is usually faster because the COPY command is optimized
for this application while allowing less flexibility than INSERT. An example with weather.csv would be:
COPY weather
FROM 'weather.csv';
Where the file name for the source file must be available on the machine running the process. There are many other ways of loading data
into DuckDB, see the corresponding documentation section for more information.
Querying a Table
To retrieve data from a table, the table is queried. A SQL SELECT statement is used to do this. The statement is divided into a select list
(the part that lists the columns to be returned), a table list (the part that lists the tables from which to retrieve the data), and an optional
qualification (the part that specifies any restrictions). For example, to retrieve all the rows of table weather, type:
SELECT *
FROM weather;
Here * is a shorthand for ”all columns”. So the same result would be had with:
314
DuckDB Documentation
┌───────────────┬─────────┬─────────┬───────┬────────────┐
│ city │ temp_lo │ temp_hi │ prcp │ date │
│ varchar │ int32 │ int32 │ float │ date │
├───────────────┼─────────┼─────────┼───────┼────────────┤
│ San Francisco │ 46 │ 50 │ 0.25 │ 1994-11-27 │
│ San Francisco │ 43 │ 57 │ 0.0 │ 1994-11-29 │
│ Hayward │ 37 │ 54 │ │ 1994-11-29 │
└───────────────┴─────────┴─────────┴───────┴────────────┘
You can write expressions, not just simple column references, in the select list. For example, you can do:
┌───────────────┬──────────┬────────────┐
│ city │ temp_avg │ date │
│ varchar │ double │ date │
├───────────────┼──────────┼────────────┤
│ San Francisco │ 48.0 │ 1994-11-27 │
│ San Francisco │ 50.0 │ 1994-11-29 │
│ Hayward │ 45.5 │ 1994-11-29 │
└───────────────┴──────────┴────────────┘
Notice how the AS clause is used to relabel the output column. (The AS clause is optional.)
A query can be ”qualified” by adding a WHERE clause that specifies which rows are wanted. The WHERE clause contains a Boolean (truth
value) expression, and only rows for which the Boolean expression is true are returned. The usual Boolean operators (AND, OR, and NOT)
are allowed in the qualification. For example, the following retrieves the weather of San Francisco on rainy days:
SELECT *
FROM weather
WHERE city = 'San Francisco' AND prcp > 0.0;
Result:
┌───────────────┬─────────┬─────────┬───────┬────────────┐
│ city │ temp_lo │ temp_hi │ prcp │ date │
│ varchar │ int32 │ int32 │ float │ date │
├───────────────┼─────────┼─────────┼───────┼────────────┤
│ San Francisco │ 46 │ 50 │ 0.25 │ 1994-11-27 │
└───────────────┴─────────┴─────────┴───────┴────────────┘
You can request that the results of a query be returned in sorted order:
SELECT *
FROM weather
ORDER BY city;
┌───────────────┬─────────┬─────────┬───────┬────────────┐
│ city │ temp_lo │ temp_hi │ prcp │ date │
│ varchar │ int32 │ int32 │ float │ date │
├───────────────┼─────────┼─────────┼───────┼────────────┤
│ Hayward │ 37 │ 54 │ │ 1994-11-29 │
│ San Francisco │ 46 │ 50 │ 0.25 │ 1994-11-27 │
│ San Francisco │ 43 │ 57 │ 0.0 │ 1994-11-29 │
└───────────────┴─────────┴─────────┴───────┴────────────┘
In this example, the sort order isn't fully specified, and so you might get the San Francisco rows in either order. But you'd always get the
results shown above if you do:
315
DuckDB Documentation
SELECT *
FROM weather
ORDER BY city, temp_lo;
You can request that duplicate rows be removed from the result of a query:
┌───────────────┐
│ city │
│ varchar │
├───────────────┤
│ Hayward │
│ San Francisco │
└───────────────┘
Here again, the result row ordering might vary. You can ensure consistent results by using DISTINCT and ORDER BY together:
Thus far, our queries have only accessed one table at a time. Queries can access multiple tables at once, or access the same table in such a
way that multiple rows of the table are being processed at the same time. A query that accesses multiple rows of the same or different tables
at one time is called a join query. As an example, say you wish to list all the weather records together with the location of the associated
city. To do that, we need to compare the city column of each row of the weather table with the name column of all rows in the cities
table, and select the pairs of rows where these values match.
SELECT *
FROM weather, cities
WHERE city = name;
┌───────────────┬─────────┬─────────┬───────┬────────────┬───────────────┬───────────────┬───────────────┐
│ city │ temp_lo │ temp_hi │ prcp │ date │ name │ lat │ lon │
│ varchar │ int32 │ int32 │ float │ date │ varchar │ decimal(18,3) │ decimal(18,3) │
├───────────────┼─────────┼─────────┼───────┼────────────┼───────────────┼───────────────┼───────────────┤
│ San Francisco │ 46 │ 50 │ 0.25 │ 1994-11-27 │ San Francisco │ -194.000 │ 53.000 │
│ San Francisco │ 43 │ 57 │ 0.0 │ 1994-11-29 │ San Francisco │ -194.000 │ 53.000 │
└───────────────┴─────────┴─────────┴───────┴────────────┴───────────────┴───────────────┴───────────────┘
• There is no result row for the city of Hayward. This is because there is no matching entry in the cities table for Hayward, so the
join ignores the unmatched rows in the weather table. We will see shortly how this can be fixed.
• There are two columns containing the city name. This is correct because the lists of columns from the weather and cities tables
are concatenated. In practice this is undesirable, though, so you will probably want to list the output columns explicitly rather than
using *:
┌───────────────┬─────────┬─────────┬───────┬────────────┬───────────────┬───────────────┐
│ city │ temp_lo │ temp_hi │ prcp │ date │ lon │ lat │
│ varchar │ int32 │ int32 │ float │ date │ decimal(18,3) │ decimal(18,3) │
├───────────────┼─────────┼─────────┼───────┼────────────┼───────────────┼───────────────┤
316
DuckDB Documentation
Since the columns all had different names, the parser automatically found which table they belong to. If there were duplicate column
names in the two tables you'd need to qualify the column names to show which one you meant, as in:
It is widely considered good style to qualify all column names in a join query, so that the query won't fail if a duplicate column name is later
added to one of the tables.
Join queries of the kind seen thus far can also be written in this alternative form:
SELECT *
FROM weather
INNER JOIN cities ON weather.city = cities.name;
This syntax is not as commonly used as the one above, but we show it here to help you understand the following topics.
Now we will figure out how we can get the Hayward records back in. What we want the query to do is to scan the weather table and for
each row to find the matching cities row(s). If no matching row is found we want some ”empty values” to be substituted for the cities
table's columns. This kind of query is called an outer join. (The joins we have seen so far are inner joins.) The command looks like this:
SELECT *
FROM weather
LEFT OUTER JOIN cities ON weather.city = cities.name;
┌───────────────┬─────────┬─────────┬───────┬────────────┬───────────────┬───────────────┬───────────────┐
│ city │ temp_lo │ temp_hi │ prcp │ date │ name │ lat │ lon │
│ varchar │ int32 │ int32 │ float │ date │ varchar │ decimal(18,3) │ decimal(18,3) │
├───────────────┼─────────┼─────────┼───────┼────────────┼───────────────┼───────────────┼───────────────┤
│ San Francisco │ 46 │ 50 │ 0.25 │ 1994-11-27 │ San Francisco │ -194.000 │ 53.000 │
│ San Francisco │ 43 │ 57 │ 0.0 │ 1994-11-29 │ San Francisco │ -194.000 │ 53.000 │
│ Hayward │ 37 │ 54 │ │ 1994-11-29 │ │ │ │
└───────────────┴─────────┴─────────┴───────┴────────────┴───────────────┴───────────────┴───────────────┘
This query is called a left outer join because the table mentioned on the left of the join operator will have each of its rows in the output
at least once, whereas the table on the right will only have those rows output that match some row of the left table. When outputting a
left‑table row for which there is no right‑table match, empty (null) values are substituted for the right‑table columns.
Aggregate Functions
Like most other relational database products, DuckDB supports aggregate functions. An aggregate function computes a single result from
multiple input rows. For example, there are aggregates to compute the count, sum, avg (average), max (maximum) and min (minimum)
over a set of rows.
SELECT max(temp_lo)
FROM weather;
┌──────────────┐
│ max(temp_lo) │
│ int32 │
├──────────────┤
│ 46 │
└──────────────┘
317
DuckDB Documentation
If we wanted to know what city (or cities) that reading occurred in, we might try:
SELECT city
FROM weather
WHERE temp_lo = max(temp_lo); -- WRONG
but this will not work since the aggregate max cannot be used in the WHERE clause. (This restriction exists because the WHERE clause
determines which rows will be included in the aggregate calculation; so obviously it has to be evaluated before aggregate functions are
computed.) However, as is often the case the query can be restated to accomplish the desired result, here by using a subquery:
SELECT city
FROM weather
WHERE temp_lo = (SELECT max(temp_lo) FROM weather);
┌───────────────┐
│ city │
│ varchar │
├───────────────┤
│ San Francisco │
└───────────────┘
This is OK because the subquery is an independent computation that computes its own aggregate separately from what is happening in
the outer query.
Aggregates are also very useful in combination with GROUP BY clauses. For example, we can get the maximum low temperature observed
in each city with:
┌───────────────┬──────────────┐
│ city │ max(temp_lo) │
│ varchar │ int32 │
├───────────────┼──────────────┤
│ San Francisco │ 46 │
│ Hayward │ 37 │
└───────────────┴──────────────┘
Which gives us one output row per city. Each aggregate result is computed over the table rows matching that city. We can filter these
grouped rows using HAVING:
┌─────────┬──────────────┐
│ city │ max(temp_lo) │
│ varchar │ int32 │
├─────────┼──────────────┤
│ Hayward │ 37 │
└─────────┴──────────────┘
which gives us the same results for only the cities that have all temp_lo values below 40. Finally, if we only care about cities whose names
begin with ”S”, we can use the LIKE operator:
318
DuckDB Documentation
More information about the LIKE operator can be found in the pattern matching page.
It is important to understand the interaction between aggregates and SQL's WHERE and HAVING clauses. The fundamental difference
between WHERE and HAVING is this: WHERE selects input rows before groups and aggregates are computed (thus, it controls which rows
go into the aggregate computation), whereas HAVING selects group rows after groups and aggregates are computed. Thus, the WHERE
clause must not contain aggregate functions; it makes no sense to try to use an aggregate to determine which rows will be inputs to the
aggregates. On the other hand, the HAVING clause always contains aggregate functions.
In the previous example, we can apply the city name restriction in WHERE, since it needs no aggregate. This is more efficient than adding
the restriction to HAVING, because we avoid doing the grouping and aggregate calculations for all rows that fail the WHERE check.
Updates
You can update existing rows using the UPDATE command. Suppose you discover the temperature readings are all off by 2 degrees after
November 28. You can correct the data as follows:
UPDATE weather
SET temp_hi = temp_hi - 2, temp_lo = temp_lo - 2
WHERE date > '1994-11-28';
SELECT *
FROM weather;
┌───────────────┬─────────┬─────────┬───────┬────────────┐
│ city │ temp_lo │ temp_hi │ prcp │ date │
│ varchar │ int32 │ int32 │ float │ date │
├───────────────┼─────────┼─────────┼───────┼────────────┤
│ San Francisco │ 46 │ 50 │ 0.25 │ 1994-11-27 │
│ San Francisco │ 41 │ 55 │ 0.0 │ 1994-11-29 │
│ Hayward │ 35 │ 52 │ │ 1994-11-29 │
└───────────────┴─────────┴─────────┴───────┴────────────┘
Deletions
Rows can be removed from a table using the DELETE command. Suppose you are no longer interested in the weather of Hayward. Then
you can do the following to delete those rows from the table:
SELECT *
FROM weather;
┌───────────────┬─────────┬─────────┬───────┬────────────┐
│ city │ temp_lo │ temp_hi │ prcp │ date │
│ varchar │ int32 │ int32 │ float │ date │
├───────────────┼─────────┼─────────┼───────┼────────────┤
│ San Francisco │ 46 │ 50 │ 0.25 │ 1994-11-27 │
│ San Francisco │ 41 │ 55 │ 0.0 │ 1994-11-29 │
└───────────────┴─────────┴─────────┴───────┴────────────┘
Without a qualification, DELETE will remove all rows from the given table, leaving it empty. The system will not request confirmation
before doing this!
319
DuckDB Documentation
Statements
Statements Overview
The ALTER TABLE statement changes the schema of an existing table in the catalog.
Examples
-- add a new column with name "k" to the table "integers", it will be filled with the default value NULL
ALTER TABLE integers ADD COLUMN k INTEGER;
-- add a new column with name "l" to the table integers, it will be filled with the default value 10
ALTER TABLE integers ADD COLUMN l INTEGER DEFAULT 10;
-- change the type of the column "i" to the type "VARCHAR" using a standard cast
ALTER TABLE integers ALTER i TYPE VARCHAR;
-- change the type of the column "i" to the type "VARCHAR", using the specified expression to convert
the data for each row
ALTER TABLE integers ALTER i SET DATA TYPE VARCHAR USING concat(i, '_', j);
-- rename a table
ALTER TABLE integers RENAME TO integers_old;
Syntax
ALTER TABLE changes the schema of an existing table. All the changes made by ALTER TABLE fully respect the transactional semantics,
i.e., they will not be visible to other transactions until committed, and can be fully reverted through a rollback.
RENAME TABLE
-- rename a table
ALTER TABLE integers RENAME TO integers_old;
The RENAME TO clause renames an entire table, changing its name in the schema. Note that any views that rely on the table are not
automatically updated.
320
DuckDB Documentation
RENAME COLUMN
The RENAME COLUMN clause renames a single column within a table. Any constraints that rely on this name (e.g., CHECK constraints) are
automatically updated. However, note that any views that rely on this column name are not automatically updated.
ADD COLUMN
-- add a new column with name "k" to the table "integers", it will be filled with the default value NULL
ALTER TABLE integers ADD COLUMN k INTEGER;
-- add a new column with name "l" to the table integers, it will be filled with the default value 10
ALTER TABLE integers ADD COLUMN l INTEGER DEFAULT 10;
The ADD COLUMN clause can be used to add a new column of a specified type to a table. The new column will be filled with the specified
default value, or NULL if none is specified.
DROP COLUMN
The DROP COLUMN clause can be used to remove a column from a table. Note that columns can only be removed if they do not have any
indexes that rely on them. This includes any indexes created as part of a PRIMARY KEY or UNIQUE constraint. Columns that are part of
multi‑column check constraints cannot be dropped either.
ALTER TYPE
-- change the type of the column "i" to the type "VARCHAR" using a standard cast
ALTER TABLE integers ALTER i TYPE VARCHAR;
-- change the type of the column "i" to the type "VARCHAR", using the specified expression to convert
the data for each row
ALTER TABLE integers ALTER i SET DATA TYPE VARCHAR USING concat(i, '_', j);
The SET DATA TYPE clause changes the type of a column in a table. Any data present in the column is converted according to the
provided expression in the USING clause, or, if the USING clause is absent, cast to the new data type. Note that columns can only have
their type changed if they do not have any indexes that rely on them and are not part of any CHECK constraints.
The SET/DROP DEFAULT clause modifies the DEFAULT value of an existing column. Note that this does not modify any existing data in
the column. Dropping the default is equivalent to setting the default value to NULL.
Note. Warning At the moment DuckDB will not allow you to alter a table if there are any dependencies. That means that if you
have an index on a column you will first need to drop the index, alter the table, and then recreate the index. Otherwise you will get a
”Dependency Error.”
321
DuckDB Documentation
Note. The ADD CONSTRAINT and DROP CONSTRAINT clauses are not yet supported in DuckDB.
The ALTER VIEW statement changes the schema of an existing view in the catalog.
Examples
-- rename a view
ALTER VIEW v1 RENAME TO v2;
ALTER VIEW changes the schema of an existing table. All the changes made by ALTER VIEW fully respect the transactional semantics,
i.e., they will not be visible to other transactions until committed, and can be fully reverted through a rollback. Note that other views that
rely on the table are not automatically updated.
ATTACH/DETACH Statement
The ATTACH statement adds a new database file to the catalog that can be read from and written to.
Examples
-- attach the database "file.db" with the alias inferred from the name ("file")
ATTACH 'file.db';
-- attach the database "file.db" with an explicit alias ("file_db")
ATTACH 'file.db' AS file_db;
-- attach the database "file.db" in read only mode
ATTACH 'file.db' (READ_ONLY);
-- attach a SQLite database for reading and writing (see the sqlite extension for more information)
ATTACH 'sqlite_file.db' AS sqlite_db (TYPE SQLITE);
-- attach the database "file.db" if inferred database alias "file_db" does not yet exist
ATTACH IF NOT EXISTS 'file.db';
-- attach the database "file.db" if explicit database alias "file_db" does not yet exist
ATTACH IF NOT EXISTS 'file.db' AS file_db;
-- create a table in the attached database with alias "file"
CREATE TABLE file.new_table (i INTEGER);
-- detach the database with alias "file"
DETACH file;
-- show a list of all attached databases
SHOW DATABASES;
-- change the default database that is used to the database "file"
USE file;
Attach
Attach Syntax ATTACH allows DuckDB to operate on multiple database files, and allows for transfer of data between different database
files.
Detach
The DETACH statement allows previously attached database files to be closed and detached, releasing any locks held on the database file.
It is not possible to detach from the default database: if you would like to do so, issue the USE statement to change the default database
to another one.
322
DuckDB Documentation
Note. Warning Closing the connection, e.g., invoking the close() function in Python, does not release the locks held on the
database files as the file handles are held by the main DuckDB instance (in Python's case, the duckdb module).
Detach Syntax
Name Qualification
The fully qualified name of catalog objects contains the catalog, the schema and the name of the object. For example:
Note that often the fully qualified name is not required. When a name is not fully qualified, the system looks for which entries to reference
using the catalog search path. The default catalog search path includes the system catalog, the temporary catalog and the initially attached
database together with the main schema.
Default Database and Schema When a table is created without any qualifications, the table is created in the default schema of the default
database. The default database is the database that is launched when the system is created ‑ and the default schema is main.
Changing the Default Database and Schema The default database and schema can be changed using the USE command.
Resolving Conflicts When providing only a single qualification, the system can interpret this as either a catalog or a schema, as long as
there are no conflicts. For example:
ATTACH 'new_db.db';
CREATE SCHEMA my_schema;
-- creates the table "new_db.main.tbl"
CREATE TABLE new_db.tbl (i INTEGER);
-- creates the table "default_db.my_schema.tbl"
CREATE TABLE my_schema.tbl (i INTEGER);
If we create a conflict (i.e., we have both a schema and a catalog with the same name) the system requests that a fully qualified path is used
instead:
323
DuckDB Documentation
Changing the Catalog Search Path The catalog search path can be adjusted by setting the search_path configuration option, which
uses a comma‑separated list of values that will be on the search path. The following example demonstrates searching in two databases:
Transactional Semantics
When running queries on multiple databases, the system opens separate transactions per database. The transactions are started lazily by
default ‑ when a given database is referenced for the first time in a query, a transaction for that database will be started. SET immediate_
transaction_mode = true can be toggled to change this behavior to eagerly start transactions in all attached databases instead.
While multiple transactions can be active at a time ‑ the system only supports writing to a single attached database in a single transaction.
If you try to write to multiple attached databases in a single transaction the following error will be thrown:
Attempting to write to database "db2" in a transaction that has already modified database "db1" -
a single transaction can only write to a single attached database.
The reason for this restriction is that the system does not maintain atomicity for transactions across attached databases. Transactions are
only atomic within each database file. By restricting the global transaction to write to only a single database file the atomicity guarantees
are maintained.
CALL Statement
The CALL statement invokes the given table function and returns the results.
Examples
Syntax
CHECKPOINT Statement
The CHECKPOINT statement synchronizes data in the write‑ahead log (WAL) to the database data file. For in‑memory databases this
statement will succeed with no effect.
Examples
324
DuckDB Documentation
CHECKPOINT file_db;
-- Abort any in-progress transactions to synchronize the data
FORCE CHECKPOINT;
Syntax
Checkpoint operations happen automatically based on the WAL size (see Configuration). This statement is for manual checkpoint ac‑
tions.
Behavior
The default CHECKPOINT command will fail if there are any running transactions. Including FORCE will abort any transactions and execute
the checkpoint operation.
Also see the related PRAGMA option for further behavior modification.
Reclaiming Space When performing a checkpoint (automatic or otherwise), the space occupied by deleted rows is partially reclaimed.
Note that this does not remove all deleted rows, but rather merges row groups that have a significant amount of deletes together. In the
current implementation this requires ~25% of rows to be deleted in adjacent row groups.
When running in in‑memory mode, checkpointing has no effect, hence it does not reclaim space after deletes in in‑memory databases.
Note. Warning The VACUUM statement does not trigger vacuuming deletes and hence does not reclaim space.
COMMENT ON Statement
The COMMENT ON statement allows adding metadata to catalog entries (tables, columns, etc.). It follows the PostgreSQL syntax.
Examples
Reading Comments
Comments can be read by querying the comment column of the respective metadata functions:
325
DuckDB Documentation
Limitations
Syntax
COPY Statement
Examples
-- read a CSV file into the lineitem table, using auto-detected CSV options
COPY lineitem FROM 'lineitem.csv';
-- read a CSV file into the lineitem table, using manually specified CSV options
COPY lineitem FROM 'lineitem.csv' (DELIMITER '|');
-- read a Parquet file into the lineitem table
COPY lineitem FROM 'lineitem.pq' (FORMAT PARQUET);
-- read a JSON file into the lineitem table, using auto-detected options
COPY lineitem FROM 'lineitem.json' (FORMAT JSON, AUTO_DETECT true);
-- read a CSV file into the lineitem table, using double quotes
COPY lineitem FROM "lineitem.csv";
-- read a CSV file into the lineitem table, omitting quotes
COPY lineitem FROM lineitem.csv;
Overview
COPY moves data between DuckDB and external files. COPY ... FROM imports data into DuckDB from an external file. COPY ... TO
writes data from DuckDB to an external file. The COPY command can be used for CSV, PARQUET and JSON files.
COPY ... FROM imports data from an external file into an existing table. The data is appended to whatever data is in the table already.
The amount of columns inside the file must match the amount of columns in the table table_name, and the contents of the columns
must be convertible to the column types of the table. In case this is not possible, an error will be thrown.
If a list of columns is specified, COPY will only copy the data in the specified columns from the file. If there are any columns in the table that
are not in the column list, COPY ... FROM will insert the default values for those columns
326
DuckDB Documentation
-- Copy the contents of a comma-separated file 'test.csv' without a header into the table 'test'
COPY test FROM 'test.csv';
-- Copy the contents of a comma-separated file with a header into the 'category' table
COPY category FROM 'categories.csv' (HEADER);
-- Copy the contents of 'lineitem.tbl' into the 'lineitem' table, where the contents are delimited by a
pipe character ('|')
COPY lineitem FROM 'lineitem.tbl' (DELIMITER '|');
-- Copy the contents of 'lineitem.tbl' into the 'lineitem' table, where the delimiter, quote character,
and presence of a header are automatically detected
COPY lineitem FROM 'lineitem.tbl' (AUTO_DETECT true);
-- Read the contents of a comma-separated file 'names.csv' into the 'name' column of the 'category'
table. Any other columns of this table are filled with their default value.
COPY category(name) FROM 'names.csv';
-- Read the contents of a Parquet file 'lineitem.parquet' into the lineitem table
COPY lineitem FROM 'lineitem.parquet' (FORMAT PARQUET);
-- Read the contents of a newline-delimited JSON file 'lineitem.ndjson' into the lineitem table
COPY lineitem FROM 'lineitem.ndjson' (FORMAT JSON);
-- Read the contents of a JSON file 'lineitem.json' into the lineitem table
COPY lineitem FROM 'lineitem.json' (FORMAT JSON, ARRAY true);
Syntax
COPY ... TO
COPY ... TO exports data from DuckDB to an external CSV or Parquet file. It has mostly the same set of options as COPY ... FROM,
however, in the case of COPY ... TO the options specify how the file should be written to disk. Any file created by COPY ... TO can
be copied back into the database by using COPY ... FROM with a similar set of options.
The COPY ... TO function can be called specifying either a table name, or a query. When a table name is specified, the contents of the
entire table will be written into the resulting file. When a query is specified, the query is executed and the result of the query is written to
the resulting file.
-- Copy the contents of the 'lineitem' table to a CSV file with a header
COPY lineitem TO 'lineitem.csv';
-- Copy the contents of the 'lineitem' table to the file 'lineitem.tbl',
-- where the columns are delimited by a pipe character ('|'), including a header line.
COPY lineitem TO 'lineitem.tbl' (DELIMITER '|');
-- Use tab separators to create a TSV file without a header
COPY lineitem TO 'lineitem.tsv' (DELIMITER '\t', HEADER false);
-- Copy the l_orderkey column of the 'lineitem' table to the file 'orderkey.tbl'
COPY lineitem(l_orderkey) TO 'orderkey.tbl' (DELIMITER '|');
-- Copy the result of a query to the file 'query.csv', including a header with column names
COPY (SELECT 42 AS a, 'hello' AS b) TO 'query.csv' (DELIMITER ',');
-- Copy the result of a query to the Parquet file 'query.parquet'
COPY (SELECT 42 AS a, 'hello' AS b) TO 'query.parquet' (FORMAT PARQUET);
-- Copy the result of a query to the newline-delimited JSON file 'query.ndjson'
COPY (SELECT 42 AS a, 'hello' AS b) TO 'query.ndjson' (FORMAT JSON);
-- Copy the result of a query to the JSON file 'query.json'
COPY (SELECT 42 AS a, 'hello' AS b) TO 'query.json' (FORMAT JSON, ARRAY true);
COPY ... TO Options Zero or more copy options may be provided as a part of the copy operation. The WITH specifier is optional, but
if any options are specified, the parentheses are required. Parameter values can be passed in with or without wrapping in single quotes.
Any option that is a Boolean can be enabled or disabled in multiple ways. You can write true, ON, or 1 to enable the option, and false,
OFF, or 0 to disable it. The BOOLEAN value can also be omitted, e.g., by only passing (HEADER), in which case true is assumed.
The below options are applicable to all formats written with COPY.
327
DuckDB Documentation
Syntax
The COPY FROM DATABASE ... TO statement copies the entire content from one attached database to another attached database.
This includes the schema, including constraints, indexes, sequences, macros, and the data itself.
┌───────┐
│ z │
│ int32 │
├───────┤
│ 87 │
└───────┘
To only copy the schema of db1 to db2 but omit copying the data, add SCHEMA to the statement:
328
DuckDB Documentation
Syntax
Format‑Specific Options
CSV Options The below options are applicable when writing CSV files.
compression The compression type for the file. By default this will be VARCHAR auto
detected automatically from the file extension (e.g.,
file.csv.gz will use gzip, file.csv will use none).
Options are none, gzip, zstd.
force_quote The list of columns to always add quotes to, even if not VARCHAR[] []
required.
dateformat Specifies the date format to use when writing dates. See VARCHAR (empty)
Date Format
delim or sep The character that is written to separate columns within VARCHAR ,
each row.
escape The character that should appear before a character that VARCHAR "
matches the quote value.
header Whether or not to write a header for the CSV file. BOOL true
nullstr The string that is written to represent a NULL value. VARCHAR (empty)
quote The quoting character to be used when a data value is VARCHAR "
quoted.
timestampformat Specifies the date format to use when writing timestamps. VARCHAR (empty)
See Date Format
Parquet Options The below options are applicable when writing Parquet files.
329
DuckDB Documentation
JSON Options The below options are applicable when writing JSON files.
compression The compression type for the file. By default this will be VARCHAR auto
detected automatically from the file extension (e.g.,
file.csv.gz will use gzip, file.csv will use none).
Options are none, gzip, zstd.
dateformat Specifies the date format to use when writing dates. See VARCHAR (empty)
Date Format
timestampformat Specifies the date format to use when writing timestamps. VARCHAR (empty)
See Date Format
array Whether to write a JSON array. If true, a JSON array of BOOL false
records is written, if false, newline‑delimited JSON is
written
The CREATE MACRO statement can create a scalar or table macro (function) in the catalog. A macro may only be a single SELECT statement
(similar to a VIEW), but it has the benefit of accepting parameters. For a scalar macro, CREATE MACRO is followed by the name of the
macro, and optionally parameters within a set of parentheses. The keyword AS is next, followed by the text of the macro. By design, a scalar
macro may only return a single value. For a table macro, the syntax is similar to a scalar macro except AS is replaced with AS TABLE. A
table macro may return a table of arbitrary size and shape.
330
DuckDB Documentation
Note. If a MACRO is temporary, it is only usable within the same database connection and is deleted when the connection is closed.
Examples
Scalar Macros
Table Macros
Syntax
Macros can have default parameters. Unlike some languages, default parameters must be named when the macro is invoked.
-- b is a default parameter
CREATE MACRO add_default(a, b := 5) AS a + b;
331
DuckDB Documentation
When macros are used, they are expanded (i.e., replaced with the original expression), and the parameters within the expanded expression
are replaced with the supplied arguments. Step by step:
Limitations
Using Named Parameters Currently, positional macro parameters can only be used positionally, and named parameters can only be
used by supplying their name. Therefore, the following will not work:
Error: Binder Error: Macro function 'test(a)' requires a single positional argument, but 2 positional
arguments were provided.
LINE 1: SELECT test(32, 52);
^
Using Subquery Macros If a MACRO is defined as a subquery, it cannot be invoked in a table function. DuckDB will return the following
error:
The CREATE SCHEMA statement creates a schema in the catalog. The default schema is main.
Examples
-- create a schema
CREATE SCHEMA s1;
-- create a schema if it does not exist yet
CREATE SCHEMA IF NOT EXISTS s2;
-- create table in the schemas
CREATE TABLE s1.t (id INTEGER PRIMARY KEY, other_id INTEGER);
CREATE TABLE s2.t (id INTEGER PRIMARY KEY, j VARCHAR);
332
DuckDB Documentation
Syntax
The CREATE SECRET statement creates a new secret in the Secrets Manager.
Examples
-- generate an ascending sequence starting from 1
CREATE SEQUENCE serial;
-- generate sequence from a given start number
CREATE SEQUENCE serial START 101;
-- generate odd numbers using INCREMENT BY
CREATE SEQUENCE serial START WITH 1 INCREMENT BY 2;
-- generate a descending sequqnce starting from 99
CREATE SEQUENCE serial START WITH 99 INCREMENT BY -1 MAXVALUE 99;
-- by default, cycles are not allowed and will result in a Serialization Error, e.g.:
-- reached maximum value of sequence "serial" (10)
CREATE SEQUENCE serial START WITH 1 MAXVALUE 10;
-- CYCLE allows cycling through the same sequence repeatedly
CREATE SEQUENCE serial START WITH 1 MAXVALUE 10 CYCLE;
Creating and Dropping Sequences Sequences can be created and dropped similarly to other catalogue items:
Using Sequences for Primary Keys Sequences can provide an integer primary key for a table. For example:
333
DuckDB Documentation
┌───────┬─────────┐
│ id │ s │
│ int32 │ varchar │
├───────┼─────────┤
│ 1 │ hello │
│ 2 │ world │
└───────┴─────────┘
Sequences can also be added using the ALTER TABLE statement. The following example adds an id column and fills it with values
generated by the sequence:
Selecting the Next Value Select the next number from a sequence:
┌─────────┐
│ nextval │
│ int64 │
├─────────┤
│ 1 │
└─────────┘
Selecting the Current Value You may also view the current number from the sequence. Note that the nextval function must have
already been called before calling currval, otherwise a Serialization Error (”sequence is not yet defined in this session”) will be thrown.
┌─────────┐
│ currval │
│ int64 │
├─────────┤
│ 1 │
└─────────┘
If a schema name is given then the sequence is created in the specified schema. Otherwise it is created in the current schema. Temporary
sequences exist in a special schema, so a schema name may not be given when creating a temporary sequence. The sequence name must
be distinct from the name of any other sequence in the same schema.
After a sequence is created, you use the function nextval to operate on the sequence.
Parameters
334
DuckDB Documentation
Name Description
CYCLE or NO CYCLE The CYCLE option allows the sequence to wrap around when the maxvalue or
minvalue has been reached by an ascending or descending sequence respectively. If
the limit is reached, the next number generated will be the minvalue or maxvalue,
respectively. If NO CYCLE is specified, any calls to nextval after the sequence has
reached its maximum value will return an error. If neither CYCLE or NO CYCLE are
specified, NO CYCLE is the default.
increment The optional clause INCREMENT BY increment specifies which value is added to the
current sequence value to create a new value. A positive value will make an ascending
sequence, a negative one a descending sequence. The default value is 1.
maxvalue The optional clause MAXVALUE maxvalue determines the maximum value for the
sequence. If this clause is not supplied or NO MAXVALUE is specified, then default values
will be used. The defaults are 2^63 ‑ 1 and ‑1 for ascending and descending sequences,
respectively.
minvalue The optional clause MINVALUE minvalue determines the minimum value a sequence
can generate. If this clause is not supplied or NO MINVALUE is specified, then defaults
will be used. The defaults are 1 and ‑(2^63 ‑ 1) for ascending and descending sequences,
respectively.
name The name (optionally schema‑qualified) of the sequence to be created.
start The optional clause START WITH start allows the sequence to begin anywhere. The
default starting value is minvalue for ascending sequences and maxvalue for
descending ones.
TEMPORARY or TEMP If specified, the sequence object is created only for this session, and is automatically
dropped on session exit. Existing permanent sequences with the same name are not
visible (in this session) while the temporary sequence exists, unless they are referenced
with schema‑qualified names.
Note. Sequences are based on BIGINT arithmetic, so the range cannot exceed the range of an eight‑byte integer (‑9223372036854775808
to 9223372036854775807).
Examples
335
DuckDB Documentation
Temporary Tables
Temporary tables can be created using the CREATE TEMP TABLE or the CREATE TEMPORARY TABLE statement (see diagram below).
Temporary tables are session scoped (similar to PostgreSQL for example), meaning that only the specific connection that created them can
access them, and once the connection to DuckDB is closed they will be automatically dropped. Temporary tables reside in memory rather
than on disk (even when connecting to a persistent DuckDB), but if the temp_directory configuration is set when connecting or with a
SET command, data will be spilled to disk if memory becomes constrained.
-- create a temporary table from a CSV file (automatically detecting column names and types)
CREATE TEMP TABLE t1 AS SELECT * FROM read_csv('path/file.csv');
Temporary tables are part of the temp.main schema. While discouraged, their names can overlap with the names of the regular database
tables. In these cases, use their fully qualified name, e.g., temp.main.t1, for disambiguation.
CREATE OR REPLACE
The CREATE OR REPLACE syntax allows a new table to be created or for an existing table to be overwritten by the new table. This is
shorthand for dropping the existing table and then creating the new one.
-- create a table with two integer columns (i and j) even if t1 already exists
CREATE OR REPLACE TABLE t1 (i INTEGER, j INTEGER);
IF NOT EXISTS
The IF NOT EXISTS syntax will only proceed with the creation of the table if it does not already exist. If the table already exists, no action
will be taken and the existing table will remain in the database.
-- create a table with two integer columns (i and j) only if t1 does not exist yet
CREATE TABLE IF NOT EXISTS t1 (i INTEGER, j INTEGER);
Check Constraints
A CHECK constraint is an expression that must be satisfied by the values of every row in the table.
CREATE TABLE t1 (
id INTEGER PRIMARY KEY,
percentage INTEGER CHECK (0 <= percentage AND percentage <= 100)
);
INSERT INTO t1 VALUES (1, 5);
INSERT INTO t1 VALUES (2, -1);
-- Error: Constraint Error: CHECK constraint failed: t1
INSERT INTO t1 VALUES (3, 101);
-- Error: Constraint Error: CHECK constraint failed: t1
336
DuckDB Documentation
CREATE TABLE t2 (id INTEGER PRIMARY KEY, x INTEGER, y INTEGER CHECK (x < y));
INSERT INTO t2 VALUES (1, 5, 10);
INSERT INTO t2 VALUES (2, 5, 3);
-- Error: Constraint Error: CHECK constraint failed: t2
CREATE TABLE t3 (
id INTEGER PRIMARY KEY,
x INTEGER,
y INTEGER,
CONSTRAINT x_smaller_than_y CHECK (x < y)
);
INSERT INTO t3 VALUES (1, 5, 10);
INSERT INTO t3 VALUES (2, 5, 3);
-- Error: Constraint Error: CHECK constraint failed: t3
A FOREIGN KEY is a column (or set of columns) that references another table's primary key. Foreign keys check referential integrity, i.e.,
the referred primary key must exist in the other table upon insertion.
-- example
INSERT INTO t1 VALUES (1, 'a');
INSERT INTO t2 VALUES (1, 1);
INSERT INTO t2 VALUES (2, 2);
-- Error: Constraint Error: Violates foreign key constraint because key "id: 2" does not exist in the
referenced table
-- example
INSERT INTO t3 VALUES (1, 'a');
INSERT INTO t4 VALUES (1, 1, 'a');
INSERT INTO t4 VALUES (2, 1, 'b');
-- Error: Constraint Error: Violates foreign key constraint because key "id: 1, j: b" does not exist in
the referenced table
337
DuckDB Documentation
Note. Foreign keys with cascading deletes (FOREIGN KEY ... REFERENCES ... ON DELETE CASCADE) are not supported.
Generated Columns
The [type] [GENERATED ALWAYS] AS (expr) [VIRTUAL|STORED] syntax will create a generated column. The data in this kind
of column is generated from its expression, which can reference other (regular or generated) columns of the table. Since they are produced
by calculations, these columns can not be inserted into directly.
DuckDB can infer the type of the generated column based on the expression's return type. This allows you to leave out the type when
declaring a generated column. It is possible to explicitly set a type, but insertions into the referenced columns might fail if the type can not
be cast to the type of the generated column.
Generated columns come in two varieties: VIRTUAL and STORED. The data of virtual generated columns is not stored on disk, instead it
is computed from the expression every time the column is referenced (through a select statement).
The data of stored generated columns is stored on disk and is computed every time the data of their dependencies change (through an
insert/update/drop statement).
Currently, only the VIRTUAL kind is supported, and it is also the default option if the last field is left blank.
Syntax
Examples
The SQL query behind an existing view can be read using the duckdb_views() function like this:
Syntax
CREATE VIEW defines a view of a query. The view is not physically materialized. Instead, the query is run every time the view is referenced
in a query.
CREATE OR REPLACE VIEW is similar, but if a view of the same name already exists, it is replaced.
If a schema name is given then the view is created in the specified schema. Otherwise it is created in the current schema. Temporary views
exist in a special schema, so a schema name cannot be given when creating a temporary view. The name of the view must be distinct from
the name of any other view or table in the same schema.
338
DuckDB Documentation
Examples
Syntax
CREATE TYPE defines a new data type available to this duckdb instance. These new types can then be inspected in the duckdb_types
table.
Extending these custom types to support custom operators (such as the PostgreSQL && operator) would require C++ development. To do
this, create an extension.
DELETE Statement
The DELETE statement removes rows from the table identified by the table‑name.
Examples
-- remove the rows matching the condition "i = 2" from the database
DELETE FROM tbl WHERE i = 2;
-- delete all rows in the table "tbl"
DELETE FROM tbl;
-- the TRUNCATE statement removes all rows from a table,
-- acting as an alias for DELETE FROM without a WHERE clause
TRUNCATE tbl;
Syntax
The DELETE statement removes rows from the table identified by the table‑name.
If the WHERE clause is not present, all records in the table are deleted. If a WHERE clause is supplied, then only those rows for which the
WHERE clause results in true are deleted. Rows for which the expression is false or NULL are retained.
The USING clause allows deleting based on the content of other tables or subqueries.
Running DELETE does not mean space is reclaimed. In general, rows are only marked as deleted. DuckDB's support for VACUUM is limited
to vacuuming entire row groups.
339
DuckDB Documentation
DROP Statement
The DROP statement removes a catalog entry added previously with the CREATE command.
Examples
Syntax
The optional IF EXISTS clause suppresses the error that would normally result if the table does not exist.
By default (or if the RESTRICT clause is provided), the entry will not be dropped if there are any other objects that depend on it. If the
CASCADE clause is provided then all the objects that are dependent on the object will be dropped as well.
Running DROP TABLE should free the memory used by the table, but not always disk space. Even if disk space does not decrease, the free
blocks will be marked as ”free”. For example, if we have a 2 GB file and we drop a 1 GB table, the file might still be 2 GB, but it should have
1 GB of free blocks in it. To check this, use the following PRAGMA and check the number of free_blocks in the output:
PRAGMA database_size;
The EXPORT DATABASE command allows you to export the contents of the database to a specific directory. The IMPORT DATABASE
command allows you to then read the contents again.
Examples
340
DuckDB Documentation
ROW_GROUP_SIZE 100_000
);
-- reload the database again
IMPORT DATABASE 'source_directory';
-- alternatively, use a PRAGMA
PRAGMA import_database('source_directory');
For details regarding the writing of Parquet files, see the Parquet Files page in the Data Import section, and the COPY Statement page.
EXPORT DATABASE
The EXPORT DATABASE command exports the full contents of the database ‑ including schema information, tables, views and sequences
‑ to a specific directory that can then be loaded again. The created directory will be structured as follows:
target_directory/schema.sql
target_directory/load.sql
target_directory/t_1.csv
...
target_directory/t_n.csv
The schema.sql file contains the schema statements that are found in the database. It contains any CREATE SCHEMA, CREATE TABLE,
CREATE VIEW and CREATE SEQUENCE commands that are necessary to re‑construct the database.
The load.sql file contains a set of COPY statements that can be used to read the data from the CSV files again. The file contains a single
COPY statement for every table found in the schema.
Syntax
IMPORT DATABASE
The database can be reloaded by using the IMPORT DATABASE command again, or manually by running schema.sql followed by
load.sql to re‑load the data.
Syntax
INSERT Statement
Examples
341
DuckDB Documentation
Syntax INSERT INTO inserts new rows into a table. One can insert one or more rows specified by value expressions, or zero or more
rows resulting from a query.
It's possible to provide an optional insert column order, this can either be BY POSITION (the default) or BY NAME. Each column not
present in the explicit or implicit column list will be filled with a default value, either its declared default value or NULL if there is none.
If the expression for any column is not of the correct data type, automatic type conversion will be attempted.
INSERT INTO ... [BY POSITION] The order that values are inserted into the columns of the table is determined by the order that
the columns were declared in. That is, the values supplied by the VALUES clause or query are associated with the column list left‑to‑right.
This is the default option, that can be explicitly specified using the BY POSITION option. For example:
To use a different order, column names can be provided as part of the target, for example:
INSERT INTO ... BY NAME The names of the column list of the SELECT statement are matched against the column names of the
table to determine the order that values should be inserted into the table, even if the order of the columns in the table differs from the order
of the values in the SELECT statement. For example:
This will insert 42 into b and insert NULL (or its default value) into a.
It's important to note that when using INSERT INTO ... BY NAME, the column names specified in the SELECT statement must match
the column names in the table. If a column name is misspelled or does not exist in the table, an error will occur. Columns that are missing
from the SELECT statement will be filled with the default value.
ON CONFLICT Clause
An ON CONFLICT clause can be used to perform a certain action on conflicts that arise from UNIQUE or PRIMARY KEY constraints. An
example for such a conflict is shown in the following example:
This raises as an error and leaves the table with a single row <i: 1, j: 42>.
Error: Constraint Error: Duplicate key "i: 1" violates primary key constraint.
There are two supported actions: DO NOTHING and DO UPDATE SET ...
342
DuckDB Documentation
DO NOTHING Clause The DO NOTHING clause causes the error(s) to be ignored, and the values are not inserted or updated. For
example:
These statements finish successfully and leaves the table with the row <i: 1, j: 42>.
Shorthand The INSERT OR IGNORE INTO ... statement is a shorter syntax alternative to INSERT INTO ... ON CONFLICT
DO NOTHING. For example, the following statements are equivalent:
DO UPDATE Clause The DO UPDATE clause causes the INSERT to turn into an UPDATE on the conflicting row(s) instead. The SET ex‑
pressions that follow determine how these rows are updated. The expressions can use the special virtual table EXCLUDED, which contains
the conflicting values for the row. Optionally you can provide an additional WHERE clause that can exclude certain rows from the update.
The conflicts that don't meet this condition are ignored instead.
Because we need a way to refer to both the to‑be‑inserted tuple and the existing tuple, we introduce the special EXCLUDED qualifier.
When the EXCLUDED qualifier is provided, the reference refers to the to‑be‑inserted tuple, otherwise it refers to the existing tuple. This
special qualifier can be used within the WHERE clauses and SET expressions of the ON CONFLICT clause.
These statements finish successfully and leaves the table with a single row <i: 1, j: 84>.
Shorthand The INSERT OR REPLACE INTO ... statement is a shorter syntax alternative to INSERT INTO ... DO UPDATE
SET c1 = EXCLUDED.c1, c2 = EXCLUDED.c2, .... That is, it updates every column of the existing row to the new values of
the to‑be‑inserted row. For example, given the following input table:
Defining a Conflict Target A conflict target may be provided as ON CONFLICT (confict_target). This is a group of columns
that an index or uniqueness/key constraint is defined on. If the conflict target is omitted, or PRIMARY KEY constraint(s) on the table are
targeted.
Specifying a conflict target is optional unless using a DO UPDATE and there are multiple unique/primary key constraints on the table.
343
DuckDB Documentation
When a conflict target is provided, you can further filter this with a WHERE clause, that should be met by all conflicts.
INSERT INTO tbl VALUES (1, 40, 700) ON CONFLICT (i) DO UPDATE SET k = 2 * EXCLUDED.k WHERE k < 100;
Multiple Tuples Conflicting on the Same Key Having multiple tuples conflicting on the same key is not supported. For example:
Error: Invalid Input Error: ON CONFLICT DO UPDATE can not update the same row twice in the same command.
Ensure that no rows proposed for insertion within the same command have duplicate constrained values
RETURNING Clause
The RETURNING clause may be used to return the contents of the rows that were inserted. This can be useful if some columns are calculated
upon insert. For example, if the table contains an automatically incrementing primary key, then the RETURNING clause will include the
automatically created primary key. This is also useful in the case of generated columns.
Some or all columns can be explicitly chosen to be returned and they may optionally be renamed using aliases. Arbitrary non‑aggregating
expressions may also be returned instead of simply returning a column. All columns can be returned using the * expression, and columns
or expressions can be returned in addition to all columns returned by the *.
For example:
42
i j i_times_j
2 3 6
The next example shows a situation where the RETURNING clause is more helpful. First, a table is created with a primary key column.
Then a sequence is created to allow for that primary key to be incremented as new rows are inserted. When we insert into the table, we
do not already know the values generated by the sequence, so it is valuable to return them. For additional information, see the CREATE
SEQUENCE page.
344
DuckDB Documentation
i j
1 42
2 43
PIVOT Statement
The PIVOT statement allows distinct values within a column to be separated into their own columns. The values within those new columns
are calculated using an aggregate function on the subset of rows that match each distinct value.
DuckDB implements both the SQL Standard PIVOT syntax and a simplified PIVOT syntax that automatically detects the columns to create
while pivoting. PIVOT_WIDER may also be used in place of the PIVOT keyword.
The full syntax diagram is below, but the simplified PIVOT syntax can be summarized using spreadsheet pivot table naming conventions
as:
PIVOT dataset
ON columns
USING values
GROUP BY rows
ORDER BY columns_with_order_directions
LIMIT number_of_rows ;
The ON, USING, and GROUP BY clauses are each optional, but they may not all be omitted.
Example Data All examples use the dataset produced by the queries below:
CREATE TABLE Cities (Country VARCHAR, Name VARCHAR, Year INT, Population INT);
INSERT INTO Cities VALUES ('NL', 'Amsterdam', 2000, 1005);
INSERT INTO Cities VALUES ('NL', 'Amsterdam', 2010, 1065);
INSERT INTO Cities VALUES ('NL', 'Amsterdam', 2020, 1158);
INSERT INTO Cities VALUES ('US', 'Seattle', 2000, 564);
INSERT INTO Cities VALUES ('US', 'Seattle', 2010, 608);
INSERT INTO Cities VALUES ('US', 'Seattle', 2020, 738);
INSERT INTO Cities VALUES ('US', 'New York City', 2000, 8015);
INSERT INTO Cities VALUES ('US', 'New York City', 2010, 8175);
INSERT INTO Cities VALUES ('US', 'New York City', 2020, 8772);
FROM Cities;
345
DuckDB Documentation
PIVOT ON and USING Use the PIVOT statement below to create a separate column for each year and calculate the total population in
each. The ON clause specifies which column(s) to split into separate columns. It is equivalent to the columns parameter in a spreadsheet
pivot table.
The USING clause determines how to aggregate the values that are split into separate columns. This is equivalent to the values parameter
in a spreadsheet pivot table. If the USING clause is not included, it defaults to count(*).
PIVOT Cities
ON Year
USING sum(Population);
In the above example, the sum aggregate is always operating on a single value. If we only want to change the orientation of how the data
is displayed without aggregating, use the first aggregate function. In this example, we are pivoting numeric values, but the first
function works very well for pivoting out a text column. (This is something that is difficult to do in a spreadsheet pivot table, but easy in
DuckDB!)
PIVOT ON, USING, and GROUP BY By default, the PIVOT statement retains all columns not specified in the ON or USING clauses. To
include only certain columns and further aggregate, specify columns in the GROUP BY clause. This is equivalent to the rows parameter of
a spreadsheet pivot table.
In the below example, the Name column is no longer included in the output, and the data is aggregated up to the Country level.
PIVOT Cities
ON Year
USING sum(Population)
GROUP BY Country;
346
DuckDB Documentation
IN Filter for ON Clause To only create a separate column for specific values within a column in the ON clause, use an optional IN expres‑
sion. Let's say for example that we wanted to forget about the year 2020 for no particular reason...
PIVOT Cities
ON Year IN (2000, 2010)
USING sum(Population)
GROUP BY Country;
NL 1005 1065
US 8579 8783
Multiple Expressions per Clause Multiple columns can be specified in the ON and GROUP BY clauses, and multiple aggregate expres‑
sions can be included in the USING clause.
Multiple ON Columns and ON Expressions Multiple columns can be pivoted out into their own columns. DuckDB will find the distinct
values in each ON clause column and create one new column for all combinations of those values (a cartesian product).
In the below example, all combinations of unique countries and unique cities receive their own column. Some combinations may not be
present in the underlying data, so those columns are populated with NULL values.
PIVOT Cities
ON Country, Name
USING sum(Population);
Year NL_Amsterdam NL_New York City NL_Seattle US_Amsterdam US_New York City US_Seattle
To pivot only the combinations of values that are present in the underlying data, use an expression in the ON clause. Multiple expressions
and/or columns may be provided.
Here, Country and Name are concatenated together and the resulting concatenations each receive their own column. Any arbitrary non‑
aggregating expression may be used. In this case, concatenating with an underscore is used to imitate the naming convention the PIVOT
clause uses when multiple ON columns are provided (like in the prior example).
347
DuckDB Documentation
Multiple USING Expressions An alias may also be included for each expression in the USING clause. It will be appended to the generated
column names after an underscore (_). This makes the column naming convention much cleaner when multiple expressions are included
in the USING clause.
In this example, both the sum and max of the Population column are calculated for each year and are split into separate columns.
PIVOT Cities
ON Year
USING sum(Population) AS total, max(Population) AS max
GROUP BY Country;
Multiple GROUP BY Columns Multiple GROUP BY columns may also be provided. Note that column names must be used rather than
column positions (1, 2, etc.), and that expressions are not supported in the GROUP BY clause.
PIVOT Cities
ON Year
USING sum(Population)
GROUP BY Country, Name;
Using PIVOT within a SELECT Statement The PIVOT statement may be included within a SELECT statement as a CTE (a Common
Table Expression, or WITH clause), or a subquery. This allows for a PIVOT to be used alongside other SQL logic, as well as for multiple
PIVOTs to be used in one query.
No SELECT is needed within the CTE, the PIVOT keyword can be thought of as taking its place.
WITH pivot_alias AS (
PIVOT Cities
ON Year
USING sum(Population)
GROUP BY Country
)
SELECT * FROM pivot_alias;
A PIVOT may be used in a subquery and must be wrapped in parentheses. Note that this behavior is different than the SQL Standard Pivot,
as illustrated in subsequent examples.
348
DuckDB Documentation
SELECT *
FROM (
PIVOT Cities
ON Year
USING sum(Population)
GROUP BY Country
) pivot_alias;
Multiple PIVOT Statements Each PIVOT can be treated as if it were a SELECT node, so they can be joined together or manipulated in
other ways.
For example, if two PIVOT statements share the same GROUP BY expression, they can be joined together using the columns in the GROUP
BY clause into a wider pivot.
Internals
Pivoting is implemented as a combination of SQL query re‑writing and a dedicated PhysicalPivot operator for higher performance.
Each PIVOT is implemented as set of aggregations into lists and then the dedicated PhysicalPivot operator converts those lists into
column names and values. Additional pre‑processing steps are required if the columns to be created when pivoting are detected dynami‑
cally (which occurs when the IN clause is not in use).
DuckDB, like most SQL engines, requires that all column names and types be known at the start of a query. In order to automatically detect
the columns that should be created as a result of a PIVOT statement, it must be translated into multiple queries. ENUM types are used to
find the distinct values that should become columns. Each ENUM is then injected into one of the PIVOT statement's IN clauses.
After the IN clauses have been populated with ENUMs, the query is re‑written again into a set of aggregations into lists.
For example:
PIVOT Cities
ON Year
USING sum(Population);
349
DuckDB Documentation
The PhysicalPivot operator converts those lists into column names and values to return this result:
The full syntax diagram is below, but the SQL Standard PIVOT syntax can be summarized as:
FROM dataset
PIVOT (
values
FOR
column_1 IN ( in_list )
column_2 IN ( in_list )
...
GROUP BY rows
);
Unlike the simplified syntax, the IN clause must be specified for each column to be pivoted. If you are interested in dynamic pivoting, the
simplified syntax is recommended.
Note that no commas separate the expressions in the FOR clause, but that value and GROUP BY expressions must be comma‑
separated!
Examples
This example uses a single value expression, a single column expression, and a single row expression:
350
DuckDB Documentation
FROM Cities
PIVOT (
sum(Population)
FOR
Year IN (2000, 2010, 2020)
GROUP BY Country
);
This example is somewhat contrived, but serves as an example of using multiple value expressions and multiple columns in the FOR
clause.
FROM Cities
PIVOT (
sum(Population) AS total,
count(Population) AS count
FOR
Year IN (2000, 2010)
Country in ('NL', 'US')
);
SQL Standard PIVOT Full Syntax Diagram Below is the full syntax diagram of the SQL Standard version of the PIVOT statement.
Profiling Queries
DuckDB supports profiling queries via the EXPLAIN and EXPLAIN ANALYZE statements.
EXPLAIN
EXPLAIN query ;
The output of EXPLAIN contains the estimated cardinalities for each operator.
EXPLAIN ANALYZE
351
DuckDB Documentation
The EXPLAIN ANALYZE statement runs the query, and shows the actual cardinalities for each operator, as well as the cumulative wall‑
clock time spent in each operator.
SELECT Statement
Examples
-- select all columns from the table "tbl"
SELECT * FROM tbl;
-- select the rows from tbl
SELECT j FROM tbl WHERE i = 3;
-- perform an aggregate grouped by the column "i"
SELECT i, sum(j) FROM tbl GROUP BY i;
-- select only the top 3 rows from the tbl
SELECT * FROM tbl ORDER BY i DESC LIMIT 3;
-- join two tables together using the USING clause
SELECT * FROM t1 JOIN t2 USING (a, b);
-- use column indexes to select the first and third column from the table "tbl"
SELECT #1, #3 FROM tbl;
-- select all unique cities from the addresses table
SELECT DISTINCT city FROM addresses;
Syntax The SELECT statement retrieves rows from the database. The canonical order of a select statement is as follows, with less com‑
mon clauses being indented:
SELECT select_list
FROM tables
USING SAMPLE sample_expr
WHERE condition
GROUP BY groups
HAVING group_filter
WINDOW window_expr
QUALIFY qualify_filter
ORDER BY order_expr
LIMIT n;
As the SELECT statement is so complex, we have split up the syntax diagrams into several parts. The full syntax diagram can be found at
the bottom of the page.
SELECT Clause
The SELECT clause specifies the list of columns that will be returned by the query. While it appears first in the clause, logically the ex‑
pressions here are executed only at the end. The SELECT clause can contain arbitrary expressions that transform the output, as well as
aggregates and window functions. The DISTINCT keyword ensures that only unique tuples are returned.
Note. Column names are case‑insensitive. See the Rules for Case Sensitivity for more details.
FROM Clause
The FROM clause specifies the source of the data on which the remainder of the query should operate. Logically, the FROM clause is where
the query starts execution. The FROM clause can contain a single table, a combination of multiple tables that are joined together, or another
SELECT query inside a subquery node.
352
DuckDB Documentation
SAMPLE Clause
The SAMPLE clause allows you to run the query on a sample from the base table. This can significantly speed up processing of queries,
at the expense of accuracy in the result. Samples can also be used to quickly see a snapshot of the data when exploring a data set. The
sample clause is applied right after anything in the from clause (i.e., after any joins, but before the where clause or any aggregates). See
the sample page for more information.
WHERE Clause
The WHERE clause specifies any filters to apply to the data. This allows you to select only a subset of the data in which you are interested.
Logically the WHERE clause is applied immediately after the FROM clause.
The GROUP BY clause specifies which grouping columns should be used to perform any aggregations in the SELECT clause. If the GROUP
BY clause is specified, the query is always an aggregate query, even if no aggregations are present in the SELECT clause.
WINDOW Clause
The WINDOW clause allows you to specify named windows that can be used within window functions. These are useful when you have
multiple window functions, as they allow you to avoid repeating the same window clause.
QUALIFY Clause
ORDER BY and LIMIT are output modifiers. Logically they are applied at the very end of the query. The LIMIT clause restricts the
amount of rows fetched, and the ORDER BY clause sorts the rows on the sorting criteria in either ascending or descending order.
VALUES List
Row IDs
For each table, the rowid pseudocolumn returns the row identifiers based on the physical storage.
┌───────┬────┬─────────┐
│ rowid │ id │ content │
├───────┼────┼─────────┤
│ 0 │ 42 │ hello │
│ 1 │ 43 │ world │
└───────┴────┴─────────┘
In the current storage, these identifiers are contiguous unsigned integers (0, 1, ...) if no rows were deleted. Deletions introduce gaps in the
rowids which may be reclaimed later. Therefore, it is strongly recommended not to use rowids as identifiers.
353
DuckDB Documentation
Note. If there is a user‑defined column named rowid, it shadows the rowid pseudocolumn.
SET/RESET Statements
The SET statement modifies the provided DuckDB configuration option at the specified scope.
Examples
Syntax
RESET
The RESET statement changes the given DuckDB configuration option to the default value.
Scopes
• GLOBAL: Configuration value is used (or reset) across the entire DuckDB instance.
• SESSION: Configuration value is used (or reset) only for the current session attached to a DuckDB instance.
• LOCAL: Not yet implemented.
When not specified, the default scope for the configuration option is used. For most options this is GLOBAL.
Configuration
See the Configuration page for the full list of configuration options.
354
DuckDB Documentation
Transaction Management
DuckDB supports ACID database transactions. Transactions provide isolation, i.e., changes made by a transaction are not visible from
concurrent transactions until it is committed. A transaction can also be aborted, which discards any changes it made so far.
Statements
BEGIN TRANSACTION;
Committing a Transaction You can commit a transaction to make it visible to other transactions and to write it to persistent storage (if
using DuckDB in persistent mode). To commit a transaction, run:
COMMIT;
If you are not in an active transaction, the COMMIT statement will fail.
Rolling Back a Transaction You can abort a transaction. This operation, also known as rolling back, will discard any changes the trans‑
action made to the database. To abort a transaction, run:
ROLLBACK;
You can also use the abort command, which has an identical behavior:
ABORT;
If you are not in an active transaction, the ROLLBACK and ABORT statements will fail.
BEGIN TRANSACTION;
INSERT INTO person VALUES ('Ada', 52);
COMMIT;
BEGIN TRANSACTION;
DELETE FROM person WHERE name = 'Ada';
INSERT INTO person VALUES ('Bruce', 39);
ROLLBACK;
The first transaction (inserting ”Ada”) was committed but the second (deleting ”Ada” and inserting ”Bob”) was aborted. Therefore, the
resulting table will only contain <'Ada', 52>.
UNPIVOT Statement
The UNPIVOT statement allows multiple columns to be stacked into fewer columns. In the basic case, multiple columns are stacked into
two columns: a NAME column (which contains the name of the source column) and a VALUE column (which contains the value from the
source column).
DuckDB implements both the SQL Standard UNPIVOT syntax and a simplified UNPIVOT syntax. Both can utilize a COLUMNS expression
to automatically detect the columns to unpivot. PIVOT_LONGER may also be used in place of the UNPIVOT keyword.
355
DuckDB Documentation
The full syntax diagram is below, but the simplified UNPIVOT syntax can be summarized using spreadsheet pivot table naming conventions
as:
UNPIVOT dataset
ON column(s)
INTO
NAME name-column-name
VALUE value-column-name(s)
ORDER BY column(s)-with-order-direction(s)
LIMIT number-of-rows ;
Example Data All examples use the dataset produced by the queries below:
FROM monthly_sales;
1 electronics 1 2 3 4 5 6
2 clothes 10 20 30 40 50 60
3 cars 100 200 300 400 500 600
UNPIVOT Manually The most typical UNPIVOT transformation is to take already pivoted data and re‑stack it into a column each for the
name and value. In this case, all months will be stacked into a month column and a sales column.
UNPIVOT monthly_sales
ON jan, feb, mar, apr, may, jun
INTO
NAME month
VALUE sales;
1 electronics Jan 1
1 electronics Feb 2
1 electronics Mar 3
1 electronics Apr 4
1 electronics May 5
1 electronics Jun 6
2 clothes Jan 10
2 clothes Feb 20
356
DuckDB Documentation
2 clothes Mar 30
2 clothes Apr 40
2 clothes May 50
2 clothes Jun 60
3 cars Jan 100
3 cars Feb 200
3 cars Mar 300
3 cars Apr 400
3 cars May 500
3 cars Jun 600
UNPIVOT Dynamically Using Columns Expression In many cases, the number of columns to unpivot is not easy to predetermine ahead
of time. In the case of this dataset, the query above would have to change each time a new month is added. The COLUMNS expression can
be used to select all columns that are not empid or dept. This enables dynamic unpivoting that will work regardless of how many months
are added. The query below returns identical results to the one above.
UNPIVOT monthly_sales
ON COLUMNS(* EXCLUDE (empid, dept))
INTO
NAME month
VALUE sales;
1 electronics Jan 1
1 electronics Feb 2
1 electronics Mar 3
1 electronics Apr 4
1 electronics May 5
1 electronics Jun 6
2 clothes Jan 10
2 clothes Feb 20
2 clothes Mar 30
2 clothes Apr 40
2 clothes May 50
2 clothes Jun 60
3 cars Jan 100
3 cars Feb 200
3 cars Mar 300
3 cars Apr 400
3 cars May 500
3 cars Jun 600
357
DuckDB Documentation
UNPIVOT into Multiple Value Columns The UNPIVOT statement has additional flexibility: more than 2 destination columns are sup‑
ported. This can be useful when the goal is to reduce the extent to which a dataset is pivoted, but not completely stack all pivoted columns.
To demonstrate this, the query below will generate a dataset with a separate column for the number of each month within the quarter
(month 1, 2, or 3), and a separate row for each quarter. Since there are fewer quarters than months, this does make the dataset longer, but
not as long as the above.
To accomplish this, multiple sets of columns are included in the ON clause. The q1 and q2 aliases are optional. The number of columns in
each set of columns in the ON clause must match the number of columns in the VALUE clause.
UNPIVOT monthly_sales
ON (jan, feb, mar) AS q1, (apr, may, jun) AS q2
INTO
NAME quarter
VALUE month_1_sales, month_2_sales, month_3_sales;
1 electronics q1 1 2 3
1 electronics q2 4 5 6
2 clothes q1 10 20 30
2 clothes q2 40 50 60
3 cars q1 100 200 300
3 cars q2 400 500 600
Using UNPIVOT within a SELECT Statement The UNPIVOT statement may be included within a SELECT statement as a CTE (a Com‑
mon Table Expression, or WITH clause), or a subquery. This allows for an UNPIVOT to be used alongside other SQL logic, as well as for
multiple UNPIVOTs to be used in one query.
No SELECT is needed within the CTE, the UNPIVOT keyword can be thought of as taking its place.
WITH unpivot_alias AS (
UNPIVOT monthly_sales
ON COLUMNS(* EXCLUDE (empid, dept))
INTO
NAME month
VALUE sales
)
SELECT * FROM unpivot_alias;
An UNPIVOT may be used in a subquery and must be wrapped in parentheses. Note that this behavior is different than the SQL Standard
Unpivot, as illustrated in subsequent examples.
SELECT *
FROM (
UNPIVOT monthly_sales
ON COLUMNS(* EXCLUDE (empid, dept))
INTO
NAME month
VALUE sales
) unpivot_alias;
Expressions within UNPIVOT Statements DuckDB allows expressions within the UNPIVOT statements, provided that they only involve
a single column. These can be used to perform computations as well as explicit casts. For example:
358
DuckDB Documentation
UNPIVOT
(SELECT 42 as col1, 'woot' as col2)
ON
(col1 * 2)::VARCHAR,
col2;
┌─────────┬─────────┐
│ name │ value │
│ varchar │ varchar │
├─────────┼─────────┤
│ col1 │ 84 │
│ col2 │ woot │
└─────────┴─────────┘
Internals Unpivoting is implemented entirely as rewrites into SQL queries. Each UNPIVOT is implemented as set of unnest functions,
operating on a list of the column names and a list of the column values. If dynamically unpivoting, the COLUMNS expression is evaluated
first to calculate the column list.
For example:
UNPIVOT monthly_sales
ON jan, feb, mar, apr, may, jun
INTO
NAME month
VALUE sales;
is translated into:
SELECT
empid,
dept,
unnest(['jan', 'feb', 'mar', 'apr', 'may', 'jun']) AS month,
unnest(["jan", "feb", "mar", "apr", "may", "jun"]) AS sales
FROM monthly_sales;
Note the single quotes to build a list of text strings to populate month, and the double quotes to pull the column values for use in sales.
This produces the same result as the initial example:
1 electronics jan 1
1 electronics feb 2
1 electronics mar 3
1 electronics apr 4
1 electronics may 5
1 electronics jun 6
2 clothes jan 10
2 clothes feb 20
2 clothes mar 30
2 clothes apr 40
2 clothes may 50
2 clothes jun 60
3 cars jan 100
359
DuckDB Documentation
Simplified UNPIVOT Full Syntax Diagram Below is the full syntax diagram of the UNPIVOT statement.
The full syntax diagram is below, but the SQL Standard UNPIVOT syntax can be summarized as:
FROM [dataset]
UNPIVOT [INCLUDE NULLS] (
[value-column-name(s)]
FOR [name-column-name] IN [column(s)]
);
Note that only one column can be included in the name-column-name expression.
SQL Standard UNPIVOT Manually To complete the basic UNPIVOT operation using the SQL standard syntax, only a few additions are
needed.
1 electronics Jan 1
1 electronics Feb 2
1 electronics Mar 3
1 electronics Apr 4
1 electronics May 5
1 electronics Jun 6
2 clothes Jan 10
2 clothes Feb 20
2 clothes Mar 30
2 clothes Apr 40
2 clothes May 50
2 clothes Jun 60
3 cars Jan 100
3 cars Feb 200
3 cars Mar 300
360
DuckDB Documentation
SQL Standard UNPIVOT Dynamically Using the COLUMNS Expression The COLUMNS expression can be used to determine the IN list
of columns dynamically. This will continue to work even if additional month columns are added to the dataset. It produces the same result
as the query above.
SQL Standard UNPIVOT into Multiple Value Columns The UNPIVOT statement has additional flexibility: more than 2 destination
columns are supported. This can be useful when the goal is to reduce the extent to which a dataset is pivoted, but not completely stack
all pivoted columns. To demonstrate this, the query below will generate a dataset with a separate column for the number of each month
within the quarter (month 1, 2, or 3), and a separate row for each quarter. Since there are fewer quarters than months, this does make the
dataset longer, but not as long as the above.
To accomplish this, multiple columns are included in the value-column-name portion of the UNPIVOT statement. Multiple sets of
columns are included in the IN clause. The q1 and q2 aliases are optional. The number of columns in each set of columns in the IN clause
must match the number of columns in the value-column-name portion.
FROM monthly_sales
UNPIVOT (
(month_1_sales, month_2_sales, month_3_sales)
FOR quarter IN (
(jan, feb, mar) AS q1,
(apr, may, jun) AS q2
)
);
1 electronics q1 1 2 3
1 electronics q2 4 5 6
2 clothes q1 10 20 30
2 clothes q2 40 50 60
3 cars q1 100 200 300
3 cars q2 400 500 600
SQL Standard UNPIVOT Full Syntax Diagram Below is the full syntax diagram of the SQL Standard version of the UNPIVOT state‑
ment.
UPDATE Statement
361
DuckDB Documentation
Examples
-- for every row where "i" is NULL, set the value to 0 instead
UPDATE tbl
SET i = 0
WHERE i IS NULL;
Syntax
UPDATE changes the values of the specified columns in all rows that satisfy the condition. Only the columns to be modified need be
mentioned in the SET clause; columns not explicitly modified retain their previous values.
A table can be updated based upon values from another table. This can be done by specifying a table in a FROM clause, or using a sub‑select
statement. Both approaches have the benefit of completing the UPDATE operation in bulk for increased performance.
SELECT *
FROM original;
key value
1 original value
2 original value 2
UPDATE original
SET value = new.value
FROM new
WHERE original.key = new.key;
-- OR
UPDATE original
SET value = (
SELECT
new.value
FROM new
WHERE original.key = new.key
);
SELECT *
FROM original;
362
DuckDB Documentation
key value
1 new value
2 new value 2
The only difference between this case and the above is that a different table alias must be specified on both the target table and the source
table. In this example as true_original and as new are both required.
To select the rows to update, UPDATE statements can use the FROM clause and express joins via the WHERE clause. For example:
To increase the revenue of all cities in France, join the city and the country tables, and filter on the latter:
UPDATE city
SET revenue = revenue + 100
FROM country
WHERE city.country_code = country.code
AND country.name = 'France';
SELECT *
FROM city;
┌──────────┬─────────┬──────────────┐
│ name │ revenue │ country_code │
│ varchar │ int64 │ varchar │
├──────────┼─────────┼──────────────┤
│ Paris │ 800 │ FR │
│ Lyon │ 300 │ FR │
│ Brussels │ 400 │ BE │
└──────────┴─────────┴──────────────┘
USE Statement
The USE statement selects a database and optional schema to use as the default.
363
DuckDB Documentation
Examples
Syntax
The USE statement sets a default database or database/schema combination to use for future operations. For instance, tables created
without providing a fully qualified table name will be created in the default database.
VACUUM Statement
The VACUUM statement alone does nothing and is at present provided for PostgreSQL‑compatibility. The VACUUM ANALYZE statement
recomputes table statistics if they have become stale due to table updates or deletions.
Examples
-- No-op
VACUUM;
-- Rebuild database statistics
VACUUM ANALYZE;
-- Rebuild statistics for the table and column
VACUUM ANALYZE memory.main.my_table(my_column);
-- Not supported
VACUUM FULL; -- error
Reclaiming Space
Syntax
Query Syntax
SELECT Clause
The SELECT clause specifies the list of columns that will be returned by the query. While it appears first in the clause, logically the ex‑
pressions here are executed only at the end. The SELECT clause can contain arbitrary expressions that transform the output, as well as
aggregates and window functions.
Examples
364
DuckDB Documentation
Syntax
SELECT List
The SELECT clause contains a list of expressions that specify the result of a query. The select list can refer to any columns in the FROM
clause, and combine them using expressions. As the output of a SQL query is a table ‑ every expression in the SELECT clause also has a
name. The expressions can be explicitly named using the AS clause (e.g., expr AS name). If a name is not provided by the user the
expressions are named automatically by the system.
Note. Column names are case‑insensitive. See the Rules for Case Sensitivity for more details.
Star Expressions
-- select all columns from the table called "table_name"
SELECT *
FROM table_name;
-- select all columns matching the given regex from the table
SELECT COLUMNS('number\d+')
FROM addresses;
The star expression is a special expression that expands to multiple expressions based on the contents of the FROM clause. In the simplest
case, * expands to all expressions in the FROM clause. Columns can also be selected using regular expressions or lambda functions. See
the star expression page for more details.
DISTINCT Clause
-- select all unique cities from the addresses table
SELECT DISTINCT city
FROM addresses;
The DISTINCT clause can be used to return only the unique rows in the result ‑ so that any duplicate rows are filtered out.
Note. Queries starting with SELECT DISTINCT run deduplication, which is an expensive operation. Therefore, only use DIS-
TINCT if necessary.
DISTINCT ON Clause
-- select only the highest population city for each country
SELECT DISTINCT ON(country) city, population
FROM cities
ORDER BY population DESC;
The DISTINCT ON clause returns only one row per unique value in the set of expressions as defined in the ON clause. If an ORDER BY
clause is present, the row that is returned is the first row that is encountered as per the ORDER BY criteria. If an ORDER BY clause is not
present, the first row that is encountered is not defined and can be any row in the table.
365
DuckDB Documentation
Note. When querying large data sets, using DISTINCT on all columns can be expensive. Therefore, consider using DISTINCT ON
on a column (or a set of columns) which guaranetees a sufficient degree of uniqueness for your results. For example, using DISTINCT
ON on the key column(s) of a table guarantees full uniqueness.
Aggregates
Aggregate functions are special functions that combine multiple rows into a single value. When aggregate functions are present in the
SELECT clause, the query is turned into an aggregate query. In an aggregate query, all expressions must either be part of an aggregate
function, or part of a group (as specified by the GROUP BY clause).
Window Functions
Window functions are special functions that allow the computation of values relative to other rows in a result. Window functions are marked
by the OVER clause which contains the window specification. The window specification defines the frame or context in which the window
function is computed. See the window functions page for more information.
unnest Function
The unnest function is a special function that can be used together with arrays, lists, or structs. The unnest function strips one level of
nesting from the type. For example, INT[] is transformed into INT. STRUCT(a INT, b INT) is transformed into a INT, b INT.
The unnest function can be used to transform nested types into regular scalar types, which makes them easier to operate on.
The FROM clause specifies the source of the data on which the remainder of the query should operate. Logically, the FROM clause is where
the query starts execution. The FROM clause can contain a single table, a combination of multiple tables that are joined together using
JOIN clauses, or another SELECT query inside a subquery node. DuckDB also has an optional FROM‑first syntax which enables you to
also query without a SELECT statement.
Examples
366
DuckDB Documentation
-- select all columns using the FROM-first syntax and omitting the SELECT clause
FROM table_name;
-- select all columns from the table called "table_name" through an alias "tn"
SELECT tn.* FROM table_name tn;
-- select all columns from the table "table_name" in the schema "schema_name"
SELECT * FROM schema_name.table_name;
-- select the column "i" from the table function "range",
-- where the first column of the range function is renamed to "i"
SELECT t.i FROM range(100) AS t(i);
-- select all columns from the CSV file called "test.csv"
SELECT * FROM 'test.csv';
-- select all columns from a subquery
SELECT * FROM (SELECT * FROM table_name);
-- select the entire row of the table as a struct
SELECT t FROM t;
-- select the entire row of the subquery as a struct (i.e., a single column)
SELECT t FROM (SELECT unnest(generate_series(41, 43)) AS x, 'hello' AS y) t;
-- join two tables together
SELECT * FROM table_name JOIN other_table ON (table_name.key = other_table.key);
-- select a 10% sample from a table
SELECT * FROM table_name TABLESAMPLE 10%;
-- select a sample of 10 rows from a table
SELECT * FROM table_name TABLESAMPLE 10 ROWS;
-- use the FROM-first syntax with WHERE clause and aggregation
FROM range(100) AS t(i) SELECT sum(t.i) WHERE i % 2 = 0;
Joins
Joins are a fundamental relational operation used to connect two tables or relations horizontally. The relations are referred to as the left
and right sides of the join based on how they are written in the join clause. Each result row has the columns from both relations.
A join uses a rule to match pairs of rows from each relation. Often this is a predicate, but there are other implied rules that may be speci‑
fied.
Outer Joins Rows that do not have any matches can still be returned if an OUTER join is specified. Outer joins can be one of:
• LEFT (All rows from the left relation appear at least once)
• RIGHT (All rows from the right relation appear at least once)
• FULL (All rows from both relations appear at least once)
A join that is not OUTER is INNER (only rows that get paired are returned).
When an unpaired row is returned, the attributes from the other table are set to NULL.
Cross Product Joins The simplest type of join is a CROSS JOIN. There are no conditions for this type of join, and it just returns all the
possible pairs.
Conditional Joins Most joins are specified by a predicate that connects attributes from one side to attributes from the other side. The
conditions can be explicitly specified using an ON clause with the join (clearer) or implied by the WHERE clause (old‑fashioned).
We use the l_regions and the l_nations tables from the TPC‑H schema:
367
DuckDB Documentation
If the column names are the same and are required to be equal, then the simpler USING syntax can be used:
-- return the pairs of jobs where one ran longer but cost less
SELECT s1.t_id, s2.t_id
FROM west s1, west s2
WHERE s1.time > s2.time
AND s1.cost < s2.cost;
Semi and Anti Joins Semi joins return rows from the left table that have at least one match in the right table. Anti joins return rows from
the left table that have no matches in the right table. When using a semi or anti join the result will never have more rows than the left hand
side table. Semi and anti joins provide the same logic as (NOT) IN statements.
368
DuckDB Documentation
Lateral Joins The LATERAL keyword allows subqueries in the FROM clause to refer to previous subqueries. This feature is also known
as a lateral join.
SELECT *
FROM range(3) t(i), LATERAL (SELECT i + 1) t2(j);
┌───────┬───────┐
│ i │ j │
│ int64 │ int64 │
├───────┼───────┤
│ 0 │ 1 │
│ 1 │ 2 │
│ 2 │ 3 │
└───────┴───────┘
Lateral joins are a generalization of correlated subqueries, as they can return multiple values per input value rather than only a single
value.
SELECT *
FROM
generate_series(0, 1) t(i),
LATERAL (SELECT i + 10 UNION ALL SELECT i + 100) t2(j);
┌───────┬───────┐
│ i │ j │
│ int64 │ int64 │
├───────┼───────┤
│ 0 │ 10 │
│ 1 │ 11 │
│ 0 │ 100 │
│ 1 │ 101 │
└───────┴───────┘
It may be helpful to think about LATERAL as a loop where we iterate through the rows of the first subquery and use it as input to the second
(LATERAL) subquery. In the examples above, we iterate through table t and refer to its column i from the definition of table t2. The rows
of t2 form column j in the result.
It is possible to refer to multiple attributes from the LATERAL subquery. Using the table from the first example:
┌───────┬───────┬───────┐
│ i │ j │ k │
│ int64 │ int64 │ int64 │
├───────┼───────┼───────┤
│ 0 │ 1 │ 1 │
│ 1 │ 2 │ 3 │
│ 2 │ 3 │ 5 │
└───────┴───────┴───────┘
Note. DuckDB detects when LATERAL joins should be used, making the use of the LATERAL keyword optional.
Positional Joins When working with data frames or other embedded tables of the same size, the rows may have a natural correspondence
based on their physical order. In scripting languages, this is easily expressed using a loop:
369
DuckDB Documentation
It is difficult to express this in standard SQL because relational tables are not ordered, but imported tables (like data frames) or disk files
(like CSVs or Parquet files) do have a natural ordering.
As‑Of Joins A common operation when working with temporal or similarly‑ordered data is to find the nearest (first) event in a reference
table (such as prices). This is called an as‑of join:
The ASOF join requires at least one inequality condition on the ordering field. The inequality can be any inequality condition (>=, >, <=,
<) on any data type, but the most common form is >= on a temporal type. Any other conditions must be equalities (or NOT DISTINCT).
This means that the left/right order of the tables is significant.
ASOF joins each left side row with at most one right side row. It can be specified as an OUTER join to find unpaired rows (e.g., trades without
prices or prices which have no trades.)
ASOF joins can also specify join conditions on matching column names with the USING syntax, but the last attribute in the list must be the
inequality, which will be greater than or equal to (>=):
SELECT *
FROM trades t
ASOF JOIN prices p USING (symbol, "when");
-- Returns symbol, trades.when, price (but NOT prices.when)
If you combine USING with a SELECT * like this, the query will return the left side (probe) column values for the matches, not the right
side (build) column values. To get the prices times in the example, you will need to list the columns explicitly:
Syntax
WHERE Clause
The WHERE clause specifies any filters to apply to the data. This allows you to select only a subset of the data in which you are interested.
Logically the WHERE clause is applied immediately after the FROM clause.
370
DuckDB Documentation
Examples
-- select all rows that match the given case-insensitive LIKE expression
SELECT *
FROM table_name
WHERE name ILIKE '%mark%';
Syntax
GROUP BY Clause
The GROUP BY clause specifies which grouping columns should be used to perform any aggregations in the SELECT clause. If the GROUP
BY clause is specified, the query is always an aggregate query, even if no aggregations are present in the SELECT clause.
When a GROUP BY clause is specified, all tuples that have matching data in the grouping columns (i.e., all tuples that belong to the same
group) will be combined. The values of the grouping columns themselves are unchanged, and any other columns can be combined using
an aggregate function (such as count, sum, avg, etc).
GROUP BY ALL
Use GROUP BY ALL to GROUP BY all columns in the SELECT statement that are not wrapped in aggregate functions. This simplifies the
syntax by allowing the columns list to be maintained in a single location, and prevents bugs by keeping the SELECT granularity aligned to
the GROUP BY granularity (Ex: Prevents any duplication). See examples below and additional examples in the Friendlier SQL with DuckDB
blog post.
Multiple Dimensions
Normally, the GROUP BY clause groups along a single dimension. Using the GROUPING SETS, CUBE or ROLLUP clauses it is possible to
group along multiple dimensions. See the GROUPING SETS page for more information.
Examples
-- count the number of entries in the "addresses" table that belong to each different city
SELECT city, count(*)
FROM addresses
GROUP BY city;
371
DuckDB Documentation
Syntax
GROUPING SETS
GROUPING SETS, ROLLUP and CUBE can be used in the GROUP BY clause to perform a grouping over multiple dimensions within the
same query. Note that this syntax is not compatible with GROUP BY ALL.
Examples
-- compute the average income along the provided four different dimensions
-- () signifies the empty set (i.e., computing an ungrouped aggregate)
SELECT city, street_name, avg(income)
FROM addresses
GROUP BY GROUPING SETS ((city, street_name), (city), (street_name), ());
-- compute the average income along the dimensions (city, street_name), (city) and ()
SELECT city, street_name, avg(income)
FROM addresses
GROUP BY ROLLUP (city, street_name);
Description
GROUPING SETS perform the same aggregate across different GROUP BY clauses in a single query.
┌────────┬──────────┬──────────────┐
│ course │ type │ count_star() │
├────────┼──────────┼──────────────┤
│ CS │ Bachelor │ 2 │
372
DuckDB Documentation
│ CS │ PhD │ 1 │
│ Math │ Masters │ 1 │
│ CS │ NULL │ 2 │
│ Math │ NULL │ 1 │
│ CS │ NULL │ 5 │
│ Math │ NULL │ 2 │
│ NULL │ Bachelor │ 2 │
│ NULL │ PhD │ 1 │
│ NULL │ Masters │ 1 │
│ NULL │ NULL │ 3 │
│ NULL │ NULL │ 7 │
└────────┴──────────┴──────────────┘
In the above query, we group across four different sets: course, type, course, type and () (the empty group). The result contains
NULL for a group which is not in the grouping set for the result, i.e., the above query is equivalent to the following UNION statement:
CUBE and ROLLUP are syntactic sugar to easily produce commonly used grouping sets.
The ROLLUP clause will produce all ”sub‑groups” of a grouping set, e.g., ROLLUP (country, city, zip) produces the grouping
sets (country, city, zip), (country, city), (country), (). This can be useful for producing different levels of detail
of a group by clause. This produces n+1 grouping sets where n is the amount of terms in the ROLLUP clause.
CUBE produces grouping sets for all combinations of the inputs, e.g., CUBE (country, city, zip) will produce (country, city,
zip), (country, city), (country, zip), (city, zip), (country), (city), (zip), (). This produces 2^n
grouping sets.
The super‑aggregate rows generated by GROUPING SETS, ROLLUP and CUBE can often be identified by NULL‑values returned for the
respective column in the grouping. But if the columns used in the grouping can themselves contain actual NULL‑values, then it can be
challenging to distinguish whether the value in the resultset is a ”real” NULL‑value coming out of the data itself, or a NULL‑value generated
by the grouping construct. The GROUPING_ID() or GROUPING() function is designed to identify which groups generated the super‑
aggregate rows in the result.
GROUPING_ID() is an aggregate function that takes the column expressions that make up the grouping(s). It returns a BIGINT value.
The return value is 0 for the rows that are not super‑aggregate rows. But for the super‑aggregate rows, it returns an integer value that
identifies the combination of expressions that make up the group for which the super‑aggregate is generated. At this point, an example
might help. Consider the following query:
373
DuckDB Documentation
WITH days AS (
SELECT
year("generate_series") AS y,
quarter("generate_series") AS q,
month("generate_series") AS m
FROM generate_series(DATE '2023-01-01', DATE '2023-12-31', INTERVAL 1 DAY)
)
SELECT y, q, m, GROUPING_ID(y, q, m) AS "grouping_id()"
FROM days
GROUP BY GROUPING SETS (
(y, q, m),
(y, q),
(y),
()
)
ORDER BY y, q, m;
┌───────┬───────┬───────┬─────────────┐
│ y │ q │ m │ grouping_id │
│ int64 │ int64 │ int64 │ int64 │
├───────┼───────┼───────┼─────────────┤
│ 2023 │ 1 │ 1 │ 0 │
│ 2023 │ 1 │ 2 │ 0 │
│ 2023 │ 1 │ 3 │ 0 │
│ 2023 │ 1 │ NULL │ 1 │
│ 2023 │ 2 │ 4 │ 0 │
│ 2023 │ 2 │ 5 │ 0 │
│ 2023 │ 2 │ 6 │ 0 │
│ 2023 │ 2 │ NULL │ 1 │
│ 2023 │ 3 │ 7 │ 0 │
│ 2023 │ 3 │ 8 │ 0 │
│ 2023 │ 3 │ 9 │ 0 │
│ 2023 │ 3 │ NULL │ 1 │
│ 2023 │ 4 │ 10 │ 0 │
│ 2023 │ 4 │ 11 │ 0 │
│ 2023 │ 4 │ 12 │ 0 │
│ 2023 │ 4 │ NULL │ 1 │
│ 2023 │ NULL │ NULL │ 3 │
│ NULL │ NULL │ NULL │ 7 │
├───────┴───────┴───────┴─────────────┤
│ 18 rows 4 columns │
└─────────────────────────────────────┘
In this example, the lowest level of grouping is at the month level, defined by the grouping set (y, q, m). Result rows corresponding to
that level are simply aggregate rows and the GROUPING_ID(y, q, m) function returns 0 for those. The grouping set (y, q) results
in super‑aggregate rows over the month level, leaving a NULL‑value for the m column, and for which GROUPING_ID(y, q, m) returns
1. The grouping set (y) results in super‑aggregate rows over the quarter level, leaving NULL‑values for the m and q column, for which
GROUPING_ID(y, q, m) returns 3. Finally, the () grouping set results in one super‑aggregate row for the entire resultset, leaving
NULL‑values for y, q and m and for which GROUPING_ID(y, q, m) returns 7.
To understand the relationship between the return value and the grouping set, you can think of GROUPING_ID(y, q, m) writing to a
bitfield, where the first bit corresponds to the last expression passed to GROUPING_ID(), the second bit to the one‑but‑last expression
passed to GROUPING_ID(), and so on. This may become clearer by casting GROUPING_ID() to BIT:
WITH days AS (
SELECT
year("generate_series") AS y,
quarter("generate_series") AS q,
374
DuckDB Documentation
month("generate_series") AS m
FROM generate_series(DATE '2023-01-01', DATE '2023-12-31', INTERVAL 1 DAY)
)
SELECT
y, q, m,
GROUPING_ID(y, q, m) AS "grouping_id(y, q, m)",
right(GROUPING_ID(y, q, m)::BIT, 3) AS "y_q_m_bits"
FROM days
GROUP BY GROUPING SETS (
(y, q, m),
(y, q),
(y),
()
)
ORDER BY y, q, m;
┌───────┬───────┬───────┬──────────────────────┬────────────┐
│ y │ q │ m │ grouping_id(y, q, m) │ y_q_m_bits │
│ int64 │ int64 │ int64 │ int64 │ varchar │
├───────┼───────┼───────┼──────────────────────┼────────────┤
│ 2023 │ 1 │ 1 │ 0 │ 000 │
│ 2023 │ 1 │ 2 │ 0 │ 000 │
│ 2023 │ 1 │ 3 │ 0 │ 000 │
│ 2023 │ 1 │ NULL │ 1 │ 001 │
│ 2023 │ 2 │ 4 │ 0 │ 000 │
│ 2023 │ 2 │ 5 │ 0 │ 000 │
│ 2023 │ 2 │ 6 │ 0 │ 000 │
│ 2023 │ 2 │ NULL │ 1 │ 001 │
│ 2023 │ 3 │ 7 │ 0 │ 000 │
│ 2023 │ 3 │ 8 │ 0 │ 000 │
│ 2023 │ 3 │ 9 │ 0 │ 000 │
│ 2023 │ 3 │ NULL │ 1 │ 001 │
│ 2023 │ 4 │ 10 │ 0 │ 000 │
│ 2023 │ 4 │ 11 │ 0 │ 000 │
│ 2023 │ 4 │ 12 │ 0 │ 000 │
│ 2023 │ 4 │ NULL │ 1 │ 001 │
│ 2023 │ NULL │ NULL │ 3 │ 011 │
│ NULL │ NULL │ NULL │ 7 │ 111 │
├───────┴───────┴───────┴──────────────────────┴────────────┤
│ 18 rows 5 columns │
└───────────────────────────────────────────────────────────┘
Note that the number of expressions passed to GROUPING_ID(), or the order in which they are passed is independent from the actual
group definitions appearing in the GROUPING SETS‑clause (or the groups implied by ROLLUP and CUBE). As long as the expressions
passed to GROUPING_ID() are expressions that appear some where in the GROUPING SETS‑clause, GROUPING_ID() will set a bit
corresponding to the position of the expression whenever that expression is rolled up to a super‑aggregate.
Syntax
HAVING Clause
The HAVING clause can be used after the GROUP BY clause to provide filter criteria after the grouping has been completed. In terms of
syntax the HAVING clause is identical to the WHERE clause, but while the WHERE clause occurs before the grouping, the HAVING clause
occurs after the grouping.
375
DuckDB Documentation
Examples
-- count the number of entries in the "addresses" table that belong to each different city
-- filtering out cities with a count below 50
SELECT city, count(*)
FROM addresses
GROUP BY city
HAVING count(*) >= 50;
Syntax
ORDER BY Clause
ORDER BY is an output modifier. Logically it is applied near the very end of the query (just prior to LIMIT or OFFSET, if present). The
ORDER BY clause sorts the rows on the sorting criteria in either ascending or descending order. In addition, every order clause can specify
whether NULL values should be moved to the beginning or to the end.
The ORDER BY clause may contain one or more expressions, separated by commas. An error will be thrown if no expressions are included,
since the ORDER BY clause should be removed in that situation. The expressions may begin with either an arbitrary scalar expression
(which could be a column name), a column position number (Ex: 1. Note that it is 1‑indexed), or the keyword ALL. Each expression can
optionally be followed by an order modifier (ASC or DESC, default is ASC), and/or a NULL order modifier (NULLS FIRST or NULLS LAST,
default is NULLS LAST).
ORDER BY ALL
The ALL keyword indicates that the output should be sorted by every column in order from left to right. The direction of this sort may be
modified using either ORDER BY ALL ASC or ORDER BY ALL DESC and/or NULLS FIRST or NULLS LAST. Note that ALL may not
be used in combination with other expressions in the ORDER BY clause ‑ it must be by itself. See examples below.
By default if no modifiers are provided, DuckDB sorts ASC NULLS LAST, i.e., the values are sorted in ascending order and null values are
placed last. This is identical to the default sort order of PostgreSQL. The default sort order can be changed with the following configuration
options.
Note. Using ASC NULLS LAST as default the default sorting order was a breaking change in version 0.8.0. Prior to 0.8.0, DuckDB
sorted using ASC NULLS FIRST.
-- change the default null sorting order to either NULLS FIRST and NULLS LAST
SET default_null_order = 'NULLS FIRST';
-- change the default sorting order to either DESC or ASC
SET default_order = 'DESC';
Collations
Text is sorted using the binary comparison collation by default, which means values are sorted on their binary UTF‑8 values. While this
works well for ASCII text (e.g., for English language data), the sorting order can be incorrect for other languages. For this purpose, DuckDB
provides collations. For more information on collations, see the Collation page.
376
DuckDB Documentation
Examples
-- select the addresses, ordered by city name using the default null order and default order
SELECT *
FROM addresses
ORDER BY city;
-- select the addresses, ordered by city name in descending order with nulls at the end
SELECT *
FROM addresses
ORDER BY city DESC NULLS LAST;
-- order by city and then by zip code, both using the default orderings
SELECT *
FROM addresses
ORDER BY city, zip;
-- Order from left to right (by address, then by city, then by zip) in descending order
SELECT *
FROM addresses
ORDER BY ALL DESC;
377
DuckDB Documentation
Syntax
LIMIT Clause
LIMIT is an output modifier. Logically it is applied at the very end of the query. The LIMIT clause restricts the amount of rows fetched.
The OFFSET clause indicates at which position to start reading the values, i.e., the first OFFSET values are ignored.
Note that while LIMIT can be used without an ORDER BY clause, the results might not be deterministic without the ORDER BY clause.
This can still be useful, however, for example when you want to inspect a quick snapshot of the data.
Examples
-- select the 5 rows from the addresses table, starting at position 5 (i.e., ignoring the first 5 rows)
SELECT *
FROM addresses
LIMIT 5
OFFSET 5;
Syntax
SAMPLE Clause
The SAMPLE clause allows you to run the query on a sample from the base table. This can significantly speed up processing of queries,
at the expense of accuracy in the result. Samples can also be used to quickly see a snapshot of the data when exploring a data set. The
sample clause is applied right after anything in the FROM clause (i.e., after any joins, but before the WHERE clause or any aggregates). See
the SAMPLE page for more information.
Examples
378
DuckDB Documentation
Syntax
Unnesting
Examples
The unnest function is used to unnest lists or structs by one level. The function can be used as a regular scalar function, but only in the
SELECT clause. Invoking unnest with the recursive parameter will unnest lists and structs of multiple levels.
Unnesting Lists
Using unnest on a list will emit one tuple per entry in the list. When unnest is combined with regular scalar expressions, those ex‑
pressions are repeated for every entry in the list. When multiple lists are unnested in the same SELECT clause, the lists are unnested
side‑by‑side. If one list is longer than the other, the shorter list will be padded with NULL values.
An empty list and a NULL list will both unnest to zero elements.
Unnesting Structs
unnest on a struct will emit one column per entry in the struct.
Recursive Unnest
379
DuckDB Documentation
Calling unnest with the recursive setting will fully unnest lists, followed by fully unnesting structs. This can be useful to fully flatten
columns that contain lists within lists, or lists of structs. Note that lists within structs are not unnested.
WITH Clause
The WITH clause allows you to specify common table expressions (CTEs). Regular (non‑recursive) common‑table‑expressions are essen‑
tially views that are limited in scope to a particular query. CTEs can reference each‑other and can be nested.
┌────┐
│ x │
├────┤
│ 42 │
└────┘
-- create two CTEs, where the second CTE references the first CTE
WITH cte AS (SELECT 42 AS i),
cte2 AS (SELECT i*100 AS x FROM cte)
SELECT * FROM cte2;
┌──────┐
│ x │
├──────┤
│ 4200 │
└──────┘
Materialized CTEs
By default, CTEs are inlined into the main query. Inlining can result in duplicate work, because the definition is copied for each reference.
Take this query for example:
Inlining duplicates the definition of t for each reference which results in the following query:
If Q_t is expensive, materializing it with the MATERIALIZED keyword can improve performance. In this case, Q_t is evaluated only
once.
380
DuckDB Documentation
Recursive CTEs
WITH RECURSIVE allows the definition of CTEs which can refer to themselves. Note that the query must be formulated in a way that
ensures termination, otherwise, it may run into an infinite loop.
Tree Traversal WITH RECURSIVE can be used to traverse trees. For example, take a hierarchy of tags:
The following query returns the path from the node Oasis to the root of the tree (Art).
381
DuckDB Documentation
UNION ALL
SELECT tag.id, tag.name, list_prepend(tag.name, tag_hierarchy.path)
FROM tag, tag_hierarchy
WHERE tag.subclassof = tag_hierarchy.id
)
SELECT path
FROM tag_hierarchy
WHERE source = 'Oasis';
┌───────────────────────────┐
│ path │
├───────────────────────────┤
│ [Oasis, Rock, Music, Art] │
└───────────────────────────┘
Graph Traversal The WITH RECURSIVE clause can be used to express graph traversal on arbitrary graphs. However, if the graph has
cycles, the query must perform cycle detection to prevent infinite loops. One way to achieve this is to store the path of a traversal in a list
and, before extending the path with a new edge, check whether its endpoint has been visited before (see the example later).
Take the following directed graph from the LDBC Graphalytics benchmark:
Note that the graph contains directed cycles, e.g., between nodes 1, 2, and 5.
382
DuckDB Documentation
Enumerate All Paths from a Node The following query returns all paths starting in node 1:
┌───────────┬─────────┬───────────────┐
│ startNode │ endNode │ path │
├───────────┼─────────┼───────────────┤
│ 1 │ 3 │ [1, 3] │
│ 1 │ 5 │ [1, 5] │
│ 1 │ 5 │ [1, 3, 5] │
│ 1 │ 8 │ [1, 3, 8] │
│ 1 │ 10 │ [1, 3, 10] │
│ 1 │ 3 │ [1, 5, 3] │
│ 1 │ 4 │ [1, 5, 4] │
│ 1 │ 8 │ [1, 5, 8] │
│ 1 │ 4 │ [1, 3, 5, 4] │
│ 1 │ 8 │ [1, 3, 5, 8] │
│ 1 │ 8 │ [1, 5, 3, 8] │
│ 1 │ 10 │ [1, 5, 3, 10] │
└───────────┴─────────┴───────────────┘
Note that the result of this query is not restricted to shortest paths, e.g., for node 5, the results include paths [1, 5] and [1, 3, 5].
Enumerate Unweighted Shortest Paths from a Node In most cases, enumerating all paths is not practical or feasible. Instead, only the
(unweighted) shortest paths are of interest. To find these, the second half of the WITH RECURSIVE query should be adjusted such
that it only includes a node if it has not yet been visited. This is implemented by using a subquery that checks if any of the previous paths
includes the node:
383
DuckDB Documentation
┌───────────┬─────────┬────────────┐
│ startNode │ endNode │ path │
├───────────┼─────────┼────────────┤
│ 1 │ 3 │ [1, 3] │
│ 1 │ 5 │ [1, 5] │
│ 1 │ 8 │ [1, 3, 8] │
│ 1 │ 10 │ [1, 3, 10] │
│ 1 │ 4 │ [1, 5, 4] │
│ 1 │ 8 │ [1, 5, 8] │
└───────────┴─────────┴────────────┘
Enumerate Unweighted Shortest Paths between Two Nodes WITH RECURSIVE can also be used to find all (unweighted) shortest
paths between two nodes. To ensure that the recursive query is stopped as soon as we reach the end node, we use a window function
which checks whether the end node is among the newly added nodes.
The following query returns all unweighted shortest paths between nodes 1 (start node) and 8 (end node):
384
DuckDB Documentation
┌───────────┬─────────┬───────────┐
│ startNode │ endNode │ path │
├───────────┼─────────┼───────────┤
│ 1 │ 8 │ [1, 3, 8] │
│ 1 │ 8 │ [1, 5, 8] │
└───────────┴─────────┴───────────┘
WINDOW Clause
The WINDOW clause allows you to specify named windows that can be used within window functions. These are useful when you have
multiple window functions, as they allow you to avoid repeating the same window clause.
Syntax
QUALIFY Clause
The QUALIFY clause is used to filter the results of WINDOW functions. This filtering of results is similar to how a HAVING clause filters the
results of aggregate functions applied based on the GROUP BY clause.
The QUALIFY clause avoids the need for a subquery or WITH clause to perform this filtering (much like HAVING avoids a subquery). An
example using a WITH clause instead of QUALIFY is included below the QUALIFY examples.
Note that this is filtering based on WINDOW functions, not necessarily based on the WINDOW clause. The WINDOW clause is optional and
can be used to simplify the creation of multiple WINDOW function expressions.
The position of where to specify a QUALIFY clause is following the WINDOW clause in a SELECT statement (WINDOW does not need to be
specified), and before the ORDER BY.
Examples
Each of the following examples produce the same output, located below.
-- Filter based on a WINDOW function defined in the QUALIFY clause, but using the WINDOW clause
SELECT
schema_name,
function_name,
385
DuckDB Documentation
-- In this example the function_rank column in the select clause is for reference
row_number() OVER my_window AS function_rank
FROM duckdb_functions()
WINDOW
my_window AS (PARTITION BY schema_name ORDER BY function_name)
QUALIFY
row_number() OVER my_window < 3;
-- Filter based on a WINDOW function defined in the SELECT clause, but using the WINDOW clause
SELECT
schema_name,
function_name,
row_number() OVER my_window AS function_rank
FROM duckdb_functions()
WINDOW
my_window AS (PARTITION BY schema_name ORDER BY function_name)
QUALIFY
function_rank < 3;
main !__postfix 1
main !~~ 2
pg_catalog col_description 1
pg_catalog format_pg_type 2
Syntax
VALUES Clause
The VALUES clause is used to specify a fixed number of rows. The VALUES clause can be used as a stand‑alone statement, as part of the
FROM clause, or as input to an INSERT INTO statement.
Examples
386
DuckDB Documentation
Syntax
FILTER Clause
The FILTER clause may optionally follow an aggregate function in a SELECT statement. This will filter the rows of data that are fed into
the aggregate function in the same way that a WHERE clause filters rows, but localized to the specific aggregate function. FILTERs are not
currently able to be used when the aggregate function is in a windowing context.
There are multiple types of situations where this is useful, including when evaluating multiple aggregates with different filters, and when
creating a pivoted view of a dataset. FILTER provides a cleaner syntax for pivoting data when compared with the more traditional CASE
WHEN approach discussed below.
Some aggregate functions also do not filter out null values, so using a FILTER clause will return valid results when at times the CASE
WHEN approach will not. This occurs with the functions first and last, which are desirable in a non‑aggregating pivot operation where
the goal is to simply re‑orient the data into columns rather than re‑aggregate it. FILTER also improves null handling when using the LIST
and ARRAY_AGG functions, as the CASE WHEN approach will include null values in the list result, while the FILTER clause will remove
them.
Examples
10 5 5
-- Different aggregate functions may be used, and multiple WHERE expressions are also permitted
-- The sum of i for rows where i <= 5
-- The median of i where i is odd
SELECT
sum(i) FILTER (i <= 5) AS lte_five_sum,
median(i) FILTER (i % 2 = 1) AS odds_median,
median(i) FILTER (i % 2 = 1 AND i <= 5) AS odds_lte_five_median
FROM generate_series(1, 10) tbl(i);
15 5.0 3.0
The FILTER clause can also be used to pivot data from rows into columns. This is a static pivot, as columns must be defined prior to
runtime in SQL. However, this kind of statement can be dynamically generated in a host programming language to leverage DuckDB's SQL
engine for rapid, larger than memory pivoting.
387
DuckDB Documentation
-- "Pivot" the data out by year (move each year out to a separate column)
SELECT
count(i) FILTER (year = 2022) AS "2022",
count(i) FILTER (year = 2023) AS "2023",
count(i) FILTER (year = 2024) AS "2024",
count(i) FILTER (year = 2025) AS "2025",
count(i) FILTER (year IS NULL) AS "NULLs"
FROM stacked_data;
-- This syntax produces the same results as the FILTER clauses above
SELECT
count(CASE WHEN year = 2022 THEN i END) AS "2022",
count(CASE WHEN year = 2023 THEN i END) AS "2023",
count(CASE WHEN year = 2024 THEN i END) AS "2024",
count(CASE WHEN year = 2025 THEN i END) AS "2025",
count(CASE WHEN year IS NULL THEN i END) AS "NULLs"
FROM stacked_data;
However, the CASE WHEN approach will not work as expected when using an aggregate function that does not ignore NULL values. The
first function falls into this category, so FILTER is preferred in this case.
-- "Pivot" the data out by year (move each year out to a separate column)
SELECT
first(i) FILTER (year = 2022) AS "2022",
first(i) FILTER (year = 2023) AS "2023",
first(i) FILTER (year = 2024) AS "2024",
first(i) FILTER (year = 2025) AS "2025",
first(i) FILTER (year IS NULL) AS "NULLs"
FROM stacked_data;
-- This will produce NULL values whenever the first evaluation of the CASE WHEN clause returns a NULL
SELECT
388
DuckDB Documentation
Set Operations
Set operations allow queries to be combined according to set operation semantics. Set operations refer to the UNION [ALL], INTERSECT
[ALL] and EXCEPT [ALL] clauses. The vanilla variants use set semantics, i.e., they eliminate duplicates, while the variants with ALL
use bag semantics.
Traditional set operations unify queries by column position, and require the to‑be‑combined queries to have the same number of input
columns. If the columns are not of the same type, casts may be added. The result will use the column names from the first query.
DuckDB also supports UNION [ALL] BY NAME, which joins columns by name instead of by position. UNION BY NAME does not require
the inputs to have the same number of columns. NULL values will be added in case of missing columns.
UNION
The UNION clause can be used to combine rows from multiple queries. The queries are required to have the same number of columns and
the same column types.
Vanilla UNION (Set Semantics) The vanilla UNION clause follows set semantics, therefore it performs duplicate elimination, i.e., only
unique rows will be included in the result.
┌───────┐
│ x │
│ int64 │
├───────┤
│ 0 │
│ 2 │
│ 1 │
└───────┘
UNION ALL (Bag Semantics) UNION ALL returns all rows of both queries following bag semantics, i.e., without duplicate elimina‑
tion.
389
DuckDB Documentation
┌───────┐
│ x │
│ int64 │
├───────┤
│ 0 │
│ 1 │
│ 0 │
│ 1 │
│ 2 │
└───────┘
UNION [ALL] BY NAME The UNION [ALL] BY NAME clause can be used to combine rows from different tables by name, instead
of by position. UNION BY NAME does not require both queries to have the same number of columns. Any columns that are only found in
one of the queries are filled with NULL values for the other query.
┌───────────┬─────────┬─────────┬────────────┐
│ city │ country │ degrees │ date │
│ varchar │ varchar │ int32 │ date │
├───────────┼─────────┼─────────┼────────────┤
│ Amsterdam │ NULL │ 10 │ 2022-10-14 │
│ Seattle │ NULL │ 8 │ 2022-10-12 │
│ Amsterdam │ NL │ NULL │ NULL │
│ Berlin │ Germany │ NULL │ NULL │
└───────────┴─────────┴─────────┴────────────┘
UNION BY NAME follows set semantics (therefore it performs duplicate elimination), whereas UNION ALL BY NAME follows bag
semantics.
INTERSECT
The INTERSECT clause can be used to select all rows that occur in the result of both queries.
Vanilla INTERSECT (Set Semantics) Vanilla INTERSECT performs duplicate elimination, so only unique rows are returned.
-- the values [0..5] (all values that are both in t1 and t2)
SELECT * FROM range(2) t1(x)
INTERSECT
SELECT * FROM range(6) t2(x);
┌───────┐
│ x │
│ int64 │
├───────┤
390
DuckDB Documentation
│ 0 │
│ 1 │
└───────┘
INTERSECT ALL (Bag Semantics) INTERSECT ALL follows bag semantics, so duplicates are returned.
┌───────┐
│ x │
│ int32 │
├───────┤
│ 7 │
│ 6 │
│ 6 │
│ 5 │
└───────┘
EXCEPT
The EXCEPT clause can be used to select all rows that only occur in the left query.
Vanilla EXCEPT (Set Semantics) Vanilla EXCEPT follows set semantics, therefore, it performs duplicate elimination, so only unique
rows are returned.
┌───────┐
│ x │
│ int64 │
├───────┤
│ 4 │
│ 3 │
│ 2 │
└───────┘
┌───────┐
│ x │
│ int32 │
├───────┤
│ 6 │
│ 6 │
│ 8 │
│ 5 │
└───────┘
391
DuckDB Documentation
Syntax
Prepared Statements
DuckDB supports prepared statements where parameters are substituted when the query is executed. This can improve readability and is
useful for preventing SQL injections.
Syntax
There are three syntaxes for denoting parameters in prepared statements: auto‑incremented (?), positional ($1), and named ($param).
Note that not all clients support all of these syntaxes, e.g., the JDBC client only supports auto‑incremented parameters in prepared state‑
ments.
Example Data Set In the following, we introduce the three different syntaxes and illustrate them with examples using the following
table.
In our example query, we'll look for people whose name starts with a ”B” and are at least 40 years old. This will return a single row <'Bob',
41>.
Auto‑Incremented Parameters: ? DuckDB support using prepared statements with auto‑incremented indexing, i.e., the position of the
parameters in the query corresponds to their position in the execution statement. For example:
PREPARE query_person AS
SELECT *
FROM person
WHERE starts_with(name, ?)
AND age >= ?;
Positional Parameters: $1 Prepared statements can use positional parameters, where parameters are denoted with an integer ($1, $2).
For example:
PREPARE query_person AS
SELECT *
FROM person
WHERE starts_with(name, $2)
AND age >= $1;
Using the CLI client, the statement is executed as follows. Note that the first parameter corresponds to $1, the second to $2, and so on.
Named Parameters: $parameter DuckDB also supports names parameters where parameters are denoted with $parameter_name.
For example:
PREPARE query_person AS
SELECT *
FROM person
WHERE starts_with(name, $name_start_letter)
AND age >= $minimum_age;
392
DuckDB Documentation
Data Types
Data Types
The table below shows all the built‑in general‑purpose data types. The alternatives listed in the aliases column can be used to refer to
these types as well, however, note that the aliases are not part of the SQL standard and hence might not be accepted by other database
engines.
Implicit and explicit typecasting is possible between numerous types, see the Typecasting page for details.
393
DuckDB Documentation
DuckDB supports five nested data types: ARRAY, LIST, STRUCT, MAP, and UNION. Each supports different use cases and has a different
structure.
Name Description Rules when used in a column Build from values Define in DDL/CREATE
ARRAY An ordered, fixed‑length Each row must have the same [1, 2, 3] INT[3]
sequence of data values of the data type within each instance
same type. of the ARRAY and the same
number of elements.
LIST An ordered sequence of data Each row must have the same [1, 2, 3] INT[]
values of the same type. data type within each instance
of the LIST, but can have any
number of elements.
MAP A dictionary of multiple named Rows may have different keys. map([1, 2], MAP(INT,
values, each key having the ['a', 'b']) VARCHAR)
same type and each value
having the same type. Keys
and values can be any type
and can be different types
from one another.
STRUCT A dictionary of multiple named Each row must have the same {'i': 42, 'j': STRUCT(i INT, j
values, where each key is a keys. 'a'} VARCHAR)
string, but the value can be a
different type for each key.
UNION A union of multiple alternative Rows may be set to different union_value(num UNION(num INT,
data types, storing one of them member types of the union. := 2) text VARCHAR)
in each value at a time. A union
also contains a discriminator
”tag” value to inspect and
access the currently set
member type.
Nesting
ARRAY, LIST, MAP, STRUCT, and UNION types can be arbitrarily nested to any depth, so long as the type rules are observed.
Performance Implications
The choice of data types can have a strong effect on performance. Please consult the Performance Guide for details.
394
DuckDB Documentation
Array Type
An ARRAY column stores fixed‑sized arrays. All fields in the column must have the same length and the same underlying type. Arrays are
typically used to store arrays of numbers, but can contain any uniform data type, including ARRAY, LIST and STRUCT types.
Arrays can be used to store vectors such as word embeddings or image embeddings.
To store variable‑length lists, use the LIST type. See the data types overview for a comparison between nested data types.
Note. The ARRAY type in PostgreSQL allows variable‑length fields. DuckDB's ARRAY type is fixed‑length.
Creating Arrays
Arrays can be created using the TYPE_NAME [ LENGTH ] syntax. For example, to create an array field for 3 integers, run:
Retrieving one or more values from an array can be accomplished using brackets and slicing notation, or through list functions like list_
extract and array_extract. Using the example in Defining an Array Field.
The following queries for extracting the second element of an array are equivalent:
┌───────┬─────────┐
│ id │ element │
│ int32 │ int32 │
├───────┼─────────┤
│ 10 │ 1 │
│ 20 │ 4 │
└───────┴─────────┘
┌───────┬──────────┐
│ id │ elements │
│ int32 │ int32[] │
395
DuckDB Documentation
├───────┼──────────┤
│ 10 │ [1, 2] │
│ 20 │ [4, 5] │
└───────┴──────────┘
Functions
All LIST functions work with the ARRAY type. Additionally, several ARRAY‑native functions are also supported. In the following, l1
stands for the 3‑element list created by array_value(1.0::FLOAT, 2.0::FLOAT, 3.0::FLOAT) and l2 stands for array_
value(2.0::FLOAT, 3.0::FLOAT, 4.0::FLOAT).
Examples
396
DuckDB Documentation
FROM x, y
WHERE x.i = y.i;
Ordering
The ordering of ARRAY instances is defined using a lexicographical order. NULL values compare greater than all other values and are
considered equal to each other.
Functions
Bitstring Type
Bitstrings are strings of 1s and 0s. The bit type data is of variable length. A bitstring value requires 1 byte for each group of 8 bits, plus a
fixed amount to store some metadata.
By default bitstrings will not be padded with zeroes. Bitstrings can be very large, having the same size restrictions as BLOBs.
-- Create a bitstring
SELECT '101010'::BIT
-- Create a bitstring with predefined length.
-- The resulting bitstring will be left-padded with zeroes.
-- This returns 000000101011.
SELECT bitstring('0101011', 12);
Functions
Blob Type
The blob (Binary Large OBject) type represents an arbitrary binary object stored in the database system. The blob type can contain any
type of binary data with no restrictions. What the actual bytes represent is opaque to the database system.
Blobs are typically used to store non‑textual objects that the database does not provide explicit support for, such as images. While blobs
can hold objects up to 4GB in size, typically it is not recommended to store very large objects within the database system. In many situations
it is better to store the large file on the file system, and store the path to the file in the database system in a VARCHAR field.
397
DuckDB Documentation
Functions
Boolean Type
The BOOLEAN type represents a statement of truth (”true” or ”false”). In SQL, the boolean field can also have a third state ”unknown” which
is represented by the SQL NULL value.
Boolean values can be explicitly created using the literals true and false. However, they are most often created as a result of comparisons
or conjunctions. For example, the comparison i > 10 results in a boolean value. Boolean values can be used in the WHERE and HAVING
clauses of a SQL statement to filter out tuples from the result. In this case, tuples for which the predicate evaluates to true will pass the
filter, and tuples for which the predicate evaluates to false or NULL will be filtered out. Consider the following example:
Conjunctions
Below is the truth table for the AND conjunction (i.e., x AND y).
398
DuckDB Documentation
Expressions
Date Types
A date specifies a combination of year, month and day. DuckDB follows the SQL standard's lead by counting dates exclusively in the Gre‑
gorian calendar, even for years before that calendar was in use. Dates can be created using the DATE keyword, where the data must be
formatted according to the ISO 8601 format (YYYY-MM-DD).
-- 20 September, 1992
SELECT DATE '1992-09-20';
Special Values
There are also three special date values that can be used on input:
The values infinity and -infinity are specially represented inside the system and will be displayed unchanged; but epoch is simply
a notational shorthand that will be converted to the date value when read.
SELECT
'-infinity'::DATE AS negative,
'epoch'::DATE AS epoch,
'infinity'::DATE AS positive;
┌───────────┬────────────┬──────────┐
│ negative │ epoch │ positive │
│ date │ date │ date │
├───────────┼────────────┼──────────┤
│ -infinity │ 1970-01-01 │ infinity │
└───────────┴────────────┴──────────┘
Functions
399
DuckDB Documentation
Name Description
The ENUM type represents a dictionary data structure with all possible unique values of a column. For example, a column storing the days
of the week can be an Enum holding all possible days. Enums are particularly interesting for string columns with low cardinality (i.e., fewer
distinct values). This is because the column only stores a numerical reference to the string in the Enum dictionary, resulting in immense
savings in disk storage and faster query performance.
Enum Definition
Enum types are created from either a hardcoded set of values or from a select statement that returns a single column of varchars. The set
of values in the select statement will be deduplicated, but if the enum is created from a hardcoded set there may not be any duplicates.
-- Create enum using a select statement that returns a single column of varchars
CREATE TYPE ${enum_name} AS ENUM (${SELECT expression});
For example:
-- Create an enum using the unique string values in the my_varchar column
CREATE TYPE birds AS ENUM (SELECT my_varchar FROM my_inputs);
-- Show the available values in the birds enum using the enum_range function
SELECT enum_range(NULL::birds) AS my_enum_range;
my_enum_range
[duck, goose]
Enum Usage
After an enum has been created, it can be used anywhere a standard built‑in type is used. For example, we can create a table with a column
that references the enum.
400
DuckDB Documentation
Creates a table person, with attributes name (string type) and current_mood (mood type):
The following query will fail since the mood type does not have a 'quackity‑quack' value.
The string 'sad' is cast to the type mood, returning a numerical reference value. This makes the comparison a numerical comparison instead
of a string comparison.
SELECT *
FROM person
WHERE current_mood = 'sad';
┌───────────┬───────────────────────────────────────┐
│ name │ current_mood │
│ varchar │ enum('sad', 'ok', 'happy', 'anxious') │
├───────────┼───────────────────────────────────────┤
│ Pagliacci │ sad │
└───────────┴───────────────────────────────────────┘
If you are importing data from a file, you can create an Enum for a VARCHAR column before importing. Given this, the following subquery
selects automatically selects only distinct values:
Then you can create a table with the ENUM type and import using any data import statement
DuckDB Enums are automatically cast to VARCHAR types whenever necessary. This characteristic allows for ENUM columns to be used in
any VARCHAR function. In addition, it also allows for comparisons between different ENUM columns, or an ENUM and a VARCHAR column.
For example:
┌────────────┐
│ contains_a │
│ boolean │
├────────────┤
│ true │
│ NULL │
│ true │
│ false │
└────────────┘
401
DuckDB Documentation
Since the current_mood and future_mood columns are constructed on different ENUM types, DuckDB will cast both ENUMs to strings
and perform a string comparison:
SELECT *
FROM person_2
WHERE current_mood = future_mood;
When comparing the past_mood column (string), DuckDB will cast the current_mood ENUM to VARCHAR and perform a string compar‑
ison:
SELECT *
FROM person_2
WHERE current_mood = past_mood;
Enum Removal
Enum types are stored in the catalog, and a catalog dependency is added to each table that uses them. It is possible to drop an Enum from
the catalog using the following command:
Currently, it is possible to drop Enums that are used in tables without affecting the tables.
Note. Warning This behavior of the Enum Removal feature is subject to change. In future releases, it is expected that any dependent
columns must be removed before dropping the Enum, or the Enum must be dropped with the additional CASCADE parameter.
Comparison of Enums
Enum values are compared according to their order in the Enum's definition. For example:
┌─────────┐
│ comp │
│ boolean │
├─────────┤
│ true │
└─────────┘
┌────────────────────────────┐
│ m │
│ enum('sad', 'ok', 'happy') │
├────────────────────────────┤
│ sad │
│ ok │
│ happy │
└────────────────────────────┘
402
DuckDB Documentation
Interval Type
Intervals represent a period of time. This period can be measured in a specific unit or combination of units, for example years, days, or
seconds. Intervals are generally used to modify timestamps or dates by either adding or subtracting them.
Name Description
An INTERVAL can be constructed by providing an amount together with a unit. Intervals can be added or subtracted from DATE or TIMES-
TAMP values.
-- 1 year
SELECT INTERVAL 1 YEAR;
-- add 1 year to a specific date
SELECT DATE '2000-01-01' + INTERVAL 1 YEAR;
-- subtract 1 year from a specific date
SELECT DATE '2000-01-01' - INTERVAL 1 YEAR;
-- construct an interval from a column, instead of a constant
SELECT INTERVAL (i) YEAR FROM range(1, 5) t(i);
-- construct an interval with mixed units
SELECT INTERVAL '1 month 1 day';
-- intervals greater than 24 hours/12 months/etc. are supported
SELECT '540:58:47.210'::INTERVAL;
SELECT INTERVAL '16 MONTHS';
-- WARNING:
-- If a decimal value is specified, it will be automatically rounded to an integer
-- To use more precise values, simply use a more granular date part
-- (In this example use 18 MONTHS instead of 1.5 YEARS)
-- The statement below is equivalent to to_years(CAST(1.5 AS INTEGER))
SELECT INTERVAL '1.5' YEARS; -- WARNING! This returns 2 years!
Details
The interval class represents a period of time using three distinct components: the month, day and microsecond. These three components
are required because there is no direct translation between them. For example, a month does not correspond to a fixed amount of days.
That depends on which month is referenced. February has fewer days than March.
The division into components makes the interval class suitable for adding or subtracting specific time units to a date. For example, we can
generate a table with the first day of every month using the following SQL query:
If we subtract two timestamps from one another, we obtain an interval describing the difference between the timestamps with the days
and microseconds components. For example:
┌──────────────────┐
│ diff │
│ interval │
├──────────────────┤
403
DuckDB Documentation
│ 31 days 01:00:00 │
└──────────────────┘
The datediff function can be used to obtain the difference between two dates for a specific unit.
┌───────┐
│ diff │
│ int64 │
├───────┤
│ 1 │
└───────┘
Functions
See the Date Part Functions page for a list of available date parts for use with an INTERVAL.
See the Interval Operators page for functions that operate on intervals.
List Type
A LIST column encodes lists of values. Fields in the column can have values with different lengths, but they must all have the same
underlying type. LISTs are typically used to store arrays of numbers, but can contain any uniform data type, including other LISTs and
STRUCTs.
LISTs are similar to PostgreSQL's ARRAY type. DuckDB uses the LIST terminology, but some array functions are provided for PostgreSQL
compatibility.
See the data types overview for a comparison between nested data types.
Note. For storing fixed‑length lists, DuckDB uses the ARRAY type.
Creating Lists
Lists can be created using the list_value(expr, ...) function or the equivalent bracket notation [expr, ...]. The expressions
can be constants or arbitrary expressions. To create a list from a table column, use the list aggregate function.
-- List of integers
SELECT [1, 2, 3];
-- List of strings with a NULL value
SELECT ['duck', 'goose', NULL, 'heron'];
-- List of lists with NULL values
SELECT [['duck', 'goose', 'heron'], NULL, ['frog', 'toad'], []];
-- Create a list with the list_value function
SELECT list_value(1, 2, 3);
-- Create a table with an integer list column and a varchar list column
CREATE TABLE list_table (int_list INT[], varchar_list VARCHAR[]);
Retrieving one or more values from a list can be accomplished using brackets and slicing notation, or through list functions like list_
extract. Multiple equivalent functions are provided as aliases for compatibility with systems that refer to lists as arrays. For example,
the function array_slice.
404
DuckDB Documentation
Note. We wrap the list creation in parenthesis so that it happens first. This is only needed in our basic examples here, not when
working with a list column. For example, this can't be parsed: SELECT ['a', 'b', 'c'][1].
Example Result
Ordering
The ordering is defined positionally. NULL values compare greater than all other values and are considered equal to each other.
Null Comparisons
At the top level, NULL nested values obey standard SQL NULL comparison rules: comparing a NULL nested value to a non‑NULL nested
value produces a NULL result. Comparing nested value members, however, uses the internal nested value rules for NULLs, and a NULL
nested value member will compare above a non‑NULL nested value member.
Functions
Literal Types
DuckDB has special literal types for representing NULL, integer and string literals in queries. These have their own binding and conversion
rules.
Note. Prior to version 0.10.0, integer and string literals behaved identically to the INTEGER and VARCHAR types.
Null Literals
Integer Literals
INTEGER_LITERAL types can be implicitly converted to any integer type in which the value fits. For example, the integer literal 42 can
be implicitly converted to a TINYINT, but the integer literal 1000 cannot be.
405
DuckDB Documentation
String Literals
┌─────────┐
│ result │
│ boolean │
├─────────┤
│ false │
└─────────┘
-- Binder Error: Cannot compare values of type DATE and type VARCHAR –
-- an explicit cast is required
Escape String Literals To include special characters such as newline, use E escape the string. Both the uppercase (E'...') and lower‑
case variants (e'...') work.
┌──────────────┐
│ msg │
│ varchar │
├──────────────┤
│ Hello\nworld │
└──────────────┘
Dollar‑Quoted String Literals DuckDB supports dollar‑quoted string literals, which are surrounded by double‑dollar symbols ($$):
SELECT $$Hello
world$$ AS msg
┌──────────────┐
│ msg │
│ varchar │
├──────────────┤
│ Hello\nworld │
└──────────────┘
┌────────────────────┐
│ msg │
│ varchar │
├────────────────────┤
│ The price is $9.95 │
└────────────────────┘
406
DuckDB Documentation
Map Type
MAPs are similar to STRUCTs in that they are an ordered list of ”entries” where a key maps to a value. However, MAPs do not need to have
the same keys present for each row, and thus are suitable for other use cases. MAPs are useful when the schema is unknown beforehand
or when the schema varies per row; their flexibility is a key differentiator.
MAPs must have a single type for all keys, and a single type for all values. Keys and values can be any type, and the type of the keys does
not need to match the type of the values (Ex: a MAP of VARCHAR to INT is valid). MAPs may not have duplicate keys. MAPs return an empty
list if a key is not found rather than throwing an error as structs do.
In contrast, STRUCTs must have string keys, but each key may have a value of a different type. See the data types overview for a comparison
between nested data types.
To construct a MAP, use the bracket syntax preceded by the MAP keyword.
Creating Maps
-- A map with VARCHAR keys and INTEGER values. This returns {key1=10, key2=20, key3=30}
SELECT MAP {'key1': 10, 'key2': 20, 'key3': 30};
-- Alternatively use the map_from_entries function. This returns {key1=10, key2=20, key3=30}
SELECT map_from_entries([('key1', 10), ('key2', 20), ('key3', 30)]);
-- A map can be also created using two lists: keys and values. This returns {key1=10, key2=20, key3=30}
SELECT MAP(['key1', 'key2', 'key3'], [10, 20, 30]);
-- A map can also use INTEGER keys and NUMERIC values. This returns {1=42.001, 5=-32.100}
SELECT MAP {1: 42.001, 5: -32.1};
-- Keys and/or values can also be nested types.
-- This returns {[a, b]=[1.1, 2.2], [c, d]=[3.3, 4.4]}
SELECT MAP {['a', 'b']: [1.1, 2.2], ['c', 'd']: [3.3, 4.4]};
-- Create a table with a map column that has INTEGER keys and DOUBLE values
CREATE TABLE tbl (col MAP(INTEGER, DOUBLE));
MAPs use bracket notation for retrieving values. Selecting from a MAP returns a LIST rather than an individual value, with an empty LIST
meaning that the key was not found.
-- Use bracket notation to retrieve a list containing the value at a key's location. This returns [5]
-- Note that the expression in bracket notation must match the type of the map's key
SELECT MAP {'key1': 5, 'key2': 43}['key1'];
-- To retrieve the underlying value, use list selection syntax to grab the first element.
-- This returns 5
SELECT MAP {'key1': 5, 'key2': 43}['key1'][1];
-- If the element is not in the map, an empty list will be returned. Returns []
-- Note that the expression in bracket notation must match the type of the map's key else an error is
returned
SELECT MAP {'key1': 5, 'key2': 43}['key3'];
-- The element_at function can also be used to retrieve a map value. This returns [5]
SELECT element_at(MAP {'key1': 5, 'key2': 43}, 'key1');
Comparison Operators
Nested types can be compared using all the comparison operators. These comparisons can be used in logical expressions for both WHERE
and HAVING clauses, as well as for creating Boolean values.
The ordering is defined positionally in the same way that words can be ordered in a dictionary. NULL values compare greater than all other
values and are considered equal to each other.
407
DuckDB Documentation
At the top level, NULL nested values obey standard SQL NULL comparison rules: comparing a NULL nested value to a non‑NULL nested
value produces a NULL result. Comparing nested value members, however, uses the internal nested value rules for NULLs, and a NULL
nested value member will compare above a non‑NULL nested value member.
Functions
NULL Values
NULL values are special values that are used to represent missing data in SQL. Columns of any type can contain NULL values. Logically, a
NULL value can be seen as ”the value of this field is unknown”.
NULL values have special semantics in many parts of the query as well as in many functions:
Note. Any comparison with a NULL value returns NULL, including NULL = NULL.
You can use IS NOT DISTINCT FROM to perform an equality comparison where NULL values compare equal to each other. Use IS
(NOT) NULL to check if a value is NULL.
SELECT cos(NULL);
-- NULL
The coalesce function is an exception to this: it takes any number of arguments, and returns for each row the first argument that is not
NULL. If all arguments are NULL, coalesce also returns NULL.
NULL values have special semantics in AND/OR conjunctions. For the ternary logic truth tables, see the Boolean Type documentation.
408
DuckDB Documentation
Aggregate functions that do not ignore NULL values include: first, last, list, and array_agg. To exclude NULL values from those
aggregate functions, the FILTER clause can be used.
Numeric Types
Integer Types
The types TINYINT, SMALLINT, INTEGER, BIGINT and HUGEINT store whole numbers, that is, numbers without fractional compo‑
nents, of various ranges. Attempts to store values outside of the allowed range will result in an error. The types UTINYINT, USMALLINT,
UINTEGER, UBIGINT and UHUGEINT store whole unsigned numbers. Attempts to store negative numbers or values outside of the al‑
lowed range will result in an error
The type integer is the common choice, as it offers the best balance between range, storage size, and performance. The SMALLINT type is
generally only used if disk space is at a premium. The BIGINT and HUGEINT types are designed to be used when the range of the integer
type is insufficient.
Fixed‑Point Decimals
The data type DECIMAL(WIDTH, SCALE) (also available under the alias NUMERIC(WIDTH, SCALE)) represents an exact fixed‑point
decimal value. When creating a value of type DECIMAL, the WIDTH and SCALE can be specified to define which size of decimal values can
be held in the field. The WIDTH field determines how many digits can be held, and the scale determines the amount of digits after the
decimal point. For example, the type DECIMAL(3, 2) can fit the value 1.23, but cannot fit the value 12.3 or the value 1.234. The
default WIDTH and SCALE is DECIMAL(18, 3), if none are specified.
409
DuckDB Documentation
1‑4 INT16 2
5‑9 INT32 4
10‑18 INT64 8
19‑38 INT128 16
Performance can be impacted by using too large decimals when not required. In particular decimal values with a width above 19 are slow,
as arithmetic involving the INT128 type is much more expensive than operations involving the INT32 or INT64 types. It is therefore
recommended to stick with a width of 18 or below, unless there is a good reason for why this is insufficient.
Floating‑Point Types
The data types REAL and DOUBLE precision are inexact, variable‑precision numeric types. In practice, these types are usually implementa‑
tions of IEEE Standard 754 for Binary Floating‑Point Arithmetic (single and double precision, respectively), to the extent that the underlying
processor, operating system, and compiler support it.
Inexact means that some values cannot be converted exactly to the internal format and are stored as approximations, so that storing and
retrieving a value might show slight discrepancies. Managing these errors and how they propagate through calculations is the subject of
an entire branch of mathematics and computer science and will not be discussed here, except for the following points:
• If you require exact storage and calculations (such as for monetary amounts), use the DECIMAL data type or its NUMERIC alias
instead.
• If you want to do complicated calculations with these types for anything important, especially if you rely on certain behavior in
boundary cases (infinity, underflow), you should evaluate the implementation carefully.
• Comparing two floating‑point values for equality might not always work as expected.
On most platforms, the REAL type has a range of at least 1E‑37 to 1E+37 with a precision of at least 6 decimal digits. The DOUBLE type
typically has a range of around 1E‑307 to 1E+308 with a precision of at least 15 digits. Values that are too large or too small will cause an
error. Rounding might take place if the precision of an input number is too high. Numbers too close to zero that are not representable as
distinct from zero will cause an underflow error.
In addition to ordinary numeric values, the floating‑point types have several special values:
• Infinity
• -Infinity
• NaN
These represent the IEEE 754 special values ”infinity”, ”negative infinity”, and ”not‑a‑number”, respectively. (On a machine whose floating‑
point arithmetic does not follow IEEE 754, these values will probably not work as expected.) When writing these values as constants in an
SQL command, you must put quotes around them, for example: UPDATE table SET x = '-Infinity'. On input, these strings are
recognized in a case‑insensitive manner.
410
DuckDB Documentation
DuckDB supports universally unique identifiers (UUIDs) through the UUID type. These use 128 bits and are represented internally as
HUGEINT values. When printed, they are shown with hexadecimal characters, separated by dashes as follows: 8 characters - 4
characters - 4 characters - 4 characters - 12 characters (using 36 characters in total). For example, 4ac7a9e9-
607c-4c8a-84f3-843f0191e3fd is a valid UUID.
Functions
Conceptually, a STRUCT column contains an ordered list of columns called ”entries”. The entries are referenced by name using strings. This
document refers to those entry names as keys. Each row in the STRUCT column must have the same keys. The names of the struct entries
are part of the schema. Each row in a STRUCT column must have the same layout. The names of the struct entries are case‑insensitive.
STRUCTs are typically used to nest multiple columns into a single column, and the nested column can be of any type, including other
STRUCTs and LISTs.
STRUCTs are similar to PostgreSQL's ROW type. The key difference is that DuckDB STRUCTs require the same keys in each row of a STRUCT
column. This allows DuckDB to provide significantly improved performance by fully utilizing its vectorized execution engine, and also
enforces type consistency for improved correctness. DuckDB includes a row function as a special way to produce a STRUCT, but does not
have a ROW data type. See an example below and the nested functions docs for details.
See the data types overview for a comparison between nested data types.
Creating Structs Structs can be created using the struct_pack(name := expr, ...) function or the equivalent array notation
{'name': expr, ...} notation. The expressions can be constants or arbitrary expressions.
-- Struct of integers
SELECT {'x': 1, 'y': 2, 'z': 3};
-- Struct of strings with a NULL value
SELECT {'yes': 'duck', 'maybe': 'goose', 'huh': NULL, 'no': 'heron'};
-- Struct with a different type for each key
SELECT {'key1': 'string', 'key2': 1, 'key3': 12.345};
-- Struct using the struct_pack function.
-- Note the lack of single quotes around the keys and the use of the := operator
SELECT struct_pack(key1 := 'value1', key2 := 42);
-- Struct of structs with NULL values
SELECT {'birds':
{'yes': 'duck', 'maybe': 'goose', 'huh': NULL, 'no': 'heron'},
'aliens':
NULL,
'amphibians':
{'yes':'frog', 'maybe': 'salamander', 'huh': 'dragon', 'no':'toad'}
};
-- Create a struct from columns and/or expressions using the row function.
-- This returns {'': 1, '': 2, '': a}
SELECT row(x, x + 1, y) FROM (SELECT 1 AS x, 'a' AS y);
-- If using multiple expressions when creating a struct, the row function is optional
-- This also returns {'': 1, '': 2, '': a}
SELECT (x, x + 1, y) FROM (SELECT 1 AS x, 'a' AS y);
411
DuckDB Documentation
Retrieving from Structs Retrieving a value from a struct can be accomplished using dot notation, bracket notation, or through struct
functions like struct_extract.
-- Use dot notation to retrieve the value at a key's location. This returns 1
-- The subquery generates a struct column "a", which we then query with a.x
SELECT a.x FROM (SELECT {'x': 1, 'y': 2, 'z': 3} AS a);
-- If key contains a space, simply wrap it in double quotes. This returns 1
-- Note: Use double quotes not single quotes
-- This is because this action is most similar to selecting a column from within the struct
SELECT a."x space" FROM (SELECT {'x space': 1, 'y': 2, 'z': 3} AS a);
-- Bracket notation may also be used. This returns 1
-- Note: Use single quotes since the goal is to specify a certain string key.
-- Only constant expressions may be used inside the brackets (no columns)
SELECT a['x space'] FROM (SELECT {'x space': 1, 'y': 2, 'z': 3} AS a);
-- The struct_extract function is also equivalent. This returns 1
SELECT struct_extract({'x space': 1, 'y': 2, 'z': 3}, 'x space');
Struct.* Rather than retrieving a single key from a struct, star notation (*) can be used to retrieve all keys from a struct as separate
columns. This is particularly useful when a prior operation creates a struct of unknown shape, or if a query must handle any potential struct
keys.
x y z
1 2 3
Dot Notation Order of Operations Referring to structs with dot notation can be ambiguous with referring to schemas and tables. In
general, DuckDB looks for columns first, then for struct keys within columns. DuckDB resolves references in these orders, using the first
match to occur:
No Dots
SELECT part1
FROM tbl;
1. part1 is a column
One Dot
SELECT part1.part2
FROM tbl;
412
DuckDB Documentation
Any extra parts (e.g., .part4.part5 etc) are always treated as properties
Creating Structs with the row Function The row function can be used to automatically convert multiple columns to a single struct
column. When using row the keys will be empty strings allowing for easy insertion into a table with a struct column. Columns, however,
cannot be initialized with the row function, and must be explicitly named. For example:
When casting structs, the names of fields have to match. Therefore, the following query will fail:
Error: Mismatch Type Error: Type STRUCT(x INTEGER) does not match with STRUCT(y INTEGER). Cannot cast
STRUCTs with different names
Note. This behavior was introduced in DuckDB v0.9.0. Previously, this query ran successfully and returned struct {'y': 42} as
column b.
Comparison Operators
Nested types can be compared using all the comparison operators. These comparisons can be used in logical expressions for both WHERE
and HAVING clauses, as well as for creating BOOLEAN values.
The ordering is defined positionally in the same way that words can be ordered in a dictionary. NULL values compare greater than all other
values and are considered equal to each other.
At the top level, NULL nested values obey standard SQL NULL comparison rules: comparing a NULL nested value to a non‑NULL nested
value produces a NULL result. Comparing nested value members, however, uses the internal nested value rules for NULLs, and a NULL
nested value member will compare above a non‑NULL nested value member.
413
DuckDB Documentation
Functions
Text Types
In DuckDB, strings can be stored in the VARCHAR field. The field allows storage of Unicode characters. Internally, the data is encoded as
UTF‑8.
Specifying the length for the VARCHAR, STRING, and TEXT types is not required and has no effect on the system. Specifying the length
will not improve performance or reduce storage space of the strings in the database. These variants variant is supported for compatibility
reasons with other systems that do require a length to be specified for strings.
If you wish to restrict the number of characters in a VARCHAR column for data integrity reasons the CHECK constraint should be used, for
example:
The VARCHAR field allows storage of Unicode characters. Internally, the data is encoded as UTF‑8.
Formatting Strings
┌─────────────┐
│ msg │
│ varchar │
├─────────────┤
│ Hello world │
└─────────────┘
┌───────────────┐
│ msg │
│ varchar │
├───────────────┤
│ Hello 'world' │
└───────────────┘
414
DuckDB Documentation
To use special characters in string, use escape string literals or dollar‑quoted string literals. Alternatively, you can use concatenation and
the chr character function:
┌──────────────┐
│ msg │
│ varchar │
├──────────────┤
│ Hello\nworld │
└──────────────┘
Double quote characters (") are used to denote table and column names. Surrounding their names allows the use of keywords, e.g.:
While DuckDB occasionally accepts both single quote and double quotes for strings (e.g., both FROM "filename.csv" and FROM
'filename.csv' work), their use is not recommended.
Functions
Time Types
The TIME and TIMETZ types specify the hour, minute, second, microsecond of a day.
TIME TIME WITHOUT TIME ZONE time of day (ignores time zone)
TIMETZ TIME WITH TIME ZONE time of day (uses time zone)
Instances can be created using the type names as a keyword, where the data must be formatted according to the ISO 8601 format
(hh:mm:ss[.zzzzzz][+-TT[:tt]]).
Note. Warning The TIME type should only be used in rare cases, where the date part of the timestamp can be disregarded. Most
applications should use the TIMESTAMP types to represent their timestamps.
Timestamp Types
Timestamps represent points in absolute time, usually called instants. DuckDB represents instants as the number of microseconds (μs)
since 1970-01-01 00:00:00+00.
415
DuckDB Documentation
Timestamp Types
A timestamp specifies a combination of DATE (year, month, day) and a TIME (hour, minute, second, microsecond). Timestamps
can be created using the TIMESTAMP keyword, where the data must be formatted according to the ISO 8601 format (YYYY-MM-DD
hh:mm:ss[.zzzzzz][+-TT[:tt]]).
Special Values
There are also three special date values that can be used on input:
The values infinity and -infinity are specially represented inside the system and will be displayed unchanged; but epoch is simply
a notational shorthand that will be converted to the time stamp value when read.
Functions
416
DuckDB Documentation
Time Zones
The TIMESTAMPTZ type can be binned into calendar and clock bins using a suitable extension. The built‑in ICU extension implements all
the binning and arithmetic functions using the International Components for Unicode time zone and calendar functions.
To set the time zone to use, first load the ICU extension. The ICU extension comes pre‑bundled with several DuckDB clients (including
Python, R, JDBC, and ODBC), so this step can be skipped in those cases. In other cases you might first need to install and load the ICU
extension.
INSTALL icu;
LOAD icu;
Time binning operations for TIMESTAMPTZ will then be implemented using the given time zone.
A list of available time zones can be pulled from the pg_timezone_names() table function:
SELECT
name,
abbrev,
utc_offset
FROM pg_timezone_names()
ORDER BY
name;
Calendars
The ICU extension also supports non‑Gregorian calendars using the SET Calendar command. Note that the INSTALL and LOAD steps
are only required if the DuckDB client does not bundle the ICU extension.
INSTALL icu;
LOAD icu;
SET Calendar = 'japanese';
Time binning operations for TIMESTAMPTZ will then be implemented using the given calendar. In this example, the era part will now
report the Japanese imperial era number.
A list of available calendars can be pulled from the icu_calendar_names() table function:
SELECT name
FROM icu_calendar_names()
ORDER BY 1;
Settings
The current value of the TimeZone and Calendar settings are determined by ICU when it starts up. They can be queried from in the
duckdb_settings() table function:
SELECT *
FROM duckdb_settings()
WHERE name = 'TimeZone';
┌──────────┬──────────────────┬───────────────────────┬────────────┐
│ name │ value │ description │ input_type │
│ varchar │ varchar │ varchar │ varchar │
├──────────┼──────────────────┼───────────────────────┼────────────┤
417
DuckDB Documentation
SELECT *
FROM duckdb_settings()
WHERE name = 'Calendar';
┌──────────┬───────────┬──────────────────────┬────────────┐
│ name │ value │ description │ input_type │
│ varchar │ varchar │ varchar │ varchar │
├──────────┼───────────┼──────────────────────┼────────────┤
│ Calendar │ gregorian │ The current calendar │ VARCHAR │
└──────────┴───────────┴──────────────────────┴────────────┘
An up‑to‑date version of this list can be pulled from the pg_timezone_names() table function:
name abbrev
ACT ACT
AET AET
AGT AGT
ART ART
AST AST
Africa/Abidjan Iceland
Africa/Accra Iceland
Africa/Addis_Ababa EAT
Africa/Algiers Africa/Algiers
Africa/Asmara EAT
Africa/Asmera EAT
Africa/Bamako Iceland
Africa/Bangui Africa/Bangui
Africa/Banjul Iceland
Africa/Bissau Africa/Bissau
Africa/Blantyre CAT
Africa/Brazzaville Africa/Brazzaville
Africa/Bujumbura CAT
Africa/Cairo ART
Africa/Casablanca Africa/Casablanca
Africa/Ceuta Africa/Ceuta
Africa/Conakry Iceland
Africa/Dakar Iceland
Africa/Dar_es_Salaam EAT
418
DuckDB Documentation
name abbrev
Africa/Djibouti EAT
Africa/Douala Africa/Douala
Africa/El_Aaiun Africa/El_Aaiun
Africa/Freetown Iceland
Africa/Gaborone CAT
Africa/Harare CAT
Africa/Johannesburg Africa/Johannesburg
Africa/Juba Africa/Juba
Africa/Kampala EAT
Africa/Khartoum Africa/Khartoum
Africa/Kigali CAT
Africa/Kinshasa Africa/Kinshasa
Africa/Lagos Africa/Lagos
Africa/Libreville Africa/Libreville
Africa/Lome Iceland
Africa/Luanda Africa/Luanda
Africa/Lubumbashi CAT
Africa/Lusaka CAT
Africa/Malabo Africa/Malabo
Africa/Maputo CAT
Africa/Maseru Africa/Maseru
Africa/Mbabane Africa/Mbabane
Africa/Mogadishu EAT
Africa/Monrovia Africa/Monrovia
Africa/Nairobi EAT
Africa/Ndjamena Africa/Ndjamena
Africa/Niamey Africa/Niamey
Africa/Nouakchott Iceland
Africa/Ouagadougou Iceland
Africa/Porto‑Novo Africa/Porto‑Novo
Africa/Sao_Tome Africa/Sao_Tome
Africa/Timbuktu Iceland
Africa/Tripoli Libya
Africa/Tunis Africa/Tunis
Africa/Windhoek Africa/Windhoek
America/Adak America/Adak
America/Anchorage AST
America/Anguilla PRT
America/Antigua PRT
419
DuckDB Documentation
name abbrev
America/Araguaina America/Araguaina
America/Argentina/Buenos_Aires AGT
America/Argentina/Catamarca America/Argentina/Catamarca
America/Argentina/ComodRivadavia America/Argentina/ComodRivadavia
America/Argentina/Cordoba America/Argentina/Cordoba
America/Argentina/Jujuy America/Argentina/Jujuy
America/Argentina/La_Rioja America/Argentina/La_Rioja
America/Argentina/Mendoza America/Argentina/Mendoza
America/Argentina/Rio_Gallegos America/Argentina/Rio_Gallegos
America/Argentina/Salta America/Argentina/Salta
America/Argentina/San_Juan America/Argentina/San_Juan
America/Argentina/San_Luis America/Argentina/San_Luis
America/Argentina/Tucuman America/Argentina/Tucuman
America/Argentina/Ushuaia America/Argentina/Ushuaia
America/Aruba PRT
America/Asuncion America/Asuncion
America/Atikokan America/Atikokan
America/Atka America/Atka
America/Bahia America/Bahia
America/Bahia_Banderas America/Bahia_Banderas
America/Barbados America/Barbados
America/Belem America/Belem
America/Belize America/Belize
America/Blanc‑Sablon PRT
America/Boa_Vista America/Boa_Vista
America/Bogota America/Bogota
America/Boise America/Boise
America/Buenos_Aires AGT
America/Cambridge_Bay America/Cambridge_Bay
America/Campo_Grande America/Campo_Grande
America/Cancun America/Cancun
America/Caracas America/Caracas
America/Catamarca America/Catamarca
America/Cayenne America/Cayenne
America/Cayman America/Cayman
America/Chicago CST
America/Chihuahua America/Chihuahua
America/Ciudad_Juarez America/Ciudad_Juarez
America/Coral_Harbour America/Coral_Harbour
420
DuckDB Documentation
name abbrev
America/Cordoba America/Cordoba
America/Costa_Rica America/Costa_Rica
America/Creston PNT
America/Cuiaba America/Cuiaba
America/Curacao PRT
America/Danmarkshavn America/Danmarkshavn
America/Dawson America/Dawson
America/Dawson_Creek America/Dawson_Creek
America/Denver Navajo
America/Detroit America/Detroit
America/Dominica PRT
America/Edmonton America/Edmonton
America/Eirunepe America/Eirunepe
America/El_Salvador America/El_Salvador
America/Ensenada America/Ensenada
America/Fort_Nelson America/Fort_Nelson
America/Fort_Wayne IET
America/Fortaleza America/Fortaleza
America/Glace_Bay America/Glace_Bay
America/Godthab America/Godthab
America/Goose_Bay America/Goose_Bay
America/Grand_Turk America/Grand_Turk
America/Grenada PRT
America/Guadeloupe PRT
America/Guatemala America/Guatemala
America/Guayaquil America/Guayaquil
America/Guyana America/Guyana
America/Halifax America/Halifax
America/Havana Cuba
America/Hermosillo America/Hermosillo
America/Indiana/Indianapolis IET
America/Indiana/Knox America/Indiana/Knox
America/Indiana/Marengo America/Indiana/Marengo
America/Indiana/Petersburg America/Indiana/Petersburg
America/Indiana/Tell_City America/Indiana/Tell_City
America/Indiana/Vevay America/Indiana/Vevay
America/Indiana/Vincennes America/Indiana/Vincennes
America/Indiana/Winamac America/Indiana/Winamac
America/Indianapolis IET
421
DuckDB Documentation
name abbrev
America/Inuvik America/Inuvik
America/Iqaluit America/Iqaluit
America/Jamaica Jamaica
America/Jujuy America/Jujuy
America/Juneau America/Juneau
America/Kentucky/Louisville America/Kentucky/Louisville
America/Kentucky/Monticello America/Kentucky/Monticello
America/Knox_IN America/Knox_IN
America/Kralendijk PRT
America/La_Paz America/La_Paz
America/Lima America/Lima
America/Los_Angeles PST
America/Louisville America/Louisville
America/Lower_Princes PRT
America/Maceio America/Maceio
America/Managua America/Managua
America/Manaus America/Manaus
America/Marigot PRT
America/Martinique America/Martinique
America/Matamoros America/Matamoros
America/Mazatlan America/Mazatlan
America/Mendoza America/Mendoza
America/Menominee America/Menominee
America/Merida America/Merida
America/Metlakatla America/Metlakatla
America/Mexico_City America/Mexico_City
America/Miquelon America/Miquelon
America/Moncton America/Moncton
America/Monterrey America/Monterrey
America/Montevideo America/Montevideo
America/Montreal America/Montreal
America/Montserrat PRT
America/Nassau America/Nassau
America/New_York America/New_York
America/Nipigon America/Nipigon
America/Nome America/Nome
America/Noronha America/Noronha
America/North_Dakota/Beulah America/North_Dakota/Beulah
America/North_Dakota/Center America/North_Dakota/Center
422
DuckDB Documentation
name abbrev
America/North_Dakota/New_Salem America/North_Dakota/New_Salem
America/Nuuk America/Nuuk
America/Ojinaga America/Ojinaga
America/Panama America/Panama
America/Pangnirtung America/Pangnirtung
America/Paramaribo America/Paramaribo
America/Phoenix PNT
America/Port‑au‑Prince America/Port‑au‑Prince
America/Port_of_Spain PRT
America/Porto_Acre America/Porto_Acre
America/Porto_Velho America/Porto_Velho
America/Puerto_Rico PRT
America/Punta_Arenas America/Punta_Arenas
America/Rainy_River America/Rainy_River
America/Rankin_Inlet America/Rankin_Inlet
America/Recife America/Recife
America/Regina America/Regina
America/Resolute America/Resolute
America/Rio_Branco America/Rio_Branco
America/Rosario America/Rosario
America/Santa_Isabel America/Santa_Isabel
America/Santarem America/Santarem
America/Santiago America/Santiago
America/Santo_Domingo America/Santo_Domingo
America/Sao_Paulo BET
America/Scoresbysund America/Scoresbysund
America/Shiprock Navajo
America/Sitka America/Sitka
America/St_Barthelemy PRT
America/St_Johns CNT
America/St_Kitts PRT
America/St_Lucia PRT
America/St_Thomas PRT
America/St_Vincent PRT
America/Swift_Current America/Swift_Current
America/Tegucigalpa America/Tegucigalpa
America/Thule America/Thule
America/Thunder_Bay America/Thunder_Bay
America/Tijuana America/Tijuana
423
DuckDB Documentation
name abbrev
America/Toronto America/Toronto
America/Tortola PRT
America/Vancouver America/Vancouver
America/Virgin PRT
America/Whitehorse America/Whitehorse
America/Winnipeg America/Winnipeg
America/Yakutat America/Yakutat
America/Yellowknife America/Yellowknife
Antarctica/Casey Antarctica/Casey
Antarctica/Davis Antarctica/Davis
Antarctica/DumontDUrville Antarctica/DumontDUrville
Antarctica/Macquarie Antarctica/Macquarie
Antarctica/Mawson Antarctica/Mawson
Antarctica/McMurdo NZ
Antarctica/Palmer Antarctica/Palmer
Antarctica/Rothera Antarctica/Rothera
Antarctica/South_Pole NZ
Antarctica/Syowa Antarctica/Syowa
Antarctica/Troll Antarctica/Troll
Antarctica/Vostok Antarctica/Vostok
Arctic/Longyearbyen Arctic/Longyearbyen
Asia/Aden Asia/Aden
Asia/Almaty Asia/Almaty
Asia/Amman Asia/Amman
Asia/Anadyr Asia/Anadyr
Asia/Aqtau Asia/Aqtau
Asia/Aqtobe Asia/Aqtobe
Asia/Ashgabat Asia/Ashgabat
Asia/Ashkhabad Asia/Ashkhabad
Asia/Atyrau Asia/Atyrau
Asia/Baghdad Asia/Baghdad
Asia/Bahrain Asia/Bahrain
Asia/Baku Asia/Baku
Asia/Bangkok Asia/Bangkok
Asia/Barnaul Asia/Barnaul
Asia/Beirut Asia/Beirut
Asia/Bishkek Asia/Bishkek
Asia/Brunei Asia/Brunei
Asia/Calcutta IST
424
DuckDB Documentation
name abbrev
Asia/Chita Asia/Chita
Asia/Choibalsan Asia/Choibalsan
Asia/Chongqing CTT
Asia/Chungking CTT
Asia/Colombo Asia/Colombo
Asia/Dacca BST
Asia/Damascus Asia/Damascus
Asia/Dhaka BST
Asia/Dili Asia/Dili
Asia/Dubai Asia/Dubai
Asia/Dushanbe Asia/Dushanbe
Asia/Famagusta Asia/Famagusta
Asia/Gaza Asia/Gaza
Asia/Harbin CTT
Asia/Hebron Asia/Hebron
Asia/Ho_Chi_Minh VST
Asia/Hong_Kong Hongkong
Asia/Hovd Asia/Hovd
Asia/Irkutsk Asia/Irkutsk
Asia/Istanbul Turkey
Asia/Jakarta Asia/Jakarta
Asia/Jayapura Asia/Jayapura
Asia/Jerusalem Israel
Asia/Kabul Asia/Kabul
Asia/Kamchatka Asia/Kamchatka
Asia/Karachi PLT
Asia/Kashgar Asia/Kashgar
Asia/Kathmandu Asia/Kathmandu
Asia/Katmandu Asia/Katmandu
Asia/Khandyga Asia/Khandyga
Asia/Kolkata IST
Asia/Krasnoyarsk Asia/Krasnoyarsk
Asia/Kuala_Lumpur Singapore
Asia/Kuching Asia/Kuching
Asia/Kuwait Asia/Kuwait
Asia/Macao Asia/Macao
Asia/Macau Asia/Macau
Asia/Magadan Asia/Magadan
Asia/Makassar Asia/Makassar
425
DuckDB Documentation
name abbrev
Asia/Manila Asia/Manila
Asia/Muscat Asia/Muscat
Asia/Nicosia Asia/Nicosia
Asia/Novokuznetsk Asia/Novokuznetsk
Asia/Novosibirsk Asia/Novosibirsk
Asia/Omsk Asia/Omsk
Asia/Oral Asia/Oral
Asia/Phnom_Penh Asia/Phnom_Penh
Asia/Pontianak Asia/Pontianak
Asia/Pyongyang Asia/Pyongyang
Asia/Qatar Asia/Qatar
Asia/Qostanay Asia/Qostanay
Asia/Qyzylorda Asia/Qyzylorda
Asia/Rangoon Asia/Rangoon
Asia/Riyadh Asia/Riyadh
Asia/Saigon VST
Asia/Sakhalin Asia/Sakhalin
Asia/Samarkand Asia/Samarkand
Asia/Seoul ROK
Asia/Shanghai CTT
Asia/Singapore Singapore
Asia/Srednekolymsk Asia/Srednekolymsk
Asia/Taipei ROC
Asia/Tashkent Asia/Tashkent
Asia/Tbilisi Asia/Tbilisi
Asia/Tehran Iran
Asia/Tel_Aviv Israel
Asia/Thimbu Asia/Thimbu
Asia/Thimphu Asia/Thimphu
Asia/Tokyo JST
Asia/Tomsk Asia/Tomsk
Asia/Ujung_Pandang Asia/Ujung_Pandang
Asia/Ulaanbaatar Asia/Ulaanbaatar
Asia/Ulan_Bator Asia/Ulan_Bator
Asia/Urumqi Asia/Urumqi
Asia/Ust‑Nera Asia/Ust‑Nera
Asia/Vientiane Asia/Vientiane
Asia/Vladivostok Asia/Vladivostok
Asia/Yakutsk Asia/Yakutsk
426
DuckDB Documentation
name abbrev
Asia/Yangon Asia/Yangon
Asia/Yekaterinburg Asia/Yekaterinburg
Asia/Yerevan NET
Atlantic/Azores Atlantic/Azores
Atlantic/Bermuda Atlantic/Bermuda
Atlantic/Canary Atlantic/Canary
Atlantic/Cape_Verde Atlantic/Cape_Verde
Atlantic/Faeroe Atlantic/Faeroe
Atlantic/Faroe Atlantic/Faroe
Atlantic/Jan_Mayen Atlantic/Jan_Mayen
Atlantic/Madeira Atlantic/Madeira
Atlantic/Reykjavik Iceland
Atlantic/South_Georgia Atlantic/South_Georgia
Atlantic/St_Helena Iceland
Atlantic/Stanley Atlantic/Stanley
Australia/ACT AET
Australia/Adelaide Australia/Adelaide
Australia/Brisbane Australia/Brisbane
Australia/Broken_Hill Australia/Broken_Hill
Australia/Canberra AET
Australia/Currie Australia/Currie
Australia/Darwin ACT
Australia/Eucla Australia/Eucla
Australia/Hobart Australia/Hobart
Australia/LHI Australia/LHI
Australia/Lindeman Australia/Lindeman
Australia/Lord_Howe Australia/Lord_Howe
Australia/Melbourne Australia/Melbourne
Australia/NSW AET
Australia/North ACT
Australia/Perth Australia/Perth
Australia/Queensland Australia/Queensland
Australia/South Australia/South
Australia/Sydney AET
Australia/Tasmania Australia/Tasmania
Australia/Victoria Australia/Victoria
Australia/West Australia/West
Australia/Yancowinna Australia/Yancowinna
BET BET
427
DuckDB Documentation
name abbrev
BST BST
Brazil/Acre Brazil/Acre
Brazil/DeNoronha Brazil/DeNoronha
Brazil/East BET
Brazil/West Brazil/West
CAT CAT
CET CET
CNT CNT
CST CST
CST6CDT CST6CDT
CTT CTT
Canada/Atlantic Canada/Atlantic
Canada/Central Canada/Central
Canada/East‑Saskatchewan Canada/East‑Saskatchewan
Canada/Eastern Canada/Eastern
Canada/Mountain Canada/Mountain
Canada/Newfoundland CNT
Canada/Pacific Canada/Pacific
Canada/Saskatchewan Canada/Saskatchewan
Canada/Yukon Canada/Yukon
Chile/Continental Chile/Continental
Chile/EasterIsland Chile/EasterIsland
Cuba Cuba
EAT EAT
ECT ECT
EET EET
EST EST
EST5EDT EST5EDT
Egypt ART
Eire Eire
Etc/GMT GMT
Etc/GMT+0 GMT
Etc/GMT+1 Etc/GMT+1
Etc/GMT+10 Etc/GMT+10
Etc/GMT+11 Etc/GMT+11
Etc/GMT+12 Etc/GMT+12
Etc/GMT+2 Etc/GMT+2
Etc/GMT+3 Etc/GMT+3
Etc/GMT+4 Etc/GMT+4
428
DuckDB Documentation
name abbrev
Etc/GMT+5 Etc/GMT+5
Etc/GMT+6 Etc/GMT+6
Etc/GMT+7 Etc/GMT+7
Etc/GMT+8 Etc/GMT+8
Etc/GMT+9 Etc/GMT+9
Etc/GMT‑0 GMT
Etc/GMT‑1 Etc/GMT‑1
Etc/GMT‑10 Etc/GMT‑10
Etc/GMT‑11 Etc/GMT‑11
Etc/GMT‑12 Etc/GMT‑12
Etc/GMT‑13 Etc/GMT‑13
Etc/GMT‑14 Etc/GMT‑14
Etc/GMT‑2 Etc/GMT‑2
Etc/GMT‑3 Etc/GMT‑3
Etc/GMT‑4 Etc/GMT‑4
Etc/GMT‑5 Etc/GMT‑5
Etc/GMT‑6 Etc/GMT‑6
Etc/GMT‑7 Etc/GMT‑7
Etc/GMT‑8 Etc/GMT‑8
Etc/GMT‑9 Etc/GMT‑9
Etc/GMT0 GMT
Etc/Greenwich GMT
Etc/UCT UCT
Etc/UTC UCT
Etc/Universal UCT
Etc/Zulu UCT
Europe/Amsterdam Europe/Amsterdam
Europe/Andorra Europe/Andorra
Europe/Astrakhan Europe/Astrakhan
Europe/Athens Europe/Athens
Europe/Belfast GB
Europe/Belgrade Europe/Belgrade
Europe/Berlin Europe/Berlin
Europe/Bratislava Europe/Bratislava
Europe/Brussels Europe/Brussels
Europe/Bucharest Europe/Bucharest
Europe/Budapest Europe/Budapest
Europe/Busingen Europe/Busingen
Europe/Chisinau Europe/Chisinau
429
DuckDB Documentation
name abbrev
Europe/Copenhagen Europe/Copenhagen
Europe/Dublin Eire
Europe/Gibraltar Europe/Gibraltar
Europe/Guernsey GB
Europe/Helsinki Europe/Helsinki
Europe/Isle_of_Man GB
Europe/Istanbul Turkey
Europe/Jersey GB
Europe/Kaliningrad Europe/Kaliningrad
Europe/Kiev Europe/Kiev
Europe/Kirov Europe/Kirov
Europe/Kyiv Europe/Kyiv
Europe/Lisbon Portugal
Europe/Ljubljana Europe/Ljubljana
Europe/London GB
Europe/Luxembourg Europe/Luxembourg
Europe/Madrid Europe/Madrid
Europe/Malta Europe/Malta
Europe/Mariehamn Europe/Mariehamn
Europe/Minsk Europe/Minsk
Europe/Monaco ECT
Europe/Moscow W‑SU
Europe/Nicosia Europe/Nicosia
Europe/Oslo Europe/Oslo
Europe/Paris ECT
Europe/Podgorica Europe/Podgorica
Europe/Prague Europe/Prague
Europe/Riga Europe/Riga
Europe/Rome Europe/Rome
Europe/Samara Europe/Samara
Europe/San_Marino Europe/San_Marino
Europe/Sarajevo Europe/Sarajevo
Europe/Saratov Europe/Saratov
Europe/Simferopol Europe/Simferopol
Europe/Skopje Europe/Skopje
Europe/Sofia Europe/Sofia
Europe/Stockholm Europe/Stockholm
Europe/Tallinn Europe/Tallinn
Europe/Tirane Europe/Tirane
430
DuckDB Documentation
name abbrev
Europe/Tiraspol Europe/Tiraspol
Europe/Ulyanovsk Europe/Ulyanovsk
Europe/Uzhgorod Europe/Uzhgorod
Europe/Vaduz Europe/Vaduz
Europe/Vatican Europe/Vatican
Europe/Vienna Europe/Vienna
Europe/Vilnius Europe/Vilnius
Europe/Volgograd Europe/Volgograd
Europe/Warsaw Poland
Europe/Zagreb Europe/Zagreb
Europe/Zaporozhye Europe/Zaporozhye
Europe/Zurich Europe/Zurich
Factory Factory
GB GB
GB‑Eire GB
GMT GMT
GMT+0 GMT
GMT‑0 GMT
GMT0 GMT
Greenwich GMT
HST HST
Hongkong Hongkong
IET IET
IST IST
Iceland Iceland
Indian/Antananarivo EAT
Indian/Chagos Indian/Chagos
Indian/Christmas Indian/Christmas
Indian/Cocos Indian/Cocos
Indian/Comoro EAT
Indian/Kerguelen Indian/Kerguelen
Indian/Mahe Indian/Mahe
Indian/Maldives Indian/Maldives
Indian/Mauritius Indian/Mauritius
Indian/Mayotte EAT
Indian/Reunion Indian/Reunion
Iran Iran
Israel Israel
JST JST
431
DuckDB Documentation
name abbrev
Jamaica Jamaica
Japan JST
Kwajalein Kwajalein
Libya Libya
MET MET
MIT MIT
MST MST
MST7MDT MST7MDT
Mexico/BajaNorte Mexico/BajaNorte
Mexico/BajaSur Mexico/BajaSur
Mexico/General Mexico/General
NET NET
NST NZ
NZ NZ
NZ‑CHAT NZ‑CHAT
Navajo Navajo
PLT PLT
PNT PNT
PRC CTT
PRT PRT
PST PST
PST8PDT PST8PDT
Pacific/Apia MIT
Pacific/Auckland NZ
Pacific/Bougainville Pacific/Bougainville
Pacific/Chatham NZ‑CHAT
Pacific/Chuuk Pacific/Chuuk
Pacific/Easter Pacific/Easter
Pacific/Efate Pacific/Efate
Pacific/Enderbury Pacific/Enderbury
Pacific/Fakaofo Pacific/Fakaofo
Pacific/Fiji Pacific/Fiji
Pacific/Funafuti Pacific/Funafuti
Pacific/Galapagos Pacific/Galapagos
Pacific/Gambier Pacific/Gambier
Pacific/Guadalcanal SST
Pacific/Guam Pacific/Guam
Pacific/Honolulu Pacific/Honolulu
Pacific/Johnston Pacific/Johnston
432
DuckDB Documentation
name abbrev
Pacific/Kanton Pacific/Kanton
Pacific/Kiritimati Pacific/Kiritimati
Pacific/Kosrae Pacific/Kosrae
Pacific/Kwajalein Kwajalein
Pacific/Majuro Pacific/Majuro
Pacific/Marquesas Pacific/Marquesas
Pacific/Midway Pacific/Midway
Pacific/Nauru Pacific/Nauru
Pacific/Niue Pacific/Niue
Pacific/Norfolk Pacific/Norfolk
Pacific/Noumea Pacific/Noumea
Pacific/Pago_Pago Pacific/Pago_Pago
Pacific/Palau Pacific/Palau
Pacific/Pitcairn Pacific/Pitcairn
Pacific/Pohnpei SST
Pacific/Ponape SST
Pacific/Port_Moresby Pacific/Port_Moresby
Pacific/Rarotonga Pacific/Rarotonga
Pacific/Saipan Pacific/Saipan
Pacific/Samoa Pacific/Samoa
Pacific/Tahiti Pacific/Tahiti
Pacific/Tarawa Pacific/Tarawa
Pacific/Tongatapu Pacific/Tongatapu
Pacific/Truk Pacific/Truk
Pacific/Wake Pacific/Wake
Pacific/Wallis Pacific/Wallis
Pacific/Yap Pacific/Yap
Poland Poland
Portugal Portugal
ROC ROC
ROK ROK
SST SST
Singapore Singapore
SystemV/AST4 SystemV/AST4
SystemV/AST4ADT SystemV/AST4ADT
SystemV/CST6 SystemV/CST6
SystemV/CST6CDT SystemV/CST6CDT
SystemV/EST5 SystemV/EST5
SystemV/EST5EDT SystemV/EST5EDT
433
DuckDB Documentation
name abbrev
SystemV/HST10 SystemV/HST10
SystemV/MST7 SystemV/MST7
SystemV/MST7MDT SystemV/MST7MDT
SystemV/PST8 SystemV/PST8
SystemV/PST8PDT SystemV/PST8PDT
SystemV/YST9 SystemV/YST9
SystemV/YST9YDT SystemV/YST9YDT
Turkey Turkey
UCT UCT
US/Alaska AST
US/Aleutian US/Aleutian
US/Arizona PNT
US/Central CST
US/East‑Indiana IET
US/Eastern US/Eastern
US/Hawaii US/Hawaii
US/Indiana‑Starke US/Indiana‑Starke
US/Michigan US/Michigan
US/Mountain Navajo
US/Pacific PST
US/Pacific‑New PST
US/Samoa US/Samoa
UTC UCT
Universal UCT
VST VST
W‑SU W‑SU
WET WET
Zulu UCT
Union Type
A UNION type (not to be confused with the SQL UNION operator) is a nested type capable of holding one of multiple ”alternative” values,
much like the union in C. The main difference being that these UNION types are tagged unions and thus always carry a discriminator ”tag”
which signals which alternative it is currently holding, even if the inner value itself is null. UNION types are thus more similar to C++17's
std::variant, Rust's Enum or the ”sum type” present in most functional languages.
UNION types must always have at least one member, and while they can contain multiple members of the same type, the tag names must
be unique. UNION types can have at most 256 members.
Under the hood, UNION types are implemented on top of STRUCT types, and simply keep the ”tag” as the first entry.
UNION values can be created with the union_value(tag := expr) function or by casting from a member type.
434
DuckDB Documentation
Example
Union Casts
Compared to other nested types, UNIONs allow a set of implicit casts to facilitate unintrusive and natural usage when working with their
members as ”subtypes”. However, these casts have been designed with two principles in mind, to avoid ambiguity and to avoid casts that
could lead to loss of information. This prevents UNIONs from being completely ”transparent”, while still allowing UNION types to have a
”supertype” relationship with their members.
Thus UNION types can't be implicitly cast to any of their member types in general, since the information in the other members not matching
the target type would be ”lost”. If you want to coerce a UNION into one of its members, you should use the union_extract function
explicitly instead.
The only exception to this is when casting a UNION to VARCHAR, in which case the members will all use their corresponding VARCHAR
casts. Since everything can be cast to VARCHAR, this is ”safe” in a sense.
Casting to Unions A type can always be implicitly cast to a UNION if it can be implicitly cast to one of the UNION member types.
• If there are multiple candidates, the built in implicit casting priority rules determine the target type. For example, a FLOAT ->
UNION(i INT, v VARCHAR) cast will always cast the FLOAT to the INT member before VARCHAR.
• If the cast still is ambiguous, i.e., there are multiple candidates with the same implicit casting priority, an error is raised. This usually
happens when the UNION contains multiple members of the same type, e.g., a FLOAT -> UNION(i INT, num INT) is always
ambiguous.
So how do we disambiguate if we want to create a UNION with multiple members of the same type? By using the union_value function,
which takes a keyword argument specifying the tag. For example, union_value(num := 2::INT) will create a UNION with a single
435
DuckDB Documentation
member of type INT with the tag num. This can then be used to disambiguate in an explicit (or implicit, read on below!) UNION to UNION
cast, like CAST(union_value(b := 2) AS UNION(a INT, b INT)).
Casting between Unions UNION types can be cast between each other if the source type is a ”subset” of the target type. In other words,
all the tags in the source UNION must be present in the target UNION, and all the types of the matching tags must be implicitly castable
between source and target. In essence, this means that UNION types are covariant with respect to their members.
UNION(a A, b B) UNION(a A, b B, c C)
UNION(a A, b B) UNION(a A, b C) if B can be implicitly cast to C
UNION(a A, b B, c C) UNION(a A, b B)
UNION(a A, b B) UNION(a A, b C) if B can't be implicitly cast to C
UNION(A, B, D) UNION(A, B, C)
Since UNION types are implemented on top of STRUCT types internally, they can be used with all the comparison operators as well as in
both WHERE and HAVING clauses with the same semantics as STRUCTs. The ”tag” is always stored as the first struct entry, which ensures
that the UNION types are compared and ordered by ”tag” first.
Functions
Typecasting
Typecasting is an operation that converts a value in one particular data type to the closest corresponding value in another data type. Like
other SQL engines, DuckDB supports both implicit and explicit typecasting.
Explicit Casting
Explicit typecasting is performed by using a CAST expression. For example, CAST(col AS VARCHAR) or col::VARCHAR explicitly
cast the column col to VARCHAR. See the cast page for more information.
Implicit Casting
In many situations, the system will add casts by itself. This is called implicit casting. This happens for example when a function is called
with an argument that does not match the type of the function, but can be casted to the desired type.
Consider the function sin(DOUBLE). This function takes as input argument a column of type DOUBLE, however, it can be called with an
integer as well: sin(1). The integer is converted into a double before being passed to the sin function.
Implicit casts can only be added for a number of type combinations, and is generally only possible when the cast cannot fail. For example,
an implicit cast can be added from INT to DOUBLE ‑ but not from DOUBLE to INT.
436
DuckDB Documentation
Values of a particular data type cannot always be cast to any arbitrary target data type. The only exception is the NULL value ‑ which can
always be converted between types. The following matrix describes which conversions are supported. When implicit casting is allowed, it
implies that explicit casting is also possible.
Even though a casting operation is supported based on the source and target data type, it does not necessarily mean the cast operation
will succeed at runtime.
437
DuckDB Documentation
Note. Deprecated Prior to version 0.10.0, DuckDB allowed any type to be implicitly cast to VARCHAR during function binding. Ver‑
sion 0.10.0 introduced a breaking change which no longer allows implicit casts to VARCHAR. The old_implicit_casting con‑
figuration option setting can be used to revert to the old behavior. However, please note that this flag will be deprecated in the
future.
Lossy Casts Casting operations that result in loss of precision are allowed. For example, it is possible to explicitly cast a numeric type
with fractional digits like DECIMAL, FLOAT or DOUBLE to an integral type like INTEGER. The number will be rounded.
Overflows Casting operations that would result in a value overflow throw an error. For example, the value 999 is too large to be repre‑
sented by the TINYINT data type. Therefore, an attempt to cast that value to that type results in a runtime error:
So even though the cast operation from INTEGER to TINYINT is supported, it is not possible for this particular value. TRY_CAST can be
used to convert the value into NULL instead of throwing an error.
Varchar The VARCHAR type acts as a univeral target: any arbitrary value of any arbitrary type can always be cast to the VARCHAR type.
This type is also used for displaying values in the shell.
Casting from VARCHAR to another data type is supported, but can raise an error at runtime if DuckDB cannot parse and convert the provided
text to the target data type.
In general, casting to VARCHAR is a lossless operation and any type can be cast back to the original type after being converted into text.
Literal Types Integer literals (such as 42) and string literals (such as 'string') have special implicit casting rules. See the literal types
page for more information.
Lists / Arrays Lists can be explicitly cast to other lists using the same casting rules. The cast is applied to the children of the list. For
example, if we convert a INT[] list to a VARCHAR[] list, the child INT elements are individually cast to VARCHAR and a new list is con‑
structed.
Arrays Arrays follow the same casting rules as lists. In addition, arrays can be implicitly cast to lists of the same type. For example, an
INT[3] array can be implicitly cast to an INT[] list.
Structs Structs can be cast to other structs as long as the names of the child elements match.
The names of the struct can also be in a different order. The fields of the struct will be reshuffled based on the names of the structs.
Unions Union casting rules can be found on the UNION type page.
438
DuckDB Documentation
Expressions
Expressions
An expression is a combination of values, operators and functions. Expressions are highly composable, and range from very simple to
arbitrarily complex. They can be found in many different parts of SQL statements. In this section, we provide the different types of operators
and functions that can be used within expressions.
CASE Statement
The CASE statement performs a switch based on a condition. The basic form is identical to the ternary condition used in many programming
languages (CASE WHEN cond THEN a ELSE b END is equivalent to cond ? a : b). With a single condition this can be expressed
with IF(cond, a, b).
The WHEN cond THEN expr part of the CASE statement can be chained, whenever any of the conditions returns true for a single tuple,
the corresponding expression is evaluated and returned.
The ELSE part of the CASE statement is optional. If no else statement is provided and none of the conditions match, the CASE statement
will return NULL.
After the CASE but before the WHEN an individual expression can also be provided. When this is done, the CASE statement is essentially
transformed into a switch statement.
439
DuckDB Documentation
Casting
Casting refers to the operation of converting a value in a particular data type to the corresponding value in another data type. Casting can
occur either implicitly or explicitly. The syntax described here performs an explicit cast. More information on casting can be found on the
typecasting page.
Explicit Casting
The standard SQL syntax for explicit casting is CAST(expr AS TYPENAME), where TYPENAME is a name (or alias) of one of DuckDB's
data types. DuckDB also supports the shorthand expr::TYPENAME, which is also present in PostgreSQL.
Casting Rules Not all casts are possible. For example, it is not possible to convert an INTEGER to a DATE. Casts may also throw errors
when the cast could not be successfully performed. For example, trying to cast the string 'hello' to an INTEGER will result in an error
being thrown.
The exact behavior of the cast depends on the source and destination types. For example, when casting from VARCHAR to any other type,
the string will be attempted to be converted.
TRY_CAST TRY_CAST can be used when the preferred behavior is not to throw an error, but instead to return a NULL value. TRY_CAST
will never throw an error, and will instead return NULL if a cast is not possible.
Collations
Collations provide rules for how text should be sorted or compared in the execution engine. Collations are useful for localization, as the
rules for how text should be ordered are different for different languages or for different countries. These orderings are often incompatible
with one another. For example, in English the letter ”y” comes between ”x” and ”z”. However, in Lithuanian the letter ”y” comes between
the ”i” and ”j”. For that reason, different collations are supported. The user must choose which collation they want to use when performing
sorting and comparison operations.
By default, the BINARY collation is used. That means that strings are ordered and compared based only on their binary contents. This
makes sense for standard ASCII characters (i.e., the letters A‑Z and numbers 0‑9), but generally does not make much sense for special
unicode characters. It is, however, by far the fastest method of performing ordering and comparisons. Hence it is recommended to stick
with the BINARY collation unless required otherwise.
Using Collations
In the stand‑alone installation of DuckDB three collations are included: NOCASE, NOACCENT and NFC. The NOCASE collation compares
characters as equal regardless of their casing. The NOACCENT collation compares characters as equal regardless of their accents. The NFC
collation performs NFC‑normalized comparisons, see Unicode normalization for more information.
440
DuckDB Documentation
Collations can be combined by chaining them using the dot operator. Note, however, that not all collations can be combined together. In
general, the NOCASE collation can be combined with any other collator, but most other collations cannot be combined.
Default Collations
The collations we have seen so far have all been specified per expression. It is also possible to specify a default collator, either on the global
database level or on a base table column. The PRAGMA default_collation can be used to specify the global default collator. This is
the collator that will be used if no other one is specified.
Collations can also be specified per‑column when creating a table. When that column is then used in a comparison, the per‑column collation
is used to perform that comparison.
Be careful here, however, as different collations cannot be combined. This can be problematic when you want to compare columns that
have a different collation specified.
SELECT name
FROM names
WHERE name = 'hannes' COLLATE NOCASE;
-- ERROR: Cannot combine types with different collation!
SELECT *
FROM names, other_names
WHERE names.name = other_names.name;
-- ERROR: Cannot combine types with different collation!
441
DuckDB Documentation
SELECT *
FROM names, other_names
WHERE names.name COLLATE NOACCENT.NOCASE = other_names.name COLLATE NOACCENT.NOCASE;
-- hännes|HÄNNES
ICU Collations
The collations we have seen so far are not region‑dependent, and do not follow any specific regional rules. If you wish to follow the rules
of a specific region or language, you will need to use one of the ICU collations. For that, you need to load the ICU extension.
If you are using the C++ API, you may find the extension in the extension/icu folder of the DuckDB project. Using the C++ API, the
extension can be loaded as follows:
DuckDB db;
db.LoadExtension<ICUExtension>();
Loading this extension will add a number of language and region specific collations to your database. These can be queried using PRAGMA
collations command, or by querying the pragma_collations function.
PRAGMA collations;
SELECT * FROM pragma_collations();
-- [af, am, ar, as, az, be, bg, bn, bo, bs, bs, ca, ceb, chr, cs, cy, da, de, de_AT, dsb, dz, ee, el, en,
en_US, en_US, eo, es, et, fa, fa_AF, fi, fil, fo, fr, fr_CA, ga, gl, gu, ha, haw, he, he_IL, hi, hr, hsb,
hu, hy, id, id_ID, ig, is, it, ja, ka, kk, kl, km, kn, ko, kok, ku, ky, lb, lkt, ln, lo, lt, lv, mk, ml,
mn, mr, ms, mt, my, nb, nb_NO, ne, nl, nn, om, or, pa, pa, pa_IN, pl, ps, pt, ro, ru, se, si, sk, sl,
smn, sq, sr, sr, sr_BA, sr_ME, sr_RS, sr, sr_BA, sr_RS, sv, sw, ta, te, th, tk, to, tr, ug, uk, ur, uz,
vi, wae, wo, xh, yi, yo, zh, zh, zh_CN, zh_SG, zh, zh_HK, zh_MO, zh_TW, zu]
These collations can then be used as the other collations would be used before. They can also be combined with the NOCASE collation.
For example, to use the German collation rules you could use the following code snippet:
Comparisons
Comparison Operators
The table below shows the standard comparison operators. Whenever either of the input arguments is NULL, the output of the comparison
is NULL.
The table below shows the standard distinction operators. These operators treat NULL values as equal.
442
DuckDB Documentation
IS DISTINCT FROM not equal, including NULL 2 IS DISTINCT FROM NULL true
IS NOT DISTINCT FROM equal, including NULL NULL IS NOT DISTINCT FROM true
NULL
Besides the standard comparison operators there are also the BETWEEN and IS (NOT) NULL operators. These behave much like
operators, but have special syntax mandated by the SQL standard. They are shown in the table below.
Note that BETWEEN and NOT BETWEEN are only equivalent to the examples below in the cases where both a, x and y are of the same
type, as BETWEEN will cast all of its inputs to the same type.
Predicate Description
Note. For the expression BETWEEN x AND y, x is used as the lower bound and y is used as the upper bound. Therefore, if x >
y, the result will always be false.
IN Operator
The IN operator checks containment of the left expression inside the set of expressions on the right hand side (RHS). The IN operator
returns true if the expression is present in the RHS, false if the expression is not in the RHS and the RHS has no NULL values, or NULL if the
expression is not in the RHS and the RHS has NULL values.
NOT IN can be used to check if an element is not present in the set. X NOT IN Y is equivalent to NOT(X IN Y).
The IN operator can also be used with a subquery that returns a single column. See the subqueries page for more information.
Logical Operators
The following logical operators are available: AND, OR and NOT. SQL uses a three‑valuad logic system with true, false and NULL. Note
that logical operators involving NULL do not always evaluate to NULL. For example, NULL AND false will evaluate to false, and NULL
OR true will evaluate to true. Below are the complete truth tables.
443
DuckDB Documentation
a b a AND b a OR b
a NOT a
true false
false true
NULL NULL
The operators AND and OR are commutative, that is, you can switch the left and right operand without affecting the result.
Star Expression
Examples
Syntax
Star Expression
The * expression can be used in a SELECT statement to select all columns that are projected in the FROM clause.
SELECT *
FROM tbl;
444
DuckDB Documentation
EXCLUDE Clause EXCLUDE allows us to exclude specific columns from the * expression.
REPLACE Clause REPLACE allows us to replace specific columns with different expressions.
COLUMNS Expression
The COLUMNS expression can be used to execute the same expression on multiple columns. Like the * expression, it can only be used in
the SELECT clause.
1 10 3 2
The * expression in the COLUMNS statement can also contain EXCLUDE or REPLACE, similar to regular star expressions.
SELECT min(COLUMNS(* REPLACE (number + id AS number))), count(COLUMNS(* EXCLUDE (number))) FROM numbers;
1 11 3
COLUMNS expressions can also be combined, as long as the COLUMNS contains the same (star) expression:
2 20
4 40
6 NULL
id number
1 10
2 20
445
DuckDB Documentation
id number
3 NULL
COLUMNS also supports passing in a lambda function. The lambda function will be evaluated for all columns present in the FROM clause,
and only columns that match the lambda function will be returned. This allows the execution of arbitrary expressions in order to select
columns.
number
10
20
NULL
STRUCT.*
The * expression can also be used to retrieve all keys from a struct as separate columns. This is particularly useful when a prior operation
creates a struct of unknown shape, or if a query must handle any potential struct keys. See the STRUCT data type and nested functions
pages for more details on working with structs.
x y z
1 2 3
Subqueries
Scalar Subquery
Scalar subqueries are subqueries that return a single value. They can be used anywhere where a regular expression can be used. If a scalar
subquery returns more than a single value, the first value returned will be used.
Grades
grade course
7 Math
9 Math
8 CS
446
DuckDB Documentation
By using a scalar subquery in the WHERE clause, we can figure out for which course this grade was obtained:
SELECT course FROM grades WHERE grade = (SELECT min(grade) FROM grades);
-- {Math}
EXISTS
The EXISTS operator tests for the existence of any row inside the subquery. It returns either true when the subquery returns one or more
records, and false otherwise. The EXISTS operator is generally the most useful as a correlated subquery to express semijoin operations.
However, it can be used as an uncorrelated subquery as well.
For example, we can use it to figure out if there are any grades present for a given course:
NOT EXISTS The NOT EXISTS operator tests for the absence of any row inside the subquery. It returns either true when the subquery
returns an empty result, and false otherwise. The NOT EXISTS operator is generally the most useful as a correlated subquery to express
antijoin operations. For example, to find Person nodes without an interest:
SELECT *
FROM Person
WHERE NOT EXISTS (SELECT * FROM interest WHERE interest.PersonId = Person.id);
┌───────┬─────────┐
│ id │ name │
│ int64 │ varchar │
├───────┼─────────┤
│ 1 │ Jane │
└───────┴─────────┘
Note. DuckDB automatically detects when a NOT EXISTS query expresses an antijoin operation. There is no need to manually
rewrite such queries to use LEFT OUTER JOIN ... WHERE ... IS NULL.
IN Operator
The IN operator checks containment of the left expression inside the result defined by the subquery or the set of expressions on the right
hand side (RHS). The IN operator returns true if the expression is present in the RHS, false if the expression is not in the RHS and the RHS
has no NULL values, or NULL if the expression is not in the RHS and the RHS has NULL values.
We can use the IN operator in a similar manner as we used the EXISTS operator:
447
DuckDB Documentation
Correlated Subqueries
All the subqueries presented here so far have been uncorrelated subqueries, where the subqueries themselves are entirely self‑contained
and can be run without the parent query. There exists a second type of subqueries called correlated subqueries. For correlated subqueries,
the subquery uses values from the parent subquery.
Conceptually, the subqueries are run once for every single row in the parent query. Perhaps a simple way of envisioning this is that the
correlated subquery is a function that is applied to every row in the source data set.
For example, suppose that we want to find the minimum grade for every course. We could do that as follows:
SELECT *
FROM grades grades_parent
WHERE grade =
(SELECT min(grade)
FROM grades
WHERE grades.course = grades_parent.course);
-- {7, Math}, {8, CS}
The subquery uses a column from the parent query (grades_parent.course). Conceptually, we can see the subquery as a function
where the correlated column is a parameter to that function:
SELECT min(grade)
FROM grades
WHERE course = ?;
Now when we execute this function for each of the rows, we can see that for Math this will return 7, and for CS it will return 8. We then
compare it against the grade for that actual row. As a result, the row (Math, 9) will be filtered out, as 9 <> 7.
Using the name of a subquery in the SELECT clause (without referring to a specific column) turns each row of the subquery into a struct
whose fields correspond to the columns of the subquery. For example:
SELECT t
FROM (SELECT unnest(generate_series(41, 43)) AS x, 'hello' AS y) t;
┌─────────────────────────────┐
│ t │
│ struct(x bigint, y varchar) │
├─────────────────────────────┤
│ {'x': 41, 'y': hello} │
│ {'x': 42, 'y': hello} │
│ {'x': 43, 'y': hello} │
└─────────────────────────────┘
Functions
Functions
Function Syntax
Query Functions
The duckdb_functions() table function shows the list of functions currently built into the system.
448
DuckDB Documentation
bar scalar VARCHAR [x, min, [DOUBLE, Draws a band whose width is proportional to (x ‑ min) and equal to
max, width] DOUBLE, width characters when x = max. width defaults to 80
DOUBLE,
DOUBLE]
base64 scalar VARCHAR [blob] [BLOB] Convert a blob to a base64 encoded string
bin scalar VARCHAR [value] [VARCHAR] Converts the value to binary representation
bit_ scalar TINYINT [x] [TINYINT] Returns the number of bits that are set
count
bit_ scalar BIGINT [col0] [VARCHAR] NULL
length
bit_ scalar INTEGER [substring, [BIT, BIT] Returns first starting index of the specified substring within bits, or
position bitstring] zero if it is not present. The first (leftmost) bit is indexed 1
bitstring scalar BIT [bitstring, [VARCHAR, Pads the bitstring until the specified length
length] INTEGER]
Note. Currently, the description and parameter names of functions are not available in the duckdb_functions() function.
Bitstring Functions
This section describes functions and operators for examining and manipulating bit values. Bitstrings must be of equal length when per‑
forming the bitwise operands AND, OR and XOR. When bit shifting, the original length of the string is preserved.
Bitstring Operators
The table below shows the available mathematical operators for BIT type.
449
DuckDB Documentation
Bitstring Functions
The table below shows the available scalar functions for BIT type.
Bitstring Aggregation The bitstring_agg function takes any integer type as input and returns a bitstring with bits set for each dis‑
tinct value. The left‑most bit represents the smallest value in the column and the right‑most bit the maximum value. If possible, the min
and max are retrieved from the column statistics. Otherwise, it is also possible to provide the min and max values.
The combination of bit_count and bitstring_agg could be used as an alternative to count(DISTINCT ...), with possible
performance improvements in cases of low cardinality and dense values.
450
DuckDB Documentation
Blob Functions
This section describes functions and operators for examining and manipulating blob values.
The strftime and strptime functions can be used to convert between dates/timestamps and strings. This is often required when
parsing CSV files, displaying output to the user or transferring information between programs. Because there are many possible date
representations, these functions accept a format string that describes how the date or timestamp should be structured.
strftime Examples
strftime(timestamp, format) converts timestamps or dates to strings according to the specified pattern.
strptime Examples
SELECT strptime('Monday, 2 March 1992 - 08:32:45 PM', '%A, %-d %B %Y - %I:%M:%S %p');
-- 1992-03-02 20:32:45
CSV Parsing
The date formats can also be specified during CSV parsing, either in the COPY statement or in the read_csv function. This can be done by
either specifying a DATEFORMAT or a TIMESTAMPFORMAT (or both). DATEFORMAT will be used for converting dates, and TIMESTAMP-
FORMAT will be used for converting timestamps. Below are some examples for how to use this:
451
DuckDB Documentation
-- in COPY statement
COPY dates FROM 'test.csv' (DATEFORMAT '%d/%m/%Y', TIMESTAMPFORMAT '%A, %-d %B %Y - %I:%M:%S %p');
-- in read_csv function
SELECT *
FROM read_csv('test.csv', dateformat = '%m/%d/%Y');
Format Specifiers
452
DuckDB Documentation
%W Week number of the year. Week 01 starts on the first Monday of the 00, 01, ..., 53
year, so there can be week 00. Note that this is not compliant with the
week date standard in ISO‑8601.
%x ISO date representation 1992‑03‑02
%X ISO time representation 10:30:20
%y Year without century as a zero‑padded decimal number. 00, 01, ..., 99
%-y Year without century as a decimal number. 0, 1, ..., 99
%Y Year with century as a decimal number. 2013, 2019 etc.
%z Time offset from UTC in the form ±HH:MM, ±HHMM, or ±HH. ‑0700
%Z Time zone name. Europe/Amsterdam
%% A literal % character. %
Date Functions
This section describes functions and operators for examining and manipulating date values.
Date Operators
The table below shows the available mathematical operators for DATE types.
Adding to or subtracting from infinite values produces the same infinite value.
Date Functions
The table below shows the available functions for DATE types. Dates can also be manipulated with the timestamp functions through type
promotion.
453
DuckDB Documentation
454
DuckDB Documentation
There are also dedicated extraction functions to get the subfields. A few examples include extracting the day from a date, or the day of the
week from a date.
Functions applied to infinite dates will either return the same infinite dates (e.g, greatest) or NULL (e.g., date_part) depending on
what ”makes sense”. In general, if the function needs to examine the parts of the infinite date, the result will be NULL.
The date_part and date_diff and date_trunc functions can be used to manipulate the fields of temporal types. The fields are
specified as strings that contain the part name of the field.
Part Specifiers
Below is a full list of all available date part specifiers. The examples are the corresponding parts of the timestamp 2021-08-03
11:59:44.123456.
455
DuckDB Documentation
Note that the time zone parts are all zero unless a time zone plugin such as ICU has been installed to support TIMESTAMP WITH TIME
ZONE.
Part Functions There are dedicated extraction functions to get certain subfields:
456
DuckDB Documentation
Enum Functions
This section describes functions and operators for examining and manipulating ENUM values. The examples assume an enum type created
as:
457
DuckDB Documentation
These functions can take NULL or a specific value of the type as argument(s). With the exception of enum_range_boundary, the result
depends only on the type of the argument and not on its value.
Interval Functions
This section describes functions and operators for examining and manipulating INTERVAL values.
Interval Operators
The table below shows the available mathematical operators for INTERVAL types.
458
DuckDB Documentation
Interval Functions
The table below shows the available scalar functions for INTERVAL types.
Lambda Functions
Lambda functions enable the use of more complex and flexible expressions in queries. DuckDB supports several scalar functions that
accept lambda functions as parameters in the form (parameter1, parameter2, ...) -> expression. If the lambda function
has only one parameter, then the parentheses can be omitted. The parameters can have any names. For example, the following are all
valid lambda functions:
459
DuckDB Documentation
SELECT list_transform(
list_filter([0, 1, 2, 3, 4, 5], x -> x % 2 = 0),
y -> y * y
);
[0, 4, 16]
Nested lambda function to add each element of the first list to the sum of the second list:
SELECT list_transform(
[1, 2, 3],
x -> list_reduce([4, 5, 6], (a, b) -> a + b) + x
);
[15, 16]
Indexes as Parameters All lambda functions accept an optional extra parameter that represents the index of the current element. This
is always the last parameter of the lambda function, and is 1‑based (i.e., the first element has index 1).
[3, 5]
460
DuckDB Documentation
Transform
Description:
list_transform returns a list that is the result of applying the lambda function to each element of the input list.
Aliases:
• array_transform
• apply
• list_apply
• array_apply
Examples:
Incrementing each list element by one:
[2, 3, NULL, 4]
Transforming strings:
[6, 1, 7]
Filter
Description:
Constructs a list from those elements of the input list for which the lambda function returns true. DuckDB must be able to cast the lambda
function's return type to BOOL.
Aliases:
• array_filter
• filter
Examples:
Filter out negative values:
[5, 7]
Divisible by 2 and 5:
461
DuckDB Documentation
[1, 2, 3, 4]
[2, 3, 4]
[3, 4]
[4]
[]
Reduce
Description:
The scalar function returns a single value that is the result of applying the lambda function to each element of the input list. Starting with
the first element and then repeatedly applying the lambda function to the result of the previous application and the next element of the
list. The list must have at least one element.
Aliases:
• array_reduce
• reduce
Examples:
Sum of all list elements:
10
SELECT list_reduce(['DuckDB', 'is', 'awesome'], (x, y) -> concat(x, ' ', y));
DuckDB is awesome
Nested Functions
This section describes functions and operators for examining and manipulating nested values. There are five nested data types: ARRAY,
LIST, MAP, STRUCT, and UNION.
List Functions
462
DuckDB Documentation
463
DuckDB Documentation
464
DuckDB Documentation
list_sort( list) array_sort Sorts the elements of the list. See the list_sort([3, 6, [1, 2,
Sorting Lists section for more details 1, 2]) 3, 6]
about the sorting order and the null
sorting order.
list_transform( array_transform, Returns a list that is the result of list_transform(l, [5, 6,
list, lambda) apply, list_apply, applying the lambda function to x -> x + 1) 7]
array_apply each element of the input list. See
the Lambda Functions page for more
details.
list_unique( array_unique Counts the unique elements of a list. list_unique([1, 3
list) 1, NULL, -3, 1,
5])
list_value( any, list_pack Create a LIST containing the list_value(4, 5, [4, 5,
...) argument values. 6) 6]
list_where( array_where Returns a list with the BOOLEANs in list_where([10, [10, 40]
value_list, mask_ mask_list applied as a mask to 20, 30, 40],
list) the value_list. [true, false,
false, true])
list_zip( list1, array_zip Zips k LISTs to a new LIST whose list_zip([1, 2], [{'list_
list2, ...) length will be that of the longest list. [3, 4], [5, 6]) 1': 1,
Its elements are structs of k elements 'list_
list_1, ..., list_k. Elements 2': 3,
missing will be replaced with NULL. 'list_
3': 5},
{'list_
1': 2,
'list_
2': 4,
'list_
3': 6}]
unnest( list) Unnests a list by one level. Note that unnest([1, 2, 3]) 1, 2, 3
this is a special function that alters
the cardinality of the result. See the
unnest page for more details.
List Operators
465
DuckDB Documentation
List Comprehension
Python‑style list comprehension can be used to compute expressions over elements in a list. For example:
Struct Functions
466
DuckDB Documentation
Map Functions
cardinality( map) Return the size of the map (or the cardinality(map([4, 2], 2
number of entries in the map). ['a', 'b']))
element_at( map, key) Return a list containing the value for a element_at(map([100, 5], [42]
given key or an empty list if the key is [42, 43]), 100)
not contained in the map. The type of
the key provided in the second
parameter must match the type of the
map's keys else an error is returned.
map_entries( map) Return a list of struct(k, v) for each map_entries(map([100, 5], [{'key': 100,
key‑value pair in the map. [42, 43])) 'value': 42},
{'key': 5,
'value': 43}]
map_extract( map, Alias of element_at. Return a list map_extract(map([100, 5], [42]
key) containing the value for a given key or [42, 43]), 100)
an empty list if the key is not contained
in the map. The type of the key
provided in the second parameter
must match the type of the map's keys
else an error is returned.
map_from_entries( Returns a map created from the entries map_from_entries([{k: 5, v: {5=val1,
STRUCT(k, v)[]) of the array 'val1'}, {k: 3, v: 3=val2}
'val2'}])
map_keys( map) Return a list of all keys in the map. map_keys(map([100, 5], [100, 5]
[42,43]))
map_values( map) Return a list of all values in the map. map_values(map([100, 5], [42, 43]
[42, 43]))
map() Returns an empty map. map() {}
map[entry] Alias for element_at map([100, 5], ['a', [a]
'b'])[100]
Union Functions
467
DuckDB Documentation
Range Functions
The functions range and generate_series create a list of values in the range between start and stop. The start parameter is
inclusive. For the range function, the stop parameter is exclusive, while for generate_series, it is inclusive.
SELECT range(5);
-- [0, 1, 2, 3, 4]
SELECT generate_series(5);
-- [0, 1, 2, 3, 4, 5]
SELECT *
FROM range(DATE '1992-01-01', DATE '1992-03-01', INTERVAL '1' MONTH);
┌─────────────────────┐
│ range │
├─────────────────────┤
│ 1992-01-01 00:00:00 │
│ 1992-02-01 00:00:00 │
└─────────────────────┘
Slicing
The function list_slice can be used to extract a sublist from a list. The following variants exist:
468
DuckDB Documentation
• list[begin:end]
• list[begin:end:step]
list
begin
end
step (optional)
SELECT([1, 2, 3, 4, 5])[4:2:-2];
-- [4, 2]
List Aggregates
The function list_aggregate allows the execution of arbitrary existing aggregate functions on the elements of a list. Its first argument
is the list (column), its second argument is the aggregate function name, e.g., min, histogram or sum.
list_aggregate accepts additional arguments after the aggregate function name. These extra arguments are passed directly to the
aggregate function, which serves as the second argument of list_aggregate.
469
DuckDB Documentation
The following is a list of existing rewrites. Rewrites simplify the use of the list aggregate function by only taking the list (column) as their ar‑
gument. list_avg, list_var_samp, list_var_pop, list_stddev_pop, list_stddev_samp, list_sem, list_approx_
count_distinct, list_bit_xor, list_bit_or, list_bit_and, list_bool_and, list_bool_or, list_count, list_
entropy, list_last, list_first, list_kurtosis, list_kurtosis_pop, list_min, list_max, list_product, list_
skewness, list_sum, list_string_agg, list_mode, list_median, list_mad and list_histogram.
Sorting Lists
The function list_sort sorts the elements of a list either in ascending or descending order. In addition, it allows to provide whether
NULL values should be moved to the beginning or to the end of the list.
By default if no modifiers are provided, DuckDB sorts ASC NULLS FIRST, i.e., the values are sorted in ascending order and NULL values
are placed first. This is identical to the default sort order of SQLite. The default sort order can be changed using PRAGMA statements.
list_sort leaves it open to the user whether they want to use the default sort order or a custom order. list_sort takes up to two ad‑
ditional optional parameters. The second parameter provides the sort order and can be either ASC or DESC. The third parameter provides
the NULL sort order and can be either NULLS FIRST or NULLS LAST.
list_reverse_sort has an optional second parameter providing the NULL sort order. It can be either NULLS FIRST or NULLS
LAST.
470
DuckDB Documentation
Lambda Functions
DuckDB supports lambda functions in the form (parameter1, parameter2, ...) -> expression. For details, see the lambda
functions page.
Flatten
The flatten function is a scalar function that converts a list of lists into a single list by concatenating each sub‑list together. Note that this
only flattens one level at a time, not all levels of sub‑lists.
In general, the input to the flatten function should be a list of lists (not a single level list). However, the behavior of the flatten function has
specific behavior when handling empty lists and NULL values.
471
DuckDB Documentation
generate_subscripts
The generate_subscripts( arr, dim) function generates indexes along the dimth dimension of array arr.
┌───┐
│ i │
├───┤
│ 1 │
│ 2 │
│ 3 │
└───┘
Related Functions
There are also aggregate functions list and histogram that produces lists and lists of structs. The unnest function is used to unnest
a list by one level.
Numeric Functions
Numeric Operators
The table below shows the available mathematical operators for numeric types.
+ addition 2 + 3 5
- subtraction 2 - 3 -1
* multiplication 2 * 3 6
/ float division 5 / 2 2.5
// division 5 // 2 2
% modulo (remainder) 5 % 4 1
472
DuckDB Documentation
** exponent 3 ** 4 81
^ exponent (alias for **) 3 ^ 4 81
& bitwise AND 91 & 15 11
| bitwise OR 32 | 3 35
<< bitwise shift left 1 << 4 16
>> bitwise shift right 8 >> 2 2
~ bitwise negation ~15 -16
! factorial of x 4! 24
Division and Modulo Operators There are two division operators: / and //. They are equivalent when at least one of the operands is
a FLOAT or a DOUBLE. When both operands are integers, / performs floating points division (5 / 2 = 2.5) while // performs integer
division (5 // 2 = 2).
Supported Types The modulo, bitwise, and negation and factorial operators work only on integral data types, whereas the others are
available for all numeric data types.
Numeric Functions
473
DuckDB Documentation
474
DuckDB Documentation
Pattern Matching
There are four separate approaches to pattern matching provided by DuckDB: the traditional SQL LIKE operator, the more recent SIMILAR
TO operator (added in SQL:1999), a GLOB operator, and POSIX‑style regular expressions.
LIKE
The LIKE expression returns true if the string matches the supplied pattern. (As expected, the NOT LIKE expression returns false if
LIKE returns true, and vice versa. An equivalent expression is NOT (string LIKE pattern).)
If pattern does not contain percent signs or underscores, then the pattern only represents the string itself; in that case LIKE acts like the
equals operator. An underscore (_) in pattern stands for (matches) any single character; a percent sign (%) matches any sequence of zero
or more characters.
LIKE pattern matching always covers the entire string. Therefore, if it's desired to match a sequence anywhere within a string, the pattern
must start and end with a percent sign.
Some examples:
475
DuckDB Documentation
The keyword ILIKE can be used instead of LIKE to make the match case‑insensitive according to the active locale.
To search within a string for a character that is a wildcard (% or _), the pattern must use an ESCAPE clause and an escape character to
indicate the wildcard should be treated as a literal character instead of a wildcard. See an example below.
Additionally, the function like_escape has the same functionality as a LIKE expression with an ESCAPE clause, but using function
syntax. See the Text Functions Docs for details.
-- Search for strings with 'a' then a literal percent sign then 'c'
SELECT 'a%c' LIKE 'a$%c' ESCAPE '$'; -- true
SELECT 'azc' LIKE 'a$%c' ESCAPE '$'; -- false
There are also alternative characters that can be used as keywords in place of LIKE expressions. These enhance PostgreSQL compatibil‑
ity.
LIKE‑style PostgreSQL‑style
LIKE ~~
NOT LIKE !~~
ILIKE ~~*
NOT ILIKE !~~*
SIMILAR TO
The SIMILAR TO operator returns true or false depending on whether its pattern matches the given string. It is similar to LIKE, except
that it interprets the pattern using a regular expression. Like LIKE, the SIMILAR TO operator succeeds only if its pattern matches the
entire string; this is unlike common regular expression behavior where the pattern can match any part of the string.
A regular expression is a character sequence that is an abbreviated definition of a set of strings (a regular set). A string is said to match a
regular expression if it is a member of the regular set described by the regular expression. As with LIKE, pattern characters match string
characters exactly unless they are special characters in the regular expression language — but regular expressions use different special
characters than LIKE does.
Some examples:
There are also alternative characters that can be used as keywords in place of SIMILAR TO expressions. These follow POSIX syntax.
SIMILAR TO ~
NOT SIMILAR TO !~
476
DuckDB Documentation
GLOB
The GLOB operator returns true or false if the string matches the GLOB pattern. The GLOB operator is most commonly used when
searching for filenames that follow a specific pattern (for example a specific file extension). Use the question mark (?) wildcard to match
any single character, and use the asterisk (*) to match zero or more characters. In addition, use bracket syntax ([ ]) to match any single
character contained within the brackets, or within the character range specified by the brackets. An exclamation mark (!) may be used
inside the first bracket to search for a character that is not contained within the brackets. To learn more, visit the Glob Programming
Wikipedia page.
Some examples:
Three tildes (~~~) may also be used in place of the GLOB keyword.
GLOB‑style Symbolic‑style
GLOB ~~~
Glob Function to Find Filenames The glob pattern matching syntax can also be used to search for filenames using the glob table func‑
tion. It accepts one parameter: the path to search (which may include glob patterns).
file
duckdb.exe
test.csv
test.json
test.parquet
test2.csv
test2.parquet
todos.json
Regular Expressions
477
DuckDB Documentation
Regular Expressions
DuckDB offers pattern matching operators (LIKE, SIMILAR TO, GLOB), as well as support for regular expressions via functions.
DuckDB uses the RE2 library as its regular expression engine. For the regular expression syntax, see the RE2 docs.
Functions
regexp_extract_all( Split the string along the regex regexp_extract_ [hello, world]
string, regex[,group= and extract all occurrences of all('hello_world',
0]) group '([a-z ]+)_?', 1)
regexp_extract( string, If string contains the regexp regexp_ {'y':'2023',
pattern, name_list); pattern, returns the capturing extract('2023-04-15', 'm':'04',
groups as a struct with '(\d+)-(\d+)-(\d+)', 'd':'15'}
corresponding names from ['y', 'm', 'd'])
name_list
regexp_extract( string, If string contains the regexp regexp_extract('hello_ hello
pattern[,idx]); pattern, returns the capturing world', '([a-z ]+)_?',
group specified by optional 1)
parameter idx
regexp_full_match( Returns true if the entire string regexp_full_ false
string, regex) matches the regex match('anabanana',
'(an)*')
regexp_matches( string, Returns true if string contains regexp_ true
pattern) the regexp pattern, false matches('anabanana',
otherwise '(an)*')
regexp_replace( string, If string contains the regexp regexp_replace('hello', he-lo
pattern, replacement); pattern, replaces the matching '[lo]', '-')
part with replacement
regexp_split_to_array( Alias of string_split_regex. regexp_split_to_ ['hello',
string, regex) Splits the string along the regex array('hello world; 'world', '42']
42', ';? ')
regexp_split_to_table( Splits the string along the regex regexp_split_to_ Two rows: 'hello',
string, regex) and returns a row for each part array('hello world; 'world'
42', ';? ')
The regexp_matches function is similar to the SIMILAR TO operator, however, it does not require the entire string to match. Instead,
regexp_matches returns true if the string merely contains the pattern (unless the special tokens ^ and $ are used to anchor the regular
expression to the start and end of the string). Below are some examples:
478
DuckDB Documentation
Options for Regular Expression Functions The regexp_matches and regexp_replace functions also support the following op‑
tions.
Option Description
Using regexp_matches The regexp_matches operator will be optimized to the LIKE operator when possible. To achieve best
performance, the 'c' option (case‑sensitive matching) should be passed if applicable. Note that by default the RE2 library doesn't match
the . character to newline.
Using regexp_replace The regexp_replace function can be used to replace the part of a string that matches the regexp pattern
with a replacement string. The notation \d (where d is a number indicating the group) can be used to refer to groups captured in the regular
expression in the replacement string. Note that by default, regexp_replace only replaces the first occurrence of the regular expression.
To replace all occurrences, use the global replace (g) flag.
Using regexp_extract The regexp_extract function is used to extract a part of a string that matches the regexp pattern. A specific
capturing group within the pattern can be extracted using the idx parameter. If idx is not specified, it defaults to 0, extracting the first
match with the whole pattern.
479
DuckDB Documentation
If ids is a LIST of strings, then regexp_extract will return the corresponding capture groups as fields of a STRUCT:
If the number of column names is less than the number of capture groups, then only the first groups are returned. If the number of column
names is greater, then an error is generated.
Text Functions
This section describes functions and operators for examining and manipulating string values. The symbol denotes a space character.
480
DuckDB Documentation
481
DuckDB Documentation
482
DuckDB Documentation
483
DuckDB Documentation
484
DuckDB Documentation
485
DuckDB Documentation
486
DuckDB Documentation
487
DuckDB Documentation
These functions are used to measure the similarity of two strings using various similarity measures.
488
DuckDB Documentation
Formatters
fmt Syntax The format( format, parameters...) function formats strings, loosely following the syntax of the {fmt} open‑source
formatting library.
Format Specifiers
Formatting Types
-- Integers
SELECT format('{} + {} = {}', 3, 5, 3 + 5); -- 3 + 5 = 8
-- Booleans
SELECT format('{} != {}', true, false); -- true != false
-- Format datetime values
SELECT format('{}', DATE '1992-01-01'); -- 1992-01-01
SELECT format('{}', TIME '12:01:00'); -- 12:01:00
SELECT format('{}', TIMESTAMP '1992-01-01 12:01:00'); -- 1992-01-01 12:01:00
489
DuckDB Documentation
-- Format BLOB
SELECT format('{}', BLOB '\x00hello'); -- \x00hello
-- Pad integers with 0s
SELECT format('{:04d}', 33); -- 0033
-- Create timestamps from integers
SELECT format('{:02d}:{:02d}:{:02d} {}', 12, 3, 16, 'AM'); -- 12:03:16 AM
-- Convert to hexadecimal
SELECT format('{:x}', 123_456_789); -- 75bcd15
-- Convert to binary
SELECT format('{:b}', 123_456_789); -- 111010110111100110100010101
printf Syntax The printf( format, parameters...) function formats strings using the printf syntax.
Format Specifiers
490
DuckDB Documentation
Formatting Types
-- Integers
SELECT printf('%d + %d = %d', 3, 5, 3 + 5); -- 3 + 5 = 8
-- Booleans
SELECT printf('%s != %s', true, false); -- true != false
-- Format datetime values
SELECT printf('%s', DATE '1992-01-01'); -- 1992-01-01
SELECT printf('%s', TIME '12:01:00'); -- 12:01:00
SELECT printf('%s', TIMESTAMP '1992-01-01 12:01:00'); -- 1992-01-01 12:01:00
-- Format BLOB
SELECT printf('%s', BLOB '\x00hello'); -- \x00hello
-- Pad integers with 0s
SELECT printf('%04d', 33); -- 0033
-- Create timestamps from integers
SELECT printf('%02d:%02d:%02d %s', 12, 3, 16, 'AM'); -- 12:03:16 AM
-- Convert to hexadecimal
SELECT printf('%x', 123_456_789); -- 75bcd15
-- Convert to binary
SELECT printf('%b', 123_456_789); -- 111010110111100110100010101
Thousand Separators
SELECT printf('%,d', 123_456_789); -- 123,456,789
SELECT printf('%.d', 123_456_789); -- 123.456.789
SELECT printf('%''d', 123_456_789); -- 123'456'789
SELECT printf('%_d', 123_456_789); -- 123_456_789
Time Functions
This section describes functions and operators for examining and manipulating TIME values.
Time Operators
The table below shows the available mathematical operators for TIME types.
Time Functions
The table below shows the available scalar functions for TIME types.
491
DuckDB Documentation
The only date parts that are defined for times are epoch, hours, minutes, seconds, milliseconds and microseconds.
Timestamp Functions
This section describes functions and operators for examining and manipulating TIMESTAMP values.
Timestamp Operators
The table below shows the available mathematical operators for TIMESTAMP types.
Adding to or subtracting from infinite values produces the same infinite value.
492
DuckDB Documentation
Timestamp Functions
The table below shows the available scalar functions for TIMESTAMP values.
493
DuckDB Documentation
494
DuckDB Documentation
Functions applied to infinite dates will either return the same infinite dates (e.g, greatest) or NULL (e.g., date_part) depending on
what ”makes sense”. In general, if the function needs to examine the parts of the infinite date, the result will be NULL.
The table below shows the available table functions for TIMESTAMP types.
495
DuckDB Documentation
This section describes functions and operators for examining and manipulating TIMESTAMP WITH TIME ZONE (or TIMESTAMPTZ)
values.
Despite the name, these values do not store a time zone ‑ just an instant like TIMESTAMP. Instead, they request that the instant be binned
and formatted using the current time zone.
Time zone support is not built in but can be provided by an extension, such as the ICU extension that ships with DuckDB.
In the examples below, the current time zone is presumed to be America/Los_Angeles using the Gregorian calendar.
The table below shows the available scalar functions for TIMESTAMPTZ values. Since these functions do not involve binning or display,
they are always available.
496
DuckDB Documentation
With no time zone extension loaded, TIMESTAMPTZ values will be cast to and from strings using offset notation. This will let you specify
an instant correctly without access to time zone information. For portability, TIMESTAMPTZ values will always be displayed using GMT
offsets:
If a time zone extension such as ICU is loaded, then a time zone can be parsed from a string and cast to a representation in the local time
zone:
The table below shows the available mathematical operators for TIMESTAMP WITH TIME ZONE values provided by the ICU extension.
Adding to or subtracting from infinite values produces the same infinite value.
The table below shows the ICU provided scalar functions for TIMESTAMP WITH TIME ZONE values.
497
DuckDB Documentation
498
DuckDB Documentation
499
DuckDB Documentation
The table below shows the available table functions for TIMESTAMP WITH TIME ZONE types.
The table below shows the ICU provided scalar functions that operate on plain TIMESTAMP values. These functions assume that the
TIMESTAMP is a ”local timestamp”.
A local timestamp is effectively a way of encoding the part values from a time zone into a single value. They should be used with cau‑
tion because the produced values can contain gaps and ambiguities thanks to daylight savings time. Often the same functionality can be
implemented more reliably using the struct variant of the date_part function.
500
DuckDB Documentation
timezone( text, Use the date parts of the timezone('America/Denver', 2001-02-16 18:38:40
timestamptz) timestamp in the given time TIMESTAMPTZ '2001-02-16
zone to construct a 20:38:40-05')
timestamp. Effectively, the
result is a ”local” time.
At Time Zone The AT TIME ZONE syntax is syntactic sugar for the (two argument) timezone function listed above:
Infinities
Functions applied to infinite dates will either return the same infinite dates (e.g, greatest) or NULL (e.g., date_part) depending on
what ”makes sense”. In general, if the function needs to examine the parts of the infinite temporal value, the result will be NULL.
Calendars
The ICU extension also supports non‑Gregorian calendars. If such a calendar is current, then the display and binning operations will use
that calendar.
Utility Functions
Utility Functions
The functions below are difficult to categorize into specific function types and are broadly useful.
501
DuckDB Documentation
502
DuckDB Documentation
503
DuckDB Documentation
glob( search_path) Return filenames found at the location indicated by the glob('*')
search_path in a single column named file. The search_
path may contain glob pattern matching syntax.
Aggregate Functions
Examples
Syntax
Aggregates are functions that combine multiple rows into a single value. Aggregates are different from scalar functions and window func‑
tions because they change the cardinality of the result. As such, aggregates can only be used in the SELECT and HAVING clauses of a SQL
query.
DISTINCT Clause in Aggregate Functions When the DISTINCT clause is provided, only distinct values are considered in the compu‑
tation of the aggregate. This is typically used in combination with the count aggregate to get the number of distinct elements; but it can
be used together with any aggregate function in the system.
ORDER BY Clause in Aggregate Functions An ORDER BY clause can be provided after the last argument of the function call. Note the
lack of the comma separator before the clause.
This clause ensures that the values being aggregated are sorted before applying the function. Most aggregate functions are order‑
insensitive, therefore, this clause is parsed and applied, which is inefficient, but has on effect on the results. However, there are some
order‑sensitive aggregates that can have non‑deterministic results without ordering, e.g., first, last, list and string_agg /
group_concat / listagg. These can be made deterministic by ordering the arguments.
For example:
┌───────────┐
│ countdown │
│ varchar │
504
DuckDB Documentation
├───────────┤
│ 3, 2, 1 │
└───────────┘
505
DuckDB Documentation
Approximate Aggregates
506
DuckDB Documentation
Statistical Aggregates
507
DuckDB Documentation
The table below shows the available ”ordered set” aggregate functions. These functions are specified using the WITHIN GROUP (ORDER
BY sort_expression) syntax, and they are converted to an equivalent aggregate function that takes the ordering expression as the
first argument.
Function Equivalent
508
DuckDB Documentation
Function Equivalent
Constraints
In SQL, constraints can be specified for tables. Constraints enforce certain properties over data that is inserted into a table. Constraints can
be specified along with the schema of the table as part of the CREATE TABLE statement. In certain cases, constraints can also be added
to a table using the ALTER TABLE statement, but this is not currently supported for all constraints.
Note. Warning Constraints have a strong impact on performance: they slow down loading and updates but speed up certain queries.
Please consult the Performance Guide for details.
Syntax
Check Constraint
Check constraints allow you to specify an arbitrary boolean expression. Any columns that do not satisfy this expression violate the con‑
straint. For example, we could enforce that the name column does not contain spaces using the following CHECK constraint.
CREATE TABLE students (name VARCHAR CHECK (NOT contains(name, ' ')));
INSERT INTO students VALUES ('this name contains spaces');
-- Constraint Error: CHECK constraint failed: students
A not‑null constraint specifies that the column cannot contain any NULL values. By default, all columns in tables are nullable. Adding NOT
NULL to a column definition enforces that a column cannot contain NULL values.
509
DuckDB Documentation
Primary key or unique constraints define a column, or set of columns, that are a unique identifier for a row in the table. The constraint
enforces that the specified columns are unique within a table, i.e., that at most one row contains the given values for the set of columns.
In order to enforce this property efficiently, an ART index is automatically created for every primary key or unique constraint that is defined
in the table.
Primary key constraints and unique constraints are identical except for two points:
• A table can only have one primary key constraint defined, but many unique constraints
• A primary key constraint also enforces the keys to not be NULL.
Note. Warning Indexes have certain limitations that might result in constraints being evaluated too eagerly, see the indexes section
for more details.
Foreign Keys
Foreign keys define a column, or set of columns, that refer to a primary key or unique constraint from another table. The constraint enforces
that the key exists in the other table.
In order to enforce this property efficiently, an ART index is automatically created for every foreign key constraint that is defined in the
table.
Note. Warning Indexes have certain limitations that might result in constraints being evaluated too eagerly, see the indexes section
for more details.
Indexes
Index Types
• A min‑max index (also known as zonemap and block range index) is automatically created for columns of all general‑purpose data
types.
• An Adaptive Radix Tree (ART) is mainly used to ensure primary key constraints and to speed up point and very highly selective (i.e., <
0.1%) queries. Such an index is automatically created for columns with a UNIQUE or PRIMARY KEY constraint and can be defined
using CREATE INDEX.
Note. Warning ART indexes must currently be able to fit in‑memory. Avoid creating ART indexes if the index does not fit in memory.
Persistence
510
DuckDB Documentation
To create an index, use the CREATE INDEX statement. To drop an index, use the DROP INDEX statement.
Index Limitations
ART indexes create a secondary copy of the data in a second location ‑ this complicates processing, particularly when combined with trans‑
actions. Certain limitations apply when it comes to modifying data that is also stored in secondary indexes.
Note. As expected, indexes have a strong effect on performance, slowing down loading and updates, but speeding up certain
queries. Please consult the Performance Guide for details.
Updates Become Deletes and Inserts When an update statement is executed on a column that is present in an index ‑ the statement is
transformed into a delete of the original row followed by an insert. This has certain performance implications, particularly for wide tables,
as entire rows are rewritten instead of only the affected columns.
Over‑Eager Unique Constraint Checking Due to the presence of transactions, data can only be removed from the index after (1) the
transaction that performed the delete is committed, and (2) no further transactions exist that refer to the old entry still present in the index.
As a result of this ‑ transactions that perform deletions followed by insertions may trigger unexpected unique constraint violations, as the
deleted tuple has not actually been removed from the index yet. For example:
This, combined with the fact that updates are turned into deletions and insertions within the same transaction, means that updating rows
in the presence of unique or primary key constraints can often lead to unexpected unique constraint violations.
Currently, this is an expected limitation of the system ‑ although we aim to resolve this in the future.
Information Schema
The views in the information_schema are SQL‑standard views that describe the catalog entries of the database. These views can be
filtered to obtain information about a specific column or table.
The top level catalog view is information_schema.schemata. It lists the catalogs and the schemas present in the database and has
the following layout:
511
DuckDB Documentation
The view that describes the catalog information for tables and views is information_schema.tables. It lists the tables present in
the database and has the following layout:
table_catalog The catalog the table or view belongs to. VARCHAR NULL
table_schema The schema the table or view belongs to. VARCHAR 'main'
table_name The name of the table or view. VARCHAR 'widgets'
table_type The type of table. One of: BASE TABLE, VARCHAR 'BASE TABLE'
LOCAL TEMPORARY, VIEW.
self_referencing_column_ Applies to a feature not available in DuckDB. VARCHAR NULL
name
reference_generation Applies to a feature not available in DuckDB. VARCHAR NULL
user_defined_type_ If the table is a typed table, the name of the VARCHAR NULL
catalog database that contains the underlying data
type (always the current database), else null.
Currently unimplemented.
user_defined_type_schema If the table is a typed table, the name of the VARCHAR NULL
schema that contains the underlying data
type, else null. Currently unimplemented.
user_defined_type_name If the table is a typed table, the name of the VARCHAR NULL
underlying data type, else null. Currently
unimplemented.
is_insertable_into YES if the table is insertable into, NO if not VARCHAR 'YES'
(Base tables are always insertable into, views
not necessarily.)
is_typed YES if the table is a typed table, NO if not. VARCHAR 'NO'
commit_action Not yet implemented. VARCHAR 'NO'
512
DuckDB Documentation
Columns
The view that describes the catalog information for columns is information_schema.columns. It lists the column present in the
database and has the following layout:
513
DuckDB Documentation
Catalog Functions
Several functions are also provided to see details about the catalogs and schemas that are configured in the database.
DuckDB offers a collection of table functions that provide metadata about the current database. These functions reside in the main schema
and their names are prefixed with duckdb_.
The resultset returned by a duckdb_ table function may be used just like an ordinary table or view. For example, you can use a duckdb_
function call in the FROM clause of a SELECT statement, and you may refer to the columns of its returned resultset elsewhere in the state‑
ment, for example in the WHERE clause.
Table functions are still functions, and you should write parenthesis after the function name to call it to obtain its returned resultset:
Alternatively, you may execute table functions also using the CALL‑syntax:
CALL duckdb_settings();
Note. For some of the duckdb_% functions, there is also an identically named view available, which also resides in the main
schema. Typically, these views do a SELECT on the duckdb_ table function with the same name, while filtering out those objects
that are marked as internal. We mention it here, because if you accidentally omit the parentheses in your duckdb_ table function
call, you might still get a result, but from the identically named view.
Example:
-- duckdb_views table function: returns all views, including those marked internal
SELECT * FROM duckdb_views();
-- duckdb_views view: returns views that are not marked as internal
SELECT * FROM duckdb_views;
duckdb_columns
The duckdb_columns() function provides metadata about the columns available in the DuckDB instance.
database_name The name of the database that contains the column object. VARCHAR
database_oid Internal identifier of the database that contains the column BIGINT
object.
514
DuckDB Documentation
schema_name The SQL name of the schema that contains the table object that VARCHAR
defines this column.
schema_oid Internal identifier of the schema object that contains the table of BIGINT
the column.
table_name The SQL name of the table that defines the column. VARCHAR
table_oid Internal identifier (name) of the table object that defines the BIGINT
column.
column_name The SQL name of the column. VARCHAR
column_index The unique position of the column within its table. INTEGER
internal true if this column built‑in, false if it is user‑defined. BOOLEAN
column_default The default value of the column (expressed in SQL) VARCHAR
is_nullable true if the column can hold NULL values; false if the column BOOLEAN
cannot hold NULL‑values.
data_type The name of the column datatype. VARCHAR
data_type_id The internal identifier of the column data type BIGINT
character_maximum_ Always NULL. DuckDB text types do not enforce a value length INTEGER
length restriction based on a length type parameter.
numeric_precision The number of units (in the base indicated by numeric_ INTEGER
precision_radix) used for storing column values. For
integral and approximate numeric types, this is the number of
bits. For decimal types, this is the number of digits positions.
numeric_precision_ The number‑base of the units in the numeric_precision INTEGER
radix column. For integral and approximate numeric types, this is 2,
indicating the precision is expressed as a number of bits. For the
decimal type this is 10, indicating the precision is expressed as
a number of decimal positions.
numeric_scale Applicable to decimal type. Indicates the maximum number of INTEGER
fractional digits (i.e., the number of digits that may appear after
the decimal separator).
The information_schema.columns system view provides a more standardized way to obtain metadata about database columns, but
the duckdb_columns function also returns metadata about DuckDB internal objects. (In fact, information_schema.columns is
implemented as a query on top of duckdb_columns())
duckdb_constraints
The duckdb_constraints() function provides metadata about the constraints available in the DuckDB instance.
database_name The name of the database that contains the constraint. VARCHAR
database_oid Internal identifier of the database that contains the constraint. BIGINT
schema_name The SQL name of the schema that contains the table on which the VARCHAR
constraint is defined.
515
DuckDB Documentation
schema_oid Internal identifier of the schema object that contains the table on BIGINT
which the constraint is defined.
table_name The SQL name of the table on which the constraint is defined. VARCHAR
table_oid Internal identifier (name) of the table object on which the BIGINT
constraint is defined.
constraint_index Indicates the position of the constraint as it appears in its table BIGINT
definition.
constraint_type Indicates the type of constraint. Applicable values are CHECK, VARCHAR
FOREIGN KEY, PRIMARY KEY, NOT NULL, UNIQUE.
constraint_text The definition of the constraint expressed as a SQL‑phrase. (Not VARCHAR
necessarily a complete or syntactically valid DDL‑statement.)
expression If constraint is a check constraint, the definition of the condition VARCHAR
being checked, otherwise NULL.
constraint_column_ An array of table column indexes referring to the columns that BIGINT[]
indexes appear in the constraint definition
constraint_column_ An array of table column names appearing in the constraint VARCHAR[]
names definition
duckdb_databases
The duckdb_databases() function lists the databases that are accessible from within the current DuckDB process. Apart from the
database associated at startup, the list also includes databases that were attached later on to the duckdb process
database_name The name of the database, or the alias if the database was VARCHAR
attached using an ALIAS‑clause.
database_oid The internal identifier of the database. VARCHAR
path The file path associated with the database. VARCHAR
internal true indicates a system or built‑in database. False indicates a BOOLEAN
user‑defined database.
type The type indicates the type of RDBMS implemented by the VARCHAR
attached database. For DuckDB databases, that value is duckdb.
duckdb_dependencies
The duckdb_dependencies() function provides metadata about the dependencies available in the DuckDB instance.
516
DuckDB Documentation
duckdb_extensions
The duckdb_extensions() function provides metadata about the extensions available in the DuckDB instance.
duckdb_functions
The duckdb_functions() function provides metadata about the functions (including macros) available in the DuckDB instance.
database_name The name of the database that contains this function. VARCHAR
schema_name The SQL name of the schema where the function resides. VARCHAR
function_name The SQL name of the function. VARCHAR
function_type The function kind. Value is one of: VARCHAR
table,scalar,aggregate,pragma,macro
description Description of this function (always NULL) VARCHAR
return_type The logical data type name of the returned value. Applicable for VARCHAR
scalar and aggregate functions.
parameters If the function has parameters, the list of parameter names. VARCHAR[]
parameter_types If the function has parameters, a list of logical data type names VARCHAR[]
corresponding to the parameter list.
varargs The name of the data type in case the function has a variable VARCHAR
number of arguments, or NULL if the function does not have a
variable number of arguments.
macro_definition If this is a macro, the SQL expression that defines it. VARCHAR
has_side_effects false if this is a pure function. true if this function changes the BOOLEAN
database state (like sequence functions nextval() and
curval()).
function_oid The internal identifier for this function BIGINT
517
DuckDB Documentation
duckdb_indexes
The duckdb_indexes() function provides metadata about secondary indexes available in the DuckDB instance.
database_name The name of the database that contains this index. VARCHAR
database_oid Internal identifier of the database containing the index. BIGINT
schema_name The SQL name of the schema that contains the table with the VARCHAR
secondary index.
schema_oid Internal identifier of the schema object. BIGINT
index_name The SQL name of this secondary index VARCHAR
index_oid The object identifier of this index. BIGINT
table_name The name of the table with the index VARCHAR
table_oid Internal identifier (name) of the table object. BIGINT
is_unique true if the index was created with the UNIQUE modifier, false BOOLEAN
if it was not.
is_primary Always false BOOLEAN
expressions Always NULL VARCHAR
sql The definition of the index, expressed as a CREATE INDEX SQL VARCHAR
statement.
Note that duckdb_indexes only provides metadata about secondary indexes ‑ i.e., those indexes created by explicit CREATE IN-
DEX statements. Primary keys, foreign keys, and UNIQUE constraints are maintained using indexes, but their details are included in the
duckdb_constraints() function.
duckdb_keywords
The duckdb_keywords() function provides metadata about DuckDB's keywords and reserved words.
duckdb_memory
tag The memory tag. It has one of the following values: BASE_ VARCHAR
TABLE, HASH_TABLE, PARQUET_READER, CSV_READER,
ORDER_BY, ART_INDEX, COLUMN_DATA, METADATA,
OVERFLOW_STRINGS, IN_MEMORY_TABLE, ALLOCATOR,
EXTENSION.
518
DuckDB Documentation
duckdb_optimizers
The duckdb_optimizers() function provides metadata about the optimization rules (e.g., expression_rewriter, filter_
pushdown) available in the DuckDB instance. These can be selectively turned off using PRAGMA disabled_optimizers.
duckdb_schemas
The duckdb_schemas() function provides metadata about the schemas available in the DuckDB instance.
The information_schema.schemata system view provides a more standardized way to obtain metadata about database schemas.
duckdb_secrets
The duckdb_secrets() function provides metadata about the secrets available in the DuckDB instance.
519
DuckDB Documentation
duckdb_sequences
The duckdb_sequences() function provides metadata about the sequences available in the DuckDB instance.
database_name The name of the database that contains this sequence VARCHAR
database_oid Internal identifier of the database containing the sequence. BIGINT
schema_name The SQL name of the schema that contains the sequence object. VARCHAR
schema_oid Internal identifier of the schema object that contains the BIGINT
sequence object.
sequence_name The SQL name that identifies the sequence within the schema. VARCHAR
sequence_oid The internal identifier of this sequence object. BIGINT
temporary Whether this sequence is temporary. Temporary sequences are BOOLEAN
transient and only visible within the current connection.
start_value The initial value of the sequence. This value will be returned BIGINT
when nextval() is called for the very first time on this
sequence.
min_value The minimum value of the sequence. BIGINT
max_value The maximum value of the sequence. BIGINT
increment_by The value that is added to the current value of the sequence to BIGINT
draw the next value from the sequence.
cycle Whether the sequence should start over when drawing the next BOOLEAN
value would result in a value outside the range.
last_value null if no value was ever drawn from the sequence using BIGINT
nextval(...). 1 if a value was drawn.
sql The definition of this object, expressed as SQL DDL‑statement. VARCHAR
Attributes like temporary, start_value etc. correspond to the various options available in the CREATE SEQUENCE statement and
are documented there in full. Note that the attributes will always be filled out in the duckdb_sequences resultset, even if they were not
explicitly specified in the CREATE SEQUENCE statement.
Note.
1. The column name last_value suggests that it contains the last value that was drawn from the sequence, but that is not
the case. It's either null if a value was never drawn from the sequence, or 1 (when there was a value drawn, ever, from the
sequence).
2. If the sequence cycles, then the sequence will start over from the boundary of its range, not necessarily from the value specified
as start value.
duckdb_settings
The duckdb_settings() function provides metadata about the settings available in the DuckDB instance.
520
DuckDB Documentation
duckdb_tables
The duckdb_tables() function provides metadata about the base tables available in the DuckDB instance.
database_name The name of the database that contains this table VARCHAR
database_oid Internal identifier of the database containing the table. BIGINT
schema_name The SQL name of the schema that contains the base table. VARCHAR
schema_oid Internal identifier of the schema object that contains the base BIGINT
table.
table_name The SQL name of the base table. VARCHAR
table_oid Internal identifier of the base table object. BIGINT
internal false if this is a user‑defined table. BOOLEAN
temporary Whether this is a temporary table. Temporary tables are not BOOLEAN
persisted and only visible within the current connection.
has_primary_key true if this table object defines a PRIMARY KEY. BOOLEAN
estimated_size The estimated number of rows in the table. BIGINT
column_count The number of columns defined by this object BIGINT
index_count The number of indexes associated with this table. This number BIGINT
includes all secondary indexes, as well as internal indexes
generated to maintain PRIMARY KEY and/or UNIQUE
constraints.
check_constraint_count The number of check constraints active on columns within the BIGINT
table.
sql The definition of this object, expressed as SQL CREATE VARCHAR
TABLE‑statement.
The information_schema.tables system view provides a more standardized way to obtain metadata about database tables that
also includes views. But the resultset returned by duckdb_tables contains a few columns that are not included in information_
schema.tables.
duckdb_types
The duckdb_types() function provides metadata about the data types available in the DuckDB instance.
database_name The name of the database that contains this schema. VARCHAR
521
DuckDB Documentation
database_oid Internal identifier of the database that contains the data type. BIGINT
schema_name The SQL name of the schema containing the type definition. VARCHAR
Always main.
schema_oid Internal identifier of the schema object. BIGINT
type_name The name or alias of this data type. VARCHAR
type_oid The internal identifier of the data type object. If NULL, then this BIGINT
is an alias of the type (as identified by the value in the logical_
type column).
type_size The number of bytes required to represent a value of this type in BIGINT
memory.
logical_type The 'canonical' name of this data type. The same logical_ VARCHAR
type may be referenced by several types having different type_
names.
type_category The category to which this type belongs. Data types within the VARCHAR
same category generally expose similar behavior when values of
this type are used in expression. For example, the NUMERIC
type_category includes integers, decimals, and floating point
numbers.
internal Whether this is an internal (built‑in) or a user object. BOOLEAN
duckdb_views
The duckdb_views() function provides metadata about the views available in the DuckDB instance.
database_name The name of the database that contains this view VARCHAR
database_oid Internal identifier of the database that contains this view. BIGINT
schema_name The SQL name of the schema where the view resides. VARCHAR
schema_oid Internal identifier of the schema object that contains the view. BIGINT
view_name The SQL name of the view object. VARCHAR
view_oid The internal identifier of this view object. BIGINT
internal true if this is an internal (built‑in) view, false if this is a BOOLEAN
user‑defined view.
temporary true if this is a temporary view. Temporary views are not BOOLEAN
persistent and are only visible within the current connection.
column_count The number of columns defined by this view object. BIGINT
sql The definition of this object, expressed as SQL DDL‑statement. VARCHAR
The information_schema.tables system view provides a more standardized way to obtain metadata about database views that
also includes base tables. But the resultset returned by duckdb_views contains also definitions of internal view objects as well as a few
columns that are not included in information_schema.tables.
522
DuckDB Documentation
duckdb_temporary_files
The duckdb_temporary_files() function provides metadata about the temporary files DuckDB has written to disk, to offload data
from memory. This function mostly exists for debugging and testing purposes.
Identifiers
Similarly to other SQL dialects and programming languages, identifiers in DuckDB's SQL are subject to several rules.
– They must not be a reserved keyword (see duckdb_keywords()), e.g., SELECT 123 AS SELECT will fail.
– They must not start with a number or special character, e.g., SELECT 123 AS 1col is invalid.
– They cannot contain whitespaces (including tabs and newline characters).
• Identifiers can be quoted using double‑quote characters ("). Quoted identifiers can use any keyword, whitespace or special character,
e.g., "SELECT" and " § ¶ " are valid identifiers.
• Quotes themselves can be escaped by repeating the quote character, e.g., to create an identifier named IDENTIFIER "X", use
"IDENTIFIER ""X""".
Deduplicating Identifiers In some cases, duplicate identifiers can occur, e.g., column names may conflict when unnesting a nested data
structure. In these cases, DuckDB automatically deduplicates column names by renaming them according to the following rules:
For example:
SELECT *
FROM (SELECT UNNEST({'a': 42, 'b': {'a': 88, 'b': 99}}, recursive := true));
┌───────┬───────┬───────┐
│ a │ a_1 │ b │
│ int32 │ int32 │ int32 │
├───────┼───────┼───────┤
│ 42 │ 88 │ 99 │
└───────┴───────┴───────┘
Database Names
Additionally, it is best practive to avoid DuckDB's two internal database schema names, system and temp. By default, persistent
databases are named after their filename without the extension. Therefore, the filenames system.db and temp.db (as well as
system.duckdb and temp.duckdb) result in the database names system and temp, respectively. If you need to attach to a database
that has one of these names, use an alias, e.g.:
523
DuckDB Documentation
Numeric Literals
DuckDB's SQL dialect allows using the underscore character _ in numeric literals as an optional separator. The rules for using underscores
are as follows:
Examples
Keywords and Function Names SQL keywords and function names are case‑insensitive in DuckDB.
┌────────────┐
│ CosineOfPi │
│ double │
├────────────┤
│ -1.0 │
└────────────┘
Case‑Sensitivity of Identifiers Following the convention of the SQL standard, identifiers in DuckDB are case‑insensitive. However, each
character's case (uppercase/lowercase) is maintained as originally specified by the user even if a query uses different cases when referring
to the identifier. For example:
┌────────────┐
│ CosineOfPi │
│ double │
├────────────┤
│ -1.0 │
└────────────┘
Handling Conflicts In case of a conflict, when the same identifier is spelt with different cases, one will be selected randomly. For exam‑
ple:
524
DuckDB Documentation
┌─────────┬───────┬───────┐
│ idfield │ x │ y │
│ int32 │ int32 │ int32 │
├─────────────────────────┤
│ 0 rows │
└─────────────────────────┘
Disabling Preserving Cases With the preserve_identifier_case configuration option set to false, all identifiers are turned
into lowercase:
┌────────────┐
│ cosineofpi │
│ double │
├────────────┤
│ -1.0 │
└────────────┘
Samples
Examples
-- select a sample of 5 rows from "tbl" using reservoir sampling
SELECT * FROM tbl USING SAMPLE 5;
-- select a sample of 10% of the table using system sampling (cluster sampling)
SELECT * FROM tbl USING SAMPLE 10%;
-- select a sample of 10% of the table using bernoulli sampling
SELECT * FROM tbl USING SAMPLE 10 PERCENT (bernoulli);
-- select a sample of 50 rows of the table using reservoir sampling with a fixed seed (100)
SELECT * FROM tbl USING SAMPLE reservoir(50 ROWS) REPEATABLE (100);
-- select a sample of 20% of the table using system sampling with a fixed seed (377)
SELECT * FROM tbl USING SAMPLE 10% (system, 377);
-- select a sample of 10% of "tbl" BEFORE the join with tbl2
SELECT * FROM tbl TABLESAMPLE reservoir(20%), tbl2 WHERE tbl.i = tbl2.i;
-- select a sample of 10% of "tbl" AFTER the join with tbl2
SELECT * FROM tbl, tbl2 WHERE tbl.i = tbl2.i USING SAMPLE reservoir(20%);
Syntax Samples allow you to randomly extract a subset of a dataset. Samples are useful for exploring a dataset faster, as often you might
not be interested in the exact answers to queries, but only in rough indications of what the data looks like and what is in the data. Samples
allow you to get approximate answers to queries faster, as they reduce the amount of data that needs to pass through the query engine.
DuckDB supports three different types of sampling methods: reservoir, bernoulli and system. By default, DuckDB uses reser-
voir sampling when an exact number of rows is sampled, and system sampling when a percentage is specified. The sampling methods
are described in detail below.
Samples require a sample size, which is an indication of how many elements will be sampled from the total population. Samples can either
be given as a percentage (10%) or as a fixed number of rows (10 rows). All three sampling methods support sampling over a percentage,
but only reservoir sampling supports sampling a fixed number of rows.
Samples are probablistic, that is to say, samples can be different between runs unless the seed is specifically specified. Specifying the seed
only guarantees that the sample is the same if multi‑threading is not enabled (i.e., SET threads = 1). In the case of multiple threads
running over a sample, samples are not necessarily consistent even with a fixed seed.
525
DuckDB Documentation
reservoir Reservoir sampling is a stream sampling technique that selects a random sample by keeping a reservoir of size equal to
the sample size, and randomly replacing elements as more elements come in. Reservoir sampling allows us to specify exactly how many
elements we want in the resulting sample (by selecting the size of the reservoir). As a result, reservoir sampling always outputs the same
amount of elements, unlike system and bernoulli sampling.
Reservoir sampling is only recommended for small sample sizes, and is not recommended for use with percentages. That is because reser‑
voir sampling needs to materialize the entire sample and randomly replace tuples within the materialized sample. The larger the sample
size, the higher the performance hit incurred by this process.
Reservoir sampling also incurs an additional performance penalty when multi‑processing is used, since the reservoir is to be shared
amongst the different threads to ensure unbiased sampling. This is not a big problem when the reservoir is very small, but becomes costly
when the sample is large.
Note. Bestpractice Avoid using Reservoir Sample with large sample sizes if possible. Reservoir sampling requires the entire sample
to be materialized in memory.
bernoulli Bernoulli sampling can only be used when a sampling percentage is specified. It is rather straightforward: every tuple in the
underlying table is included with a chance equal to the specified percentage. As a result, bernoulli sampling can return a different number
of tuples even if the same percentage is specified. The amount of rows will generally be more or less equal to the specified percentage of
the table, but there will be some variance.
Because bernoulli sampling is completely independent (there is no shared state), there is no penalty for using bernoulli sampling together
with multiple threads.
system System sampling is a variant of bernoulli sampling with one crucial difference: every vector is included with a chance equal
to the sampling percentage. This is a form of cluster sampling. System sampling is more efficient than bernoulli sampling, as no per‑
tuple selections have to be performed. There is almost no extra overhead for using system sampling, whereas bernoulli sampling can add
additional cost as it has to perform random number generation for every single tuple.
System sampling is not suitable for smaller data sets as the granularity of the sampling is on the order of ~1000 tuples. That means that if
system sampling is used for small data sets (e.g., 100 rows) either all the data will be filtered out, or all the data will be included.
Table Samples
The TABLESAMPLE and USING SAMPLE clauses are identical in terms of syntax and effect, with one important difference: tablesamples
sample directly from the table for which they are specified, whereas the sample clause samples after the entire from clause has been
resolved. This is relevant when there are joins present in the query plan.
The TABLESAMPLE clause is essentially equivalent to creating a subquery with the USING SAMPLE clause, i.e., the following two queries
are identical:
-- sample 20% AFTER the join (i.e., sample 20% of the join result)
SELECT * FROM tbl, tbl2 WHERE tbl.i = tbl2.i USING SAMPLE reservoir(20%);
526
DuckDB Documentation
Window Functions
Examples
-- generate a "row_number" column with containing incremental identifiers for each row
SELECT row_number() OVER () FROM sales;
-- generate a "row_number" column, by order of time
SELECT row_number() OVER (ORDER BY time) FROM sales;
-- generate a "row_number" column, by order of time partitioned by region
SELECT row_number() OVER (PARTITION BY region ORDER BY time) FROM sales;
-- compute the difference between the current amount, and the previous amount,
-- by order of time
SELECT amount - lag(amount) OVER (ORDER BY time) FROM sales;
-- compute the percentage of the total amount of sales per region for each row
SELECT amount / sum(amount) OVER (PARTITION BY region) FROM sales;
Syntax
Window functions can only be used in the SELECT clause. To share OVER specifications between functions, use the statement's WINDOW
clause and use the OVER window-name syntax.
527
DuckDB Documentation
lead(expr any [, offset same type as expr Returns expr evaluated at the row lead(column, 3, 0)
integer [, default any ]]) that is offset rows after the current
row within the partition; if there is no
such row, instead return default
(which must be of the same type as
expr). Both offset and default
are evaluated with respect to the
current row. If omitted, offset
defaults to 1 and default to null.
nth_value(expr any, nth same type as expr Returns expr evaluated at the nth row nth_value(column, 2)
integer) of the window frame (counting from 1);
null if no such row.
ntile(num_buckets integer) BIGINT An integer ranging from 1 to the ntile(4)
argument value, dividing the partition
as equally as possible.
percent_rank() DOUBLE The relative rank of the current row: percent_rank()
(rank() - 1) / (total
partition rows - 1).
rank_dense() BIGINT Alias for dense_rank. rank_dense()
rank() BIGINT The rank of the current row with gaps; rank()
same as row_number of its first peer.
row_number() BIGINT The number of the current row within row_number()
the partition, counting from 1.
Ignoring NULLs
Note that there is no comma separating the arguments from the IGNORE NULLS specification.
528
DuckDB Documentation
The inverse of IGNORE NULLS is RESPECT NULLS, which is the default for all functions.
Evaluation
Windowing works by breaking a relation up into independent partitions, ordering those partitions, and then computing a new column for
each row as a function of the nearby values. Some window functions depend only on the partition boundary and the ordering, but a few
(including all the aggregates) also use a frame. Frames are specified as a number of rows on either side (preceding or following) of the current
row. The distance can either be specified as a number of rows or a range of values using the partition's ordering value and a distance.
The full syntax is shown in the diagram at the top of the page, and this diagram visually illustrates computation environment:
Partition and Ordering Partitioning breaks the relation up into independent, unrelated pieces. Partitioning is optional, and if none is
specified then the entire relation is treated as a single partition. Window functions cannot access values outside of the partition containing
the row they are being evaluated at.
Ordering is also optional, but without it the results are not well‑defined. Each partition is ordered using the same ordering clause.
Here is a table of power generation data, available as a CSV file (power-plant-generation-history.csv). To load the data, run:
After partitioning by plant and ordering by date, it will have this layout:
529
DuckDB Documentation
In what follows, we shall use this table (or small sections of it) to illustrate various pieces of window function evaluation.
The simplest window function is row_number(). This function just computes the 1‑based row number within the partition using the
query:
SELECT
"Plant",
"Date",
row_number() OVER (PARTITION BY "Plant" ORDER BY "Date") AS "Row"
FROM "Generation History"
ORDER BY 1, 2;
Boston 2019‑01‑02 1
Boston 2019‑01‑03 2
Boston 2019‑01‑04 3
... ... ...
Worcester 2019‑01‑02 1
Worcester 2019‑01‑03 2
Worcester 2019‑01‑04 3
... ... ...
Note that even though the function is computed with an ORDER BY clause, the result does not have to be sorted, so the SELECT also
needs to be explicitly sorted if that is desired.
Framing Framing specifies a set of rows relative to each row where the function is evaluated. The distance from the current row is given
as an expression either PRECEDING or FOLLOWING the current row. This distance can either be specified as an integral number of ROWS
or as a RANGE delta expression from the value of the ordering expression. For a RANGE specification, there must be only one ordering ex‑
pression, and it has to support addition and subtraction (i.e., numbers or INTERVALs). The default values for frames are from UNBOUNDED
PRECEDING to CURRENT ROW. It is invalid for a frame to start after it ends. Using the EXCLUDE clause, rows around the current row can
be excluded from the frame.
ROW Framing Here is a simple ROW frame query, using an aggregate function:
SELECT points,
sum(points) OVER (
ROWS BETWEEN 1 PRECEDING
AND 1 FOLLOWING) we
FROM results;
This query computes the sum of each point and the points on either side of it:
Notice that at the edge of the partition, there are only two values added together. This is because frames are cropped to the edge of the
partition.
530
DuckDB Documentation
RANGE Framing Returning to the power data, suppose the data is noisy. We might want to compute a 7 day moving average for each
plant to smooth out the noise. To do this, we can use this window query:
This query partitions the data by Plant (to keep the different power plants' data separate), orders each plant's partition by Date (to put
the energy measurements next to each other), and uses a RANGE frame of three days on either side of each day for the avg (to handle any
missing days). This is the result:
EXCLUDE Clause The EXCLUDE clause allows rows around the current row to be excluded from the frame. It has the following options:
WINDOW Clauses Multiple different OVER clauses can be specified in the same SELECT, and each will be computed separately. Often,
however, we want to use the same layout for multiple window functions. The WINDOW clause can be used to define a named window that
can be shared between multiple window functions:
531
DuckDB Documentation
The three window functions will also share the data layout, which will improve performance.
Multiple windows can be defined in the same WINDOW clause by comma‑separating them:
The queries above do not use a number of clauses commonly found in select statements, like WHERE, GROUP BY, etc. For more complex
queries you can find where WINDOW clauses fall in the canonical order of the SELECT statement.
Filtering the Results of Window Functions Using QUALIFY Window functions are executed after the WHERE and HAVING clauses have
been already evaluated, so it's not possible to use these clauses to filter the results of window functions The QUALIFY clause avoids the
need for a subquery or WITH clause to perform this filtering.
Box and Whisker Queries All aggregates can be used as windowing functions, including the complex statistical functions. These function
implementations have been optimised for windowing, and we can use the window syntax to write queries that generate the data for moving
box‑and‑whisker plots:
532
Extensions
Extensions
Overview
DuckDB has a flexible extension mechanism that allows for dynamically loading extensions. These may extend DuckDB's functionality by
providing support for additional file formats, introducing new types, and domain‑specific functionality.
Note. Extensions are loadable on all clients (e.g., Python and R). Extensions distributed via the official repository are built and
tested on macOS (AMD64 and ARM64), Windows (AMD64) and Linux (AMD64 and ARM64).
Using Extensions
FROM duckdb_extensions();
extension_ install_
name loaded installed path description aliases
arrow false false A zero‑copy data integration between Apache Arrow and DuckDB []
autocomplete false false Adds support for autocomplete in the shell []
... ... ... ... ... ...
Extension Types
Built‑In Extensions Built‑in extensions are loaded at startup and are immediately available for use.
SELECT *
FROM 'test.json';
This will use the json extension to read the JSON file.
Note. To make the DuckDB distribution lightweight, it only contains a few fundamental built‑in extensions (e.g., autocomplete,
json, parquet), which are loaded automatically.
SELECT *
FROM 'https://raw.githubusercontent.com/duckdb/duckdb-web/main/data/weather.csv';
To access files via the HTTPS protocol, DuckDB will automatically load the httpfs extension. Similarly, other autoloadable extensions
(aws, fts) will be loaded on‑demand. If an extension is not already available locally, it will be installed from the official extension repository
(extensions.duckdb.org).
533
DuckDB Documentation
Explicitly Loadable Extensions Some extensions make several changes to the running DuckDB instance, hence, autoloading them may
not be possible. These extensions have to be installed and loaded using the following SQL statements:
INSTALL spatial;
LOAD spatial;
Extension Handling through the Python API If you are using the Python API client, you can install and load them with the install_
extension(name: str) and load_extension(name: str) methods.
Extensions are signed with a cryptographic key, which also simplifies distribution (this is why they are served over HTTP and not HTTPS). By
default, DuckDB uses its built‑in public keys to verify the integrity of extension before loading them. All extensions provided by the DuckDB
core team are signed.
If you wish to load your own extensions or extensions from third‑parties you will need to enable the allow_unsigned_extensions
flag. To load unsigned extensions using the CLI client, pass the -unsigned flag to it on startup. For the Python client, see the Loading
and Installing Extensions section in the Python API documentation.
Sharing Extensions between Clients The shared installation location allows extensions to be shared between the client APIs of the same
DuckDB version, as long as they share the same platfrom or ABI. For example, if an extension is installed with version 0.10.0 of the CLI
client on macOS, it is available from the Python, R, etc. client libraries provided that they have access to the user's home directory and use
DuckDB version 0.10.0.
See the Working with Extensions page for details on available platforms.
Installation Location
~/.duckdb/extensions/v{duckdb_version}/{platform_name}/
For example, the extensions for DuckDB version 0.10.0 on macOS ARM64 (Apple Silicon) are installed to ~/.duckdb/extensions/v0.10.0/osx_
arm64/.
Note. For development builds, the directory of the extensions corresponds to the Git hash of the build, e.g., ~/.duckdb/extensions/fc2e4b
amd64_gcc4.
Changing the Extension Directory To specify a different extension directory, use the extension_directory configuration option:
Developing Extensions
The same API that the official extensions use is available for developing extensions. This allows users to extend the functionality of DuckDB
such that it suits their domain the best. A template for creating extensions is available in the extension-template repository.
For advanced installation instructions and more details on extensions, see the Working with Extensions page.
534
DuckDB Documentation
Official Extensions
arrow GitHub A zero‑copy data integration between Apache Arrow and DuckDB
autocomplete Adds support for autocomplete in the shell
aws GitHub Provides features that depend on the AWS SDK
azure GitHub Adds a filesystem abstraction for Azure blob storage to DuckDB
excel Adds support for Excel‑like format strings
fts Adds support for Full‑Text Search Indexes
httpfs Adds support for reading and writing files over an HTTP(S) or S3 http, https, s3
connection
iceberg GitHub Adds support for Apache Iceberg
icu Adds support for time zones and collations using the ICU library
inet Adds support for IP‑related data types and functions
jemalloc Overwrites system allocator with jemalloc
json Adds support for JSON operations
mysql GitHub Adds support for reading from and writing to a MySQL database
parquet Adds support for reading and writing Parquet files
postgres GitHub Adds support for reading from and writing to a Postgres database postgres_scanner
spatial GitHub Geospatial extension that adds support for working with spatial
data and functions
sqlite GitHub Adds support for reading from and writing to SQLite database sqlite_scanner, sqlite3
files
substrait GitHub Adds support for the Substrait integration
tpcds Adds TPC‑DS data generation and query support
tpch Adds TPC‑H data generation and query support
Default Extensions
Different DuckDB clients ship a different set of extensions. We summarize the main distributions in the table below.
535
DuckDB Documentation
The jemalloc extension's availability is based on the operating system. It is a built‑in extension on Linux versions and optionally available
on macOS (via compiling from source). On Windows, it is not available.
Downloading an extension directly could be helpful when building a lambda service or container that uses DuckDB. DuckDB extensions are
stored in public S3 buckets, but the directory structure of those buckets is not searchable. As a result, a direct URL to the file must be used.
To directly download an extension file, use the following format:
http://extensions.duckdb.org/v{duckdb_version}/{platform_name}/{extension_name}.duckdb_extension.gz
For example:
Platforms
Extension binaries must be built for each platform. We distribute pre‑built binaries for several platforms (see below). For platforms where
packages for certain extensions are not available, users can build them from source and install the resulting binaries manually.
Note. For some Linux ARM distributions (e.g., Python), two different binaries are distributed. These target either the linux_
arm64 or linux_arm64_gcc4 platforms. Note that extension binaries are distributed for the first, but not the second. Effectively
that means that on these platforms your glibc version needs to be 2.28 or higher to use the distributed extension binaries.
• windows_amd64_rtools
• wasm_eh and wasm_mvp (see DuckDB‑Wasm's extensions)
For platforms outside the ones listed above, we do not officially distribute extensions (e.g., linux_arm64_gcc4, windows_amd64_
mingw).
To load extensions from a custom extension repository, set the following configuration option.
536
DuckDB Documentation
Local Files
folder
└── 0fd6fb9198
└── osx_arm64
├── autocomplete.duckdb_extension
├── httpfs.duckdb_extension
├── icu.duckdb_extension.gz
├── inet.duckdb_extension
├── json.duckdb_extension
├── parquet.duckdb_extension
├── tpcds.duckdb_extension
├── tpcds.duckdb_extension.gz
└── tpch.duckdb_extension.gz
With at the first level the DuckDB version, at the second the DuckDB platform, and then extensions either as name.duckdb_extension
or gzip‑compressed files name.duckdb_extension.gz.
INSTALL icu;
The execution of this statement will first look icu.duckdb_extension.gz, then icu.duckdb_extension the folder's file structure.
If it finds either of the extension binaries, it will install the extension to the location specified by the extension_directory option
(which defaults to ~/.duckdb/extensions). If the file is compressed, it will be decompressed at this step.
They work the same as local ones, and expect a similar folder structure.
Remote extension repositories act similarly to local ones, as in the file structure should be the same and either gzipped or non‑gzipped file
are supported.
Only special case here is that httpfs extension should be available locally. You can get it for example doing:
RESET custom_extension_repository;
INSTALL httpfs;
This is since httpfs extension will be needed to actually access remote encrypted files.
INSTALL x FROM y You can also use the INSTALL command's FROM clause to specify the path of the custom extension repository.
For example:
This will force install the azure extension from the specified URL.
537
DuckDB Documentation
Installing Extensions from an Explicit Path INSTALL can be used with the path to either a .duckdb_extension file or a .duckdb_
extension.gz file. For example, if the file was available into the same directory as where DuckDB is being executed, you can install it as
follows:
-- uncompressed file
INSTALL 'path/to/httpfs.duckdb_extension';
-- gzip-compressed file
INSTALL 'path/to/httpfs.duckdb_extension.gz';
When DuckDB installs an extension, it is copied to a local directory to be cached, avoiding any network traffic. Any subsequent calls to IN-
STALL extension_name will use the local version instead of downloading the extension again. To force re‑downloading the extension,
run:
LOAD can be used with the path to a .duckdb_extension. For example, if the file was available at the (relative) path path/to/httpfs.duckdb_
extension, you can load it as follows:
-- uncompressed file
LOAD 'path/to/httpfs.duckdb_extension';
This will skip any currently installed file in the specifed path.
For building and installing extensions from source, see the building guide.
To statically link extensions, follow the developer documentation's ”Using extension config files” section.
Versioning of Extensions
Extension Versions
Note. An extension can have in‑version upgrades. You can run FORCE INSTALL extension to ensure you're on the latest
version of the extension.
538
DuckDB Documentation
DuckDB extensions currently don't have an internal version. This means that, in general, when staying on a DuckDB version, the version of
the extension will be fixed. When a new version of DuckDB is released, this also marks a new release for all extension versions. However,
there are two important sidenotes to make here.
Firstly, some DuckDB extension's may be updated within a DuckDB release in case of bugs. These are considered ”hotfixes” and should not
introduce compatibility breaking changes. This means that when running into issues with an extension, it makes sense to double‑check
that you are on the latest version of an extension by running FORCE INSTALL extension .
Secondly, in the (near) future DuckDB aims to untie extension versions from DuckDB versions by adding version tags to extensions, the
ability to inspect which versions are installed, and installing specific extension versions. So keep in mind that this is likely to change and in
the future extension may introduce compatibility‑breaking updates within a DuckDB release.
Currently, when extensions are compiled, they are tied to a specific version of DuckDB. What this means is that an extension binary com‑
piled for v0.9.2 does not work for v0.10.0 for example. In most cases, this will not cause any issues and is fully transparent; DuckDB will
automatically ensure it installs the correct binary for its version. For extension developers, this means that they must ensure that new bi‑
naries are created whenever a new version of DuckDB is released. However, note that DuckDB provides an extension template that makes
this fairly simple.
Arrow Extension
The arrow extension implements features for using Apache Arrow, a cross‑language development platform for in‑memory analytics.
The arrow extension will be transparently autoloaded on first use from the official extension repository. If you would like to install and
load it manually, run:
INSTALL arrow;
LOAD arrow;
Functions
to_arrow_ipc Table in‑out‑function Serializes a table into a stream of blobs containing Arrow IPC buffers
scan_arrow_ Table function Scan a list of pointers pointing to Arrow IPC buffers
ipc
AutoComplete Extension
The autocomplete extension adds supports for autocomplete in the CLI client. The extension is shipped by default with the CLI client.
Behavior
For the behavior of the autocomplete extension, see the documentation of the CLI client.
539
DuckDB Documentation
Functions
Function Description
Example
SELECT *
FROM sql_auto_complete('SEL');
Returns:
suggestion suggestion_start
SELECT 0
DELETE 0
INSERT 0
CALL 0
LOAD 0
CALL 0
ALTER 0
BEGIN 0
EXPORT 0
CREATE 0
PREPARE 0
EXECUTE 0
EXPLAIN 0
ROLLBACK 0
DESCRIBE 0
SUMMARIZE 0
CHECKPOINT 0
DEALLOCATE 0
UPDATE 0
DROP 0
GitHub
AWS Extension
The aws extension adds functionality (e.g., authentication) on top of the httpfs extension's S3 capabilities, using the AWS SDK.
540
DuckDB Documentation
INSTALL aws;
LOAD aws;
Features
load_aws_credentials PRAGMA function Automatically loads the AWS credentials through the AWS Default
Credentials Provider Chain
Usage
CALL load_aws_credentials();
┌──────────────────────┬──────────────────────────┬──────────────────────┬───────────────┐
│ loaded_access_key_id │ loaded_secret_access_key │ loaded_session_token │ loaded_region │
│ varchar │ varchar │ varchar │ varchar │
├──────────────────────┼──────────────────────────┼──────────────────────┼───────────────┤
│ AKIAIOSFODNN7EXAMPLE │ <redacted> │ │ eu-west-1 │
└──────────────────────┴──────────────────────────┴──────────────────────┴───────────────┘
CALL load_aws_credentials('minio-testing-2');
┌──────────────────────┬──────────────────────────┬──────────────────────┬───────────────┐
│ loaded_access_key_id │ loaded_secret_access_key │ loaded_session_token │ loaded_region │
│ varchar │ varchar │ varchar │ varchar │
├──────────────────────┼──────────────────────────┼──────────────────────┼───────────────┤
│ minio_duckdb_user_2 │ <redacted> │ │ eu-west-2 │
└──────────────────────┴──────────────────────────┴──────────────────────┴───────────────┘
┌──────────────────────┬──────────────────────────────┬──────────────────────┬───────────────┐
│ loaded_access_key_id │ loaded_secret_access_key │ loaded_session_token │ loaded_region │
│ varchar │ varchar │ varchar │ varchar │
├──────────────────────┼──────────────────────────────┼──────────────────────┼───────────────┤
│ minio_duckdb_user_2 │ minio_duckdb_user_password_2 │ │ │
└──────────────────────┴──────────────────────────────┴──────────────────────┴───────────────┘
Related Extensions
aws depends on httpfs extension capablities, and both will be autoloaded on the first call to load_aws_credentials. If autoinstall
or autoload are disabled, you can always explicitly install and load them as follows:
INSTALL aws;
INSTALL httpfs;
LOAD aws;
LOAD httpfs;
541
DuckDB Documentation
Usage
Azure Extension
The azure extension is a loadable extension that adds a filesystem abstraction for the Azure Blob storage to DuckDB.
INSTALL azure;
LOAD azure;
Usage
Once the authentication is set up, the Azure Blob Storage can be queried as follows:
SELECT count(*)
FROM 'az:// my_container / my_file . parquet_or_csv ';
SELECT count(*)
FROM 'az:// my_storage_account .blob.core.windows.net/ my_container / my_file . parquet_or_csv ';
SELECT *
FROM 'az:// my_container /*.csv';
Alternatively:
SELECT *
FROM 'az:// my_storage_account .blob.core.windows.net/ my_container /*.csv';
Configuration
Use the following configuration options how the extension reads remote files:
542
DuckDB Documentation
Note. Setting azure_transport_option_type explicitly to curl with have the following effect:
• On Linux, this may solve certificates issue (Error: Invalid Error: Fail to get a new connection for:
https:// storage account name .blob.core.windows.net/. Problem with the SSL CA cert
(path? access rights?)) because when specifying the extension will try to find the bundle certificate in various paths
(that is not done by curl by default and might be wrong due to static linking).
• On Windows, this replaces the default adapter (WinHTTP) allowing you to use all curl capabilities (for example using a socks
proxies).
• On all operating systems, it will honor the following environment variables:
– CURL_CA_INFO: Path to a PEM encoded file containing the certificate authorities sent to libcurl. Note that this option is
known to only work on Linux and might throw if set on other platforms.
– CURL_CA_PATH: Path to a directory which holds PEM encoded file, containing the certificate authorities sent to libcurl.
Example:
543
DuckDB Documentation
Authentication
The Azure extension has two ways to configure the authentication. The preferred way is to use Secrets.
Authentication with Secret Multiple Secret Providers are available for the Azure extension:
Note.
• If you need to define different secrets for different storage accounts you can use the SCOPE configuration.
• If you use fully qualified path then the ACCOUNT_NAME attribute is optional.
CONFIG Provider The default provider, CONFIG (i.e., user‑configured), allows access to the storage account using a connection string
or anonymously. For example:
If you do not use authentication, you still need to specify the storage account name. For example:
CREDENTIAL_CHAIN Provider The CREDENTIAL_CHAIN provider allows connecting using credentials automatically fetched by the
Azure SDK via the Azure credential chain. By default, the DefaultAzureCredential chain used, which tries credentials according to
the order specified by the Azure documentation. For example:
DuckDB also allows specifying a specific chain using the CHAIN keyword. This takes a ; separated list of providers that will be tried in order.
For example:
The possible values are the following: cli; managed_identity; env; default;
SERVICE_PRINCIPAL Provider The SERVICE_PRINCIPAL provider allows connecting using a Azure Service Principal (SPN).
544
DuckDB Documentation
Or with a certificate:
Configuring a Proxy To configure proxy information when using secrets, you can add HTTP_PROXY, PROXY_USER_NAME, and PROXY_
PASSWORD in the secret definition. For example:
Note.
• When using secrets, the HTTP_PROXY environment variable will still be honored except if you provide an explicit value for it.
• When using secrets, the SET variable of the Authentication with variables session will be ignored.
• The Azure CREDENTIAL_CHAIN provider, the actual token is fetched at query time, not at the time of creating the secret.
545
DuckDB Documentation
Excel Extension
This extension, contrary to its name, does not provide support for reading Excel files. It instead provides a function that wraps the number
formatting functionality of the i18npool library, which formats numbers per Excel's formatting rules.
Excel files can be handled through the spatial extension: see the Excel Import and Excel Export pages for instructions.
The excel extension will be transparently autoloaded on first use from the official extension repository. If you would like to install and
load it manually, run:
INSTALL excel;
LOAD excel;
Usage
┌───────────┐
│ timestamp │
│ varchar │
├───────────┤
│ 9:31 PM │
└───────────┘
Functions
546
DuckDB Documentation
text( number, format_ Format the given number per the rules text(1234567.897, 'h 9 PM
string) given in the format_string AM/PM')
GitHub
Full‑Text Search is an extension to DuckDB that allows for search through strings, similar to SQLite's FTS5 extension.
The fts extension will be transparently autoloaded on first use from the official extension repository. If you would like to install and load
it manually, run:
INSTALL fts;
LOAD fts;
Usage
The extension adds two PRAGMA statements to DuckDB: one to create, and one to drop an index. Additionally, a scalar macro stem is
added, which is used internally by the extension.
PRAGMA create_fts_index
create_fts_index(input_table, input_id, *input_values, stemmer = 'porter',
stopwords = 'english', ignore = '(\\.|[^a-z])+',
strip_accents = 1, lower = 1, overwrite = 0)
547
DuckDB Documentation
This PRAGMA builds the index under a newly created schema. The schema will be named after the input table: if an index is created on
table 'main.table_name', then the schema will be named 'fts_main_table_name'.
PRAGMA drop_fts_index
drop_fts_index(input_table)
match_bm25 Function
match_bm25(input_id, query_string, fields := NULL, k := 1.2, b:= 0.75, conjunctive := 0)
When an index is built, this retrieval macro is created that can be used to search the index.
stem Function
stem(input_string, stemmer)
548
DuckDB Documentation
stemmer VARCHAR The type of stemmer to be used. One of 'arabic', 'basque', 'catalan',
'danish', 'dutch', 'english', 'finnish', 'french', 'german',
'greek', 'hindi', 'hungarian', 'indonesian', 'irish', 'italian',
'lithuanian', 'nepali', 'norwegian', 'porter', 'portuguese',
'romanian', 'russian', 'serbian', 'spanish', 'swedish', 'tamil',
'turkish', or 'none' if no stemming is to be used.
Example Usage
Build the index, and make both the text_content and author columns searchable.
PRAGMA create_fts_index(
'documents', 'document_identifier', 'text_content', 'author'
);
Search the author field index for documents that are authored by ”Muhleisen”. This retrieves ”doc1”:
┌─────────────────────┬──────────────────────────────────────────────────────────────────────┬────────┐
│ document_identifier │ text_content │ score │
│ varchar │ varchar │ double │
├─────────────────────┼──────────────────────────────────────────────────────────────────────┼────────┤
│ doc1 │ The mallard is a dabbling duck that breeds throughout the temperate. │ 0.0 │
└─────────────────────┴──────────────────────────────────────────────────────────────────────┴────────┘
549
DuckDB Documentation
┌─────────────────────┬────────────────────────────────────────────────────────────┬────────┐
│ document_identifier │ text_content │ score │
│ varchar │ varchar │ double │
├─────────────────────┼────────────────────────────────────────────────────────────┼────────┤
│ doc2 │ The cat is a domestic species of small carnivorous mammal. │ 0.0 │
└─────────────────────┴────────────────────────────────────────────────────────────┴────────┘
Note. Warning The FTS index will not update automatically when input table changes. A workaround of this limitation can be
recreating the index to refresh.
GitHub
The httpfs extension is an autoloadable extension implementing a file system that allows reading remote/writing remote files. For plain
HTTP(S), only file reading is supported. For object storage using the S3 API, the httpfs extension supports reading/writing/globbing
files.
The httpfs extension will be, by default, autoloaded on first use of any functionality exposed by this extension.
INSTALL httpfs;
LOAD httpfs;
HTTP(S)
S3 API
GitHub
550
DuckDB Documentation
HTTP(S) Support
With the httpfs extension, it is possible to directly query files over the HTTP(S) protocol. This works for all files supported by DuckDB or
its various extensions, and provides read‑only access.
SELECT *
FROM 'https://domain.tld/file.extension';
For CSV files, files will be downloaded entirely in most cases, due to the row‑based nature of the format. For Parquet files, DuckDB can use
a combination of the Parquet metadata and HTTP range requests to only download the parts of the file that are actually required by the
query. For example, the following query will only read the Parquet metadata and the data for the column_a column:
SELECT column_a
FROM 'https://domain.tld/file.parquet';
In some cases even, no actual data needs to be read at all as they only require reading the metadata:
SELECT count(*)
FROM 'https://domain.tld/file.parquet';
SELECT *
FROM read_parquet([
'https://domain.tld/file1.parquet',
'https://domain.tld/file2.parquet'
]);
S3 API Support
The httpfs extension supports reading/writing/globbing files on object storage servers using the S3 API. S3 offers a standard API to read
and write to remote files (while regular http servers, predating S3, do not offer a common write API). DuckDB conforms to the S3 API, that
is now common among industry storage providers.
Platforms
The httpfs filesystem is tested with AWS S3, Minio, Google Cloud, and lakeFS. Other services that implement the S3 API (such as Cloudflare
R2) should also work, but not all features may be supported.
The following table shows which parts of the S3 API are required for each httpfs feature.
The preferred way to configure and authenticate to S3 endpoints is to use secrets. Multiple secret providers are available.
Note. Deprecated Prior to version 0.10.0, DuckDB did not have a Secrets manager. Hence, the configuration of and authentication
to S3 endpoints was handled via variables. See the legacy authentication scheme for the S3 API.
551
DuckDB Documentation
CONFIG Provider The default provider, CONFIG (i.e., user‑configured), allows access to the S3 bucket by manually providing a key. For
example:
Note. Tip If you get an IO Error (Connection error for HTTP HEAD), configure the endpoint explicitly via ENDPOINT
's3. your-region .amazonaws.com'.
Now, to query using the above secret, simply query any s3:// prefixed file:
SELECT *
FROM 's3://my-bucket/file.parquet';
CREDENTIAL_CHAIN Provider The CREDENTIAL_CHAIN provider allows automatically fetching credentials using mechanisms pro‑
vided by the AWS SDK. For example, to use the AWS SDK default provider:
Again, to query a file using the above secret, simply query any s3:// prefixed file.
DuckDB also allows specifying a specific chain using the CHAIN keyword. This takes a ; separated list of providers that will be tried in order.
For example:
• config
• sts
• sso
• env
• instance
• process
• task_role
The CREDENTIAL_CHAIN provider also allows overriding the automatically fetched config. For example, to automatically load creden‑
tials, and then override the region, run:
552
DuckDB Documentation
Overview of S3 Secret Parameters Below is a complete list of the supported parameters that can be used for both the CONFIG and
CREDENTIAL_CHAIN providers:
R2 Secrets While Cloudflare R2 uses the regular S3 API, DuckDB has a special Secret type, R2, to make configuring it a bit simpler:
Note the addition of the ACCOUNT_ID which is used to generate to correct endpoint url for you. Also note that for R2 Secrets can also
use both the CONFIG and CREDENTIAL_CHAIN providers. Finally, R2 secrets are only available when using urls starting with r2://, for
example:
SELECT *
FROM read_parquet('r2://some/file/that/uses/r2/secret/file.parquet');
GCS Secrets While Google Cloud Storage is accessed by DuckDB using the S3 API, DuckDB has a special Secret type, GCS, to make con‑
figuring it a bit simpler:
Note that the above secret, will automatically have the correct Google Cloud Storage endpoint configured. Also note that for GCS Secrets
can also use both the CONFIG and CREDENTIAL_CHAIN providers. Finally, GCS secrets are only available when using urls starting with
gcs:// or gs://, for example:
SELECT *
FROM read_parquet('gcs://some/file/that/uses/gcs/secret/file.parquet');
553
DuckDB Documentation
Reading
SELECT *
FROM 's3://bucket/file.extension';
SELECT *
FROM read_parquet([
's3://bucket/file1.parquet',
's3://bucket/file2.parquet'
]);
Glob File globbing is implemented using the ListObjectV2 API call and allows to use filesystem‑like glob patterns to match multiple files,
for example:
SELECT *
FROM read_parquet('s3://bucket/*.parquet');
This query matches all files in the root of the bucket with the Parquet extension.
Several features for matching are supported, such as * to match any number of any character, ? for any single character or [0-9] for a
single character in a range of characters:
A useful feature when using globs is the filename option, which adds a column named filename that encodes the file that a particular
row originated from:
SELECT *
FROM read_parquet('s3://bucket/*.parquet', filename = true);
1 examplevalue1 s3://bucket/file1.parquet
2 examplevalue1 s3://bucket/file2.parquet
Hive Partitioning DuckDB also offers support for the Hive partitioning scheme, which is available when using HTTP(S) and S3 end‑
points.
Writing
Writing to S3 uses the multipart upload API. This allows DuckDB to robustly upload files at high speed. Writing to S3 works for both CSV
and Parquet:
554
DuckDB Documentation
An automatic check is performed for existing files/directories, which is currently quite conservative (and on S3 will add a bit of latency). To
disable this check and force writing, an OVERWRITE_OR_IGNORE flag is added:
Configuration Some additional configuration options exist for the S3 upload, though the default values should suffice for most use
cases.
Name Description
Prior to version 0.10.0, DuckDB did not have a Secrets manager. Hence, the configuration of and authentication to S3 endpoints was han‑
dled via variables. This page documents the legacy authentication scheme for the S3 API.
Note. The recommended way to configuration and authentication of S3 endpoints is to use secrets.
To be able to read or write from S3, the correct region should be set:
Optionally, the endpoint can be configured in case a non‑AWS object storage server is used:
However, note that this may also require updating the endpoint. For example for AWS S3 it is required to change the endpoint to
s3. region .amazonaws.com.
After configuring the correct endpoint and region, public files can be read. To also read private files, authentication credentials can be
added:
Alternatively, session tokens are also supported and can be used instead:
555
DuckDB Documentation
Per‑Request Configuration
Aside from the global S3 configuration described above, specific configuration values can be used on a per‑request basis. This allows for
use of multiple sets of credentials, regions, etc. These are used by including them on the S3 URI as query parameters. All the individual
configuration values listed above can be set as query parameters. For instance:
SELECT *
FROM 's3://bucket/file.parquet?s3_access_key_id=accessKey&s3_secret_access_key=secretKey';
SELECT *
FROM 's3://bucket/file.parquet?s3_region=region&s3_session_token=session_token' t1
INNER JOIN 's3://bucket/file.csv?s3_access_key_id=accessKey&s3_secret_access_key=secretKey' t2;
Configuration
Some additional configuration options exist for the S3 upload, though the default values should suffice for most use cases.
Additionally, most of the configuration options can be set via environment variables:
Iceberg Extension
The iceberg extension is a loadable extension that implements support for the Apache Iceberg format.
INSTALL iceberg;
LOAD iceberg;
Usage
To test the examples, download the iceberg_data.zip file and unzip it.
51793
556
DuckDB Documentation
Note. The allow_moved_paths option ensures that some path resolution is performed, which allows scanning Iceberg tables
that are moved.
┌────────────────────────────────────────────────────────┬──────────────────────────┬──────────────────┬─────────
│ manifest_path │ manifest_sequence_number │ manifest_content │
status │ content │ file_path │
file_format │ record_count │
│ varchar │ int64 │ varchar │
varchar │ varchar │ varchar │
varchar │ int64 │
├────────────────────────────────────────────────────────┼──────────────────────────┼──────────────────┼─────────
│ lineitem_iceberg/metadata/10eaca8a-1e1c-421e-ad6d-b2… │ 2 │ DATA │
ADDED │ EXISTING │ lineitem_iceberg/data/00041-414-f3c73457-bbd6-4b92-9c15-17b241171b16-00001.parquet
│ PARQUET │ 51793 │
│ lineitem_iceberg/metadata/10eaca8a-1e1c-421e-ad6d-b2… │ 2 │ DATA │
DELETED │ EXISTING │ lineitem_iceberg/data/00000-411-0792dcfe-4e25-4ca3-8ada-175286069a47-00001.parquet
│ PARQUET │ 60175 │
└────────────────────────────────────────────────────────┴──────────────────────────┴──────────────────┴─────────
Visualizing Snapshots
SELECT *
FROM iceberg_snapshots('data/iceberg/lineitem_iceberg');
┌─────────────────┬─────────────────────┬─────────────────────────┬──────────────────────────────────────────────
│ sequence_number │ snapshot_id │ timestamp_ms │
manifest_list │
│ uint64 │ uint64 │ timestamp │
varchar │
├─────────────────┼─────────────────────┼─────────────────────────┼──────────────────────────────────────────────
│ 1 │ 3776207205136740581 │ 2023-02-15 15:07:54.504 │ lineitem_
iceberg/metadata/snap-3776207205136740581-1-cf3d0be5-cf70-453d-ad8f-48fdc412e608.avro │
│ 2 │ 7635660646343998149 │ 2023-02-15 15:08:14.73 │ lineitem_
iceberg/metadata/snap-7635660646343998149-1-10eaca8a-1e1c-421e-ad6d-b232e5ee23d3.avro │
└─────────────────┴─────────────────────┴─────────────────────────┴──────────────────────────────────────────────
Limitations
ICU Extension
The icu extension contains an easy‑to‑use version of the collation/timezone part of the ICU library.
INSTALL icu;
LOAD icu;
557
DuckDB Documentation
Features
• region‑dependent collations
• time zones, used for timestamp data types and timestamp functions
GitHub
inet Extension
The inet extension defines the INET data type for storing IPv4 network addresses. It supports the CIDR notation for subnet masks (e.g.,
198.51.100.0/22).
INSTALL inet;
LOAD inet;
Examples
┌───────────┐
│ addr │
│ inet │
├───────────┤
│ 127.0.0.1 │
└───────────┘
┌───────┬────────────────┐
│ id │ ip │
│ int32 │ inet │
├───────┼────────────────┤
│ 1 │ 192.168.0.0/16 │
│ 2 │ 127.0.0.1 │
│ 2 │ 8.8.8.8 │
└───────┴────────────────┘
GitHub
jemalloc Extension
The jemalloc extension replaces the system's memory allocator with jemalloc. Unlike other DuckDB extensions, the jemalloc exten‑
sion is statically linked and cannot be installed or loaded during runtime.
558
DuckDB Documentation
Linux The Linux version of DuckDB ships with the jemalloc extension by default.
To disable the jemalloc extension, build DuckDB from source and set the SKIP_EXTENSIONS flag as follows:
macOS The macOS version of DuckDB does not ship with the jemalloc extension but can be built from source to include it:
GitHub
JSON Extension
The json extension is a loadable extension that implements SQL functions that are useful for reading values from existing JSON, and
creating new JSON data.
The json extension is shipped by default in DuckDB builds, otherwise it will be transparently autoloaded on first use. If you would like to
install and load it manually, run:
INSTALL json;
LOAD json;
Example Uses
559
DuckDB Documentation
JSON Type
The JSON extension makes use of the JSON logical type. The JSON logical type is interpreted as JSON, i.e., parsed, in JSON functions rather
than interpreted as VARCHAR, i.e., a regular string. All JSON creation functions return values of this type.
We also allow any of our types to be casted to JSON, and JSON to be casted back to any of our types, for example:
-- And back:
SELECT {duck: 42}::JSON;
-- {"duck":42}
This works for our nested types as shown in the example, but also for non‑nested types:
SELECT '2023-05-12'::DATE::JSON;
-- "2023-05-12"
The only exception to this behavior is the cast from VARCHAR to JSON, which does not alter the data, but instead parses and validates the
contents of the VARCHAR as JSON.
Function Description
read_json_objects( filename) Read a JSON object from filename, where filename can also
be a list of files or a glob pattern
read_ndjson_objects( filename) Alias for read_json_objects with parameter format set to
'newline_delimited'
read_json_objects_auto( filename) Alias for read_json_objects with parameter format set to
'auto'
compression The compression type for the file. By default this will be VARCHAR 'auto'
detected automatically from the file extension (e.g.,
t.json.gz will use gzip, t.json will use none). Options
are 'none', 'gzip', 'zstd', and 'auto'.
filename Whether or not an extra filename column should be BOOL false
included in the result.
format Can be one of ['auto', 'unstructured', VARCHAR 'array'
'newline_delimited', 'array'].
hive_partitioning Whether or not to interpret the path as a Hive partitioned BOOL false
path.
ignore_errors Whether to ignore parse errors (only possible when BOOL false
format is 'newline_delimited').
560
DuckDB Documentation
The format parameter specifies how to read the JSON from a file. With 'unstructured', the top‑level JSON is read, e.g.:
{
"duck": 42
}
{
"goose": [1, 2, 3]
}
With 'newline_delimited', NDJSON is read, where each JSON is separated by a newline (\n), e.g.:
{"duck": 42}
{"goose": [1, 2, 3]}
[
{
"duck": 42
},
{
"goose": [1, 2, 3]
}
]
Example usage:
DuckDB also supports reading JSON as a table, using the following functions:
Function Description
read_json( filename) Read JSON from filename, where filename can also be a list of files, or a glob
pattern
read_json_auto( filename) Alias for read_json with all auto‑detection enabled
read_ndjson( filename) Alias for read_json with parameter format set to 'newline_delimited'
561
DuckDB Documentation
Function Description
read_ndjson_auto( filename) Alias for read_json_auto with parameter format set to 'newline_
delimited'
Besides the maximum_object_size, format, ignore_errors and compression, these functions have additional parameters:
auto_detect Whether to auto‑detect detect the names of the keys and data BOOL false
types of the values automatically
columns A struct that specifies the key names and value types contained STRUCT (empty)
within the JSON file (e.g., {key1: 'INTEGER', key2:
'VARCHAR'}). If auto_detect is enabled these will be
inferred
dateformat Specifies the date format to use when parsing dates. See Date VARCHAR 'iso'
Format
maximum_depth Maximum nesting depth to which the automatic schema BIGINT -1
detection detects types. Set to ‑1 to fully detect nested JSON
types
records Can be one of ['auto', 'true', 'false'] VARCHAR 'records'
sample_size Option to define number of sample objects for automatic JSON UBIGINT 20480
type detection. Set to ‑1 to scan the entire input file
timestampformat Specifies the date format to use when parsing timestamps. See VARCHAR 'iso'
Date Format
union_by_name Whether the schema's of multiple JSON files should be unified BOOL false
Example usage:
duck
42
DuckDB can convert JSON arrays directly to its internal LIST type, and missing keys become NULL.
SELECT *
FROM read_json(['my_file1.json', 'my_file2.json'],
columns = {duck: 'INTEGER', goose: 'INTEGER[]', swan: 'DOUBLE'});
42 [1, 2, 3] NULL
43 [4, 5, 6] 3.3
562
DuckDB Documentation
goose duck
[1, 2, 3] 42
[4, 5, 6] 43
DuckDB can read (and auto‑detect) a variety of formats, specified with the format parameter. Querying a JSON file that contains an
'array', e.g.:
[
{
"duck": 42,
"goose": 4.2
},
{
"duck": 43,
"goose": 4.3
}
]
Can be queried exactly the same as a JSON file that contains 'unstructured' JSON, e.g.:
{
"duck": 42,
"goose": 4.2
}
{
"duck": 43,
"goose": 4.3
}
duck goose
42 4.2
43 4.3
If your JSON file does not contain 'records', i.e., any other type of JSON than objects, DuckDB can still read it. This is specified with the
records parameter. The records parameter specifies whether the JSON contains records that should be unpacked into individual
columns, i.e., reading the following file with records:
duck goose
42 [1,2,3]
42 [4,5,6]
You can read the same file with records set to 'false', to get a single column, which is a STRUCT containing the data:
563
DuckDB Documentation
json
For additional examples reading more complex data, please see the Shredding Deeply Nested JSON, One Vector at a Time blog post.
JSON Import/Export
When the JSON extension is installed, FORMAT JSON is supported for COPY FROM, COPY TO, EXPORT DATABASE and IMPORT
DATABASE. See Copy and Import/Export.
By default, COPY expects newline‑delimited JSON. If you prefer copying data to/from a JSON array, you can specify ARRAY true, e.g.,
[
{"range":0},
{"range":1},
{"range":2},
{"range":3},
{"range":4}
]
The following scalar JSON functions can be used to gain information about the stored JSON values. With the exception of json_valid(
json), all JSON functions produce an error when invalid JSON is supplied.
We support two kinds of notations to describe locations within JSON: JSON Pointer and JSONPath.
Function Description
json_array_length( json[,path]) Return the number of elements in the JSON array json, or 0 if it is not a
JSON array. If path is specified, return the number of elements in the
JSON array at the given path. If path is a LIST, the result will be LIST
of array lengths
json_contains( json_haystack, json_needle) Returns true if json_needle is contained in json_haystack. Both
parameters are of JSON type, but json_needle can also be a numeric
value or a string, however the string must be wrapped in double quotes
json_keys( json[,path]) Returns the keys of json as a LIST of VARCHAR, if json is a JSON
object. If path is specified, return the keys of the JSON object at the given
path. If path is a LIST, the result will be LIST of LIST of VARCHAR
564
DuckDB Documentation
Function Description
json_structure( json) Return the structure of json. Defaults to JSON the structure is
inconsistent (e.g., incompatible types in an array)
json_type( json[,path]) Return the type of the supplied json, which is one of OBJECT, ARRAY,
BIGINT, UBIGINT, VARCHAR, BOOLEAN, NULL. If path is specified,
return the type of the element at the given path. If path is a LIST, the
result will be LIST of types
json_valid( json) Return whether json is valid JSON
json( json) Parse and minify json
The JSONPointer syntax separates each field with a /. For example, to extract the first element of the array with key "duck", you can do:
The JSONPath syntax separates fields with a ., and accesses array elements with [i], and always starts with $. Using the same example,
we can do the following:
JSONPath is more expressive, and can also access from the back of lists:
565
DuckDB Documentation
-- [4]
SELECT json_type(j) FROM example;
-- OBJECT
SELECT json_keys(j) FROM example;
-- [family, species]
SELECT json_structure(j) FROM example;
-- {"family":"VARCHAR","species":["VARCHAR"]}
SELECT json_structure('["duck", {"family": "anatidae"}]');
-- ["JSON"]
SELECT json_contains('{"key": "value"}', '"value"');
-- true
SELECT json_contains('{"key": 1}', '1');
-- true
SELECT json_contains('{"top_key": {"key": "value"}}', '{"key": "value"}');
-- true
There are two extraction functions, which have their respective operators. The operators can only be used if the string is stored as the JSON
logical type. These functions supports the same two location notations as the previous functions.
Note that the equality comparison operator (=) has a higher precedence than the -> JSON extract operator. Therefore, surround the uses
of the -> operator with parentheses when making equality comparisons. For example:
Examples:
566
DuckDB Documentation
-- anatidae
SELECT j->>'$.species[0]' FROM example;
-- duck
SELECT j->'species'->>0 FROM example;
-- duck
SELECT j->'species'->>['0','1'] FROM example;
-- [duck, goose]
If multiple values need to be extracted from the same JSON, it is more efficient to extract a list of paths:
Function Description
to_json( any) Create JSON from a value of any type. Our LIST is converted to a JSON array, and
our STRUCT and MAP are converted to a JSON object
json_quote( any) Alias for to_json
array_to_json( list) Alias for to_json that only accepts LIST
row_to_json( list) Alias for to_json that only accepts STRUCT
json_array([any, ...]) Create a JSON array from any number of values
json_object([key,value, ...]) Create a JSON object from any number of key, value pairs
json_merge_patch( json,json) Merge two JSON documents together
Examples:
SELECT to_json('duck');
-- "duck"
SELECT to_json([1, 2, 3]);
-- [1,2,3]
SELECT to_json({duck : 42});
-- {"duck":42}
SELECT to_json(map(['duck'],[42]));
-- {"duck":42}
SELECT json_array(42, 'duck', NULL);
-- [42,"duck",null]
SELECT json_object('duck', 42);
567
DuckDB Documentation
-- {"duck":42}
SELECT json_merge_patch('{"duck": 42}', '{"goose": 123}');
-- {"goose":123,"duck":42}
Function Description
json_group_array( any) Return a JSON array with all values of any in the aggregation
json_group_object( key, value) Return a JSON object with all key, value pairs in the aggregation
json_group_structure( json) Return the combined json_structure of all json in the aggregation
Examples:
Transforming JSON
In many cases, it is inefficient to extract values from JSON one‑by‑one. Instead, we can ”extract” all values at once, transforming JSON to
the nested types LIST and STRUCT.
Function Description
The structure argument is JSON of the same form as returned by json_structure. The structure argument can be modified to
transform the JSON into the desired structure and types. It is possible to extract fewer key/value pairs than are present in the JSON, and it
is also possible to extract more: missing keys become NULL.
Examples:
568
DuckDB Documentation
The JSON extension also provides functions to serialize and deserialize SELECT statements between SQL and JSON, as well as executing
JSON serialized statements.
json_deserialize_sql( json) Scalar Deserialize one or many json serialized statements back to an
equivalent sql string
json_execute_serialized_sql( Table Execute json serialized statements and return the resulting rows.
varchar) Only one statement at a time is supported for now.
json_serialize_sql( varchar, skip_ Scalar Serialize a set of ; separated select statments to an equivalent list of
empty :=boolean, skip_null json serialized statements
:=boolean, format :=boolean)
PRAGMA json_execute_serialized_ Pragma Pragma version of the json_execute_serialized_sql
sql( varchar) function.
The json_serialize_sql(varchar) function takes three optional parameters, skip_empty, skip_null, and format that can
be used to control the output of the serialized statements.
If you run the json_execute_serialize_sql(varchar) table function inside of a transaction the serialized statements will not
be able to see any transaction local changes. This is because the statements are executed in a separate query context. You can use the
PRAGMA json_execute_serialize_sql(varchar) pragma version to execute the statements in the same query context as the
pragma, although with the limitation that the serialized JSON must be provided as a constant string, i.e., you cannot do PRAGMA json_
execute_serialize_sql(json_serialize_sql(...)).
Note that these functions do not preserve syntactic sugar such as FROM * SELECT ..., so a statement round‑tripped through json_
deserialize_sql(json_serialize_sql(...)) may not be identical to the original statement, but should always be semanti‑
cally equivalent and produce the same output.
Examples:
-- Simple example
SELECT json_serialize_sql('SELECT 2');
-- '{"error":false,"statements":[{"node":{"type":"SELECT_NODE","modifiers":[],"cte_
map":{"map":[]},"select_list":[{"class":"CONSTANT","type":"VALUE_
CONSTANT","alias":"","value":{"type":{"id":"INTEGER","type_info":null},"is_
null":false,"value":2}}],"from_table":{"type":"EMPTY","alias":"","sample":null},"where_
clause":null,"group_expressions":[],"group_sets":[],"aggregate_handling":"STANDARD_
HANDLING","having":null,"sample":null,"qualify":null}}]}'
569
DuckDB Documentation
-- '{"error":false,"statements":[{"node":{"type":"SELECT_NODE","select_
list":[{"class":"FUNCTION","type":"FUNCTION","function_
name":"+","children":[{"class":"CONSTANT","type":"VALUE_CONSTANT","value":{"type":{"id":"INTEGER"},"is_
null":false,"value":1}},{"class":"CONSTANT","type":"VALUE_
CONSTANT","value":{"type":{"id":"INTEGER"},"is_null":false,"value":2}}],"order_bys":{"type":"ORDER_
MODIFIER"},"distinct":false,"is_operator":true,"export_state":false}],"from_
table":{"type":"EMPTY"},"aggregate_handling":"STANDARD_HANDLING"}},{"node":{"type":"SELECT_
NODE","select_list":[{"class":"FUNCTION","type":"FUNCTION","function_
name":"+","children":[{"class":"COLUMN_REF","type":"COLUMN_REF","column_names":["a"]},{"class":"COLUMN_
REF","type":"COLUMN_REF","column_names":["b"]}],"order_bys":{"type":"ORDER_
MODIFIER"},"distinct":false,"is_operator":true,"export_state":false}],"from_table":{"type":"BASE_
TABLE","table_name":"tbl1"},"aggregate_handling":"STANDARD_HANDLING"}}]}'
Indexing
Note. Warning Following PostgreSQL's conventions, DuckDB uses 1‑based indexing for arrays and lists but 0‑based indexing for the
JSON data type.
GitHub
MySQL Extension
The mysql extension allows DuckDB to directly read and write data from/to a running MySQL instance. The data can be queried directly
from the underlying MySQL database. Data can be loaded from MySQL tables into DuckDB tables, or vice versa.
INSTALL mysql;
The extension is loaded automatically upon first use. If you prefer to load it manually, run:
LOAD mysql;
570
DuckDB Documentation
The connection string determines the parameters for how to connect to MySQL as a set of key=value pairs. Any options not provided are
replaced by their default values, as per the table below.
Setting Default
database NULL
host localhost
password
port 0
socket NULL
user current user
The tables in the MySQL database can be read as if they were normal DuckDB tables, but the underlying data is read directly from MySQL
at query time.
SHOW TABLES;
┌───────────────────────────────────────┐
│ name │
│ varchar │
├───────────────────────────────────────┤
│ signed_integers │
└───────────────────────────────────────┘
┌──────┬────────┬──────────┬─────────────┬──────────────────────┐
│ t │ s │ m │ i │ b │
│ int8 │ int16 │ int32 │ int32 │ int64 │
├──────┼────────┼──────────┼─────────────┼──────────────────────┤
│ -128 │ -32768 │ -8388608 │ -2147483648 │ -9223372036854775808 │
│ 127 │ 32767 │ 8388607 │ 2147483647 │ 9223372036854775807 │
│ NULL │ NULL │ NULL │ NULL │ NULL │
└──────┴────────┴──────────┴─────────────┴──────────────────────┘
It might be desirable to create a copy of the MySQL databases in DuckDB to prevent the system from re‑reading the tables from MySQL
continuously, particularly for large tables.
Data can be copied over from MySQL to DuckDB using standard SQL, for example:
In addition to reading data from MySQL, create tables, ingest data into MySQL and make other modifications to a MySQL database using
standard SQL queries.
This allows you to use DuckDB to, for example, export data that is stored in a MySQL database to Parquet, or read data from a Parquet file
into MySQL.
Below is a brief example of how to create a new table in MySQL and load data into it.
571
DuckDB Documentation
Many operations on MySQL tables are supported. All these operations directly modify the MySQL database, and the result of subsequent
operations can then be read using MySQL. Note that if modifications are not desired, ATTACH can be run with the READ_ONLY property
which prevents making modifications to the underlying database. For example:
Supported Operations
CREATE TABLE
INSERT INTO
SELECT
┌───────┬─────────┐
│ id │ name │
│ int64 │ varchar │
├───────┼─────────┤
│ 42 │ DuckDB │
└───────┴─────────┘
COPY
UPDATE
UPDATE mysql_db.tbl
SET name = 'Woohoo'
WHERE id = 42;
DELETE
ALTER TABLE
572
DuckDB Documentation
DROP TABLE
DROP TABLE mysql_db.tbl;
CREATE VIEW
CREATE VIEW mysql_db.v1 AS SELECT 42;
┌───────┐
│ i │
│ int32 │
├───────┤
│ 42 │
└───────┘
Transactions
CREATE TABLE mysql_db.tmp (i INTEGER);
BEGIN;
INSERT INTO mysql_db.tmp VALUES (42);
SELECT * FROM mysql_db.tmp;
┌───────┐
│ i │
│ int64 │
├───────┤
│ 42 │
└───────┘
ROLLBACK;
SELECT * FROM mysql_db.tmp;
┌────────┐
│ i │
│ int64 │
├────────┤
│ 0 rows │
└────────┘
Settings
573
DuckDB Documentation
Schema Cache
To avoid having to continuously fetch schema data from MySQL, DuckDB keeps schema information ‑ such as the names of tables, their
columns, etc ‑ cached. If changes are made to the schema through a different connection to the MySQL instance, such as new columns being
added to a table, the cached schema information might be outdated. In this case, the function mysql_clear_cache can be executed
to clear the internal caches.
CALL mysql_clear_cache();
PostgreSQL Extension
The postgres extension allows DuckDB to directly read and write data from a running Postgres database instance. The data can be
queried directly from the underlying Postgres database. Data can be loaded from Postgres tables into DuckDB tables, or vice versa.See the
official announcement for implementation details and background.
INSTALL postgres;
The extension is loaded automatically upon first use. If you prefer to load it manually, run:
LOAD postgres;
Connecting
The ATTACH command takes as input either a libpq connection string or a PostgreSQL URI.
Below are some example connection strings and commonly used parameters. A full list of available parameters can be found in the Postgres
documentation.
dbname=postgresscanner
host=localhost port=5432 dbname=mydb connect_timeout=10
574
DuckDB Documentation
Postgres connection information can also be specified with environment variables. This can be useful in a production environment where
the connection information is managed externally and passed in to the environment.
export PGPASSWORD="secret"
export PGHOST=localhost
export PGUSER=owner
export PGDATABASE=mydatabase
Usage
The tables in the PostgreSQL database can be read as if they were normal DuckDB tables, but the underlying data is read directly from
Postgres at query time.
SHOW TABLES;
┌───────────────────────────────────────┐
│ name │
│ varchar │
├───────────────────────────────────────┤
│ uuids │
└───────────────────────────────────────┘
┌──────────────────────────────────────┐
│ u │
│ uuid │
├──────────────────────────────────────┤
│ 6d3d2541-710b-4bde-b3af-4711738636bf │
│ NULL │
│ 00000000-0000-0000-0000-000000000001 │
│ ffffffff-ffff-ffff-ffff-ffffffffffff │
└──────────────────────────────────────┘
It might be desirable to create a copy of the Postgres databases in DuckDB to prevent the system from re‑reading the tables from Postgres
continuously, particularly for large tables.
Data can be copied over from Postgres to DuckDB using standard SQL, for example:
575
DuckDB Documentation
In addition to reading data from Postgres, the extension allows you to create tables, ingest data into Postgres and make other modifications
to a Postgres database using standard SQL queries.
This allows you to use DuckDB to, for example, export data that is stored in a Postgres database to Parquet, or read data from a Parquet file
into Postgres.
Below is a brief example of how to create a new table in Postgres and load data into it.
Many operations on Postgres tables are supported. All these operations directly modify the Postgres database, and the result of subsequent
operations can then be read using Postgres. Note that if modifications are not desired, ATTACH can be run with the READ_ONLY property
which prevents making modifications to the underlying database. For example:
CREATE TABLE
INSERT INTO
SELECT
┌───────┬─────────┐
│ id │ name │
│ int64 │ varchar │
├───────┼─────────┤
│ 42 │ DuckDB │
└───────┴─────────┘
COPY
UPDATE
UPDATE postgres_db.tbl
SET name = 'Woohoo'
WHERE id = 42;
DELETE
576
DuckDB Documentation
ALTER TABLE
ALTER TABLE postgres_db.tbl
ADD COLUMN k INTEGER;
DROP TABLE
DROP TABLE postgres_db.tbl;
CREATE VIEW
CREATE VIEW postgres_db.v1 AS SELECT 42;
┌───────┐
│ i │
│ int32 │
├───────┤
│ 42 │
└───────┘
Transactions
CREATE TABLE postgres_db.tmp (i INTEGER);
BEGIN;
INSERT INTO postgres_db.tmp VALUES (42);
SELECT * FROM postgres_db.tmp;
┌───────┐
│ i │
│ int64 │
├───────┤
│ 42 │
└───────┘
ROLLBACK;
SELECT * FROM postgres_db.tmp;
┌────────┐
│ i │
│ int64 │
├────────┤
│ 0 rows │
└────────┘
The postgres_query function allows you to run arbitrary SQL within an attached database. postgres_query takes the name of the
attached Postgres database to execute the query in, as well as the SQL query to execute. The result of the query is returned. Single‑quote
strings are escaped by repeating the single quote twice.
postgres_query(attached_database::VARCHAR, query::VARCHAR)
577
DuckDB Documentation
Example
┌──────────────┬───────────┬─────────┐
│ brand │ model │ color │
│ varchar │ varchar │ varchar │
├──────────────┼───────────┼─────────┤
│ ferari │ testarosa │ red │
│ aston martin │ db2 │ blue │
│ bentley │ mulsanne │ gray │
└──────────────┴───────────┴─────────┘
Settings
pg_debug_show_queries DEBUG SETTING: print all queries sent to Postgres to stdout false
pg_connection_cache Whether or not to use the connection cache true
pg_experimental_filter_ Whether or not to use filter pushdown (currently experimental) false
pushdown
pg_array_as_varchar Read Postgres arrays as varchar ‑ enables reading mixed dimensional arrays false
pg_connection_limit The maximum amount of concurrent Postgres connections 64
pg_pages_per_task The amount of pages per task 1000
pg_use_binary_copy Whether or not to use BINARY copy to read data true
Schema Cache
To avoid having to continuously fetch schema data from Postgres, DuckDB keeps schema information ‑ such as the names of tables, their
columns, etc. ‑ cached. If changes are made to the schema through a different connection to the Postgres instance, such as new columns
being added to a table, the cached schema information might be outdated. In this case, the function pg_clear_cache can be executed
to clear the internal caches.
CALL pg_clear_cache();
Note. Deprecated The old postgres_attach function is deprecated. It is recommended to switch over to the new ATTACH
syntax.
Spatial Extension
The spatial extension provides support for geospatial data processing in DuckDB. For an overview of the extension, see our blog post.
INSTALL spatial;
LOAD spatial;
578
DuckDB Documentation
GEOMETRY Type
The core of the spatial extension is the GEOMETRY type. If you're unfamiliar with geospatial data and GIS tooling, this type probably works
very different from what you'd expect.
In short, while the GEOMETRY type is a binary representation of ”geometry” data made up out of sets of vertices (pairs of X and Y dou-
ble precision floats), it actually stores one of several geometry subtypes. These are POINT, LINESTRING, POLYGON, as well as their
”collection” equivalents, MULTIPOINT, MULTILINESTRING and MULTIPOLYGON. Lastly there is GEOMETRYCOLLECTION, which can
contain any of the other subtypes, as well as other GEOMETRYCOLLECTIONs recursively.
This may seem strange at first, since DuckDB already have types like LIST, STRUCT and UNION which could be used in a similar way, but
the design and behaviour of the GEOMETRY type is actually based on the Simple Features geometry model, which is a standard used by
many other databases and GIS software.
That said, the spatial extension also includes a couple of experimental non‑standard explicit geometry types, such as POINT_2D,
LINESTRING_2D, POLYGON_2D and BOX_2D that are based on DuckDBs native nested types, such as structs and lists. In theory it
should be possible to optimize a lot of operations for these types much better than for the GEOMETRY type (which is just a binary blob),
but only a couple functions are implemented so far.
All of these are implicitly castable to GEOMETRY but with a conversion cost, so the GEOMETRY type is still the recommended type to use
for now if you are planning to work with a lot of different spatial functions.
GEOMETRY is not currently capable of storing additional geometry types, Z/M coordinates, or SRID information. These features may be
added in the future.
The spatial extension implements a large number of scalar functions and overloads. Most of these are implemented using the GEOS li‑
brary, but we'd like to implement more of them natively in this extension to better utilize DuckDB's vectorized execution and memory
management. The following symbols are used to indicate which implementation is used:
‑ DuckDB ‑ functions that are implemented natively in this extension that are capable of operating directly on the DuckDB types
‑ CAST(GEOMETRY) ‑ functions that are supported by implicitly casting to GEOMETRY and then using the GEOMETRY implementa‑
tion
The currently implemented spatial functions can roughly be categorized into the following groups:
LINESTRING_
Scalar functions GEOMETRY POINT_2D 2D POLYGON_2D BOX_2D
579
DuckDB Documentation
LINESTRING_
Scalar functions GEOMETRY POINT_2D 2D POLYGON_2D BOX_2D
Geometry Construction Construct new geometries from other geometries or other data.
LINESTRING_
Scalar functions GEOMETRY POINT_2D 2D POLYGON_2D BOX_2D
GEOMETRY ST_
Point(DOUBLE, DOUBLE)
GEOMETRY ST_ (as POLYGON)
ConvexHull(GEOMETRY)
GEOMETRY ST_ (as POLYGON)
Boundary(GEOMETRY)
GEOMETRY ST_ (as POLYGON)
Buffer(GEOMETRY)
GEOMETRY ST_
Centroid(GEOMETRY)
GEOMETRY ST_
Collect(GEOMETRY[])
GEOMETRY ST_ (as POLYGON)
Normalize(GEOMETRY)
GEOMETRY ST_ (as POLYGON)
SimplifyPreserveTopology(GEOMETRY,
DOUBLE)
GEOMETRY ST_ (as POLYGON)
Simplify(GEOMETRY,
DOUBLE)
GEOMETRY ST_ (as POLYGON)
Union(GEOMETRY,
GEOMETRY)
GEOMETRY ST_ (as POLYGON)
Intersection(GEOMETRY,
GEOMETRY)
GEOMETRY ST_
MakeLine(GEOMETRY[])
GEOMETRY ST_ (as POLYGON)
Envelope(GEOMETRY)
GEOMETRY ST_
FlipCoordinates(GEOMETRY)
580
DuckDB Documentation
LINESTRING_
Scalar functions GEOMETRY POINT_2D 2D POLYGON_2D BOX_2D
GEOMETRY ST_
Transform(GEOMETRY,
VARCHAR, VARCHAR)
BOX_2D ST_
Extent(GEOMETRY)
GEOMETRY ST_
PointN(GEOMETRY,
INTEGER)
GEOMETRY ST_
StartPoint(GEOMETRY)
GEOMETRY ST_
EndPoint(GEOMETRY)
GEOMETRY ST_
ExteriorRing(GEOMETRY)
GEOMETRY ST_
Reverse(GEOMETRY)
GEOMETRY ST_ (as POLYGON )
RemoveRepeatedPoints(GEOMETRY)
GEOMETRY ST_ (as POLYGON )
RemoveRepeatedPoints(GEOMETRY,
DOUBLE)
GEOMETRY ST_ (as POLYGON )
ReducePrecision(GEOMETRY,
DOUBLE)
GEOMETRY ST_ (as POLYGON)
PointOnSurface(GEOMETRY)
GEOMETRY ST_
CollectionExtract(GEOMETRY)
GEOMETRY ST_
CollectionExtract(GEOMETRY,
INTEGER)
LINESTRING_
Scalar functions GEOMETRY POINT_2D 2D POLYGON_2D BOX_2D
DOUBLE ST_Area(GEOMETRY)
BOOLEAN ST_ (as POLYGON)
IsClosed(GEOMETRY)
BOOLEAN ST_ (as POLYGON)
IsEmpty(GEOMETRY)
BOOLEAN ST_ (as POLYGON)
IsRing(GEOMETRY)
581
DuckDB Documentation
LINESTRING_
Scalar functions GEOMETRY POINT_2D 2D POLYGON_2D BOX_2D
LINESTRING_
Scalar functions GEOMETRY POINT_2D 2D POLYGON_2D BOX_2D
582
DuckDB Documentation
LINESTRING_
Scalar functions GEOMETRY POINT_2D 2D POLYGON_2D BOX_2D
GEOMETRY ST_Envelope_Agg(GEOMETRY)
GEOMETRY ST_Union_Agg(GEOMETRY)
GEOMETRY ST_Intersection_Agg(GEOMETRY)
ST_Read() ‑ Read Spatial Data from Files The spatial extension provides a ST_Read table function based on the GDAL translator
library to read spatial data from a variety of geospatial vector file formats as if they were DuckDB tables. For example to create a new table
from a GeoJSON file, you can use the following query:
ST_Read can take a number of optional arguments, the full signature is:
ST_Read(
VARCHAR,
sequential_layer_scan : BOOLEAN,
spatial_filter : WKB_BLOB,
583
DuckDB Documentation
open_options : VARCHAR[],
layer : VARCHAR,
allowed_drivers : VARCHAR[],
sibling_files : VARCHAR[],
spatial_filter_box : BOX_2D,
keep_wkb : BOOLEAN
)
• sequential_layer_scan (default: false): If set to true, the table function will scan through all layers sequentially and return
the first layer that matches the given layer name. This is required for some drivers to work properly, e.g., the OSM driver.
• spatial_filter (default: NULL): If set to a WKB blob, the table function will only return rows that intersect with the given WKB
geometry. Some drivers may support efficient spatial filtering natively, in which case it will be pushed down. Otherwise the filtering
is done by GDAL which may be much slower.
• open_options (default: []): A list of key‑value pairs that are passed to the GDAL driver to control the opening of the file. E.g., the
GeoJSON driver supports a FLATTEN_NESTED_ATTRIBUTES=YES option to flatten nested attributes.
• layer (default: NULL): The name of the layer to read from the file. If NULL, the first layer is returned. Can also be a layer index
(starting at 0).
• allowed_drivers (default: []): A list of GDAL driver names that are allowed to be used to open the file. If empty, all drivers are
allowed.
• sibling_files (default: []): A list of sibling files that are required to open the file. E.g., the ESRI Shapefile driver requires
a .shx file to be present. Although most of the time these can be discovered automatically.
• spatial_filter_box (default: NULL): If set to a BOX_2D, the table function will only return rows that intersect with the given
bounding box. Similar to spatial_filter.
• keep_wkb (default: false): If set, the table function will return geometries in a wkb_geometry column with the type WKB_BLOB
(which can be cast to BLOB) instead of GEOMETRY. This is useful if you want to use DuckDB with more exotic geometry subtypes that
DuckDB spatial doesnt support representing in the GEOMETRY type yet.
Note that GDAL is single‑threaded, so this table function will not be able to make full use of parallelism. We're planning to implement
support for the most common vector formats natively in this extension with additional table functions in the future.
We currently support over 50 different formats. You can generate the following table of supported GDAL drivers yourself by executing
SELECT * FROM ST_Drivers().
can_
short_name long_name create can_copy can_open help_url
584
DuckDB Documentation
can_
short_name long_name create can_copy can_open help_url
585
DuckDB Documentation
can_
short_name long_name create can_copy can_open help_url
Note that far from all of these drivers have been tested properly, and some may require additional options to be passed to work as expected.
If you run into any issues please first consult the GDAL docs.
586
DuckDB Documentation
ST_ReadOsm() ‑ Read Compressed OSM Data The spatial extension also provides an experimental ST_ReadOsm() table function to
read compressed OSM data directly from a .osm.pbf file.
This will use multithreading and zero‑copy protobuf parsing which makes it a lot faster than using the st_read() OSM driver, but it only
outputs the raw OSM data (Nodes, Ways, Relations), without constructing any geometries. For node entities you can trivially construct
POINT geometries, but it is also possible to construct LINESTRING AND POLYGON by manually joining refs and nodes together in SQL.
Example usage:
SELECT *
FROM st_readosm('tmp/data/germany.osm.pbf')
WHERE tags['highway'] != []
LIMIT 5;
┌──────────────────────┬────────┬──────────────────────┬─────────┬────────────────────┬────────────┬───────────┬─
│ kind │ id │ tags │ refs │ lat │ lon │ ref_
roles │ ref_types │
│ enum('node', 'way'… │ int64 │ map(varchar, varch… │ int64[] │ double │ double │
varchar[] │ enum('node', 'way', … │
├──────────────────────┼────────┼──────────────────────┼─────────┼────────────────────┼────────────┼───────────┼─
│ node │ 122351 │ {bicycle=yes, butt… │ │ 53.5492951 │ 9.977553 │
│ │
│ node │ 122397 │ {crossing=no, high… │ │ 53.520990100000006 │ 10.0156924 │
│ │
│ node │ 122493 │ {TMC:cid_58:tabcd_… │ │ 53.129614600000004 │ 8.1970173 │
│ │
│ node │ 123566 │ {highway=traffic_s… │ │ 54.617268200000005 │ 8.9718171 │
│ │
│ node │ 125801 │ {TMC:cid_58:tabcd_… │ │ 53.070685000000005 │ 8.7819939 │
│ │
└──────────────────────┴────────┴──────────────────────┴─────────┴────────────────────┴────────────┴───────────┴─
The spatial extension also provides ”replacement scans” for common geospatial file formats, allowing you to query files of these formats
as if they were tables.
In practice this is just syntax‑sugar for calling ST_Read, so there is no difference in performance. If you want to pass additional options,
you should use the ST_Read table function directly.
Much like the ST_Read table function the spatial extension provides a GDAL based COPY function to export duckdb tables to different
geospatial vector formats. For example to export a table to a GeoJSON file, with generated bounding boxes, you can use the following
query:
Available options:
587
DuckDB Documentation
• FORMAT: is the only required option and must be set to GDAL to use the GDAL based copy function.
• DRIVER: is the GDAL driver to use for the export. See the table above for a list of available drivers.
• LAYER_CREATION_OPTIONS: list of options to pass to the GDAL driver. See the GDAL docs for the driver you are using for a list of
available options.
• SRS: Set a spatial reference system as metadata to use for the export. This can be a WKT string, an EPSG code or a proj‑string, basically
anything you would normally be able to pass to GDAL/OGR. This will not perform any reprojection of the input geometry though, it
just sets the metadata if the target driver supports it.
Limitations
Raster types are not supported and there is currently no plan to add them to the extension.
SQLite Extension
The SQLite extension allows DuckDB to directly read and write data from a SQLite database file. The data can be queried directly from the
underlying SQLite tables. Data can be loaded from SQLite tables into DuckDB tables, or vice versa.
INSTALL sqlite;
The extension is loaded automatically upon first use. If you prefer to load it manually, run:
LOAD sqlite;
Usage
To make a SQLite file accessible to DuckDB, use the ATTACH statement, which supports read & write.
The tables in the file can be read as if they were normal DuckDB tables, but the underlying data is read directly from the SQLite tables in
the file at query time.
SHOW TABLES;
┌────────────────────────┐
│ name │
├────────────────────────┤
│ actor │
│ address │
│ category │
│ city │
│ country │
│ customer │
│ customer_list │
│ film │
│ film_actor │
│ film_category │
│ film_list │
│ film_text │
588
DuckDB Documentation
│ inventory │
│ language │
│ payment │
│ rental │
│ sales_by_film_category │
│ sales_by_store │
│ staff │
│ staff_list │
│ store │
└────────────────────────┘
You can query the tables using SQL, e.g., using the example queries from sakila-examples.sql:
SELECT
cat.name AS category_name,
sum(ifnull(pay.amount, 0)) AS revenue
FROM category cat
LEFT JOIN film_category flm_cat
ON cat.category_id = flm_cat.category_id
LEFT JOIN film fil
ON flm_cat.film_id = fil.film_id
LEFT JOIN inventory inv
ON fil.film_id = inv.film_id
LEFT JOIN rental ren
ON inv.inventory_id = ren.inventory_id
LEFT JOIN payment pay
ON ren.rental_id = pay.rental_id
GROUP BY cat.name
ORDER BY revenue DESC
LIMIT 5;
Data Types
SQLite is a weakly typed database system. As such, when storing data in a SQLite table, types are not enforced. The following is valid SQL
in SQLite:
DuckDB is a strongly typed database system, as such, it requires all columns to have defined types and the system rigorously checks data
for correctness.
When querying SQLite, DuckDB must deduce a specific column type mapping. DuckDB follows SQLite's type affinity rules with a few exten‑
sions.
1. If the declared type contains the string INT then it is translated into the type BIGINT
2. If the declared type of the column contains any of the strings CHAR, CLOB, or TEXT then it is translated into VARCHAR.
3. If the declared type for a column contains the string BLOB or if no type is specified then it is translated into BLOB.
4. If the declared type for a column contains any of the strings REAL, FLOA, DOUB, DEC or NUM then it is translated into DOUBLE.
5. If the declared type is DATE, then it is translated into DATE.
6. If the declared type contains the string TIME, then it is translated into TIMESTAMP.
7. If none of the above apply, then it is translated into VARCHAR.
As DuckDB enforces the corresponding columns to contain only correctly typed values, we cannot load the string ”hello” into a column of
type BIGINT. As such, an error is thrown when reading from the ”numbers” table above:
Error: Mismatch Type Error: Invalid type in column "i": column was declared as integer, found "hello" of
type "text" instead.
589
DuckDB Documentation
When set, this option overrides the type conversion rules described above, and instead always converts the SQLite columns into a VARCHAR
column. Note that this setting must be set before sqlite_attach is called.
SQLite databases can also be opened directly and can be used transparently instead of a DuckDB database file. In any client, when con‑
necting, a path to a SQLite database file can be provided and the SQLite database will be opened instead.
duckdb data/db/sakila.db
SHOW tables;
┌────────────┐
│ name │
│ varchar │
├────────────┤
│ actor │
│ address │
│ category │
│ · │
│ staff_list │
│ store │
├────────────┤
│ 21 rows │
│ (5 shown) │
└────────────┘
In addition to reading data from SQLite, the extension also allows you to create new SQLite database files, create tables, ingest data into
SQLite and make other modifications to SQLite database files using standard SQL queries.
This allows you to use DuckDB to, for example, export data that is stored in a SQLite database to Parquet, or read data from a Parquet file
into SQLite.
Below is a brief example of how to create a new SQLite database and load data into it.
The resulting SQLite database can then be read into from SQLite.
sqlite3 new_sqlite_database.db
id name
-- ------
42 DuckDB
Many operations on SQLite tables are supported. All these operations directly modify the SQLite database, and the result of subsequent
operations can then be read using SQLite.
590
DuckDB Documentation
CREATE TABLE
CREATE TABLE sqlite_db.tbl (id INTEGER, name VARCHAR);
INSERT INTO
INSERT INTO sqlite_db.tbl VALUES (42, 'DuckDB');
SELECT
SELECT * FROM sqlite_db.tbl;
┌───────┬─────────┐
│ id │ name │
│ int64 │ varchar │
├───────┼─────────┤
│ 42 │ DuckDB │
└───────┴─────────┘
COPY
COPY sqlite_db.tbl TO 'data.parquet';
COPY sqlite_db.tbl FROM 'data.parquet';
UPDATE
UPDATE sqlite_db.tbl SET name = 'Woohoo' WHERE id = 42;
DELETE
DELETE FROM sqlite_db.tbl WHERE id = 42;
ALTER TABLE
ALTER TABLE sqlite_db.tbl ADD COLUMN k INTEGER;
DROP TABLE
DROP TABLE sqlite_db.tbl;
CREATE VIEW
CREATE VIEW sqlite_db.v1 AS SELECT 42;
Transactions
CREATE TABLE sqlite_db.tmp (i INTEGER);
BEGIN;
INSERT INTO sqlite_db.tmp VALUES (42);
SELECT * FROM sqlite_db.tmp;
┌───────┐
│ i │
│ int64 │
├───────┤
│ 42 │
└───────┘
591
DuckDB Documentation
ROLLBACK;
SELECT * FROM sqlite_db.tmp;
┌────────┐
│ i │
│ int64 │
├────────┤
│ 0 rows │
└────────┘
Note. Deprecated The old sqlite_attach function is deprecated. It is recommended to switch over to the new ATTACH syntax.
Substrait Extension
The main goal of the substrait extension is to support both production and consumption of Substrait query plans in DuckDB.
This extension is mainly exposed via 3 different APIs ‑ the SQL API, the Python API, and the R API. Here we depict how to consume and
produce Substrait query plans in each API.
Note. The Substrait integration is currently experimental. Support is currently only available on request. If you have not asked for
permission to ask for support, contact us prior to opening an issue. If you open an issue without doing so, we will close it without
further review.
The Substrait extension is an autoloadable extensions, meaning that it will be loaded at runtime whenever one of the substrait functions
is called. To explicitly install and load the released version of the Substrait extension, you can also use the following SQL commands.
INSTALL substrait;
LOAD substrait;
SQL
In the SQL API, users can generate Substrait plans (into a BLOB or a JSON) and consume Substrait plans.
BLOB Generation To generate a Substrait BLOB the get_substrait(sql) function must be called with a valid SQL select query.
.mode line
CALL get_substrait('SELECT count(exercise) AS exercise FROM crossfit WHERE difficulty_level <= 5');
JSON Generation To generate a JSON representing the Substrait plan the get_substrait_json(sql) function must be called with
a valid SQL select query.
CALL get_substrait_json('SELECT count(exercise) AS exercise FROM crossfit WHERE difficulty_level <= 5');
592
DuckDB Documentation
Json =
{"extensions":[{"extensionFunction":{"functionAnchor":1,"name":"lte"}},{"extensionFunction":{"functionAnchor":2,
not_
null"}},{"extensionFunction":{"functionAnchor":3,"name":"and"}},{"extensionFunction":{"functionAnchor":4,"name":
level"],"struct":{"types":[{"varchar":{"length":13,"nullability":"NULLABILITY_
NULLABLE"}},{"i32":{"nullability":"NULLABILITY_NULLABLE"}}],"nullability":"NULLABILITY_
REQUIRED"}},"filter":{"scalarFunction":{"functionReference":3,"outputType":{"bool":{"nullability":"NULLABILITY_
NULLABLE"}},"arguments":[{"value":{"scalarFunction":{"functionReference":1,"outputType":{"i32":{"nullability":"N
NULLABLE"}},"arguments":[{"value":{"selection":{"directReference":{"structField":{"field":1}},"rootReference":{}
NULLABLE"}},"arguments":[{"value":{"selection":{"directReference":{"structField":{"field":1}},"rootReference":{}
NULLABLE"}}}}]}},"expressions":[{"selection":{"directReference":{"structField":{}},"rootReference":{}}}]}},"name
BLOB Consumption To consume a Substrait BLOB the from_substrait(blob) function must be called with a valid Substrait BLOB
plan.
CALL from_substrait('\x12\x09\x1A\x07\x10\x01\x1A\x03lte\x12\x11\x1A\x0F\x10\x02\x1A\x0Bis_not_
null\x12\x09\x1A\x07\x10\x03\x1A\x03and\x12\x0B\x1A\x09\x10\x04\x1A\x05count\x1A\xC8\x01\x12\xC5\x01\x0A\xB8\x01:
level\x12\x11\x0A\x07\xB2\x01\x04\x08\x0D\x18\x01\x0A\x04*\x02\x10\x01\x18\x02\x1AJ\x1AH\x08\x03\x1A\x04\x0A\x02\
\x1A\x1E\x08\x01\x1A\x04*\x02\x10\x01\x22\x0C\x1A\x0A\x12\x08\x0A\x04\x12\x02\x08\x01\x22\x00\x22\x06\x1A\x04\x0A
exercise = 2
Python
Substrait extension is autoloadable, but if you prefer to do so explicitly, you can use the relevant Python syntax within a connection:
import duckdb
con = duckdb.connect()
con.install_extension("substrait")
con.load_extension("substrait")
BLOB Generation To generate a Substrait BLOB the get_substrait(sql) function must be called, from a connection, with a valid
SQL select query.
JSON Generation To generate a JSON representing the Substrait plan the get_substrait_json(sql) function, from a connection,
must be called with a valid SQL select query.
BLOB Consumption To consume a Substrait BLOB the from_substrait(blob) function must be called, from the connection, with
a valid Substrait BLOB plan.
query_result = con.from_substrait(proto=proto_bytes)
593
DuckDB Documentation
By default the extension will be autoloaded on first use. To explicitly install and load this extension in R, use the following commands:
library("duckdb")
con <- dbConnect(duckdb::duckdb())
dbExecute(con, "INSTALL substrait")
dbExecute(con, "LOAD substrait")
BLOB Generation To generate a Substrait BLOB the duckdb_get_substrait(con, sql) function must be called, with a connec‑
tion and a valid SQL select query.
JSON Generation To generate a JSON representing the Substrait plan duckdb_get_substrait_json(con, sql) function, with
a connection and a valid SQL select query.
BLOB Consumption To consume a Substrait BLOB the duckdb_prepare_substrait(con, blob) function must be called, with
a connection and a valid Substrait BLOB plan.
TPC‑DS Extension
The tpcds extension implements the data generator and queries for the TPC‑DS benchmark.
The tpcds extension will be transparently autoloaded on first use from the official extension repository. If you would like to install and
load it manually, run:
INSTALL tpcds;
LOAD tpcds;
Usage
PRAGMA tpcds(8);
594
DuckDB Documentation
┌──────────────┬────────────────────┐
│ s_store_name │ sum(ss_net_profit) │
│ varchar │ decimal(38,2) │
├──────────────┼────────────────────┤
│ able │ -10354620.18 │
│ ation │ -10576395.52 │
│ bar │ -10625236.01 │
│ ese │ -10076698.16 │
│ ought │ -10994052.78 │
└──────────────┴────────────────────┘
Limitations
The tpchds({query_id}) function runs a fixed TPC‑DS query with pre‑defined bind parameters (a.k.a. substitution parameters). It is
not possible to change the query parameters using the tpcds extension.
GitHub
TPC‑H Extension
The tpch extension implements the data generator and queries for the TPC‑H benchmark.
The tpch extension is shipped by default in some DuckDB builds, otherwise it will be transparently autoloaded on first use. If you would
like to install and load it manually, run:
INSTALL tpch;
LOAD tpch;
Usage
Calling dbgen does not clean up existing TPC‑H tables. To clean up existing tables, use DROP TABLE before running dbgen:
595
DuckDB Documentation
PRAGMA tpch(4);
┌─────────────────┬─────────────┐
│ o_orderpriority │ order_count │
│ varchar │ int64 │
├─────────────────┼─────────────┤
│ 1-URGENT │ 21188 │
│ 2-HIGH │ 20952 │
│ 3-MEDIUM │ 20820 │
│ 4-NOT SPECIFIED │ 21112 │
│ 5-LOW │ 20974 │
└─────────────────┴─────────────┘
FROM tpch_queries();
Listing Expected Answers To produced the expected results for all queries on scale factors 0.01, 0.1, and 1, run:
FROM tpch_answers();
This function returns a table with columns query_nr, scale_factor, and answer.
To generate data sets for large scale factors, which yield larger than memory data sets, run the dbgen function in steps. For example, you
may generate SF300 in 10 steps:
596
DuckDB Documentation
Limitations
• The data generator function dbgen is single‑threaded and does not support concurrency. Running multiple steps to parallelize over
different partitions is also not supported at the moment.
• The tpch({query_id}) function runs a fixed TPC‑H query with pre‑defined bind parameters (a.k.a. substitution parameters). It
is not possible to change the query parameters using the tpch extension.
GitHub
597
DuckDB Documentation
598
DuckDB Documentation
Guides
599
Data Import & Export
When importing data from other systems to DuckDB, there are several considerations to take into account. This page documents the key
approaches recommended to bulk import data to DuckDB.
1. For systems which are supported by a DuckDB scanner extension, it's preferable to use the scanner. DuckDB currently offers scanners
for MySQL, PostgreSQL, and SQLite.
2. If there is a bulk export feature in the data source system, export the data to Parquet or CSV format, then load it using DuckDB's
Parquet or CSV loader.
3. If the approaches above are not applicable, consider using the DuckDB appender, currently available in the C, C++, Go, Java, and
Rust APIs.
4. If the data source system supports Apache Arrow and the data transfer is a recurring task, consider using the DuckDB Arrow extension.
Methods to Avoid
If possible, avoid looping row‑by‑row (tuple‑at‑a‑time) in favor of bulk operations. Performing row‑by‑row inserts (even with prepared
statements) is detrimental to performance and will result in slow load times.
Note. Bestpractice Unless your data is small (<100k rows), avoid using inserts in loops.
CSV Import
To read data from a CSV file, use the read_csv function in the FROM clause of a query.
To create a new table using the result from a query, use CREATE TABLE AS from a SELECT statement.
To load data into an existing table from a query, use INSERT INTO from a SELECT statement.
Alternatively, the COPY statement can also be used to load data from a CSV file into an existing table.
For additional options, see the CSV Import reference and the COPY statement documentation.
601
DuckDB Documentation
CSV Export
To export the data from a table to a CSV file, use the COPY statement.
Parquet Import
To read data from a Parquet file, use the read_parquet function in the FROM clause of a query.
To create a new table using the result from a query, use CREATE TABLE AS from a SELECT statement.
To load data into an existing table from a query, use INSERT INTO from a SELECT statement.
Alternatively, the COPY statement can also be used to load data from a Parquet file into an existing table.
Parquet Export
To export the data from a table to a Parquet file, use the COPY statement.
The flags for setting compression, row group size, etc. are listed in the Reading and Writing Parquet files page.
To run a query directly on a Parquet file, use the read_parquet function in the FROM clause of a query.
The Parquet file will be processed in parallel. Filters will be automatically pushed down into the Parquet scan, and only the relevant columns
will be read automatically.
For more information see the blog post ”Querying Parquet with Precision using DuckDB”.
602
DuckDB Documentation
To load a Parquet file over HTTP(S), the httpfs extension is required. This can be installed use the INSTALL SQL command. This only
needs to be run once.
INSTALL httpfs;
To load the httpfs extension for usage, use the LOAD SQL command:
LOAD httpfs;
After the httpfs extension is set up, Parquet files can be read over http(s):
For example:
The function read_parquet can be omitted if the URL ends with .parquet:
Moreover, the read_parquet function itself can also be omitted thanks to DuckDB's replacement scan mechanism:
S3 Parquet Import
Prerequisites
To load a Parquet file from S3, the httpfs extension is required. This can be installed use the INSTALL SQL command. This only needs
to be run once.
INSTALL httpfs;
To load the httpfs extension for usage, use the LOAD SQL command:
LOAD httpfs;
After loading the httpfs extension, set up the credentials and S3 region to read data:
CREATE SECRET (
TYPE S3,
KEY_ID 'AKIAIOSFODNN7EXAMPLE',
SECRET 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY',
REGION 'us-east-1'
);
Note. Tip If you get an IO Error (Connection error for HTTP HEAD), configure the endpoint explicitly via ENDPOINT
's3. your-region .amazonaws.com'.
CREATE SECRET (
TYPE S3,
PROVIDER CREDENTIAL_CHAIN
);
603
DuckDB Documentation
Querying
After the httpfs extension is set up and the S3 configuration is set correctly, Parquet files can be read from S3 using the following com‑
mand:
DuckDB can also handle Google Cloud Storage (GCS) and Cloudflare R2 via the S3 API. See the relevant guides for details.
S3 Parquet Export
To write a Parquet file to S3, the httpfs extension is required. This can be installed use the INSTALL SQL command. This only needs to
be run once.
INSTALL httpfs;
To load the httpfs extension for usage, use the LOAD SQL command:
LOAD httpfs;
After loading the httpfs extension, set up the credentials to write data. Note that the region parameter should match the region of the
bucket you want to access.
CREATE SECRET (
TYPE S3,
KEY_ID 'AKIAIOSFODNN7EXAMPLE',
SECRET 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY',
REGION 'us-east-1'
);
Note. Tip If you get an IO Error (Connection error for HTTP HEAD), configure the endpoint explicitly via ENDPOINT
's3. your-region .amazonaws.com'.
CREATE SECRET (
TYPE S3,
PROVIDER CREDENTIAL_CHAIN
);
After the httpfs extension is set up and the S3 credentials are correctly configured, Parquet files can be written to S3 using the following
command:
Similarly, Google Cloud Storage (GCS) is supported through the Interoperability API. You need to create HMAC keys and provide the creden‑
tials as follows:
CREATE SECRET (
TYPE GCS,
KEY_ID 'AKIAIOSFODNN7EXAMPLE',
SECRET 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY',
)
604
DuckDB Documentation
S3 Iceberg Import
Prerequisites
To load an Iceberg file from S3, both the httpfs and iceberg extensions are required. They can be installed use the INSTALL SQL
command. The extensions only need to be installed once.
INSTALL httpfs;
INSTALL iceberg;
LOAD httpfs;
LOAD iceberg;
Credentials
After loading the extensions, set up the credentials and S3 region to read data. You may either use an access key and secret, or a token.
CREATE SECRET (
TYPE S3,
KEY_ID 'AKIAIOSFODNN7EXAMPLE',
SECRET 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY',
REGION 'us-east-1'
);
CREATE SECRET (
TYPE S3,
PROVIDER CREDENTIAL_CHAIN
);
After the extensions are set up and the S3 credentials are correctly configured, Iceberg table can be read from S3 using the following com‑
mand:
SELECT *
FROM iceberg_scan('s3:// bucket / iceberg-table-folder /metadata/ id .metadata.json')
Note that you need to link directly to the manifest file. Otherwise you'll get an error like this:
S3 Express One
In late 2023, AWS announced the S3 Express One Zone, a high‑speed variant of traditional S3 buckets. DuckDB can read S3 Express One
buckets using the httpfs extension.
The configuration of S3 Express One buckets is similar to regular S3 buckets with one exception: we have to specify the endpoint according
to the following pattern:
605
DuckDB Documentation
where the availability zone (e.g., use-az5) can be obtained from the S3 Express One bucket's configuration page and the
region is the AWS region (e.g., us-east-1).
For example, to allow DuckDB to use an S3 Express One bucket, configure the Secrets manager as follows:
CREATE SECRET (
TYPE S3,
REGION 'us-east-1',
KEY_ID 'AKIAIOSFODNN7EXAMPLE',
SECRET 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY',
ENDPOINT 's3express-use1-az5.us-east-1.amazonaws.com'
);
Querying
You can query the S3 Express One bucket as any other S3 bucket:
SELECT *
FROM 's3://express-bucket-name--use1-az5--x-s3/my-file.parquet';
Performance
We ran two experiments on a c7gd.12xlarge instance using the LDBC SF300 Comments creationDate Parquet file file (also used in
the microbenchmarks of the performance guide).
The ”loading only” variant is running the load as part of an EXPLAIN ANALYZE statement to measure the runtime without account
creating a local table, while the ”creating local table” variant uses CREATE TABLE ... AS to create a persistent table on the local
disk.
GCS Import
Prerequisites
For Google Cloud Storage (GCS), the Interoperability API enables you to have access to it like an S3 connection using the httpfs extension.
This can be installed use the INSTALL SQL command. This only needs to be run once.
CREATE SECRET (
TYPE GCS,
KEY_ID 'AKIAIOSFODNN7EXAMPLE',
SECRET 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY',
)
606
DuckDB Documentation
Querying
After setting up the GCS credentials, you can query the GCS data using:
SELECT *
FROM read_parquet('gs:// gcs_bucket / file ');
Cloudflare R2 Import
Prerequisites
For Cloudflare R2, the S3 Compatibility API allows you to use DuckDB's S3 support to read and write from R2 buckets. This requires the
httpfs extension, which can be installed use the INSTALL SQL command. This only needs to be run once.
You will need to generate an S3 auth token and create an R2 secret in DuckDB:
CREATE SECRET (
TYPE R2,
KEY_ID 'AKIAIOSFODNN7EXAMPLE',
SECRET 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY',
ACCOUNT_ID 'my_account_id'
);
Querying
After setting up the R2 credentials, you can query the R2 data using:
JSON Import
To read data from a JSON file, use the read_json_auto function in the FROM clause of a query.
To create a new table using the result from a query, use CREATE TABLE AS from a SELECT statement.
To load data into an existing table from a query, use INSERT INTO from a SELECT statement.
Alternatively, the COPY statement can also be used to load data from a JSON file into an existing table.
For additional options, see the JSON Loading reference and the COPY statement documentation.
607
DuckDB Documentation
JSON Export
To export the data from a table to a JSON file, use the COPY statement.
Excel Import
To read data from an Excel file, install and load the spatial extension. This is only needed once per DuckDB connection.
INSTALL spatial;
LOAD spatial;
The layer parameter allows specifying the name of the Excel worksheet.
Creating a New Table To create a new table using the result from a query, use CREATE TABLE ... AS from a SELECT statement.
Loading to an Existing Table To load data into an existing table from a query, use INSERT INTO from a SELECT statement.
Options Several configuration options are also available for the underlying GDAL library that is doing the XLSX parsing. You can pass
them via the open_options parameter of the st_read function as a list of 'KEY=VALUE' strings.
Importing a Sheet with/without a Header The option HEADERS has three possible values:
SELECT *
FROM st_read(
'test_excel.xlsx',
layer = 'Sheet1',
open_options = ['HEADERS=FORCE']
);
608
DuckDB Documentation
Detecting Types The option FIELD_TYPE defines how field types should be treated:
For example, to treat the first row as a header and use auto‑detection for types, run:
SELECT *
FROM st_read(
'test_excel.xlsx',
layer = 'Sheet1',
open_options = ['HEADERS=FORCE', 'FIELD_TYPES=AUTO']
);
SELECT *
FROM st_read(
'test_excel.xlsx',
layer = 'Sheet1',
open_options = ['FIELD_TYPES=STRING']
);
See Also
DuckDB can also export Export files. For additional details on Excel support, see the spatial extension page, the GDAL XLSX driver page, and
the GDAL configuration options page.
Excel Export
To export the data from a table to an Excel file, install and load the spatial extension. This is only needed once per DuckDB connection.
INSTALL spatial;
LOAD spatial;
Then use the COPY statement. The file will contain one worksheet with the same name as the file, but without the .xlsx extension.
COPY (SELECT * FROM tbl) TO 'output.xlsx' WITH (FORMAT GDAL, DRIVER 'xlsx');
Note. Dates and timestamps are currently not supported by the xlsx writer. Cast columns of those types to VARCHAR prior to
creating the xlsx file.
See Also
DuckDB can also import Export files. For additional details, see the spatial extension page and the GDAL XLSX driver page.
609
DuckDB Documentation
MySQL Import
To run a query directly on a running MySQL database, the mysql extension is required.
The extension can be installed use the INSTALL SQL command. This only needs to be run once.
INSTALL mysql;
To load the mysql extension for usage, use the LOAD SQL command:
LOAD mysql;
Usage
After the mysql extension is installed, you can attach to a MySQL database using the following command:
The string used by ATTACH is a PostgreSQL‑style connection string (not a MySQL connection string!). It is a list of connection arguments
provided in {key}={value} format. Below is a list of valid arguments. Any options not provided are replaced by their default values.
Setting Default
database NULL
host localhost
password
port 0
socket NULL
user current user
PostgreSQL Import
To run a query directly on a running PostgreSQL database, the postgres extension is required.
The extension can be installed use the INSTALL SQL command. This only needs to be run once.
INSTALL postgres;
To load the postgres extension for usage, use the LOAD SQL command:
LOAD postgres;
610
DuckDB Documentation
Usage
After the postgres extension is installed, tables can be queried from PostgreSQL using the postgres_scan function:
-- scan the table "mytable" from the schema "public" in the database "mydb"
SELECT * FROM postgres_scan('host=localhost port=5432 dbname=mydb', 'public', 'mytable');
The first parameter to the postgres_scan function is the PostgreSQL connection string, a list of connection arguments provided in
{key}={value} format. Below is a list of valid arguments.
Alternatively, the entire database can be attached using the ATTACH command. This allows you to query all tables stored within the
PostgreSQL database as if it was a regular database.
SQLite Import
The extension can be installed use the INSTALL SQL command. This only needs to be run once.
INSTALL sqlite;
To load the sqlite extension for usage, use the LOAD SQL command:
LOAD sqlite;
Usage
After the SQLite extension is installed, tables can be queried from SQLite using the sqlite_scan function:
611
DuckDB Documentation
Alternatively, the entire file can be attached using the ATTACH command. This allows you to query all tables stored within a SQLite database
file as if they were a regular database.
DuckDB allows directly reading files via the read_text and read_blob functions. These functions accept a filename, a list of filenames
or a glob pattern, and output the content of each file as a VARCHAR or BLOB, respectively, as well as additional metadata such as the file
size and last modified time.
read_text
The read_text table function reads from the selected source(s) to a VARCHAR.
┌───────┬───────────────────────────────────────────────┬──────────────┐
│ size │ parse_path(filename) │ content │
│ int64 │ varchar[] │ varchar │
├───────┼───────────────────────────────────────────────┼──────────────┤
│ 12 │ [test, sql, table_function, files, one.txt] │ Hello World! │
│ 2 │ [test, sql, table_function, files, three.txt] │ 42 │
│ 10 │ [test, sql, table_function, files, two.txt] │ Föö Bär │
└───────┴───────────────────────────────────────────────┴──────────────┘
The file content is first validated to be valid UTF‑8. If read_text attempts to read a file with invalid UTF‑8 an error is thrown suggesting
to use read_blob instead.
read_blob
The read_blob table function reads from the selected source(s) to a BLOB.
┌───────┬──────────────────────────────────────────────────────────────────────────────────────────────────┬─────
│ size │ content │
filename │
│ int64 │ blob │
varchar │
├───────┼──────────────────────────────────────────────────────────────────────────────────────────────────┼─────
│ 178 │
PK\x03\x04\x0A\x00\x00\x00\x00\x00\xACi=X\x14t\xCE\xC7\x0A\x00\x00\x00\x0A\x00\x00\x00\x09\x00… │
test/sql/table_function/files/four.blob │
612
DuckDB Documentation
│ 12 │ Hello World! │
test/sql/table_function/files/one.txt │
│ 2 │ 42 │
test/sql/table_function/files/three.txt │
│ 10 │ F\xC3\xB6\xC3\xB6 B\xC3\xA4r │
test/sql/table_function/files/two.txt │
└───────┴──────────────────────────────────────────────────────────────────────────────────────────────────┴─────
Schema
The schemas of the tables returned by read_text and read_blob are identical:
┌───────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │ null │ key │ default │ extra │
│ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │
├───────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ filename │ VARCHAR │ YES │ │ │ │
│ content │ VARCHAR │ YES │ │ │ │
│ size │ BIGINT │ YES │ │ │ │
│ last_modified │ TIMESTAMP │ YES │ │ │ │
└───────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘
In cases where the underlying filesystem is unable to provide some of this data due (e.g. because HTTPFS can't always return a valid
timestamp), the cell is set to NULL instead.
The table functions also utilize projection pushdown to avoid computing properties unnecessarily. So you could e.g. use this to glob a
directory full of huge files to get the file size in the size column, as long as you omit the content column the data wont be read into DuckDB.
613
DuckDB Documentation
614
Performance
Performance Guide
DuckDB aims to automatically achieve high performance by using well‑chosen default configurations and having a forgiving architecture.
Of course, there are still opportunities for tuning the system for specific workloads. The Performance Guide's page contain guidelines and
tips for achieving good performance when loading and processing data with DuckDB.
The guides include several microbenchmarks. You may find details about these on the Benchmarks page.
Schema
Types
It is important to use the correct type for encoding columns (e.g., BIGINT, DATE, DATETIME). While it is always possible to use string types
(VARCHAR, etc.) to encode more specific values, this is not recommended. Strings use more space and are slower to process in operations
such as filtering, join, and aggregation.
When loading CSV files, you may leverage the CSV reader's auto‑detection mechanism to get the correct types for CSV inputs.
If you run in a memory‑constrained environment, using smaller data types (e.g., TINYINT) can reduce the amount of memory and disk
space required to complete a query. DuckDB’s bitpacking compression means small values stored in larger data types will not take up larger
sizes on disk, but they will take up more memory during processing.
Note. Bestpractice Use the most restrictive types possible when creating columns. Avoid using strings for encoding more specific
data items.
Microbenchmark: Using Timestamps We illustrate the difference in aggregation speed using the creationDate column of the LDBC
Comment table on scale factor 300. This table has approx. 554 million unordered timestamp values. We run a simple aggregation query
that returns the average day‑of‑the month from the timestamps in two configurations.
First, we use a DATETIME to encode the values and run the query using the extract datetime function:
The results show that using the DATETIME value yields smaller storage sizes and faster processing.
615
DuckDB Documentation
Microbenchmark: Joining on Strings We illustrate the difference caused by joining on different types by computing a self‑join on the
LDBC Comment table at scale factor 100. The table has 64‑bit integer identifiers used as the id attribute of each row. We perform the
following join operation:
SELECT count(*) AS count
FROM Comment c1
JOIN Comment c2 ON c1.ParentCommentId = c2.id;
In the first experiment, we use the correct (most restrictive) types, i.e., both the id and the ParentCommentId columns are defined as
BIGINT. In the second experiment, we define all columns with the VARCHAR type. While the results of the queries are the same for all
both experiments, their runtime vary significantly. The results below show that joining on BIGINT columns is approx. 1.8× faster than
performing the same join on VARCHAR‑typed columns encoding the same value.
Join column payload type Join column schema type Example value Query time
Note. Bestpractice Avoid representing numeric values as strings, especially if you intend to perform e.g. join operations on them.
Constraints
DuckDB allows defining constraints such as UNIQUE, PRIMARY KEY, and FOREIGN KEY. These constraints can be beneficial for ensuring
data integrity but they have a negative effect on load performance as they necessitate building indexes and performing checks. Moreover,
they very rarely improve the performance of queries as DuckDB does not rely on these indexes for join and aggregation operators (see index‑
ing for more details).
Note. Bestpractice Do not define constraints unless your goal is to ensure data integrity.
We illustrate the effect of using primary keys with the LDBC Comment table at scale factor 300. This table has approx. 554 million entries.
We first create the schema without a primary key, then load the data. In the second experiment, we create the schema with a primary key,
then load the data. In both cases, we take the data from .csv.gz files, and measure the time required to perform the loading.
Note. Bestpractice For best bulk load performance, avoid defining primary key constraints if possible.
Indexing
Zonemaps
DuckDB automatically creates zonemaps (also known as min‑max indexes) for the columns of all general‑purpose data types. These in‑
dexes are used for predicate pushdown into scan operators and computing aggregations. This means that if a filter criterion (like WHERE
column1 = 123) is in use, DuckDB can skip any row group whose min‑max range does not contain that filter value (e.g., a block with a
min‑max range of 1000 to 2000 will be omitted when comparing for = 123 or < 400).
616
DuckDB Documentation
The Effect of Ordering on Zonemaps The more ordered the data within a column, the more useful the zonemap indexes will be. For
example, in the worst case, a column could contain a random number on every row. DuckDB will be unlikely to be able to skip any row
groups. The best case of ordered data commonly arises with DATETIME columns. If specific columns will be queried with selective filters,
it is best to pre‑order data by those columns when inserting it. Even an imperfect ordering will still be helpful.
Microbenchmark: The Effect of Ordering For an example, let’s repeat the microbenchmark for timestamps with a timestamp column
that sorted using an ascending order vs. an unordered one.
The results show that simply keeping the column order allows for improved compression, yielding a 2.5x smaller storage size. It also allows
the computation to be 1.5x faster.
Ordered Integers Another practical way to exploit ordering is to use the INTEGER type with automatic increments rather than UUID for
columns that will be queried using selective filters. UUIDs will likely be inserted in a random order, so many row groups in the table will
need to be scanned to find a specific UUID value, while an ordered INTEGER column will allow all row groups to be skipped except the
one that contains the value.
ART Indexes
DuckDB allows defining Adaptive Radix Tree (ART) indexes in two ways. First, such an index is created implicitly for columns with PRIMARY
KEY, FOREIGN KEY, and UNIQUE constraints. Second, explicitly running a the CREATE INDEX statement creates an ART index on the
target column(s).
1. It enables efficient constraint checking upon changes (inserts, updates, and deletes) for non‑bulky changes.
2. Having an ART index makes changes to the affected column(s) slower compared to non‑indexed performance. That is because of
index maintenance for these operations.
1. It speeds up point queries and other highly selective queries using the indexed column(s), where the filtering condition returns approx.
0.1% of all rows or fewer. When in doubt, use EXPLAIN to verify that your query plan uses the index scan.
2. An ART index has no effect on the performance of join, aggregation, and sorting queries.
Indexes are serialized to disk and deserialized lazily, i.e., when the database is reopened, operations using the index will only load the
required parts of the index. Therefore, having an index will not cause any slowdowns when opening an existing database.
• Only use primary keys, foreign keys, or unique constraints, if these are necessary for enforcing constraints on your data.
• Do not define explicit indexes unless you have highly selective queries.
• If you define an ART index, do so after bulk loading the data to the table. Adding an index prior to loading, either explicitly or
via primary/foreign keys, is detrimental to load performance.
Environment
The environment where DuckDB is run has an obvious impact on performance. This page focuses on the effects of the hardware configura‑
tion and the operating system used.
617
DuckDB Documentation
Hardware Configuration
CPU and Memory As a rule of thumb, aggregation‑heavy workloads require approx. 5 GB memory per CPU core and join‑heavy workloads
require approximately 10 GB memory per core for best performance. In AWS EC2, the former are available as general‑purpose instances
(e.g., M7g) and the latter as memory‑optimized instances (e.g., R7g).
Disk DuckDB is capable of operating both as an in‑memory and as a disk‑based database system. In the latter case, it can spill to disk
to process larger‑than‑memory workloads (a.k.a. out‑of‑core processing). In these cases, a fast disk is highly beneficial. However, if the
workload fits in memory, the disk speed only has a limited effect on performance.
In general, network‑based storage will result in slower DuckDB workloads than using local disks. This includes network disks such as NFS,
network drives such as SMB and Samba, and network‑backed cloud disks such as AWS EBS. However, different network disks can have
vastly varying IO performance, ranging from very slow to almost as fast as local. Therefore, for optimal performance, only use network
disks that can provide high IO performance.
Note. Bestpractice Fast disks are important if your workload is larger than memory and/or fast data loading is important. Only use
network‑backed disks if they guarantee high IO.
Operating System
We recommend using the latest stable version of operating systems: macOS, Windows, and Linux are all well‑tested and DuckDB can run
on them with high performance. Among Linux distributions, we recommended using Ubuntu Linux LTS due to its stability and the fact that
most of DuckDB’s Linux test suite jobs run on Ubuntu workers.
File Formats
DuckDB has advanced support for Parquet files, which includes directly querying Parquet files. When deciding on whether to query these
files directly or to first load them to the database, you need to consider several factors.
Reasons for Querying Parquet Files Availability of basic statistics: Parquet files use a columnar storage format and contain basic
statistics such as zonemaps. Thanks to these features, DuckDB can leverage optimizations such as projection and filter pushdown on
Parquet files. Therefore, workloads that combine projection, filtering, and aggregation tend to perform quite well when run on Parquet
files.
Storage considerations: Loading the data from Parquet files will require approximately the same amount of space for the DuckDB database
file. Therefore, if the available disk space is constrained, it is worth running the queries directly on Parquet files.
Reasons against Querying Parquet Files Lack of advanced statistics: The DuckDB database format has the hyperloglog statistics that
Parquet files do not have. These improve the accuracy of cardinality estimates, and are especially important if the queries contain a large
number of join operators.
Tip. If you find that DuckDB produces a suboptimal join order on Parquet files, try loading the Parquet files to DuckDB tables. The improved
statistics likely help obtain a better join order.
Repeated queries: If you plan to run multiple queries on the same data set, it is worth loading the data into DuckDB. The queries will
always be somewhat faster, which over time amortizes the initial load time.
High decompression times: Some Parquet files are compressed using heavyweight compression algorithms such as gzip. In these cases,
querying the Parquet files will necessitate an expensive decompression time every time the file is accessed. Meanwhile, lightweight com‑
pression methods like snappy, lz4, zstd, are faster to decompress. You may use the parquet_metadata function to find out the com‑
pression algorithm used.
618
DuckDB Documentation
Microbenchmark: Running TPC‑H on a DuckDB Database vs. Parquet The queries on the TPC‑H benchmark run approximately 1.1‑
5.0x slower on Parquet files than on a DuckDB database.
Note. Bestpractice If you have the storage space available, and have a join‑heavy workload and/or plan to run many queries on the
same dataset, load the Parquet files into the database first. The compression algorithm and the row group sizes in the Parquet files
have a large effect on performance: study these using the parquet_metadata function.
The Effect of Row Group Sizes DuckDB works best on Parquet files with row groups of 100K‑1M rows each. The reason for this is that
DuckDB can only parallelize over row groups – so if a Parquet file has a single giant row group it can only be processed by a single thread.
You can use the parquet_metadata function to figure out how many row groups a Parquet file has. When writing Parquet files, use the
row_group_size option.
Microbenchmark: Running Aggregation Query at Different Row Group Sizes We run a simple aggregation query over Parquet files
using different row group sizes, selected between 960 and 1,966,080. The results are as follows.
960 8.77s
1920 8.95s
3840 4.33s
7680 2.35s
15360 1.58s
30720 1.17s
61440 0.94s
122880 0.87s
245760 0.93s
491520 0.95s
983040 0.97s
1966080 0.88s
The results show that row group sizes <5,000 have a strongly detrimental effect, making runtimes more than 5‑10x larger than ideally‑sized
row groups, while row group sizes between 5,000 and 20,000 are still 1.5‑2.5x off from best performance. Above row group size of 100,000,
the differences are small: the gap is about 10% between the best and the worst runtime.
Parquet File Sizes DuckDB can also parallelize across multiple Parquet files. It is advisable to have at least as many total row groups
across all files as there are CPU threads. For example, with a machine having 10 threads, both 10 files with 1 row group or 1 file with 10 row
groups will achieve full parallelism. It is also beneficial to keep the size of individual Parquet files moderate.
Note. Bestpractice The ideal range is between 100MB and 10GB per individual Parquet file.
Hive Partitioning for Filter Pushdown When querying many files with filter conditions, performance can be improved by using a Hive‑
format folder structure to partition the data along the columns used in the filter condition. DuckDB will only need to read the folders and
files that meet the filter criteria. This can be especially helpful when querying remote files.
More Tips on Reading and Writing Parquet Files For tips on reading and writing Parquet files, see the Parquet Tips page.
619
DuckDB Documentation
CSV files are often distributed in compressed format such as GZIP archives (.csv.gz). DuckDB can decompress these files on the fly. In
fact, this is typically faster than decompressing the files first and loading them due to reduced IO.
Loading Many Small CSV Files The CSV reader runs the CSV sniffer on all files. For many small files, this may cause an unnecessarily high
overhead. A potential optimization to speed this up is to turn the sniffer off. Assuming that all files have the same CSV dialect and colum
names/types, get the sniffer options as follows:
.mode line
SELECT Prompt FROM sniff_csv('part-0001.csv');
Then, you can adjust read_csv command, by e.g. applying filename expansion (globbing), and run with the rest of the options detected
by the sniffer:
Tuning Workloads
The Effect of Row Groups on Parallelism DuckDB parallelizes the workload based on row groups, i.e., groups of rows that are stored
together at the storage level. A row group in DuckDB's database format consists of max. 122,880 rows. Parallelism starts at the level of row
groups, therefore, for a query to run on k threads, it needs to scan at least k * 122,880 rows.
Too Many Threads Note that in certain cases DuckDB may launch too many threads (e.g., due to HyperThreading), which can lead to
slowdowns. In these cases, it’s worth manually limiting the number of threads using SET threads = X.
A key strength of DuckDB is support for larger‑than‑memory workloads, i.e., it is able to process data sets that are larger than the available
system memory (also known as out‑of‑core processing). It can also run queries where the intermediate results cannot fit into memory. This
section explains the prerequisites, scope, and known limitations of larger‑than‑memory processing in DuckDB.
Spilling to Disk Larger‑than‑memory workloads are supported by spilling to disk. If DuckDB is connected to a persistent database file,
DuckDB will create a temporary directory named database_file_name .tmp when the available memory is no longer sufficient to
continue processing.
If DuckDB is running in in‑memory mode, it cannot use disk to offload data if it does not fit into main memory. To enable offloading in the
absence of a persistent database file, use the SET temp_directory statement:
620
DuckDB Documentation
Operators Some operators cannot output a single row until the last row of their input has been seen. These are called blocking operators
as they require their entire input to be buffered, and are the most memory‑instensive operators in relational database systems. The main
blocking operators are the following:
Limitations DuckDB strives to always complete workloads even if they are larger‑than‑memory. That said, there are some limitations at
the moment:
• If multiple blocking operators appear in the same query, DuckDB may still throw an out‑of‑memory exception due to the complex
interplay of these operators.
• Some aggregate functions, such as list() and string_agg(), do not support offloading to disk.
• Aggregate functions that use sorting are holistic, i.e., they need all inputs before the aggregation can start. As DuckDB cannot yet
offload some complex intermediate aggregate states to disk, these functions can cause an out‑of‑memory exception when run on
large data sets.
• The PIVOT operation internally uses the list() function, therefore it is subject to the same limitation.
Profiling
If your queries are not performing as well as expected, it’s worth studying their query plans:
• Use EXPLAIN to print the physical query plan without running the query.
• Use EXPLAIN ANALYZE to run and profile the query. This will show the CPU time that each step in the query takes. Note that due
to multi‑threading, adding up the individual times will be larger than the total query processing time.
Query plans can point to the root of performance issues. A few general directions:
Prepared Statements
Prepared statements can improve performance when running the same query many times, but with different parameters. When a state‑
ment is prepared, it completes several of the initial portions of the query execution process (parsing, planning, etc.) and caches their
output. When it is executed, those steps can be skipped, improving performance. This is beneficial mostly for repeatedly running small
queries (with a runtime of < 100ms) with different sets of parameters.
Note that it is not a primary design goal for DuckDB to quickly execute many small queries concurrently. Rather, it is optimized for running
larger, less frequent queries.
DuckDB uses synchronous IO when reading remote files. This means that each DuckDB thread can make at most one HTTP request at a
time. If a query must make many small requests over the network, increasing DuckDB's threads setting to larger than the total number
of CPU cores (approx. 2‑5 times CPU cores) can improve parallelism and performance.
621
DuckDB Documentation
Avoid Reading Unnecessary Data The main bottleneck in workloads reading remote files is likely to be the IO. This means that minimiz‑
ing the unnecessarily read data can be highly beneficial.
• Avoid SELECT *. Instead, only select columns that are actually used. DuckDB will try to only download the data it actually needs.
• Apply filters on remote parquet files when possible. DuckDB can use these filters to reduce the amount of data that is scanned.
• Either sort or partition data by columns that are regularly used for filters: this increases the effectiveness of the filters in reducing IO.
To inspect how much remote data is transferred for a query, EXPLAIN ANALYZE can be used to print out the total number of requests
and total data transferred for queries on remote files.
Avoid Reading Data More Than Once DuckDB does not cache data from remote files automatically. This means that running a query on
a remote file twice will download the required data twice. So if data needs to be accessed multiple times, storing it locally can make sense.
To illustrate this, lets look at an example:
These queries download the columns col_a and col_b from s3://bucket/file.parquet twice. Now consider the following
queries:
Here DuckDB will first copy col_a and col_b from s3://bucket/file.parquet into a local table, and then query the local in‑
memory columns twice. Note also that the filter WHERE col_a > 10 is also now applied only once.
An important side note needs to be made here though. The first two queries are fully streaming, with only a small memory footprint,
whereas the second requires full materialization of columns col_a and col_b. This means that in some rare cases (e.g., with a high‑
speed network, but with very limited memory available) it could actually be beneficial to download the data twice.
DuckDB will perform best when reusing the same database connection many times. Disconnecting and reconnecting on every query will
incur some overhead, which can reduce performance when running many small queries. DuckDB also caches some data and metadata in
memory, and that cache is lost when the last open connection is closed. Frequently, a single connection will work best, but a connection
pool may also be used.
Using multiple connections can parallelize some operations, although it is typically not necessary. DuckDB does attempt to parallelize as
much as possible within each individual query, but it is not possible to parallelize in all cases. Making multiple connections can process
more operations concurrently. This can be more helpful if DuckDB is not CPU limited, but instead bottlenecked by another resource like
network transfer speed.
When importing or exporting data sets (from/to the Parquet or CSV formats), which are much larger than the available memory, an out of
memory error may occur:
Error: Out of Memory Error: failed to allocate data of size ... (.../... used)
This allows the systems to re‑order any results that do not contain ORDER BY clauses, potentially reducing memory usage.
622
DuckDB Documentation
My Workload Is Slow
If you find that your workload in DuckDB is slow, we recommend performing the following checks. More detailed instructions are linked for
each point.
1. Do you have enough memory? DuckDB works best if you have 5‑10GB memory per CPU core.
2. Are you using a fast disk? Network‑attached disks can cause the workload to slow down, especially for larger than memory workloads.
3. Are you using indexes or constraints (primary key, unique, etc.)? If possible, try disabling them, which boosts load and update per‑
formance.
4. Are you using the correct types? For example, use TIMESTAMP to encode datetime values.
5. Are you reading from Parquet files? If so, do they have row group sizes between 100k and 1M and file sizes between 100MB to 10GB?
6. Does the query plan look right? Study it with EXPLAIN.
7. Is the workload running in parallel? Use htop or the operating system's task manager to observe this.
8. Is DuckDB using too many threads? Try limiting the amount of threads.
Are you aware of other common issues? If so, please click the Report content issue link below and describe them along with their
workarounds.
Benchmarks
For several of the recommendations in our performance guide, we use microbenchmarks to back up our claims. For these benchmarks, we
use data sets from the TPC‑H benchmark and the LDBC Social Network Benchmark’s BI workload.
Data Sets
We use the LDBC BI SF300 data set's Comment table (20GB .tar.zst archive, 21GB when decompressed into .csv.gz files), while
others use the same table's creationDate column (4GB .parquet file).
The TPC data sets used in the benchmark are generated with the DuckDB tpch extension.
A Note on Benchmarks
Running fair benchmarks is difficult, especially when performing system‑to‑system comparison. When running benchmarks on DuckDB,
please make sure you are using the latest version (preferably the nightly build). If in doubt about your benchmark results, feel free to contact
us at [email protected].
Disclaimer on Benchmarks
Note that the benchmark results presented in this guide do not constitute official TPC or LDBC benchmark results. Instead, they merely use
the data sets of and some queries provided by the TPC‑H and the LDBC BI benchmark frameworks, and omit other parts of the workloads
such as updates.
623
DuckDB Documentation
624
Meta Queries
Describe
Describing a Table
In order to view the schema of a table, use DESCRIBE or SHOW followed by the table name.
Describing a Query
In order to view the schema of the result of a query, prepend DESCRIBE to a query.
Note that there are subtle differences: compared to the result when describing a table, nullability (null) and key information (key) are
lost.
DESCRIBE can be used a subquery. This allows creating a table from the description, for example:
It is possible to describe remote tables via the httpfs extension using the DESCRIBE TABLE statement. For example:
625
DuckDB Documentation
┌─────────────────────────────────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │ null │ key │ default │ extra │
│ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │
├─────────────────────────────────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ season_num │ BIGINT │ YES │ │ │ │
│ episode_num │ BIGINT │ YES │ │ │ │
│ aired_date │ DATE │ YES │ │ │ │
│ ... │ ... │ ... │ │ │ │
├─────────────────────────────────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┤
│ 18 rows 6 columns │
└───────────────────────────────────────────────────────────────────────────────────────────────┘
By default only the final physical plan is shown. In order to see the unoptimized and optimized logical plans, change the explain_
output setting:
Below is an example of running EXPLAIN on Q13 of the TPC‑H benchmark on the scale factor 1 data set.
EXPLAIN
SELECT
c_count,
count(*) AS custdist
FROM (
SELECT
c_custkey,
count(o_orderkey)
FROM
customer
LEFT OUTER JOIN orders ON c_custkey = o_custkey
AND o_comment NOT LIKE '%special%requests%'
GROUP BY
c_custkey) AS c_orders (c_custkey,
c_count)
GROUP BY
c_count
ORDER BY
custdist DESC,
c_count DESC;
┌─────────────────────────────┐
│┌───────────────────────────┐│
││ Physical Plan ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│ ORDER_BY │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ ORDERS: │
│ count_star() DESC │
│ c_orders.c_count DESC │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
626
DuckDB Documentation
│ HASH_GROUP_BY │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ #0 │
│ count_star() │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ c_count │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ count(o_orderkey) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ HASH_GROUP_BY │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ #0 │
│ count(#1) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ c_custkey │
│ o_orderkey │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ HASH_JOIN │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ RIGHT │
│ o_custkey = c_custkey ├──────────────┐
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ │
│ EC: 300000 │ │
└─────────────┬─────────────┘ │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│ FILTER ││ SEQ_SCAN │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ (o_comment !~~ '%special ││ customer │
│ %requests%') ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ c_custkey │
│ EC: 300000 ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ ││ EC: 150000 │
└─────────────┬─────────────┘└───────────────────────────┘
┌─────────────┴─────────────┐
│ SEQ_SCAN │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ orders │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ o_custkey │
│ o_comment │
│ o_orderkey │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ EC: 1500000 │
└───────────────────────────┘
627
DuckDB Documentation
See Also
The query plan will be pretty‑printed to the screen using timings for every operator.
Note that the cumulative wall‑clock time that is spent on every operator is shown. When multiple threads are processing the query in
parallel, the total processing time of the query may be lower than the sum of all the times spent on the individual operators.
Below is an example of running EXPLAIN ANALYZE on Q13 of the TPC‑H benchmark on the scale factor 1 data set.
EXPLAIN ANALYZE
SELECT
c_count,
count(*) AS custdist
FROM (
SELECT
c_custkey,
count(o_orderkey)
FROM
customer
LEFT OUTER JOIN orders ON c_custkey = o_custkey
AND o_comment NOT LIKE '%special%requests%'
GROUP BY
c_custkey) AS c_orders (c_custkey,
c_count)
GROUP BY
c_count
ORDER BY
custdist DESC,
c_count DESC;
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││ Total Time: 0.0487s ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌───────────────────────────┐
│ RESULT_COLLECTOR │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 0 │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ EXPLAIN_ANALYZE │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 0 │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ ORDER_BY │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ ORDERS: │
628
DuckDB Documentation
│ count_star() DESC │
│ c_orders.c_count DESC │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 42 │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ HASH_GROUP_BY │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ #0 │
│ count_star() │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 42 │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ c_count │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 150000 │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ count(o_orderkey) │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 150000 │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ HASH_GROUP_BY │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ #0 │
│ count(#1) │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 150000 │
│ (0.09s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ c_custkey │
│ o_orderkey │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 1534302 │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ HASH_JOIN │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ RIGHT │
│ o_custkey = c_custkey │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ├──────────────┐
│ EC: 300000 │ │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ │
│ 1534302 │ │
629
DuckDB Documentation
│ (0.08s) │ │
└─────────────┬─────────────┘ │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│ FILTER ││ SEQ_SCAN │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ (o_comment !~~ '%special ││ customer │
│ %requests%') ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ c_custkey │
│ EC: 300000 ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ EC: 150000 │
│ 1484298 ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ (0.10s) ││ 150000 │
│ ││ (0.00s) │
└─────────────┬─────────────┘└───────────────────────────┘
┌─────────────┴─────────────┐
│ SEQ_SCAN │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ orders │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ o_custkey │
│ o_comment │
│ o_orderkey │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ EC: 1500000 │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 1500000 │
│ (0.01s) │
└───────────────────────────┘
See Also
List Tables
The SHOW TABLES command can be used to obtain a list of all tables within the selected schema.
name
tbl
SHOW or SHOW ALL TABLES can be used to obtain a list of all tables within all attached databases and schemas.
630
DuckDB Documentation
See Also
The SQL‑standard information_schema views are also defined. Moreover, DuckDB defines sqlite_master and many PostgreSQL
system catalog tables for compatibility with SQLite and PostgreSQL respectively.
Summarize
The SUMMARIZE command can be used to easily compute a number of aggregates over a table or a query. The SUMMARIZE command
launches a query that computes a number of aggregates over all columns (min, max, approx_unique, avg, std, q25, q50, q75,
count), and return these along the column name, column type, and the percentage of NULL values in the column.
Usage
In order to summarize the contents of a table, use SUMMARIZE followed by the table name.
SUMMARIZE tbl;
Example
Below is an example of SUMMARIZE on the lineitem table of TPC‑H SF1 table, generated using the tpch extension.
INSTALL tpch;
LOAD tpch;
CALL dbgen(sf = 1);
SUMMARIZE lineitem;
631
DuckDB Documentation
l_ DECIMAL(15,2)
901.00 104949.50 923139 38255.138484656854
23300.43871096221
18756 36724 55159 60012150.0%
extendedprice
l_ DECIMAL(15,2)
0.00 0.10 11 0.04999943011540163
0.03161985510812596
0 0 0 60012150.0%
discount
l_tax DECIMAL(15,2)
0.00 0.08 9 0.04001350893110812
0.025816551798842728
0 0 0 60012150.0%
l_ VARCHAR A R 3 NULL NULL NULL NULL NULL 60012150.0%
returnflag
l_ VARCHAR F O 2 NULL NULL NULL NULL NULL 60012150.0%
linestatus
l_ DATE 1992‑ 1998‑12‑01 2516 NULL NULL NULL NULL NULL 60012150.0%
shipdate 01‑02
l_ DATE 1992‑ 1998‑10‑31 2460 NULL NULL NULL NULL NULL 60012150.0%
commitdate 01‑31
l_ DATE 1992‑ 1998‑12‑31 2549 NULL NULL NULL NULL NULL 60012150.0%
receiptdate 01‑04
l_ VARCHAR COLLECT TAKE BACK 4 NULL NULL NULL NULL NULL 60012150.0%
shipinstruct COD RETURN
l_ VARCHAR AIR TRUCK 7 NULL NULL NULL NULL NULL 60012150.0%
shipmode
l_ VARCHAR Tiresias zzle? 3558599 NULL NULL NULL NULL NULL 60012150.0%
comment furiously iro
SUMMARIZE can be used a subquery. This allows creating a table from the summary, for example:
It is possible to summarize remote tables via the httpfs extension using the SUMMARIZE TABLE statement. For example:
DuckDB Environment
DuckDB provides a number of functions and PRAGMA options to retrieve information on the running DuckDB instance and its environ‑
ment.
Version
SELECT version();
632
DuckDB Documentation
┌───────────┐
│ version() │
│ varchar │
├───────────┤
│ v0.10.0 │
└───────────┘
Using a PRAGMA:
PRAGMA version;
┌─────────────────┬────────────┐
│ library_version │ source_id │
│ varchar │ varchar │
├─────────────────┼────────────┤
│ v0.10.0 │ 20b1486d11 │
└─────────────────┴────────────┘
Platform
The platform information consists of the operating system, compiler, and, optionally, the compiler. The platform is used when installing
extensions. To retrieve the platform, use the following PRAGMA:
PRAGMA platform;
┌───────────┐
│ platform │
│ varchar │
├───────────┤
│ osx_arm64 │
└───────────┘
On Windows, running on an AMD64 architecture, the platform is windows_amd64. On CentOS 7, running on the AMD64 architecture, the
platform is linux_amd64_gcc4. On Ubuntu 22.04, running on the ARM64 architecture, the platform is linux_arm64.
Extensions
To get a list of DuckDB extension and their status (e.g., loaded, installed), use the duckdb_extensions() function:
SELECT *
FROM duckdb_extensions();
DuckDB has the following built‑in table functions to obtain metadata about available catalog objects:
• duckdb_columns(): columns
• duckdb_constraints(): constraints
• duckdb_databases(): lists the databases that are accessible from within the current DuckDB process
• duckdb_dependencies(): dependencies between objects
• duckdb_extensions(): extensions
• duckdb_functions(): functions
• duckdb_indexes(): secondary indexes
• duckdb_keywords(): DuckDB's keywords and reserved words
• duckdb_optimizers(): the available optimization rules in the DuckDB instance
633
DuckDB Documentation
• duckdb_schemas(): schemas
• duckdb_sequences(): sequences
• duckdb_settings(): settings
634
ODBC
• What is ODBC?
• General Concepts
• Setting up an Application
• Sample Application
What is ODBC?
ODBC which stands for Open Database Connectivity, is a standard that allows different programs to talk to different databases including, of
course, DuckDB . This makes it easier to build programs that work with many different databases, which saves time as developers don't
have to write custom code to connect to each database. Instead, they can use the standardized ODBC interface, which reduces development
time and costs, and programs are easier to maintain. However, ODBC can be slower than other methods of connecting to a database, such
as using a native driver, as it adds an extra layer of abstraction between the application and the database. Furthermore, because DuckDB
is column‑based and ODBC is row‑based, there can be some inefficiencies when using ODBC with DuckDB.
Note. There are links throughout this page to the official Microsoft ODBC documentation, which is a great resource for learning
more about ODBC.
General Concepts
• Handles
• Connecting
• Error Handling and Diagnostics
• Buffers and Binding
Handles A handle is a pointer to a specific ODBC object which is used to interact with the database. There are several different types of
handles, each with a different purpose, these are the environment handle, the connection handle, the statement handle, and the descriptor
handle. Handles are allocated using the SQLAllocHandle which takes as input the type of handle to allocate, and a pointer to the handle,
the driver then creates a new handle of the specified type which it returns to the application.
Handle Types
Environment SQL_HANDLE_ Manages the environment Initializing ODBC, managing Must be allocated once per
ENV settings for ODBC operations, driver behavior, resource application upon starting,
and provides a global context allocation and freed at the end.
in which to access data.
635
DuckDB Documentation
Connecting The first step is to connect to the data source so that the application can perform database operations. First the application
must allocate an environment handle, and then a connection handle. The connection handle is then used to connect to the data source.
There are two functions which can be used to connect to a data source, SQLDriverConnect and SQLConnect. The former is used to
connect to a data source using a connection string, while the latter is used to connect to a data source using a DSN.
Connection String A connection string is a string which contains the information needed to connect to a data source. It is formatted as
a semicolon separated list of key‑value pairs, however DuckDB currently only utilizes the DSN and ignores the rest of the parameters.
DSN A DSN (Data Source Name) is a string that identifies a database. It can be a file path, URL, or a database name. For example:
C:\Users\me\duckdb.db and DuckDB are both valid DSNs. More information on DSNs can be found on the ”Choosing a Data Source
or Driver” page of the SQL Server documentation.
Error Handling and Diagnostics All functions in ODBC return a code which represents the success or failure of the function. This allows
for easy error handling, as the application can simply check the return code of each function call to determine if it was successful. When
unsuccessful, the application can then use the SQLGetDiagRec function to retrieve the error information. The following table defines
the return codes:
636
DuckDB Documentation
Buffers and Binding A buffer is a block of memory used to store data. Buffers are used to store data retrieved from the database, or to
send data to the database. Buffers are allocated by the application, and then bound to a column in a result set, or a parameter in a query,
using the SQLBindCol and SQLBindParameter functions. When the application fetches a row from the result set, or executes a query,
the data is stored in the buffer. When the application sends a query to the database, the data in the buffer is sent to the database.
Setting up an Application
The following is a step‑by‑step guide to setting up an application that uses ODBC to connect to a database, execute a query, and fetch the
results in C++.
Note. To install the driver as well as anything else you will need follow these instructions.
• What is ODBC?
• General Concepts
– Handles
* Handle Types
– Connecting
* Connection String
* DSN
– Error Handling and Diagnostics
– Buffers and Binding
• Setting up an Application
• Sample Application
1. Include the SQL Header Files The first step is to include the SQL header files:
#include <sql.h>
#include <sqlext.h>
These files contain the definitions of the ODBC functions, as well as the data types used by ODBC. In order to be able to use these header
files you have to have the unixodbc package installed:
637
DuckDB Documentation
For MAKEFILE:
CFLAGS=-I/usr/local/include
# or
CFLAGS=-/opt/homebrew/Cellar/unixodbc/2.3.11/include
For CMAKE:
include_directories(/usr/local/include)
# or
include_directories(/opt/homebrew/Cellar/unixodbc/2.3.11/include)
You also have to link the library in your CMAKE or MAKEFILE: For CMAKE:
target_link_libraries(ODBC_application /path/to/duckdb_odbc/libduckdb_odbc.dylib)
For MAKEFILE:
LDLIBS=-L/path/to/duckdb_odbc/libduckdb_odbc.dylib
2.a. Connecting with SQLConnect Then set up the ODBC handles, allocate them, and connect to the database. First the environment
handle is allocated, then the environment is set to ODBC version 3, then the connection handle is allocated, and finally the connection is
made to the database. The following code snippet shows how to do this:
SQLHANDLE env;
SQLHANDLE dbc;
2.b. Connecting with SQLDriverConnect Alternatively, you can connect to the ODBC driver using SQLDriverConnect.
SQLDriverConnect accepts a connection string in which you can configure the database using any of the available DuckDB con‑
figuration options.
SQLHANDLE env;
SQLHANDLE dbc;
638
DuckDB Documentation
SQLCHAR str[1024];
SQLSMALLINT strl;
std::string dsn = "DSN=DuckDB;allow_unsigned_extensions=true;access_mode=READ_ONLY"
SQLDriverConnect(dbc, nullptr, (SQLCHAR*)dsn.c_str(), SQL_NTS, str, sizeof(str), &strl, SQL_DRIVER_
COMPLETE)
3. Adding a Query Now that the application is set up, we can add a query to it. First, we need to allocate a statement handle:
SQLHANDLE stmt;
SQLAllocHandle(SQL_HANDLE_STMT, dbc, &stmt);
4. Fetching Results Now that we have executed a query, we can fetch the results. First, we need to bind the columns in the result set to
buffers:
SQLLEN int_val;
SQLLEN null_val;
SQLBindCol(stmt, 1, SQL_C_SLONG, &int_val, 0, &null_val);
SQLFetch(stmt);
5. Go Wild Now that we have the results, we can do whatever we want with them. For example, we can print them:
or do any other processing we want. As well as executing more queries and doing any thing else we want to do with the database such as
inserting, updating, or deleting data.
6. Free the Handles and Disconnecting Finally, we need to free the handles and disconnect from the database. First, we need to free
the statement handle:
SQLFreeHandle(SQL_HANDLE_STMT, stmt);
SQLDisconnect(dbc);
And finally, we need to free the connection handle and the environment handle:
SQLFreeHandle(SQL_HANDLE_DBC, dbc);
SQLFreeHandle(SQL_HANDLE_ENV, env);
Freeing the connection and environment handles can only be done after the connection to the database has been closed. Trying to free
them before disconnecting from the database will result in an error.
Sample Application
The following is a sample application that includes a cpp file that connects to the database, executes a query, fetches the results, and prints
them. It also disconnects from the database and frees the handles, and includes a function to check the return value of ODBC functions. It
also includes a CMakeLists.txt file that can be used to build the application.
639
DuckDB Documentation
#include <iostream>
#include <sql.h>
#include <sqlext.h>
int main() {
SQLHANDLE env;
SQLHANDLE dbc;
SQLRETURN ret;
SQLHANDLE stmt;
ret = SQLAllocHandle(SQL_HANDLE_STMT, dbc, &stmt);
check_ret(ret, "SQLAllocHandle(stmt)");
SQLLEN int_val;
SQLLEN null_val;
ret = SQLBindCol(stmt, 1, SQL_C_SLONG, &int_val, 0, &null_val);
check_ret(ret, "SQLBindCol");
ret = SQLFetch(stmt);
check_ret(ret, "SQLFetch");
ret = SQLDisconnect(dbc);
check_ret(ret, "SQLDisconnect");
640
DuckDB Documentation
set(CMAKE_CXX_STANDARD 17)
include_directories(/opt/homebrew/Cellar/unixodbc/2.3.11/include)
add_executable(ODBC_Tester_App main.cpp)
target_link_libraries(ODBC_Tester_App /duckdb_odbc/libduckdb_odbc.dylib)
641
DuckDB Documentation
642
Python
The latest release of the Python client can be installed using pip.
The latest Python client can be installed from source from the tools/pythonpkg directory in the DuckDB GitHub repository.
For detailed instructions on how to compile DuckDB from source, see the Building guide.
import duckdb
duckdb.sql("SELECT 42").show()
By default this will create a relation object. The result can be converted to various formats using the result conversion functions. For
example, the fetchall method can be used to convert the result to Python objects.
Several other result objects exist. For example, you can use df to convert the result to a Pandas DataFrame.
By default, a global in‑memory connection will be used. Any data stored in files will be lost after shutting down the program. A connection
to a persistent database can be created using the connect function.
After connecting, SQL queries can be executed using the sql command.
con = duckdb.connect("file.db")
con.sql("CREATE TABLE integers (i INTEGER)")
con.sql("INSERT INTO integers VALUES (42)")
con.sql("SELECT * FROM integers").show()
643
DuckDB Documentation
Jupyter Notebooks
DuckDB's Python client can be used directly in Jupyter notebooks with no additional configuration if desired. However, additional libraries
can be used to simplify SQL query development. This guide will describe how to utilize those additional libraries. See other guides in the
Python section for how to use DuckDB and Python together.
Library Installation
1. jupysql
2. Pandas
3. matplotlib
# Run these pip install commands from the command line if Jupyter Notebook is not yet installed.
# Otherwise, see Google Collab link above for an in-notebook example
pip install duckdb
# Install Jupyter Notebook (Note: you can also install JupyterLab: pip install jupyterlab)
pip install notebook
import duckdb
import pandas as pd
%load_ext sql
conn = duckdb.connect()
%sql conn --alias duckdb
644
DuckDB Documentation
Connecting to DuckDB via SQLAlchemy Using duckdb_engine Alternatively, you can connect to DuckDB via SQLAlchemy using
duckdb_engine. See the performance and feature differences.
import duckdb
import pandas as pd
# No need to import duckdb_engine
# jupysql will auto-detect the driver needed based on the connection string!
Set configurations on jupysql to directly output data to Pandas and to simplify the output that is printed to the notebook.
Connect jupysql to DuckDB using a SQLAlchemy‑style connection string. Either connect to a new in‑memory DuckDB, the default connec‑
tion or a file backed database.
%sql duckdb:///:default:
%sql duckdb:///:memory:
%sql duckdb:///path/to/file.db
Note. The %sql command and duckdb.sql share the same default connection if you provide duckdb:///:default: as the
SQLAlchemy connection string.
Querying DuckDB
Single line SQL queries can be run using %sql at the start of a line. Query results will be displayed as a Pandas DF.
An entire Jupyter cell can be used as a SQL cell by placing %%sql at the start of the cell. Query results will be displayed as a Pandas DF.
%%sql
SELECT
schema_name,
function_name
FROM duckdb_functions()
ORDER BY ALL DESC
LIMIT 5
To store the query results in a Python variable, use << as an assignment operator. This can be used with both the %sql and %%sql Jupyter
magics.
If the %config SqlMagic.autopandas = True option is set, the variable is a Pandas dataframe, otherwise, it is a ResultSet that
can be converted to Pandas with the DataFrame() function.
DuckDB is able to find and query any dataframe stored as a variable in the Jupyter notebook.
The dataframe being queried can be specified just like any other table in the FROM clause.
645
DuckDB Documentation
The most common way to plot datasets in Python is to load them using Pandas and then use matplotlib or seaborn for plotting. This
approach requires loading all data into memory which is highly inefficient. The plotting module in JupySQL runs computations in the SQL
engine. This delegates memory management to the engine and ensures that intermediate computations do not keep eating up memory,
efficiently plotting massive datasets.
Install and Load DuckDB httpfs Extension DuckDB's httpfs extension allows Parquet and CSV files to be queried remotely over http.
These examples query a Parquet file that contains historical taxi data from NYC. Using the Parquet format allows DuckDB to only pull the
rows and columns into memory that are needed rather than downloading the entire file. DuckDB can be used to process local Parquet files
as well, which may be desirable if querying the entire Parquet file, or running multiple queries that require large subsets of the file.
%%sql
INSTALL httpfs;
LOAD httpfs;
Boxplot & Histogram To create a boxplot, call %sqlplot boxplot, passing the name of the table and the column to plot. In this case,
the name of the table is the URL of the remotely stored Parquet file.
Now, create a query that filters by the 90th percentile. Note the use of the --save, and --no-execute functions. This tells JupySQL to
store the query, but skips execution. It will be referenced in the next plotting call.
To create a histogram, call %sqlplot histogram and pass the name of the table, the column to plot, and the number of bins. This uses
--with short-trips so JupySQL uses the query defined previously and therefore only plots a subset of the data.
646
DuckDB Documentation
Summary
You now have the ability to alternate between SQL and Pandas in a simple and highly performant way! You can plot massive datasets
directly through the engine (avoiding both the download of the entire file and loading all of it into Pandas in memory). Dataframes can be
read as tables in SQL, and SQL results can be output into Dataframes. Happy analyzing!
SQL on Pandas
Pandas DataFrames stored in local variables can be queried as if they are regular tables within DuckDB.
import duckdb
import pandas
The seamless integration of Pandas DataFrames to DuckDB SQL queries is allowed by replacement scans, which replace instances of ac‑
cessing the my_df table (which does not exist in DuckDB) with a table function that reads the my_df dataframe.
CREATE TABLE AS and INSERT INTO can be used to create a table from any query. We can then create tables or insert into existing
tables by referring to referring to the Pandas DataFrame in the query.
import duckdb
import pandas
647
DuckDB Documentation
Export to Pandas
The result of a query can be converted to a Pandas DataFrame using the df() function.
import duckdb
Arrow Tables stored in local variables can be queried as if they are regular tables within DuckDB.
import duckdb
import pyarrow as pa
# query the Apache Arrow Table "my_arrow_table" and return as an Arrow Table
results = con.execute("SELECT * FROM my_arrow_table WHERE i = 2").arrow()
Arrow Datasets stored as variables can also be queried as if they were regular tables. Datasets are useful to point towards directories of
Parquet files to analyze large datasets. DuckDB will push column selections and row filters down into the dataset scan operation so that
only the necessary data is pulled into memory.
import duckdb
import pyarrow as pa
import tempfile
import pathlib
import pyarrow.parquet as pq
import pyarrow.dataset as ds
648
DuckDB Documentation
con = duckdb.connect()
# query the Apache Arrow Dataset "my_arrow_dataset" and return as an Arrow Table
results = con.execute("SELECT * FROM my_arrow_dataset WHERE i = 2").arrow()
Arrow Scanners stored as variables can also be queried as if they were regular tables. Scanners read over a dataset and select specific
columns or apply row‑wise filtering. This is similar to how DuckDB pushes column selections and filters down into an Arrow Dataset, but
using Arrow compute operations instead. Arrow can use asynchronous IO to quickly access files.
import duckdb
import pyarrow as pa
import tempfile
import pathlib
import pyarrow.parquet as pq
import pyarrow.dataset as ds
import pyarrow.compute as pc
# query the Apache Arrow scanner "arrow_scanner" and return as an Arrow Table
results = con.execute("SELECT * FROM arrow_scanner").arrow()
Arrow RecordBatchReaders are a reader for Arrow's streaming binary format and can also be queried directly as if they were tables. This
649
DuckDB Documentation
streaming format is useful when sending Arrow data for tasks like interprocess communication or communicating between language run‑
times.
import duckdb
import pyarrow as pa
# query the Apache Arrow RecordBatchReader "my_recordbatchreader" and return as an Arrow Table
results = con.execute("SELECT * FROM my_recordbatchreader WHERE i = 2").arrow()
CREATE TABLE AS and INSERT INTO can be used to create a table from any query. We can then create tables or insert into existing
tables by referring to referring to the Apache Arrow object in the query. This example imports from an Arrow Table, but DuckDB can query
different Apache Arrow formats as seen in the SQL on Arrow guide.
import duckdb
import pyarrow as pa
All results of a query can be exported to an Apache Arrow Table using the arrow function. Alternatively, results can be returned as a
RecordBatchReader using the fetch_record_batch function and results can be read one batch at a time. In addition, relations built
using DuckDB's Relational API can also be exported.
import duckdb
import pyarrow as pa
# query the Apache Arrow Table "my_arrow_table" and return as an Arrow Table
results = duckdb.sql("SELECT * FROM my_arrow_table").arrow()
650
DuckDB Documentation
Export as a RecordBatchReader
import duckdb
import pyarrow as pa
# query the Apache Arrow Table "my_arrow_table" and return as an Arrow RecordBatchReader
chunk_size = 1_000_000
results = duckdb.sql("SELECT * FROM my_arrow_table").fetch_record_batch(chunk_size)
# Loop through the results. A StopIteration exception is thrown when the RecordBatchReader is empty
while True:
try:
# Process a single chunk here (just printing as an example)
print(results.read_next_batch().to_pandas())
except StopIteration:
print('Already fetched all batches')
break
Arrow objects can also be exported from the Relational API. A relation can be converted to an Arrow table using the arrow or to_arrow_
table functions, or a record batch using record_batch. A result can be exported to an Arrow table with arrow or the alias fetch_
arrow_table, or to a RecordBatchReader using fetch_arrow_reader.
import duckdb
# Create a relation from the table and export the entire relation as Arrow
rel = con.table("integers")
relation_as_arrow = rel.arrow() # or .to_arrow_table()
# Or, calculate a result using that relation and export that result to Arrow
res = rel.aggregate("sum(i)").execute()
result_as_arrow = res.arrow() # or fetch_arrow_table()
DuckDB offers a relational API that can be used to chain together query operations. These are lazily evaluated so that DuckDB can optimize
their execution. These operators can act on Pandas DataFrames, DuckDB tables or views (which can point to any underlying storage format
that DuckDB can read, such as CSV or Parquet files, etc.). Here we show a simple example of reading from a Pandas DataFrame and returning
a DataFrame.
import duckdb
import pandas
651
DuckDB Documentation
# chain together relational operators (this is a lazy operation, so the operations are not yet executed)
# equivalent to: SELECT i, j, i*2 AS two_i FROM input_df ORDER BY i DESC LIMIT 2
transformed_rel = rel.filter('i >= 2').project('i, j, i*2 as two_i').order('i desc').limit(2)
Relational operators can also be used to group rows, aggregate, find distinct combinations of values, join, union, and more. They are also
able to directly insert results into a DuckDB table or write to a CSV.
Please see these additional examples and the available relational methods on the DuckDBPyRelation class.
This page demonstrates how to simultaneously insert into and read from a DuckDB database across multiple Python threads. This could
be useful in scenarios where new data is flowing in and an analysis should be periodically re‑run. Note that this is all within a single Python
process (see the FAQ for details on DuckDB concurrency). Feel free to follow along in this Google Colab notebook.
Setup
First, import DuckDB and several modules from the Python standard library. Note: if using Pandas, add import pandas at the top of the
script as well (as it must be imported prior to the multi‑threading). Then connect to a file‑backed DuckDB database and create an example
table to store inserted data. This table will track the name of the thread that completed the insert and automatically insert the timestamp
when that insert occurred using the DEFAULT expression.
import duckdb
from threading import Thread, current_thread
import random
duckdb_con = duckdb.connect('my_peristent_db.duckdb')
# Use connect without parameters for an in-memory database
# duckdb_con = duckdb.connect()
duckdb_con.execute("""
CREATE OR REPLACE TABLE my_inserts (
thread_name VARCHAR,
insert_time TIMESTAMP DEFAULT current_timestamp
)
""")
Next, define functions to be executed by the writer and reader threads. Each thread must use the .cursor() method to create a thread‑
local connection to the same DuckDB file based on the original connection. This approach also works with in‑memory DuckDB databases.
def write_from_thread(duckdb_con):
# Create a DuckDB connection specifically for this thread
local_con = duckdb_con.cursor()
# Insert a row with the name of the thread. insert_time is auto-generated.
652
DuckDB Documentation
thread_name = str(current_thread().name)
result = local_con.execute("""
INSERT INTO my_inserts (thread_name)
VALUES (?)
""", (thread_name,)).fetchall()
def read_from_thread(duckdb_con):
# Create a DuckDB connection specifically for this thread
local_con = duckdb_con.cursor()
# Query the current row count
thread_name = str(current_thread().name)
results = local_con.execute("""
SELECT
? AS thread_name,
count(*) AS row_counter,
current_timestamp
FROM my_inserts
""", (thread_name,)).fetchall()
print(results)
Create Threads
We define how many writers and readers to use, and define a list to track all of the Threads that will be created. Then, create first writer
and then reader Threads. Next, shuffle them so that they will be kicked off in a random order to simulate simultaneous writers and readers.
Note that the Threads have not yet been executed, only defined.
write_thread_count = 50
read_thread_count = 5
threads = []
# Create multiple writer and reader threads (in the same process)
# Pass in the same connection as an argument
for i in range(write_thread_count):
threads.append(Thread(target = write_from_thread,
args = (duckdb_con,),
name = 'write_thread_' + str(i)))
for j in range(read_thread_count):
threads.append(Thread(target = read_from_thread,
args = (duckdb_con,),
name = 'read_thread_' + str(j)))
Now, kick off all threads to run in parallel, then wait for all of them to finish before printing out the results. Note that the timestamps of
readers and writers are interspersed as expected due to the randomization.
653
DuckDB Documentation
thread.join()
print(duckdb_con.execute("""
SELECT
*
FROM my_inserts
ORDER BY
insert_time
""").df())
Ibis is a Python dataframe library that supports 15+ backends, with DuckDB as the default. Ibis with DuckDB provides a Pythonic interface
for SQL with great performance.
Installation
or use conda:
or use mamba:
Ibis can work with several file types, but at its core, it connects to existing databases and interacts with the data there. You can get started
with your own DuckDB databases or create a new one with example data.
import ibis
con = ibis.connect("duckdb://penguins.ddb")
con.create_table(
"penguins", ibis.examples.penguins.fetch().to_pyarrow(), overwrite = True
)
# Output:
DatabaseTable: penguins
species string
island string
bill_length_mm float64
bill_depth_mm float64
flipper_length_mm int64
body_mass_g int64
sex string
year int64
You can now see the example dataset copied over to the database:
654
DuckDB Documentation
# Output:
['penguins']
There's one table, called penguins. We can ask Ibis to give us an object that we can interact with.
penguins = con.table("penguins")
penguins
# Output:
DatabaseTable: penguins
species string
island string
bill_length_mm float64
bill_depth_mm float64
flipper_length_mm int64
body_mass_g int64
sex string
year int64
Ibis is lazily evaluated, so instead of seeing the data, we see the schema of the table. To peek at the data, we can call head and then to_
pandas to get the first few rows of the table as a pandas DataFrame.
penguins.head().to_pandas()
to_pandas takes the existing lazy table expression and evaluates it. If we leave it off, you'll see the Ibis representation of the table expres‑
sion that to_pandas will evaluate (when you're ready!).
penguins.head()
# Output:
r0 := DatabaseTable: penguins
species string
island string
bill_length_mm float64
bill_depth_mm float64
flipper_length_mm int64
body_mass_g int64
sex string
year int64
Limit[r0, n=5]
Ibis returns results as a pandas DataFrame using to_pandas, but isn't using pandas to perform any of the computation. The query is
executed by DuckDB. Only when to_pandas is called does Ibis then pull back the results and convert them into a DataFrame.
Interactive Mode
For the rest of this intro, we'll turn on interactive mode, which partially executes queries to give users a preview of the results. There is a
small difference in the way the output is formatted, but otherwise this is the same as calling to_pandas on the table expression with a
limit of 10 result rows returned.
ibis.options.interactive = True
penguins.head()
655
DuckDB Documentation
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ species ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex ┃ year
┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ string │ string │ float64 │ float64 │ int64 │ int64 │ string │ int64 │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ Adelie │ Torgersen │ 39.1 │ 18.7 │ 181 │ 3750 │ male │ 2007 │
│ Adelie │ Torgersen │ 39.5 │ 17.4 │ 186 │ 3800 │ female │ 2007 │
│ Adelie │ Torgersen │ 40.3 │ 18.0 │ 195 │ 3250 │ female │ 2007 │
│ Adelie │ Torgersen │ nan │ nan │ NULL │ NULL │ NULL │ 2007 │
│ Adelie │ Torgersen │ 36.7 │ 19.3 │ 193 │ 3450 │ female │ 2007 │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘
Common Operations
Ibis has a collection of useful table methods to manipulate and query the data in a table.
filter filter allows you to select rows based on a condition or set of conditions.
penguins.filter(penguins.species == "Gentoo")
┏━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ species ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex ┃ year ┃
┡━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ string │ string │ float64 │ float64 │ int64 │ int64 │ string │ int64 │
├─────────┼────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ Gentoo │ Biscoe │ 46.1 │ 13.2 │ 211 │ 4500 │ female │ 2007 │
│ Gentoo │ Biscoe │ 50.0 │ 16.3 │ 230 │ 5700 │ male │ 2007 │
│ Gentoo │ Biscoe │ 48.7 │ 14.1 │ 210 │ 4450 │ female │ 2007 │
│ Gentoo │ Biscoe │ 50.0 │ 15.2 │ 218 │ 5700 │ male │ 2007 │
│ Gentoo │ Biscoe │ 47.6 │ 14.5 │ 215 │ 5400 │ male │ 2007 │
│ Gentoo │ Biscoe │ 46.5 │ 13.5 │ 210 │ 4550 │ female │ 2007 │
│ Gentoo │ Biscoe │ 45.4 │ 14.6 │ 211 │ 4800 │ female │ 2007 │
│ Gentoo │ Biscoe │ 46.7 │ 15.3 │ 219 │ 5200 │ male │ 2007 │
│ Gentoo │ Biscoe │ 43.3 │ 13.4 │ 209 │ 4400 │ female │ 2007 │
│ Gentoo │ Biscoe │ 46.8 │ 15.4 │ 215 │ 5150 │ male │ 2007 │
│ … │ … │ … │ … │ … │ … │ … │ … │
└─────────┴────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘
Or filter for Gentoo penguins that have a body mass larger than 6 kg.
┏━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ species ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex ┃ year ┃
┡━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ string │ string │ float64 │ float64 │ int64 │ int64 │ string │ int64 │
├─────────┼────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ Gentoo │ Biscoe │ 49.2 │ 15.2 │ 221 │ 6300 │ male │ 2007 │
│ Gentoo │ Biscoe │ 59.6 │ 17.0 │ 230 │ 6050 │ male │ 2007 │
└─────────┴────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘
You can use any boolean comparison in a filter (although if you try to do something like use < on a string, Ibis will yell at you).
select Your data analysis might not require all the columns present in a given table. select lets you pick out only those columns that
you want to work with.
To select a column you can use the name of the column as a string:
656
DuckDB Documentation
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┓
┃ species ┃ island ┃ year ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━┩
│ string │ string │ int64 │
├─────────┼───────────┼───────┤
│ Adelie │ Torgersen │ 2007 │
│ Adelie │ Torgersen │ 2007 │
│ Adelie │ Torgersen │ 2007 │
│ … │ … │ … │
└─────────┴───────────┴───────┘
Or you can use column objects directly (this can be convenient when paired with tab‑completion):
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┓
┃ species ┃ island ┃ year ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━┩
│ string │ string │ int64 │
├─────────┼───────────┼───────┤
│ Adelie │ Torgersen │ 2007 │
│ Adelie │ Torgersen │ 2007 │
│ Adelie │ Torgersen │ 2007 │
│ … │ … │ … │
└─────────┴───────────┴───────┘
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┓
┃ species ┃ island ┃ year ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━┩
│ string │ string │ int64 │
├─────────┼───────────┼───────┤
│ Adelie │ Torgersen │ 2007 │
│ Adelie │ Torgersen │ 2007 │
│ Adelie │ Torgersen │ 2007 │
│ … │ … │ … │
└─────────┴───────────┴───────┘
mutate mutate lets you add new columns to your table, derived from the values of existing columns.
penguins.mutate(bill_length_cm=penguins.bill_length_mm / 10)
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━
┃ species ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex ┃ year
┃ bill_length_cm ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━
│ string │ string │ float64 │ float64 │ int64 │ int64 │ string │ int64
│ float64 │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┼──────
│ Adelie │ Torgersen │ 39.1 │ 18.7 │ 181 │ 3750 │ male │ 2007 │
3.91 │
│ Adelie │ Torgersen │ 39.5 │ 17.4 │ 186 │ 3800 │ female │ 2007 │
3.95 │
│ Adelie │ Torgersen │ 40.3 │ 18.0 │ 195 │ 3250 │ female │ 2007 │
4.03 │
657
DuckDB Documentation
Notice that the table is a little too wide to display all the columns now (depending on your screen‑size). bill_length is now present in
millimeters and centimeters. Use a select to trim down the number of columns we're looking at.
penguins.mutate(bill_length_cm=penguins.bill_length_mm / 10).select(
"species",
"island",
"bill_depth_mm",
"flipper_length_mm",
"body_mass_g",
"sex",
"year",
"bill_length_cm",
)
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━┓
┃ species ┃ island ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex ┃ year ┃ bill_length_cm
┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━┩
│ string │ string │ float64 │ int64 │ int64 │ string │ int64 │ float64 │
├─────────┼───────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┼────────────────┤
│ Adelie │ Torgersen │ 18.7 │ 181 │ 3750 │ male │ 2007 │ 3.91 │
│ Adelie │ Torgersen │ 17.4 │ 186 │ 3800 │ female │ 2007 │ 3.95 │
│ Adelie │ Torgersen │ 18.0 │ 195 │ 3250 │ female │ 2007 │ 4.03 │
│ Adelie │ Torgersen │ nan │ NULL │ NULL │ NULL │ 2007 │ nan │
│ Adelie │ Torgersen │ 19.3 │ 193 │ 3450 │ female │ 2007 │ 3.67 │
│ Adelie │ Torgersen │ 20.6 │ 190 │ 3650 │ male │ 2007 │ 3.93 │
│ Adelie │ Torgersen │ 17.8 │ 181 │ 3625 │ female │ 2007 │ 3.89 │
│ Adelie │ Torgersen │ 19.6 │ 195 │ 4675 │ male │ 2007 │ 3.92 │
│ Adelie │ Torgersen │ 18.1 │ 193 │ 3475 │ NULL │ 2007 │ 3.41 │
│ Adelie │ Torgersen │ 20.2 │ 190 │ 4250 │ NULL │ 2007 │ 4.20 │
│ … │ … │ … │ … │ … │ … │ … │ … │
└─────────┴───────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┴────────────────┘
selectors Typing out all of the column names except one is a little annoying. Instead of doing that again, we can use a selector to
quickly select or deselect groups of columns.
import ibis.selectors as s
penguins.mutate(bill_length_cm=penguins.bill_length_mm / 10).select(
~s.matches("bill_length_mm")
658
DuckDB Documentation
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━┓
┃ species ┃ island ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex ┃ year ┃ bill_length_cm
┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━┩
│ string │ string │ float64 │ int64 │ int64 │ string │ int64 │ float64 │
├─────────┼───────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┼────────────────┤
│ Adelie │ Torgersen │ 18.7 │ 181 │ 3750 │ male │ 2007 │ 3.91 │
│ Adelie │ Torgersen │ 17.4 │ 186 │ 3800 │ female │ 2007 │ 3.95 │
│ Adelie │ Torgersen │ 18.0 │ 195 │ 3250 │ female │ 2007 │ 4.03 │
│ Adelie │ Torgersen │ nan │ NULL │ NULL │ NULL │ 2007 │ nan │
│ Adelie │ Torgersen │ 19.3 │ 193 │ 3450 │ female │ 2007 │ 3.67 │
│ Adelie │ Torgersen │ 20.6 │ 190 │ 3650 │ male │ 2007 │ 3.93 │
│ Adelie │ Torgersen │ 17.8 │ 181 │ 3625 │ female │ 2007 │ 3.89 │
│ Adelie │ Torgersen │ 19.6 │ 195 │ 4675 │ male │ 2007 │ 3.92 │
│ Adelie │ Torgersen │ 18.1 │ 193 │ 3475 │ NULL │ 2007 │ 3.41 │
│ Adelie │ Torgersen │ 20.2 │ 190 │ 4250 │ NULL │ 2007 │ 4.20 │
│ … │ … │ … │ … │ … │ … │ … │ … │
└─────────┴───────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┴────────────────┘
penguins.select("island", s.numeric())
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━┓
┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ year ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━┩
│ string │ float64 │ float64 │ int64 │ int64 │ int64 │
├───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼───────┤
│ Torgersen │ 39.1 │ 18.7 │ 181 │ 3750 │ 2007 │
│ Torgersen │ 39.5 │ 17.4 │ 186 │ 3800 │ 2007 │
│ Torgersen │ 40.3 │ 18.0 │ 195 │ 3250 │ 2007 │
│ Torgersen │ nan │ nan │ NULL │ NULL │ 2007 │
│ Torgersen │ 36.7 │ 19.3 │ 193 │ 3450 │ 2007 │
│ Torgersen │ 39.3 │ 20.6 │ 190 │ 3650 │ 2007 │
│ Torgersen │ 38.9 │ 17.8 │ 181 │ 3625 │ 2007 │
│ Torgersen │ 39.2 │ 19.6 │ 195 │ 4675 │ 2007 │
│ Torgersen │ 34.1 │ 18.1 │ 193 │ 3475 │ 2007 │
│ Torgersen │ 42.0 │ 20.2 │ 190 │ 4250 │ 2007 │
│ … │ … │ … │ … │ … │ … │
└───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴───────┘
order_by order_by arranges the values of one or more columns in ascending or descending order.
penguins.order_by(penguins.flipper_length_mm).select(
"species", "island", "flipper_length_mm"
)
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
┃ species ┃ island ┃ flipper_length_mm ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
│ string │ string │ int64 │
├───────────┼───────────┼───────────────────┤
│ Adelie │ Biscoe │ 172 │
659
DuckDB Documentation
You can sort in descending order using the desc method of a column:
penguins.order_by(penguins.flipper_length_mm.desc()).select(
"species", "island", "flipper_length_mm"
)
┏━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
┃ species ┃ island ┃ flipper_length_mm ┃
┡━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
│ string │ string │ int64 │
├─────────┼────────┼───────────────────┤
│ Gentoo │ Biscoe │ 231 │
│ Gentoo │ Biscoe │ 230 │
│ Gentoo │ Biscoe │ 230 │
│ Gentoo │ Biscoe │ 230 │
│ Gentoo │ Biscoe │ 230 │
│ Gentoo │ Biscoe │ 230 │
│ Gentoo │ Biscoe │ 230 │
│ Gentoo │ Biscoe │ 230 │
│ Gentoo │ Biscoe │ 229 │
│ Gentoo │ Biscoe │ 229 │
│ … │ … │ … │
└─────────┴────────┴───────────────────┘
penguins.order_by(ibis.desc("flipper_length_mm")).select(
"species", "island", "flipper_length_mm"
)
┏━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
┃ species ┃ island ┃ flipper_length_mm ┃
┡━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
│ string │ string │ int64 │
├─────────┼────────┼───────────────────┤
│ Gentoo │ Biscoe │ 231 │
│ Gentoo │ Biscoe │ 230 │
│ Gentoo │ Biscoe │ 230 │
│ Gentoo │ Biscoe │ 230 │
│ Gentoo │ Biscoe │ 230 │
│ Gentoo │ Biscoe │ 230 │
│ Gentoo │ Biscoe │ 230 │
│ Gentoo │ Biscoe │ 230 │
│ Gentoo │ Biscoe │ 229 │
│ Gentoo │ Biscoe │ 229 │
│ … │ … │ … │
└─────────┴────────┴───────────────────┘
660
DuckDB Documentation
aggregate Ibis has several aggregate functions available to help summarize data.
penguins.flipper_length_mm.mean()
# Output:
200.91520467836258
You can compute multiple aggregates at once using the aggregate method:
penguins.aggregate([penguins.flipper_length_mm.mean(), penguins.bill_depth_mm.max()])
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ Mean(flipper_length_mm) ┃ Max(bill_depth_mm) ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ float64 │ float64 │
├─────────────────────────┼────────────────────┤
│ 200.915205 │ 21.5 │
└─────────────────────────┴────────────────────┘
group_by group_by creates groupings of rows that have the same value for one or more columns.
But it doesn't do much on its own ‑‑ you can pair it with aggregate to get a result.
penguins.group_by("species").aggregate()
┏━━━━━━━━━━━┓
┃ species ┃
┡━━━━━━━━━━━┩
│ string │
├───────────┤
│ Adelie │
│ Gentoo │
│ Chinstrap │
└───────────┘
We grouped by the species column and handed it an ”empty” aggregate command. The result of that is a column of the unique values
in the species column.
If we add a second column to the group_by, we'll get each unique pairing of the values in those columns.
penguins.group_by(["species", "island"]).aggregate()
┏━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ species ┃ island ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━┩
│ string │ string │
├───────────┼───────────┤
│ Adelie │ Torgersen │
│ Adelie │ Biscoe │
│ Adelie │ Dream │
│ Gentoo │ Biscoe │
│ Chinstrap │ Dream │
└───────────┴───────────┘
Now, if we add an aggregation function to that, we start to really open things up.
penguins.group_by(["species", "island"]).aggregate(penguins.bill_length_mm.mean())
661
DuckDB Documentation
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
┃ species ┃ island ┃ Mean(bill_length_mm) ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩
│ string │ string │ float64 │
├───────────┼───────────┼──────────────────────┤
│ Adelie │ Torgersen │ 38.950980 │
│ Adelie │ Biscoe │ 38.975000 │
│ Adelie │ Dream │ 38.501786 │
│ Gentoo │ Biscoe │ 47.504878 │
│ Chinstrap │ Dream │ 48.833824 │
└───────────┴───────────┴──────────────────────┘
By adding that mean to the aggregate, we now have a concise way to calculate aggregates over each of the distinct groups in the group_
by. And we can calculate as many aggregates as we need.
penguins.group_by(["species", "island"]).aggregate(
[penguins.bill_length_mm.mean(), penguins.flipper_length_mm.max()]
)
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ species ┃ island ┃ Mean(bill_length_mm) ┃ Max(flipper_length_mm) ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━┩
│ string │ string │ float64 │ int64 │
├───────────┼───────────┼──────────────────────┼────────────────────────┤
│ Adelie │ Torgersen │ 38.950980 │ 210 │
│ Adelie │ Biscoe │ 38.975000 │ 203 │
│ Adelie │ Dream │ 38.501786 │ 208 │
│ Gentoo │ Biscoe │ 47.504878 │ 231 │
│ Chinstrap │ Dream │ 48.833824 │ 212 │
└───────────┴───────────┴──────────────────────┴────────────────────────┘
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ species ┃ island ┃ sex ┃ Mean(bill_length_mm) ┃ Max(flipper_length_mm) ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━┩
│ string │ string │ string │ float64 │ int64 │
├─────────┼───────────┼────────┼──────────────────────┼────────────────────────┤
│ Adelie │ Torgersen │ male │ 40.586957 │ 210 │
│ Adelie │ Torgersen │ female │ 37.554167 │ 196 │
│ Adelie │ Torgersen │ NULL │ 37.925000 │ 193 │
│ Adelie │ Biscoe │ female │ 37.359091 │ 199 │
│ Adelie │ Biscoe │ male │ 40.590909 │ 203 │
│ Adelie │ Dream │ female │ 36.911111 │ 202 │
│ Adelie │ Dream │ male │ 40.071429 │ 208 │
│ Adelie │ Dream │ NULL │ 37.500000 │ 179 │
│ Gentoo │ Biscoe │ female │ 45.563793 │ 222 │
│ Gentoo │ Biscoe │ male │ 49.473770 │ 231 │
│ … │ … │ … │ … │ … │
└─────────┴───────────┴────────┴──────────────────────┴────────────────────────┘
We've already chained some Ibis calls together. We used mutate to create a new column and then select to only view a subset of the
new table. We were just chaining group_by with aggregate.
662
DuckDB Documentation
There's nothing stopping us from putting all of these concepts together to ask questions of the data.
How about:
• What was the largest female penguin (by body mass) on each island in the year 2008?
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃ island ┃ Max(body_mass_g) ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ string │ int64 │
├───────────┼──────────────────┤
│ Biscoe │ 5200 │
│ Torgersen │ 3800 │
│ Dream │ 3900 │
└───────────┴──────────────────┘
• What about the largest male penguin (by body mass) on each island for each year of data collection?
┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ island ┃ year ┃ max_body_mass ┃
┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━┩
│ string │ int64 │ int64 │
├───────────┼───────┼───────────────┤
│ Dream │ 2007 │ 4650 │
│ Torgersen │ 2007 │ 4675 │
│ Biscoe │ 2007 │ 6300 │
│ Torgersen │ 2008 │ 4700 │
│ Dream │ 2008 │ 4800 │
│ Biscoe │ 2008 │ 6000 │
│ Torgersen │ 2009 │ 4300 │
│ Dream │ 2009 │ 4475 │
│ Biscoe │ 2009 │ 6000 │
└───────────┴───────┴───────────────┘
Learn More
That's all for this quick‑start guide. If you want to learn more, check out the Ibis documentation.
Polars is a DataFrames library built in Rust with bindings for Python and Node.js. It uses Apache Arrow's columnar format as its memory
model. DuckDB can read Polars DataFrames and convert query results to Polars DataFrames. It does this internally using the efficient
Apache Arrow integration. Note that the pyarrow library must be installed for the integration to work.
Installation
663
DuckDB Documentation
Polars to DuckDB
DuckDB can natively query Polars DataFrames by referring to the name of Polars DataFrames as they exist in the current scope.
import duckdb
import polars as pl
df = pl.DataFrame(
{
"A": [1, 2, 3, 4, 5],
"fruits": ["banana", "banana", "apple", "apple", "banana"],
"B": [5, 4, 3, 2, 1],
"cars": ["beetle", "audi", "beetle", "beetle", "beetle"],
}
)
duckdb.sql("SELECT * FROM df").show()
DuckDB to Polars
DuckDB can output results as Polars DataFrames using the .pl() result‑conversion method.
df = duckdb.sql("""
SELECT 1 AS id, 'banana' AS fruit
UNION ALL
SELECT 2, 'apple'
UNION ALL
SELECT 3, 'mango'"""
).pl()
print(df)
shape: (3, 2)
┌─────┬────────┐
│ id ┆ fruit │
│ --- ┆ --- │
│ i32 ┆ str │
╞═════╪════════╡
│ 1 ┆ banana │
│ 2 ┆ apple │
│ 3 ┆ mango │
└─────┴────────┘
To learn more about Polars, feel free to explore their Python API Reference.
DuckDB support for fsspec filesystems allows querying data in filesystems that DuckDB's httpfs extension does not support. fsspec
has a large number of inbuilt filesystems, and there are also many external implementations. This capability is only available in DuckDB's
Python client because fsspec is a Python library, while the httpfs extension is available in many DuckDB clients.
Example
The following is an example of using fsspec to query a file in Google Cloud Storage (instead of using their S3‑compatible API).
Firstly, you must install duckdb and fsspec, and a filesystem interface of your choice.
664
DuckDB Documentation
import duckdb
from fsspec import filesystem
# this line will throw an exception if the appropriate filesystem interface is not installed
duckdb.register_filesystem(filesystem('gcs'))
Note. These filesystems are not implemented in C++, hence, their performance may not be comparable to the ones provided by the
httpfs extension. It is also worth noting that as they are third‑party libraries, they may contain bugs that are beyond our control.
665
DuckDB Documentation
666
SQL Features
Friendly SQL
DuckDB offers several advanced SQL features as well as extensions to the SQL syntax. We call these colloquially as ”friendly SQL”.
Note. Several of these features are also supported in other systems while some are (currently) exclusive to DuckDB.
Clauses
Query Features
Data Types
667
DuckDB Documentation
• Formatters: format() function with the fmt syntax and the printf() function
• List comprehensions
• List slicing
• String slicing
• STRUCT.* notation
• Simple LIST and STRUCT creation
Join Types
• ASOF joins
• LATERAL joins
• POSITIONAL joins
Trailing Commas
DuckDB allows trailing commas, both when listing entities (e.g., column and table names) and when constructing LIST items. For example,
the following query works:
SELECT
42 AS x,
['a', 'b', 'c',] AS y,
'hello world' AS z,
;
See Also
AsOf Join
Time series data is not always perfectly aligned. Clocks may be slightly off, or there may be a delay between cause and effect. This can
make connecting two sets of ordered data challenging. AsOf joins are a tool for solving this and other similar problems.
One of the problems that AsOf joins are used to solve is finding the value of a varying property at a specific point in time. This use case is
so common that it is where the name came from:
More generally, however, AsOf joins embody some common temporal analytic semantics, which can be cumbersome and slow to imple‑
ment in standard SQL.
Let's start with a concrete example. Suppose we have a table of stock prices with timestamps:
668
DuckDB Documentation
We can compute the value of each holding at that point in time by finding the most recent price before the holding's timestamp by using
an AsOf Join:
This attaches the value of the holding at that time to each row:
669
DuckDB Documentation
It essentially executes a function defined by looking up nearby values in the prices table. Note also that missing ticker values do not
have a match and don't appear in the output.
Because AsOf produces at most one match from the right hand side, the left side table will not grow as a result of the join, but it could shrink
if there are missing times on the right. To handle this situation, you can use an outer AsOf Join:
As you might expect, this will produce NULL prices and values instead of dropping left side rows when there is no ticker or the time is before
the prices begin.
So far we have been explicit about specifying the conditions for AsOf, but SQL also has a simplified join condition syntax for the common
case where the column names are the same in both tables. This syntax uses the USING keyword to list the fields that should be compared
for equality. AsOf also supports this syntax, but with two restrictions:
670
DuckDB Documentation
Be aware that if you don't explicitly list the columns in the SELECT, the ordering field value will be the probe value, not the build value.
For a natural join, this is not an issue because all the conditions are equalities, but for AsOf, one side has to be chosen. Since AsOf can be
viewed as a lookup function, it is more natural to return the ”function arguments” than the function internals.
See Also
For implementation details, see the blog post ”DuckDB's AsOf joins: Fuzzy Temporal Lookups”.
Full‑Text Search
DuckDB supports full‑text search via the fts extension. A full‑text index allows for a query to quickly search for all occurrences of individual
words within longer text strings.
DESCRIBE corpus;
┌─────────────┬─────────────┬─────────┐
│ column_name │ column_type │ null │
├─────────────┼─────────────┼─────────┤
│ line_id │ VARCHAR │ YES │
│ play_name │ VARCHAR │ YES │
│ line_number │ VARCHAR │ YES │
│ speaker │ VARCHAR │ YES │
│ text_entry │ VARCHAR │ YES │
└─────────────┴─────────────┴─────────┘
The text of each line is in text_entry, and a unique key for each line is in line_id.
First, we create the index, specifying the table name, the unique id column, and the column(s) to index. We will just index the single column
text_entry, which contains the text of the lines in the play.
The table is now ready to query using the Okapi BM25 ranking function. Rows with no match return a null score.
SELECT
fts_main_corpus.match_bm25(line_id, 'butter') AS score,
line_id, play_name, speaker, text_entry
FROM corpus
WHERE score IS NOT NULL
ORDER BY score DESC;
┌────────────────────┬─────────────┬──────────────────────────┬──────────────┬───────────────────────────────────
│ score │ line_id │ play_name │ speaker │ text_
entry │
│ double │ varchar │ varchar │ varchar │ varchar
│
671
DuckDB Documentation
├────────────────────┼─────────────┼──────────────────────────┼──────────────┼───────────────────────────────────
│ 4.427313429798464 │ H4/2.4.494 │ Henry IV │ Carrier │ As fat as butter.
│
│ 3.836270302568675 │ H4/1.2.21 │ Henry IV │ FALSTAFF │ prologue to an egg and
butter. │
│ 3.836270302568675 │ H4/2.1.55 │ Henry IV │ Chamberlain │ They are up already, and
call for eggs and butter; │
│ 3.3844488405497115 │ H4/4.2.21 │ Henry IV │ FALSTAFF │ toasts-and-butter, with
hearts in their bellies no │
│ 3.3844488405497115 │ H4/4.2.62 │ Henry IV │ PRINCE HENRY │ already made thee butter.
But tell me, Jack, whose │
│ 3.3844488405497115 │ AWW/4.1.40 │ Alls well that ends well │ PAROLLES │ butter-womans mouth and
buy myself another of │
│ 3.3844488405497115 │ AYLI/3.2.93 │ As you like it │ TOUCHSTONE │ right butter-womens rank
to market. │
│ 3.3844488405497115 │ KL/2.4.132 │ King Lear │ Fool │ kindness to his horse,
buttered his hay. │
│ 3.0278411214953107 │ AWW/5.2.9 │ Alls well that ends well │ Clown │ henceforth eat no fish of
fortunes buttering. │
│ 3.0278411214953107 │ MWW/2.2.260 │ Merry Wives of Windsor │ FALSTAFF │ Hang him, mechanical
salt-butter rogue! I will │
│ 3.0278411214953107 │ MWW/2.2.284 │ Merry Wives of Windsor │ FORD │ rather trust a Fleming
with my butter, Parson Hugh │
│ 3.0278411214953107 │ MWW/3.5.7 │ Merry Wives of Windsor │ FALSTAFF │ Ill have my brains taen
out and buttered, and give │
│ 3.0278411214953107 │ MWW/3.5.102 │ Merry Wives of Windsor │ FALSTAFF │ to heat as butter; a man
of continual dissolution │
│ 2.739219044070792 │ H4/2.4.115 │ Henry IV │ PRINCE HENRY │ Didst thou never see Titan
kiss a dish of butter? │
├────────────────────┴─────────────┴──────────────────────────┴──────────────┴───────────────────────────────────
│ 14 rows
5 columns │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Unlike standard indexes, full‑text indexes don't auto‑update as the underlying data is changed, so you need to PRAGMA drop_fts_
index(my_fts_index) and recreate it when appropriate.
For more details, see the ”Generating a Shakespeare corpus for full‑text searching from JSON” blog post
672
SQL Editors
DBeaver is a powerful and popular desktop sql editor and integrated development environment (IDE). It has both an open source and
enterprise version. It is useful for visually inspecting the available tables in DuckDB and for quickly building complex queries. DuckDB's
JDBC connector allows DBeaver to query DuckDB files, and by extension, any other files that DuckDB can access (like Parquet files).
Installing DBeaver
1. Install DBeaver using the download links and instructions found at their download page.
2. Open DBeaver and create a new connection. Either click on the ”New Database Connection” button or go to Database > New Database
Connection in the menu bar.
4. Enter the path or browse to the DuckDB database file you wish to query. To use an in‑memory DuckDB (useful primarily if just inter‑
ested in querying Parquet files, or for testing) enter :memory: as the path.
5. Click ”Test Connection”. This will then prompt you to install the DuckDB JDBC driver. If you are not prompted, see alternative driver
installation instructions below.
6. Click ”Download” to download DuckDB's JDBC driver from Maven. Once download is complete, click ”OK”, then click ”Finish”.
• Note: If you are in a corporate environment or behind a firewall, before clicking download, click the ”Download Configuration” link
to configure your proxy settings.
1. You should now see a database connection to your DuckDB database in the left hand ”Database Navigator” pane. Expand it to see
the tables and views in your database. Right click on that connection and create a new SQL script.
1. If not prompted to install the DuckDB driver when testing your connection, return to the ”Connect to a database” dialog and click
”Edit Driver Settings”.
2. (Alternate) You may also access the driver settings menu by returning to the main DBeaver window and clicking Database > Driver
Manager in the menu bar. Then select DuckDB, then click Edit.
3. Go to the ”Libraries” tab, then click on the DuckDB driver and click ”Download/Update”. If you do not see the DuckDB driver, first
click on ”Reset to Defaults”.
4. Click ”Download” to download DuckDB's JDBC driver from Maven. Once download is complete, click ”OK”, then return to the main
DBeaver window and continue with step 7 above.
• Note: If you are in a corporate environment or behind a firewall, before clicking download, click the ”Download Configuration”
link to configure your proxy settings.
673
DuckDB Documentation
674
Data Viewers
Tableau is a popular commercial data visualization tool. In addition to a large number of built in connectors, it also provides generic
database connectivity via ODBC and JDBC connectors.
• For Desktop, connecting to a DuckDB database is similar to working in an embedded environment like Python.
• For Online, since DuckDB is in‑process, the data needs to be either on the server itself or in a remote data bucket that is accessible
from the server.
Database Creation
When using a DuckDB database file the data sets do not actually need to be imported into DuckDB tables; it suffices to create views of the
data. For example, this will create a view of the h2oai Parquet test file in the current DuckDB code base:
Note that you should use full path names to local files so that they can be found from inside Tableau. Also note that you will need to use a
version of the driver that is compatible (i.e., from the same release) as the database format used by the DuckDB tool (e.g., Python module,
command line) that was used to create the file.
Tableau provides documentation on how to install a JDBC driver for Tableau to use.
Note. Tableau (both Desktop and Server versions) need to be restarted any time you add or modify drivers.
Driver Links The link here is for a recent version of the JDBC driver that is compatible with Tableau. If you wish to connect to a database
file, you will need to make sure the file was created with a file‑compatible version of DuckDB. Also, check that there is only one version of
the driver installed as there are multiple filenames in use.
If you just want to do something simple, you can try connecting directly to the JDBC driver and using Tableau‑provided PostgreSQL di‑
alect.
675
DuckDB Documentation
2. Launch Tableau
3. Under Connect > To a Server > More… click on “Other Databases (JDBC)” This will bring up the connection dialogue box. For the URL,
enter jdbc:duckdb:/User/username/path/to/database.db. For the Dialect, choose PostgreSQL. The rest of the fields
can be ignored:
However, functionality will be missing such as median and percentile aggregate functions. To make the data source connection more
compatible with the PostgreSQL dialect, please use the DuckDB taco connector as described below.
While it is possible to use the Tableau‑provided PostgreSQL dialect to communicate with the DuckDB JDBC driver, we strongly recommend
using the DuckDB ”taco” connector. This connector has been fully tested against the Tableau dialect generator and is more compatible
than the provided PostgreSQL dialect.
The documentation on how to install and use the connector is in its repository, but essentially you will need the duckdb_jdbc.taco
file. The current version of the Taco is not signed, so you will need to launch Tableau with signature validation disabled. (Despite what the
Tableau documentation says, the real security risk is in the JDBC driver code, not the small amount of JavaScript in the Taco.)
676
DuckDB Documentation
Server (Online) On Linux, copy the Taco file to /opt/tableau/connectors. On Windows, copy the Taco file to C:\Program
Files\Tableau\Connectors. Then issue these commands to disable signature validation:
The last command will restart the server with the new settings.
macOS Copy the Taco file to the /Users/[User]/Documents/My Tableau Repository/Connectors folder. Then launch
Tableau Desktop from the Terminal with the command line argument to disable signature validation:
You can also package this up with AppleScript by using the following script:
Create this file with the Script Editor (located in /Applications/Utilities) and save it as a packaged application:
You can then double‑click it to launch Tableau. You will need to change the application name in the script when you get upgrades.
Windows Desktop Copy the Taco file to the C:\Users\[Windows User]\Documents\My Tableau Repository\Connectors
directory. Then launch Tableau Desktop from a shell with the -DDisableVerifyConnectorPluginSignature=true argument
to disable signature validation.
677
DuckDB Documentation
Output
Once loaded, you can run queries against your data! Here is the result of the first H2O.ai benchmark query from the Parquet test file:
DuckDB can be used with CLI graphing tools to quickly pipe input to stdout to graph your data in one line.
YouPlot is a Ruby‑based CLI tool for drawing visually pleasing plots on the terminal. It can accept input from other programs by piping
data from stdin. It takes tab‑separated (or delimiter of your choice) data and can easily generate various types of plots including bar, line,
histogram and scatter.
With DuckDB, you can write to the console (stdout) by using the TO '/dev/stdout' command. And you can also write comma‑
separated values by using WITH (FORMAT 'csv', HEADER).
Installing YouPlot
Installation instructions for YouPlot can be found on the main YouPlot repository. If you're on a Mac, you can use:
678
DuckDB Documentation
By combining the COPY...TO function with a CSV output file, data can be read from any format supported by DuckDB and piped to YouPlot.
There are three important steps to doing this.
3. Finally, wrap the SELECT in the COPY ... TO function with an output location of /dev/stdout.
The full DuckDB command below outputs the query in CSV format with a header:
Finally, the data can now be piped to YouPlot! Let's assume we have an input.json file with dates and number of purchases made by
somebody on that date. Using the query above, we'll pipe the data to the uplot command to draw a plot of the Top 10 Purchase Dates
This tells uplot to draw a bar plot, use a comma‑seperated delimiter (-d,), that the data has a header (-H), and give the plot a title (-t).
Maybe you're piping some data through jq. Maybe you're downloading a JSON file from somewhere. You can also tell DuckDB to read the
data from another process by changing the filename to /dev/stdin.
Let's combine this with a quick curl from GitHub to see what a certain user has been up to lately.
679
DuckDB Documentation
680
DuckDB Documentation
681
Internals
Parser
• SQLStatement
• QueryNode
• TableRef
• ParsedExpression
The parser is not aware of the catalog or any other aspect of the database. It will not throw errors if tables do not exist, and will not resolve
any types of columns yet. It only transforms a query string into a set of tokens as specified.
ParsedExpression The ParsedExpression represents an expression within a SQL statement. This can be e.g. a reference to a column, an
addition operator or a constant value. The type of the ParsedExpression indicates what it represents, e.g. a comparison is represented as
a ComparisonExpression.
ParsedExpressions do not have types, except for nodes with explicit types such as CAST statements. The types for expressions are resolved
in the Binder, not in the Parser.
TableRef The TableRef represents any table source. This can be a reference to a base table, but it can also be a join, a table‑producing
function or a subquery.
QueryNode The QueryNode represents either (1) a SELECT statement, or (2) a set operation (i.e. UNION, INTERSECT or DIFFER-
ENCE).
SQL Statement The SQLStatement represents a complete SQL statement. The type of the SQL Statement represents what kind of state‑
ment it is (e.g. StatementType::SELECT represents a SELECT statement). A single SQL string can be transformed into multiple SQL
statements in case the original query string contains multiple queries.
Binder
The binder converts all nodes into their bound equivalents. In the binder phase:
683
DuckDB Documentation
Logical Planner
The logical planner creates LogicalOperator nodes from the bound statements. In this phase, the actual logical query tree is created.
Optimizer
After the logical planner has created the logical query tree, the optimizers are run over that query tree to create an optimized query plan.
The following query optimizers are run:
The column binding resolver converts logical BoundColumnRefExpresion nodes that refer to a column of a specific table into Bound-
ReferenceExpression nodes that refer to a specific index into the DataChunks that are passed around in the execution engine.
The physical plan generator converts the resulting logical operator tree into a PhysicalOperator tree.
Execution
In the execution phase, the physical operators are executed to produce the query result. The execution model is a vectorized volcano
model, where DataChunks are pulled from the root node of the physical operator tree. Each PhysicalOperator itself defines how it grants
its result. A PhysicalTableScan node will pull the chunk from the base tables on disk, whereas a PhysicalHashJoin will perform
a hash join between the output obtained from its child nodes.
Storage
Compatibility
Backward Compatibility Backward compatibility refers to the ability of a newer DuckDB version to read storage files created by an older
DuckDB version. Version 0.10 is the first release of DuckDB that supports backward compatibility in the storage format. DuckDB v0.10 can
read and operate on files created by the previous DuckDB version – DuckDB v0.9.
For future DuckDB versions, our goal is to ensure that any DuckDB version released after can read files created by previous versions, starting
from this release. We want to ensure that the file format is fully backward compatible. This allows you to keep data stored in DuckDB files
around and guarantees that you will be able to read the files without having to worry about which version the file was written with or having
to convert files between versions.
684
DuckDB Documentation
Forward Compatibility Forward compatibility refers to the ability of an older DuckDB version to read storage files produced by a newer
DuckDB version. DuckDB v0.9 is partially forward compatible with DuckDB v0.10. Certain files created by DuckDB v0.10 can be read by
DuckDB v0.9.
Forward compatibility is provided on a best effort basis. While stability of the storage format is important – there are still many improve‑
ments and innovations that we want to make to the storage format in the future. As such, forward compatibility may be (partially) broken
on occasion.
When you update DuckDB and open an old database file, you might encounter an error message about incompatible storage formats,
pointing to this page. To move your database(s) to newer format you only need the older and the newer DuckDB executable.
Open your database file with the older DuckDB and run the SQL statement EXPORT DATABASE 'tmp'. This allows you to save the
whole state of the current database in use inside folder tmp. The content of the tmp folder will be overridden, so choose an empty/non yet
existing location. Then, start the newer DuckDB and execute IMPORT DATABASE 'tmp' (pointing to the previously populated folder)
to load the database, which can be then saved to the file you pointed DuckDB to.
A bash two‑liner (to be adapted with the file names and executable locations) is:
After this mydata.db will be untouched with the old format, mydata.new.db will contain the same data but in a format accessible from
more recent DuckDB, and folder tmp will old the same data in an universal format as different files.
Storage Header
DuckDB files start with a uint64_t which contains a checksum for the main header, followed by four magic bytes (DUCK), followed by the
storage version number in a uint64_t.
$ hexdump -n 20 -C mydata.db
00000000 01 d0 e2 63 9c 13 39 3e 44 55 43 4b 2b 00 00 00 |...c..9>DUCK+...|
00000010 00 00 00 00 |....|
00000014
import struct
pattern = struct.Struct('<8x4sQ')
For changes in each given release, check out the change log on GitHub. To see the commits that changed each storage version, see the
commit log.
685
DuckDB Documentation
43 v0.7.0, v0.7.1
39 v0.6.0, v0.6.1
38 v0.5.0, v0.5.1
33 v0.3.3, v0.3.4, v0.4.0
31 v0.3.2
27 v0.3.1
25 v0.3.0
21 v0.2.9
18 v0.2.8
17 v0.2.7
15 v0.2.6
13 v0.2.5
11 v0.2.4
6 v0.2.3
4 v0.2.2
1 v0.2.1 and prior
Compression
DuckDB uses lightweight compression. Note that compression is only applied to persistent databases and is not applied to in‑memory
instances.
Compression Algorithms The compression algorithms supported by DuckDB include the following:
• Constant Encoding
• Run‑Length Encoding (RLE)
• Bit Packing
• Frame of Reference (FOR)
• Dictionary Encoding
• Fast Static Symbol Table (FSST) – VLDB 2020 paper
• Adaptive Lossless Floating‑Point Compression (ALP) – SIGMOD 2024 paper
• Chimp – VLDB 2022 paper
• Patas
Disk Usage
The disk usage of DuckDB's format depends on a number of factors, including the data type and the data distribution, the compression
methods used, etc. As a rough approximation, loading 100 GB of uncompressed CSV files into a DuckDB database file will require 25 GB of
disk space, while loading 100 GB of Parquet files will require 120 GB of disk space.
Row Groups
DuckDB's storage format stores the data in row groups, i.e., horizontal partitions of the data. This concept is equivalent to Parquet's row
groups. Several features in DuckDB, including parallelism and compression are based on row groups.
686
DuckDB Documentation
Troubleshooting
Error Message When Opening an Incompatible Database File When opening a database file that has been written by a different DuckDB
version from the one you are using, the following error message may occur:
Error: unable to open database "...": Serialization Error: Failed to deserialize: ...
The message implies that the database file was created with a newer DuckDB version and uses features that are backward incompatible
with the DuckDB version used to read the file.
Execution Format
Vector is the container format used to store in‑memory data during execution.
DataChunk is a collection of Vectors, used for instance to represent a column list in a PhysicalProjection operator.
Data Flow
Vector Format
Vectors logically represent arrays that contain data of a single type. DuckDB supports different vector formats, which allow the system to
store the same logical data with a different physical representation. This allows for a more compressed representation, and potentially
allows for compressed execution throughout the system. Below the list of supported vector formats is shown.
Flat Vectors Flat vectors are physically stored as a contiguous array, this is the standard uncompressed vector format. For flat vectors
the logical and physical representations are identical.
Constant Vectors Constant vectors are physically stored as a single constant value.
Constant vectors are useful when data elements are repeated ‑ for example, when representing the result of a constant expression in a
function call, the constant vector allows us to only store the value once.
Since duckdb is a string literal, the value of the literal is the same for every row. In a flat vector, we would have to duplicate the literal
'duckdb' once for every row. The constant vector allows us to only store the literal once.
Constant vectors are also emitted by the storage when decompressing from constant compression.
687
DuckDB Documentation
Dictionary Vectors Dictionary vectors are physically stored as a child vector, and a selection vector that contains indexes into the child
vector.
Dictionary vectors are emitted by the storage when decompressing from dictionary
Just like constant vectors, dictionary vectors are also emitted by the storage.
When deserializing a dictionary compressed column segment, we store this in a dictionary vector so we can keep the data compressed
during query execution.
Sequence Vectors Sequence vectors are physically stored as an offset and an increment value.
Sequence vectors are useful for efficiently storing incremental sequences. They are generally emitted for row identifiers.
Unified Vector Format These properties of the different vector formats are great for optimization purposes, for example you can imagine
the scenario where all the parameters to a function are constant, we can just compute the result once and emit a constant vector.
But writing specialized code for every combination of vector types for every function is unfeasible due to the combinatorial explosion of
possibilities.
Instead of doing this, whenever you want to generically use a vector regardless of the type, the UnifiedVectorFormat can be used.
This format essentially acts as a generic view over the contents of the Vector. Every type of Vector can convert to this format.
Complex Types
String Vectors To efficiently store strings, we make use of our string_t class.
struct string_t {
union {
struct {
uint32_t length;
char prefix[4];
char *ptr;
} pointer;
struct {
uint32_t length;
char inlined[12];
} inlined;
} value;
};
Short strings (<= 12 bytes) are inlined into the structure, while larger strings are stored with a pointer to the data in the auxiliary string
buffer. The length is used throughout the functions to avoid having to call strlen and having to continuously check for null‑pointers. The
prefix is used for comparisons as an early out (when the prefix does not match, we know the strings are not equal and don't need to chase
any pointers).
List Vectors List vectors are stored as a series of list entries together with a child Vector. The child vector contains the values that are
present in the list, and the list entries specify how each individual list is constructed.
struct list_entry_t {
idx_t offset;
idx_t length;
};
The offset refers to the start row in the child Vector, the length keeps track of the size of the list of this row.
List vectors can be stored recursively. For nested list vectors, the child of a list vector is again a list vector.
688
DuckDB Documentation
{
"type": "list",
"data": "list_entry_t",
"child": {
"type": "list",
"data": "list_entry_t",
"child": {
"type": "bigint",
"data": "int64_t"
}
}
}
Struct Vectors Struct vectors store a list of child vectors. The number and types of the child vectors is defined by the schema of the
struct.
Map Vectors Internally map vectors are stored as a LIST[STRUCT(key KEY_TYPE, value VALUE_TYPE)].
Union Vectors Internally UNION utilizes the same structure as a STRUCT. The first ”child” is always occupied by the Tag Vector of the
UNION, which records for each row which of the UNION's types apply to that row.
689
DuckDB Documentation
690
Developer Guides
Prerequisites
DuckDB needs CMake and a C++11‑compliant compiler (e.g., GCC, Apple‑Clang, MSVC). Additionally, we recommend using the Ninja build
system, which automatically parallelizes the build process.
Linux Packages Install the required packages with the package manager of your distribution.
Alpine Linux:
macOS Install Xcode and Homebrew. Then, install the required packages with:
Windows Consult the Windows CI workflow for a list of packages used to build DuckDB on Windows.
The DuckDB Python package requires the Microsoft Visual C++ Redistributable package to be built and to run.
Building DuckDB
To build DuckDB we use a Makefile which in turn calls into CMake. We also advise using Ninja as the generator for CMake.
GEN=ninja make
It is not advised to directly call CMake, as the Makefile sets certain variables that are crucial to properly building the package.
Build Type DuckDB can be built in many different settings, most of these correspond directly to CMake but not all of them.
release This build has been stripped of all the assertions and debug symbols and code, optimized for performance.
691
DuckDB Documentation
debug This build runs with all the debug information, including symbols, assertions and DEBUG blocks.
The special debug defines are not automatically set for this build however.
relassert This build does not trigger the #ifdef DEBUG code blocks, but still has debug symbols that make it possible to step
through the execution with line number information and D_ASSERT lines are still checked in this build.
reldebug This build is similar to relassert in many ways, only assertions are also stripped in this build.
tidy-check This creates a build and then runs Clang‑Tidy to check for issues or style violations through static analysis.
The CI will also run this check, causing it to fail if this check fails.
format-fix | format-changes | format-main This doesn't actually create a build, but uses the following format checkers to
check for style issues:
The CI will also run this check, causing it to fail if this check fails.
Package Flags For every package that is maintained by core DuckDB, there exists a flag in the Makefile to enable building the package.
These can be enabled by either setting them in the current env, through set up files like bashrc or zshrc, or by setting them before the
call to make, for example:
BUILD_SHELL When this flag is set, the CLI is built, this is usually enabled by default.
BUILD_BENCHMARK When this flag is set, our in‑house Benchmark testing suite is built.
More information about this can be found here.
Extension Flags For every in‑tree extension that is maintained by core DuckDB there exists a flag to enable building and statically linking
the extension into the build.
BUILD_TPCH When this flag is set, the tpch extension is built, this enables TPCH‑H data generation and query support using dbgen.
692
DuckDB Documentation
BUILD_TPCDS When this flag is set, the tpcds extension is built, this enables TPC‑DS data generation and query support using dsd-
gen.
BUILD_TPCE When this flag is set, the TPCE extension is built, unlike TPC‑H and TPC‑DS this does not enable data generation and query
support, but does enable tests for TPC‑E through our test suite.
BUILD_FTS When this flag is set, the fts (full text search) extension is built.
Debug Flags
CRASH_ON_ASSERT D_ASSERT(condition) is used all throughout the code, these will throw an InternalException in debug builds.
With this flag enabled, when the assertion triggers it will instead directly cause a crash.
DISABLE_STRING_INLINE In our execution format string_t has the feature to ”inline” strings that are under a certain length (12
bytes), this means they don't require a separate allocation.
When this flag is set, we disable this and don't inline small strings.
DISABLE_MEMORY_SAFETY Our data structures that are used extensively throughout the non‑performance‑critical code have extra
checks to ensure memory safety, these checks include:
With this flag enabled we remove these checks, this is mostly done to check that the performance hit of these checks is negligible.
DESTROY_UNPINNED_BLOCKS When previously pinned blocks in the BufferManager are unpinned, with this flag enabled we destroy
them instantly to make sure that there aren't situations where this memory is still being used, despite not being pinned.
DEBUG_STACKTRACE When a crash or assertion hit occurs in a test, print a stack trace.
This is useful when debugging a crash that is hard to pinpoint with a debugger attached.
Miscellaneous Flags
DISABLE_UNITY To improve compilation time, we use Unity Build to combine translation units.
This can however hide include bugs, this flag disables using the unity build so these errors can be detected.
693
DuckDB Documentation
DISABLE_SANITIZER In some situations, running an executable that has been built with sanitizers enabled is not support / can cause
problems. Julia is an example of this.
With this flag enabled, the sanitizers are disabled for the build.
Extensions can be built from source and installed from the resulting local binary.
Using Build Flags Set the corresponding BUILD_[EXTENSION_NAME] extension flag when running the build, then use the INSTALL
command.
For example, to install the httpfs extension, run the following script:
Using a CMake Configuration File Create an extension configuration file named extension_config.cmake with e.g. the following
content:
duckdb_extension_load(autocomplete)
duckdb_extension_load(fts)
duckdb_extension_load(httpfs)
duckdb_extension_load(inet)
duckdb_extension_load(icu)
duckdb_extension_load(json)
duckdb_extension_load(parquet)
Troubleshooting
Building the R Package is Slow By default, R compiles packages using a single thread. To parallelize the compilation, create or edit the
~/.R/Makevars file, and add the following content:
MAKEFLAGS = -j8
Building the R Package on Linux AArch64 Building the R package on Linux running on an ARM64 architecture (AArch64) may result in
the following error message:
694
DuckDB Documentation
Building the httpfs Extension and Python Package on macOS Problem: The build fails on macOS when both the httpfs extension
and the Python package are included:
Solution: In general, we recommended using the nightly builds, available under GitHub main on the installation page. If you would like
to build DuckDB from source, avoid using the BUILD_PYTHON=1 flag unless you are actively developing the Python library. Instead, first
build the httpfs extension (if required), then build and install the Python package separately using pip:
If the next line complains about pybind11 being missing, or --use-pep517 not being supported, make sure you're using a modern version
of pip and setuptools. python3-pip on your OS may not be modern, so you may need to run python3 -m pip install pip -U
first.
Building the httpfs Extension on Linux Problem: When building the httpfs extension on Linux, the build may fail with the following
error.
Profiling
Profiling is important to help understand why certain queries exhibit specific performance characteristics. DuckDB contains several built‑in
features to enable query profiling that will be explained on this page.
For the examples on this page we will use the following example data set:
INSERT INTO students VALUES (1, 'Mark'), (2, 'Hannes'), (3, 'Pedro');
INSERT INTO exams VALUES (1, 1, 8), (1, 2, 8), (1, 3, 7), (2, 1, 9), (2, 2, 10);
695
DuckDB Documentation
EXPLAIN Statement
The first step to profiling a database engine is figuring out what execution plan the engine is using. The EXPLAIN statement allows you to
peek into the query plan and see what is going on under the hood.
The EXPLAIN statement displays the physical plan, i.e., the query plan that will get executed.
┌─────────────────────────────┐
│┌───────────────────────────┐│
││ Physical Plan ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│ PROJECTION │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ name │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ HASH_JOIN │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ INNER │
│ sid = sid ├──────────────┐
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ │
│ EC: 1 │ │
└─────────────┬─────────────┘ │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│ SEQ_SCAN ││ FILTER │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ exams ││ prefix(name, 'Ma') │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ sid ││ EC: 1 │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ │
│ EC: 3 ││ │
└───────────────────────────┘└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ SEQ_SCAN │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ students │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ sid │
│ name │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ Filters: name>=Ma AND name│
│ <Mb AND name IS NOT NULL │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ EC: 1 │
└───────────────────────────┘
Note that the query is not actually executed – therefore, we can only see the estimated cardinality (EC) for each operator, which is calculated
by using the statistics of the base tables and applying heuristics for each operator.
696
DuckDB Documentation
The query plan helps understand the performance characteristics of the system. However, often it is also necessary to look at the perfor‑
mance numbers of individual operators and the cardinalities that pass through them. For this, you can create a query‑profile graph.
To create the query graphs it is first necessary to gather the necessary data by running the query. In order to do that, we must first enable
the run‑time profiling. This can be done by prefixing the query with EXPLAIN ANALYZE:
EXPLAIN ANALYZE
SELECT name
FROM students
JOIN exams USING (sid)
WHERE name LIKE 'Ma%';
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││ Total Time: 0.0008s ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌───────────────────────────┐
│ EXPLAIN_ANALYZE │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 0 │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ name │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 2 │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ HASH_JOIN │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ INNER │
│ sid = sid │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ├──────────────┐
│ EC: 1 │ │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ │
│ 2 │ │
│ (0.00s) │ │
└─────────────┬─────────────┘ │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│ SEQ_SCAN ││ FILTER │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ exams ││ prefix(name, 'Ma') │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ sid ││ EC: 1 │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ EC: 3 ││ 2 │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ (0.00s) │
│ 3 ││ │
│ (0.00s) ││ │
└───────────────────────────┘└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ SEQ_SCAN │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
697
DuckDB Documentation
│ students │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ sid │
│ name │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ Filters: name>=Ma AND name│
│ <Mb AND name IS NOT NULL │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ EC: 1 │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 2 │
│ (0.00s) │
└───────────────────────────┘
The output of EXPLAIN ANALYZE contains the estimated cardinality (EC), the actual cardinality, and the execution time for each opera‑
tor.
It is also possible to save the query plan to a file, e.g., in JSON format:
Note. This file is overwritten with each query that is issued. If you want to store the profile output for later it should be copied to a
different file.
SELECT name
FROM students
JOIN exams USING (sid)
WHERE name LIKE 'Ma%';
After the query is completed, the JSON file containing the profiling output has been written to the specified file. We can then render the
query graph using the Python script, provided we have the duckdb python module installed. This script will generate a HTML file and open
it in your web browser.
In query plans, the hash join operators adhere to the following convention: the probe side of the join is the left operand, while the build
side is the right operand.
Join operators in the query plan show the join type used:
698
DuckDB Documentation
Testing
Testing is vital to make sure that DuckDB works properly and keeps working properly. For that reason, we put a large emphasis on thorough
and frequent testing:
• We run a batch of small tests on every commit using GitHub Actions, and run a more exhaustive batch of tests on pull requests and
commits in the master branch.
• We use a fuzzer, which automatically reports of issues found through fuzzing DuckDB.
• Writing Tests
• sqllogictest
• Debugging
• Result Verification
• Persistent Testing
• Loops
• Multiple Connections
• Catch
sqllogictest
For testing plain SQL, we use an extended version of the SQL logic test suite, adopted from SQLite. Every test is a single self‑contained file lo‑
cated in the test/sql directory. To run tests located outside of the default test directory, specify --test-dir <root_directory>
and make sure provided test file paths are relative to that root directory.
The test describes a series of SQL statements, together with either the expected result, a statement ok indicator, or a statement
error indicator. An example of a test file is shown below:
# name: test/sql/projection/test_simple_projection.test
# group [projection]
# create table
statement ok
CREATE TABLE a (i INTEGER, j INTEGER);
query II
SELECT * FROM a;
----
42 84
In this example, three statements are executed. The first statements are expected to succeed (prefixed by statement ok). The third
statement is expected to return a single row with two columns (indicated by query II). The values of the row are expected to be 42 and
84 (separated by a tab character). For more information on query result verification, see the result verification section.
699
DuckDB Documentation
The top of every file should contain a comment describing the name and group of the test. The name of the test is always the relative file
path of the file. The group is the folder that the file is in. The name and group of the test are relevant because they can be used to execute
only that test in the unittest group. For example, if we wanted to execute only the above test, we would run the command unittest
test/sql/projection/test_simple_projection.test. If we wanted to run all tests in a specific directory, we would run the
command unittest "[projection]".
Any tests that are placed in the test directory are automatically added to the test suite. Note that the extension of the test is significant.
The sqllogictests should either use the .test extension, or the .test_slow extension. The .test_slow extension indicates that the
test takes a while to run, and will only be run when all tests are explicitly run using unittest *. Tests with the extension .test will be
included in the fast set of tests.
Query Verification
Many simple tests start by enabling query verification. This can be done through the following PRAGMA statement:
statement ok
PRAGMA enable_verification
Query verification performs extra validation to ensure that the underlying code runs correctly. The most important part of that is that it
verifies that optimizers do not cause bugs in the query. It does this by running both an unoptimized and optimized version of the query,
and verifying that the results of these queries are identical.
Query verification is very useful because it not only discovers bugs in optimizers, but also finds bugs in e.g. join implementations. This
is because the unoptimized version will typically run using cross products instead. Because of this, query verification can be very slow to
do when working with larger data sets. It is therefore recommended to turn on query verification for all unit tests, except those involving
larger data sets (more than 10‑100~ rows).
The sqllogictests are not exactly an industry standard, but several other systems have adopted them as well. Parsing sqllogictests is inten‑
tionally simple. All statements have to be separated by empty lines. For that reason, writing a syntax highlighter is not extremely difficult.
A syntax highlighter exists for Visual Studio Code. We have also made a fork that supports the DuckDB dialect of the sqllogictests. You can
use the fork by installing the original, then copying the syntaxes/sqllogictest.tmLanguage.json into the installed extension
(on macOS this is located in ~/.vscode/extensions/benesch.sqllogictest-0.1.1).
A syntax highlighter is also available for CLion. It can be installed directly on the IDE by searching SQLTest on the marketplace. A GitHub
repository is also available, with extensions and bug reports being welcome.
Temporary Files For some tests (e.g., CSV/Parquet file format tests) it is necessary to create temporary files. Any temporary files should
be created in the temporary testing directory. This directory can be used by placing the string __TEST_DIR__ in a query. This string will
be replaced by the path of the temporary testing directory.
statement ok
COPY csv_data TO '__TEST_DIR__/output_file.csv.gz' (COMPRESSION GZIP);
Require & Extensions To avoid bloating the core system, certain functionality of DuckDB is available only as an extension. Tests can be
build for those extensions by adding a require field in the test. If the extension is not loaded, any statements that occurs after the require
field will be skipped. Examples of this are require parquet or require icu.
Another usage is to limit a test to a specific vector size. For example, adding require vector_size 512 to a test will prevent the
test from being run unless the vector size greater than or equal to 512. This is useful because certain functionality is not supported for low
vector sizes, but we run tests using a vector size of 2 in our CI.
700
DuckDB Documentation
sqllogictest ‑ Debugging
The purpose of the tests is to figure out when things break. Inevitably changes made to the system will cause one of the tests to fail, and
when that happens the test needs to be debugged.
First, it is always recommended to run in debug mode. This can be done by compiling the system using the command make debug.
Second, it is recommended to only run the test that breaks. This can be done by passing the filename of the breaking test to the test suite as a
command line parameter (e.g., build/debug/test/unittest test/sql/projection/test_simple_projection.test).
For more options on running a subset of the tests see the Triggering which tests to run section.
After that, a debugger can be attached to the program and the test can be debugged. In the sqllogictests it is normally difficult to break on
a specific query, however, we have expanded the test suite so that a function called query_break is called with the line number line as
parameter for every query that is run. This allows you to put a conditional breakpoint on a specific query. For example, if we want to break
on line number 43 of the test file we can create the following break point:
You can also skip certain queries from executing by placing mode skip in the file, followed by an optional mode unskip. Any queries
between the two statements will not be executed.
When running the unittest program, by default all the fast tests are run. A specific test can be run by adding the name of the test as an
argument. For the sqllogictests, this is the relative path to the test file.
All tests in a given directory can be executed by providing the directory as a parameter with square brackets.
All tests, including the slow tests, can be run by running the tests with an asterisk.
We can run a subset of the tests using the --start-offset and --end-offset parameters:
The set of tests to run can also be loaded from a file containing one test name per line, and loaded using the -f command.
$ cat test.list
test/sql/join/full_outer/test_full_outer_join_issue_4252.test
test/sql/join/full_outer/full_outer_join_cache.test
test/sql/join/full_outer/test_full_outer_join.test
# run only the tests labeled in the file
$ build/debug/test/unittest -f test.list
701
DuckDB Documentation
The standard way of verifying results of queries is using the query statement, followed by the letter I times the number of columns that
are expected in the result. After the query, four dashes (----) are expected followed by the result values separated by tabs. For example,
query II
SELECT 42, 84 UNION ALL SELECT 10, 20;
----
42 84
10 20
For legacy reasons the letters R and T are also accepted to denote columns.
Empty lines have special significance for the SQLLogic test runner: they signify an end of the current statement or query. For that reason,
empty strings and NULL values have special syntax that must be used in result verification. NULL values should use the string NULL, and
empty strings should use the string (empty), e.g.:
query II
SELECT NULL, ''
----
NULL
(empty)
Error Verification
In order to signify that an error is expected, the statement error indicator can be used. The statement error also takes an
optional expected result ‑ which is interpreted as the expected error message. Similar to query, the expected error should be placed after
the four dashes (----) following the query. The test passes if the error message contains the text under statement error ‑ the entire
error message does not need to be provided. It is recommended that you only use a subset of the error message, so that the test does not
break unnecessarily if the formatting of error messages is changed.
statement error
SELECT * FROM non_existent_table;
----
Table with name non_existent_table does not exist!
Regex
In certain cases result values might be very large or complex, and we might only be interested in whether or not the result contains a snippet
of text. In that case, we can use the <REGEX>: modifier followed by a certain regex. If the result value matches the regex the test is passed.
This is primarily used for query plan analysis.
query II
EXPLAIN SELECT tbl.a FROM "data/parquet-testing/arrow/alltypes_plain.parquet" tbl(a) WHERE a=1 OR a=2
----
physical_plan <REGEX>:.*PARQUET_SCAN.*Filters: a=1 OR a=2.*
If we instead want the result not to contain a snippet of text, we can use the <!REGEX>: modifier.
File
As results can grow quite large, and we might want to re‑use results over multiple files, it is also possible to read expected results from files
using the <FILE> command. The expected result is read from the given file. As convention the file path should be provided as relative to
the root of the GitHub repository.
702
DuckDB Documentation
query I
PRAGMA tpch(1)
----
<FILE>:extension/tpch/dbgen/answers/sf1/q01.csv
The result values of a query can be either supplied in row‑wise order, with the individual values separated by tabs, or in value‑wise order.
In value wise order the individual values of the query must appear in row, column order each on an individual line. Consider the following
example in both row‑wise and value‑wise order:
# row-wise
query II
SELECT 42, 84 UNION ALL SELECT 10, 20;
----
42 84
10 20
# value-wise
query II
SELECT 42, 84 UNION ALL SELECT 10, 20;
----
42
84
10
20
Besides direct result verification, the sqllogic test suite also has the option of using MD5 hashes for value comparisons. A test using hashes
for result verification looks like this:
query I
SELECT g, string_agg(x,',') FROM strings GROUP BY g
----
200 values hashing to b8126ea73f21372cdb3f2dc483106a12
This approach is useful for reducing the size of tests when results have many output rows. However, it should be used sparingly, as hash
values make the tests more difficult to debug if they do break.
After it is ensured that the system outputs the correct result, hashes of the queries in a test file can be computed by adding mode output_
hash to the test file. For example:
mode output_hash
query II
SELECT 42, 84 UNION ALL SELECT 10, 20;
----
42 84
10 20
The expected output hashes for every query in the test file will then be printed to the terminal, as follows:
================================================================================
SQL Query
SELECT 42, 84 UNION ALL SELECT 10, 20;
================================================================================
4 values hashing to 498c69da8f30c24da3bd5b322a2fd455
================================================================================
703
DuckDB Documentation
In a similar manner, mode output_result can be used in order to force the program to print the result to the terminal for every query
run in the test file.
Result Sorting
Queries can have an optional field that indicates that the result should be sorted in a specific manner. This field goes in the same location
as the connection label. Because of that, connection labels and result sorting cannot be mixed.
The possible values of this field are nosort, rowsort and valuesort. An example of how this might be used is given below:
query I rowsort
SELECT 'world' UNION ALL SELECT 'hello'
----
hello
world
In general, we prefer not to use this field and rely on ORDER BY in the query to generate deterministic query answers. However, existing
sqllogictests use this field extensively, hence it is important to know of its existence.
Query Labels
Another feature that can be used for result verification are query labels. These can be used to verify that different queries provide the
same result. This is useful for comparing queries that are logically equivalent, but formulated differently. Query labels are provided after
the connection label or sorting specifier.
Queries that have a query label do not need to have a result provided. Instead, the results of each of the queries with the same label are
compared to each other. For example, the following script verifies that the queries SELECT 42+1 and SELECT 44-1 provide the same
result:
By default, all tests are run in in‑memory mode (unless --force-storage is enabled). In certain cases, we want to force the usage
of a persistent database. We can initiate a persistent database using the load command, and trigger a reload of the database using the
restart command.
statement ok
CREATE TABLE test (a INTEGER);
statement ok
INSERT INTO test VALUES (11), (12), (13), (14), (15), (NULL)
# ...
restart
query I
704
DuckDB Documentation
Note that by default the tests run with SET wal_autocheckpoint='0KB' ‑ meaning a checkpoint is triggered after every statement.
WAL tests typically run with the following settings to disable this behavior:
statement ok
PRAGMA disable_checkpoint_on_shutdown
statement ok
PRAGMA wal_autocheckpoint='1TB';
sqllogictest ‑ Loops
Loops can be used in sqllogictests when it is required to execute the same query many times but with slight modifications in constant
values. For example, suppose we want to fire off 100 queries that check for the presence of the values 0..100 in a table:
# end the loop (note that multiple statements can be part of a loop)
endloop
foreach partcode millennium century decade year quarter month day hour minute second millisecond
microsecond epoch
query III
SELECT i, DATE_PART('${partcode}', i) AS p, DATE_PART(['${partcode}'], i) AS st
FROM intervals
WHERE p <> st['${partcode}'];
----
endloop
foreach also has a number of preset combinations that should be used when required. In this manner, when new combinations are
added to the preset, old tests will automatically pick up these new combinations.
Preset Expansion
705
DuckDB Documentation
Preset Expansion
Note. Use large loops sparingly. Executing hundreds of thousands of SQL statements will slow down tests unnecessarily. Do not
use loops for inserting data.
Loops should be used sparingly. While it might be tempting to use loops for inserting data using insert statements, this will considerably
slow down the test cases. Instead, it is better to generate data using the built‑in range and repeat functions.
-- create the table integers with the values [0, 1, .., 98, 99]
CREATE TABLE integers AS SELECT * FROM range(0, 100, 1) t1(i);
Using these two functions, together with clever use of cross products and other expressions, many different types of datasets can be effi‑
ciently generated. The RANDOM() function can also be used to generate random data.
An alternative option is to read data from an existing CSV or Parquet file. There are several large CSV files that can be loaded from the
directory test/sql/copy/csv/data/real using a COPY INTO statement or the read_csv_auto function.
The TPC‑H and TPC‑DS extensions can also be used to generate synthetic data, using e.g. CALL dbgen(sf=1) or CALL dsd-
gen(sf=1).
For tests whose purpose is to verify that the transactional management or versioning of data works correctly, it is generally necessary to
use multiple connections. For example, if we want to verify that the creation of tables is correctly transactional, we might want to start a
transaction and create a table in con1, then fire a query in con2 that checks that the table is not accessible yet until committed.
We can use multiple connections in the sqllogictests using connection labels. The connection label can be optionally appended to
any statement or query. All queries with the same connection label will be executed in the same connection. A test that would verify
the above property would look as follows:
statement ok con1
BEGIN TRANSACTION
statement ok con1
CREATE TABLE integers (i INTEGER);
Concurrent Connections
Using connection modifiers on the statement and queries will result in testing of multiple connections, but all the queries will still be run
sequentially on a single thread. If we want to run code from multiple connections concurrently over multiple threads, we can use the
concurrentloop construct. The queries in concurrentloop will be run concurrently on separate threads at the same time.
706
DuckDB Documentation
concurrentloop i 0 10
statement ok
CREATE TEMP TABLE t2 AS (SELECT 1);
statement ok
INSERT INTO t2 VALUES (42);
statement ok
DELETE FROM t2
endloop
One caveat with concurrentloop is that results are often unpredictable ‑ as multiple clients can hammer the database at the same
time we might end up with (expected) transaction conflicts. statement maybe can be used to deal with these situations. statement
maybe essentially accepts both a success, and a failure with a specific error message.
concurrentloop i 1 10
statement maybe
CREATE OR REPLACE TABLE t2 AS (SELECT -54124033386577348004002656426531535114 FROM t2 LIMIT 70%);
----
write-write conflict
endloop
While we prefer the sqllogic tests for testing most functionality, for certain tests only SQL is not sufficient. This typically happens when you
want to test the C++ API. When using pure SQL is really not an option it might be necessary to make a C++ test using Catch.
Catch tests reside in the test directory as well. Here is an example of a catch test that tests the storage of the system:
#include "catch.hpp"
#include "test_helpers.hpp"
707
DuckDB Documentation
The test uses the TEST_CASE wrapper to create each test. The database is created and queried using the C++ API. Results are
checked using either REQUIRE_FAIL/REQUIRE_NO_FAIL (corresponding to statement ok and statement error) or REQUIRE(CHECK_
COLUMN(...)) (corresponding to query with a result check). Every test that is created in this way needs to be added to the corresponding
CMakeLists.txt.
708
DuckDB Documentation
Acknowledgments
709
DuckDB Documentation
This document is built with Pandoc using the Eisvogel template. The scripts to build the document are available in the DuckDB‑Web repos‑
itory.
The emojis used in this document are provided by Twemoji under the CC‑BY 4.0 license.
711