What Are DBT Sources

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 136

dbt Cloud features

dbt Cloud is the fastest and most reliable way to deploy dbt. Develop, test,
schedule, document, and investigate data models all in one browser-based UI.

In addition to providing a hosted architecture for running dbt across your


organization, dbt Cloud comes equipped with turnkey support for scheduling
jobs, CI/CD, hosting documentation, monitoring and alerting, an integrated
development environment (IDE), and allows you to develop and run dbt
commands from your local command line interface (CLI) or code editor.

dbt Cloud's flexible plans and features make it well-suited for data teams of any
size — sign up for your free 14-day trial!

dbt Cloud CLI

Use the dbt Cloud CLI to develop, test, run, and version control dbt projects and
commands in your dbt Cloud development environment. Collaborate with team
members, directly from the command line.

dbt Cloud IDE

The IDE is the easiest and most efficient way to develop dbt models, allowing
you to build, test, run, and version control your dbt projects directly from your
browser.

Manage environments

Set up and manage separate production and development environments in dbt


Cloud to help engineers develop and test code more efficiently, without
impacting users or data.
Schedule and run dbt jobs

Create custom schedules to run your production jobs. Schedule jobs by day of
the week, time of day, or a recurring interval. Decrease operating costs by using
webhooks to trigger CI jobs and the API to start jobs.

Notifications

Set up and customize job notifications in dbt Cloud to receive email or slack
alerts when a job run succeeds, fails, or is cancelled. Notifications alert the right
people when something goes wrong instead of waiting for a user to report it.

Run visibility

View the history of your runs and the model timing dashboard to help identify
where improvements can be made to the scheduled jobs.

Host & share documentation

dbt Cloud hosts and authorizes access to dbt project documentation, allowing
you to generate data documentation on a schedule for your project. Invite
teammates to dbt Cloud to collaborate and share your project's documentation.

Supports GitHub, GitLab, AzureDevOPs

Seamlessly connect your git account to dbt Cloud and provide another layer of
security to dbt Cloud. Import new repositories, trigger continuous integration,
clone repos using HTTPS, and more!
Enable Continuous Integration

Configure dbt Cloud to run your dbt projects in a temporary schema when new
commits are pushed to open pull requests. This build-on-PR functionality is a
great way to catch bugs before deploying to production, and an essential tool in
any analyst's belt.

Security

Manage risk with SOC-2 compliance, CI/CD deployment, RBAC, and ELT
architecture.

dbt Semantic Layer*

Use the dbt Semantic Layer to define metrics alongside your dbt models and
query them from any integrated analytics tool. Get the same answers
everywhere, every time.

Discovery API*

Enhance your workflow and run ad-hoc queries, browse schema, or query the
dbt Semantic Layer. dbt Cloud serves a GraphQL API, which supports arbitrary
queries.

dbt Explorer*

Learn about dbt Explorer and how to interact with it to understand, improve,
and leverage your data pipelines.

Using defer in dbt Cloud


Defer is a powerful feature that allows developers to only build and run and test
models they've edited without having to first run and build all the models that
come before them (upstream parents). dbt powers this by using a production
manifest for comparison, and resolves the {{ ref() }} function with upstream
production artifacts.

Both the dbt Cloud IDE and the dbt Cloud CLI enable users to natively defer to
production metadata directly in their development workflows.

By default, dbt follows these rules:

 dbt uses the production locations of parent models to


resolve {{ ref() }} functions, based on metadata from the production
environment.
 If a development version of a deferred model exists, dbt preferentially
uses the development database location when resolving the reference.
 Passing the --favor-state flag overrides the default behavior
and always resolve refs using production metadata, regardless of the
presence of a development relation.

For a clean slate, it's a good practice to drop the development schema at the
start and end of your development cycle.

Required setup

 You must select the Production environment checkbox in


the Environment Settings page.
o This can be set for one deployment environment per dbt Cloud
project.
 You must have a successful job run first.

When using defer, it compares artifacts from the most recent successful
production job, excluding CI jobs.

Defer in the dbt Cloud IDE

To enable defer in the dbt Cloud IDE, toggle the Defer to production button on
the command bar. Once enabled, dbt Cloud will:

1. Pull down the most recent manifest from the Production environment for
comparison
2. Pass the --defer flag to the command (for any command that accepts the
flag)

For example, if you were to start developing on a new branch with nothing in
your development schema, edit a single model, and run dbt build -s
state:modified — only the edited model would run. Any {{ ref() }} functions
will point to the production location of the referenced models.

Select the 'Defer to production' toggle on the bottom right of the command bar to enable defer in
the dbt Cloud IDE.

Defer in dbt Cloud CLI

One key difference between using --defer in the dbt Cloud CLI and the dbt
Cloud IDE is that --defer is automatically enabled in the dbt Cloud CLI for all
invocations, compared with production artifacts. You can disable it with the --
no-defer flag.

The dbt Cloud CLI offers additional flexibility by letting you choose the source
environment for deferral artifacts. You can set a defer-env-id key in either
your dbt_project.yml or dbt_cloud.yml file. If you do not provide a defer-env-
id setting, the dbt Cloud CLI will use artifacts from your dbt Cloud environment
marked "Production".
dbt_cloud.yml

defer-env-id: '123456'

dbt_project.yml

dbt_cloud:
defer-env-id: '123456'

Install dbt Cloud CLI


PUBLIC PREVIEW FUNCTIONALITY

The dbt Cloud CLI is currently in public preview. Share feedback or request
features you'd like to see on the dbt community Slack.

dbt Cloud natively supports developing using a command line (CLI),


empowering team members to contribute with enhanced flexibility and
collaboration. The dbt Cloud CLI allows you to run dbt commands against your
dbt Cloud development environment from your local command line.

dbt commands are run against dbt Cloud's infrastructure and benefit from:

 Secure credential storage in the dbt Cloud platform.


 Automatic deferral of build artifacts to your Cloud project's production
environment.
 Speedier, lower-cost builds.
 Support for dbt Mesh (cross-project ref),
 Significant platform improvements, to be released over the coming
months.

Prerequisites

The dbt Cloud CLI is available in all deployment regions and for both multi-
tenant and single-tenant accounts (Azure single-tenant not supported at this
time).

 Ensure you are using dbt version 1.5 or higher. Refer to dbt Cloud
versions to upgrade.
 Note that SSH tunneling for Postgres and Redshift connections doesn't
support the dbt Cloud CLI yet.

Install dbt Cloud CLI

You can install the dbt Cloud CLI on the command line by using one of these
methods.

View a video tutorial for a step-by-step guide to installation.

 macOS (brew)
 Windows (native executable)
 Linux (native executable)
 Existing dbt Core users (pip)
Before you begin, make sure you have Homebrew installed in your code editor
or command line terminal. Refer to the FAQs if your operating system runs into
path conflicts.

1. Verify that you don't already have dbt Core installed:


which dbt

o If you see a dbt not found, you're good to go. If the dbt help text
appears, use pip uninstall dbt to remove dbt Core from your
system.

2. Install the dbt Cloud CLI with Homebrew:


o First, remove the dbt-labs tap, the separate repository for
packages, from Homebrew. This prevents Homebrew from
installing packages from that repository:
brew untap dbt-labs/dbt

o Then, add and install the dbt Cloud CLI as a package:


brew tap dbt-labs/dbt-cli
brew install dbt

If you have multiple taps, use brew install dbt-labs/dbt-cli/dbt.

3. Verify your installation by running dbt --help in the command line. If you
see the following output, your installation is correct:
The dbt Cloud CLI - an ELT tool for running SQL transformations and
data models in dbt Cloud...

If you don't see this output, check that you've deactivated pyenv or venv
and don't have a global dbt version installed.

o Note that you no longer need to run the dbt deps command when
your environment starts. This step was previously required during
initialization. However, you should still run dbt deps if you make
any changes to your packages.yml file.
4. Clone your repository to your local computer using git clone. For
example, to clone a GitHub repo using HTTPS format, run git clone
https://github.com/YOUR-USERNAME/YOUR-REPOSITORY.
5. After cloning your repo, configure the dbt Cloud CLI for your dbt Cloud
project. This lets you run dbt commands like dbt environment show to
view your dbt Cloud configuration or dbt compile to compile your project
and validate models and tests. You can also add, edit, and synchronize
files with your repo.

Update dbt Cloud CLI

The following instructions explain how to update the dbt Cloud CLI to the latest
version depending on your operating system.

During the public preview period, we recommend updating before filing a bug
report. This is because the API is subject to breaking changes.

 macOS (brew)
 Windows (executable)
 Linux (executable)
 Existing dbt Core users (pip)

To update the dbt Cloud CLI, run brew update and then brew upgrade dbt.
Using VS Code extensions

Visual Studio (VS) Code extensions enhance command line tools by adding extra
functionalities. The dbt Cloud CLI is fully compatible with dbt Core, however, it
doesn't support some dbt Core APIs required by certain tools, for example, VS
Code extensions.

You can use extensions like dbt-power-user with the dbt Cloud CLI by following
these steps:

 Install it using Homebrew along with dbt Core.


 Create an alias to run the dbt Cloud CLI as dbt-cloud.

This setup allows dbt-power-user to continue to work with dbt Core in the
background, alongside the dbt Cloud CLI. For more, check the dbt Power
User documentation.

FAQs
What's the difference between the dbt Cloud CLI and dbt Core?Hover to view
How do I run both the dbt Cloud CLI and dbt Core?Hover to view
How to create an alias?Hover to view
Why am I receiving a `Session occupied` error?Hover to view

Configure and use the dbt Cloud CLI


PUBLIC PREVIEW FUNCTIONALITY

The dbt Cloud CLI is currently in public preview. Share feedback or request
features you'd like to see on the dbt community Slack.
Prerequisites

 You must set up a project in dbt Cloud.


o Note — If you're using the dbt Cloud CLI, you can connect to
your data platform directly in the dbt Cloud interface and don't
need a profiles.yml file.
 You must have your personal development credentials set for that
project. The dbt Cloud CLI will use these credentials, stored securely in
dbt Cloud, to communicate with your data platform.
 You must be on dbt version 1.5 or higher. Refer to dbt Cloud versions to
upgrade.

Configure the dbt Cloud CLI

Once you install the dbt Cloud CLI, you need to configure it to connect to a dbt
Cloud project.

1. Ensure you meet the prerequisites above.


2. Download your credentials from dbt Cloud by clicking on the Try the dbt
Cloud CLI banner on the dbt Cloud homepage. Alternatively, if you're in
dbt Cloud, you can download the credentials from the links provided
based on your region:
o North America: https://cloud.getdbt.com/cloud-cli
o EMEA: https://emea.dbt.com/cloud-cli
o APAC: https://au.dbt.com/cloud-cli
o North American Cell 1: https:/ACCOUNT_PREFIX.us1.dbt.com/cloud-
cli
o Single-tenant: https://YOUR_ACCESS_URL/cloud-cli

3. Follow the banner instructions and download the config file to:
o Mac or Linux: ~/.dbt/dbt_cloud.yml
o Windows: C:\Users\yourusername\.dbt\dbt_cloud.yml

The config file looks like this:


version: "1"
context:
active-project: "<project id from the list below>"
active-host: "<active host from the list>"
defer-env-id: "<optional defer environment id>"
projects:
- project-id: "<project-id>"
account-host: "<account-host>"
api-key: "<user-api-key>"

- project-id: "<project-id>"
account-host: "<account-host>"
api-key: "<user-api-key>"

4. After downloading the config file, navigate to a dbt project in your


terminal:
cd ~/dbt-projects/jaffle_shop

5. In your dbt_project.yml file, ensure you have or include a dbt-


cloud section with a project-id field. The project-id field contains the dbt
Cloud project ID you want to use.
# dbt_project.yml
name:
version:
# Your project configs...

dbt-cloud:
project-id: PROJECT_ID

o To find your project ID, select Develop in the dbt Cloud navigation
menu. You can use the URL to find the project ID. For example,
in https://cloud.getdbt.com/develop/26228/projects/123456, the
project ID is 123456.

6. You should now be able to use the dbt Cloud CLI and run dbt
commands like dbt environment show to view your dbt Cloud
configuration details or dbt compile to compile models in your dbt
project.

With your repo recloned, you can add, edit, and sync files with your repo.
Set environment variables

To set environment variables in the dbt Cloud CLI for your dbt project:

1. Select the gear icon on the upper right of the page.


2. Then select Profile Settings, then Credentials.
3. Click on your project and scroll to the Environment Variables section.
4. Click Edit on the lower right and then set the user-level environment
variables.
o Note, when setting up the dbt Semantic Layer, using environment
variables like {{env_var('DBT_WAREHOUSE')}} is not supported. You
should use the actual credentials instead.

Use the dbt Cloud CLI

 The dbt Cloud CLI uses the same set of dbt commands and MetricFlow
commands as dbt Core to execute the commands you provide. For
example, use the dbt environment command to view your dbt Cloud
configuration details.
 It allows you to automatically defer build artifacts to your Cloud project's
production environment.
 It also supports project dependencies, which allows you to depend on
another project using the metadata service in dbt Cloud.
o Project dependencies instantly connect to and reference (or ref)
public models defined in other projects. You don't need to execute
or analyze these upstream models yourself. Instead, you treat them
as an API that returns a dataset.

USE THE --help FLAG

As a tip, most command-line tools have a --help flag to show available


commands and arguments. Use the --help flag with dbt in two ways:

 dbt --help: Lists the commands available for dbt


 dbt run --help: Lists the flags available for the run command

About the dbt Cloud IDE


The dbt Cloud integrated development environment (IDE) is a single web-based
interface for building, testing, running, and version-controlling dbt projects. It
compiles dbt code into SQL and executes it directly on your database.
The dbt Cloud IDE offers several keyboard shortcuts and editing features for
faster and more efficient data platform development and governance:

 Syntax highlighting for SQL: Makes it easy to distinguish different parts of


your code, reducing syntax errors and enhancing readability.
 Auto-completion: Suggests table names, arguments, and column names
as you type, saving time and reducing typos.
 Code formatting and linting: Help standardize and fix your SQL code
effortlessly.
 Navigation tools: Easily move around your code, jump to specific lines,
find and replace text, and navigate between project files.
 Version control: Manage code versions with a few clicks.

These features create a powerful editing environment for efficient SQL coding,
suitable for both experienced and beginner developers.

The dbt Cloud IDE includes version control,files/folders, an editor, a command/console, and more.
Enable dark mode for a great viewing experience in low-light environments.

DISABLE AD BLOCKERS

To improve your experience using dbt Cloud, we suggest that you turn off ad
blockers. This is because some project file names, such as google_adwords.sql,
might resemble ad traffic and trigger ad blockers.
Prerequisites

 A dbt Cloud account and Developer seat license


 A git repository set up and git provider must have write access enabled.
See Connecting your GitHub Account or Importing a project by git URL for
detailed setup instructions
 A dbt project connected to a data platform
 A development environment and development credentials set up
 The environment must be on dbt version 1.0 or higher

dbt Cloud IDE features

The dbt Cloud IDE comes with features that make it easier for you to develop,
build, compile, run, and test data models.
To understand how to navigate the IDE and its user interface elements, refer to
the IDE user interface page.

Feature Info

Keyboard You can access a variety of commands and actions in the IDE
shortcuts by choosing the appropriate keyboard shortcut. Use the
shortcuts for common tasks like building modified models or
resuming builds from the last failure.

File state Ability to see when changes or actions have been made to the
indicators file. The indicators M, D, A, and • appear to the right of your
file or folder name and indicate the actions performed:

- Unsaved (•) — The IDE detects unsaved changes to your


file/folder
- Modification (M) — The IDE detects a modification of existing
files/folders
- Added (A) — The IDE detects added files
- Deleted (D) — The IDE detects deleted files.

IDE version The IDE version control section and git button allow you to
control apply the concept of version control to your project directly
into the IDE.

- Create or change branches


- Commit or revert individual files by right-clicking the edited
file
- Resolve merge conflicts
- Execute git commands using the git button
- Link to the repo directly by clicking the branch name

Project Generate and view your project documentation for your dbt
documentation project in real-time. You can inspect and verify what your
project's documentation will look like before you deploy your
changes to production.

Preview and You can compile or preview code, a snippet of dbt code, or
Compile button one of your dbt models after editing and saving.

Build, test, and Build, test, and run your project with a button click or by
Feature Info

run button using the Cloud IDE command bar.

Command bar You can enter and run commands from the command bar at
the bottom of the IDE. Use the rich model selection syntax to
execute dbt commands directly within dbt Cloud. You can
also view the history, status, and logs of previous runs by
clicking History on the left of the bar.

Drag and drop Drag and drop files located in the file explorer, and use the file
breadcrumb on the top of the IDE for quick, linear navigation.
Access adjacent files in the same file by right-clicking on the
breadcrumb file.

Organize tabs - Move your tabs around to reorganize your work in the IDE
and files - Right-click on a tab to view and select a list of actions,
including duplicate files
- Close multiple, unsaved tabs to batch save your work
- Double click files to rename files

Find and replace - Press Command-F or Control-F to open the find-and-replace


bar in the upper right corner of the current file in the IDE. The
IDE highlights your search results in the current file and code
outline
- You can use the up and down arrows to see the match
highlighted in the current file when there are multiple
matches
- Use the left arrow to replace the text with something else

Multiple You can make multiple selections for small and simultaneous
selections edits. The below commands are a common way to add more
cursors and allow you to insert cursors below or above with
ease.

- Option-Command-Down arrow or Ctrl-Alt-Down arrow


- Option-Command-Up arrow or Ctrl-Alt-Up arrow
- Press Option and click on an area or Press Ctrl-Alt and click
on an area

Lint and Format Lint and format your files with a click of a button, powered by
Feature Info

SQLFluff, sqlfmt, Prettier, and Black.

Git diff view Ability to see what has been changed in a file before you make
a pull request.

dbt New autocomplete features to help you develop faster:


autocomplete
- Use ref to autocomplete your model names
- Use source to autocomplete your source name + table name
- Use macro to autocomplete your arguments
- Use env var to autocomplete env var
- Start typing a hyphen (-) to use in-line autocomplete in a
YAML file

DAG in the IDE You can see how models are used as building blocks from left
to right to transform your data from raw sources into cleaned-
up modular derived pieces and final outputs on the far right of
the DAG. The default view is 2+model+2 (defaults to display 2
nodes away), however, you can change it to +model+
(full DAG). Note the --exclude flag isn't supported.

Status bar This area provides you with useful information about your IDE
and project status. You also have additional options like
enabling light or dark mode, restarting the IDE, or recloning
your repo.

Dark mode From the status bar in the Cloud IDE, enable dark mode for a
great viewing experience in low-light environments.

Start-up process

There are three start-up states when using or launching the Cloud IDE:

 Creation start — This is the state where you are starting the IDE for the
first time. You can also view this as a cold start (see below), and you can
expect this state to take longer because the git repository is being cloned.
 Cold start — This is the process of starting a new develop session, which
will be available for you for three hours. The environment automatically
turns off three hours after the last activity. This includes compile, preview,
or any dbt invocation, however, it does not include editing and saving a
file.
 Hot start — This is the state of resuming an existing or active develop
session within three hours of the last activity.

Work retention

The Cloud IDE needs explicit action to save your changes. There are three ways
your work is stored:

 Unsaved, local code — The browser stores your code only in its local
storage. In this state, you might need to commit any unsaved changes in
order to switch branches or browsers. If you have saved and committed
changes, you can access the "Change branch" option even if there are
unsaved changes. But if you attempt to switch branches without saving
changes, a warning message will appear, notifying you that you will lose
any unsaved changes.

If you attempt to switch branches without saving changes, a warning message will appear,
telling you that you will lose your changes.

 Saved but uncommitted code — When you save a file, the data gets
stored in durable, long-term storage, but isn't synced back to git. To
switch branches using the Change branch option, you must "Commit and
sync" or "Revert" changes. Changing branches isn't available for saved-
but-uncommitted code. This is to ensure your uncommitted changes
don't get lost.
 Committed code — This is stored in the branch with your git provider and
you can check out other (remote) branches.

Access the Cloud IDE


DISABLE AD BLOCKERS

To improve your experience using dbt Cloud, we suggest that you turn off ad
blockers. This is because some project file names, such as google_adwords.sql,
might resemble ad traffic and trigger ad blockers.

In order to start experiencing the great features of the Cloud IDE, you need to
first set up a dbt Cloud development environment. In the following steps, we
outline how to set up developer credentials and access the IDE. If you're
creating a new project, you will automatically configure this during the project
setup.

The IDE uses developer credentials to connect to your data platform. These
developer credentials should be specific to your user and they should not be
super user credentials or the same credentials that you use for your production
deployment of dbt.

Set up your developer credentials:

1. Navigate to your Credentials under Your Profile settings, which you can
access at https://YOUR_ACCESS_URL/settings/profile#credentials,
replacing YOUR_ACCESS_URL with the appropriate Access URL for your
region and plan.
2. Select the relevant project in the list.
3. Click Edit on the bottom right of the page.
4. Enter the details under Development Credentials.
5. Click Save.
Configure developer credentials in your Profile

6. Access the Cloud IDE by clicking Develop at the top of the page.
7. Initialize your project and familiarize yourself with the IDE and its
delightful features.

Nice job, you're ready to start developing and building models 🎉!

Build, compile, and run projects

You can build, compile, run, and test dbt projects using the command bar
or Build button. Use the Build button to quickly build, run, or test the model
you're working on. The Cloud IDE will update in real-time when you run models,
tests, seeds, and operations.

If a model or test fails, dbt Cloud makes it easy for you to view and download
the run logs for your dbt invocations to fix the issue.

Use dbt's rich model selection syntax to run dbt commands directly within dbt
Cloud.
Preview, compile, or build your dbt project. Use the lineage tab to see your DAG.

Build and view your project's docs

The dbt Cloud IDE makes it possible to build and view documentation for your
dbt project while your code is still in development. With this workflow, you can
inspect and verify what your project's generated documentation will look like
before your changes are released to production.

Related docs

 How we style our dbt projects


 User interface
 Version control basics
 dbt Commands

Related questions
How can I fix my .gitignore file?Hover to view

A .gitignore file specifies which files git should intentionally ignore or 'untrack'.
dbt Cloud indicates untracked files in the project file explorer pane by putting
the file or folder name in italics.

If you encounter issues like problems reverting changes, checking out or


creating a new branch, or not being prompted to open a pull request after a
commit in the dbt Cloud IDE — this usually indicates a problem with
the .gitignore file. The file may be missing or lacks the required entries for dbt
Cloud to work correctly.

Fix in the dbt Cloud IDE

To resolve issues with your gitignore file, adding the correct entries won't
automatically remove (or 'untrack') files or folders that have already been
tracked by git. The updated gitignore will only prevent new files or folders from
being tracked. So you'll need to first fix the gitignore file, then perform some
additional git operations to untrack any incorrect files or folders.

1. Launch the Cloud IDE into the project that is being fixed, by
selecting Develop on the menu bar.
2. In your File Explorer, check to see if a .gitignore file exists at the root of
your dbt project folder. If it doesn't exist, create a new file.
3. Open the new or existing gitignore file, and add the following:

# ✅ Correct
target/
dbt_packages/
logs/
# legacy -- renamed to dbt_packages in dbt v1
dbt_modules/

 Note — You can place these lines anywhere in the file, as long as they're
on separate lines. The lines shown are wildcards that will include all
nested files and folders. Avoid adding a trailing '*' to the lines, such
as target/*.

For more info on gitignore syntax, refer to the Git docs.

4. Save the changes but don't commit.


5. Restart the IDE by clicking on the three dots next to the IDE Status
button on the lower right corner of the IDE screen and select Restart IDE.
Restart
the IDE by clicking the three dots on the lower right or click on the Status bar

6. Once the IDE restarts, go to the File Explorer to delete the following files
or folders (if they exist). No data will be lost:

o target, dbt_modules, dbt_packages, logs

7. Save and then Commit and sync the changes.

8. Restart the IDE again using the same procedure as step 5.

9. Once the IDE restarts, use the Create a pull request (PR) button under
the Version Control menu to start the process of integrating the changes.

10.When the git provider's website opens to a page with the new PR, follow
the necessary steps to complete and merge the PR into the main branch
of that repository.

o Note — The 'main' branch might also be called 'master', 'dev', 'qa',
'prod', or something else depending on the organizational naming
conventions. The goal is to merge these changes into the root
branch that all other development branches are created from.

11.Return to the dbt Cloud IDE and use the Change Branch button, to switch
to the main branch of the project.

12.Once the branch has changed, click the Pull from remote button to pull
in all the changes.
13.Verify the changes by making sure the files/folders in the .gitignore file
are in italics.

A dbt project on the main branch that has properly configured gitignore folders (highlighted in
italics).

Fix in the git provider


Sometimes it's necessary to use the git providers web interface to fix a
broken .gitignore file. Although the specific steps may vary across providers,
the general process remains the same.

There are two options for this approach: editing the main branch directly if
allowed, or creating a pull request to implement the changes if required:

 Edit in main branch

 Unable to edit main branch

When permissions allow it, it's possible to edit the `.gitignore` directly on the
main branch of your repo. Here are the following steps:

1. Go to your repository's web interface.


2. Switch to the main branch and the root directory of your dbt project.
3. Find the .gitignore file. Create a blank one if it doesn't exist.
4. Edit the file in the web interface, adding the following entries:
target/
dbt_packages/
logs/
# legacy -- renamed to dbt_packages in dbt v1
dbt_modules/

5. Commit (save) the file.


6. Delete the following folders from the dbt project root, if they exist. No
data or code will be lost:
o target, dbt_modules, dbt_packages, logs
7. Commit (save) the deletions to the main branch.
8. Switch to the dbt Cloud IDE, and open the project that you're fixing.
9. Reclone your repo in the IDE by clicking on the three dots next to the IDE
Status button on the lower right corner of the IDE screen, then
select Reclone Repo.
o Note — Any saved but uncommitted changes will be lost, so make
sure you copy any modified code that you want to keep in a
temporary location outside of dbt Cloud.
10.Once you reclone the repo, open the .gitignore file in the branch you're
working in. If the new changes aren't included, you'll need to merge the
latest commits from the main branch into your working branch.
11.Go to the File Explorer to verify the .gitignore file contains the correct
entries and make sure the untracked files/folders in the .gitignore file are
in italics.
12.Great job 🎉! You've configured the .gitignore correctly and can continue
with your development!

For more info, refer to this detailed video for additional guidance.
Is there a cost to using the Cloud IDE?Hover to view
Not at all! You can use dbt Cloud when you sign up for the Free Developer plan,
which comes with one developer seat. If you’d like to access more features or
have more developer seats, you can upgrade your account to the Team or
Enterprise plan.

Refer to dbt pricing plans for more details.


Can I be a contributor to dbt Cloud?Hover to view
As a proprietary product, dbt Cloud's source code isn't available for community
contributions. If you want to build something in the dbt ecosystem, we
encourage you to review [this article](/community/contributing/contributing-
coding) about contributing to a dbt package, a plugin, dbt-core, or this
documentation site. Participation in open-source is a great way to level yourself
up as a developer, and give back to the community.
What is the difference between developing on the dbt Cloud IDE, the dbt Cloud
CLI, and dbt Core?Hover to view
You can develop dbt using the web-based IDE in dbt Cloud or on the command
line interface using the dbt Cloud CLI or open-source dbt Core, all of which
enable you to execute dbt commands. The key distinction between the dbt
Cloud CLI and dbt Core is the dbt Cloud CLI is tailored for dbt Cloud's
infrastructure and integrates with all its features:

 dbt Cloud IDE: dbt Cloud is a web-based application that allows you to
develop dbt projects with the IDE, includes a purpose-built scheduler,
and provides an easier way to share your dbt documentation with your
team. The IDE is a faster and more reliable way to deploy your dbt models
and provides a real-time editing and execution environment for your dbt
project.

 dbt Cloud CLI: The dbt Cloud CLI allows you to run dbt commands against
your dbt Cloud development environment from your local command line
or code editor. It supports cross-project ref, speedier, lower-cost builds,
automatic deferral of build artifacts, and more.

 dbt Core: dbt Core is an open-sourced software that’s freely available. You
can build your dbt project in a code editor, and run dbt commands from
the command line

IDE user interface


The dbt Cloud IDE is a tool for developers to effortlessly build, test, run, and version-
control their dbt projects, and enhance data governance — all from the convenience of
your browser. Use the Cloud IDE to compile dbt code into SQL and run it against your
database directly -- no command line required!

This page offers comprehensive definitions and terminology of user interface elements,
allowing you to navigate the IDE landscape with ease.

The Cloud IDE layout includes version control on the upper left, files/folders on the left, editor on
the right an command/console at the bottom

Basic layout

The IDE streamlines your workflow, and features a popular user interface layout with files
and folders on the left, editor on the right, and command and console information at the
bottom.
The Git repo link, documentation site
button, Version Control menu, and File Explorer
1. Git repository link — Clicking the Git repository link, located on the upper left of
the IDE, takes you to your repository on the same active branch.
o Note: This feature is only available for GitHub or GitLab repositories on multi-
tenant dbt Cloud accounts.

2. Documentation site button — Clicking the Documentation site book icon, located
next to the Git repository link, leads to the dbt Documentation site. The site is
powered by the latest dbt artifacts generated in the IDE using the dbt docs
generate command from the Command bar.
3. Version Control — The IDE's powerful Version Control section contains all git-
related elements, including the Git actions button and the Changes section.
4. File Explorer — The File Explorer shows the filetree of your repository. You can:
o Click on any file in the filetree to open the file in the File Editor.
o Click and drag files between directories to move files.
o Right-click a file to access the sub-menu options like duplicate file, copy file name,
copy as ref, rename, delete.
o Note: To perform these actions, the user must not be in read-only mode, which
generally happens when the user is viewing the default branch.
o Use file indicators, located to the right of your files or folder name, to see when
changes or actions were made:
 Unsaved (•) — The IDE detects unsaved changes to your file/folder
 Modification (M) — The IDE detects a modification of existing files/folders
 Added (A) — The IDE detects added files
 Deleted (D) — The IDE detects deleted files.

Use the Command bar to write dbt commands, toggle 'Defer', and view the current IDE status

5. Command bar — The Command bar, located in the lower left of the IDE, is used to
invoke dbt commands. When a command is invoked, the associated logs are
shown in the Invocation History Drawer.
6. Defer to production — The Defer to production toggle allows developers to only
build and run and test models they've edited without having to first run and build
all the models that come before them (upstream parents). Refer to Using defer in
dbt Cloud for more info.
7. Status button — The IDE Status button, located on the lower right of the IDE,
displays the current IDE status. If there is an error in the status or in the dbt code
that stops the project from parsing, the button will turn red and display "Error". If
there aren't any errors, the button will display a green "Ready" status. To access
the IDE Status modal, simply click on this button.
Editing features

The IDE features some delightful tools and layouts to make it easier for you to write dbt
code and collaborate with teammates.

Use the file editor, version control section, and save button during your development workflow

1. File Editor — The File Editor is where users edit code. Tabs break out the region for
each opened file, and unsaved files are marked with a blue dot icon in the tab view.
o Use intuitive keyboard shortcuts to help develop easier for you and your team.

2. Save button — The editor has a Save button that saves editable files. Pressing the
button or using the Command-S or Control-S shortcut saves the file contents. You
don't need to save to preview code results in the Console section, but it's
necessary before changes appear in a dbt invocation. The File Editor tab shows a
blue icon for unsaved changes.
3. Version Control — This menu contains all git-related elements, including the Git
actions button. The button updates relevant actions based on your editor's state,
such as prompting to pull remote changes, commit and sync when reverted
commit changes are present, or creating a merge/pull request when appropriate.
o The dropdown menu on the Git actions button allows users to revert changes,
refresh Git state, create merge/pull requests, and change branches.
 Keep in mind that although you can't delete local branches in the IDE using
this menu, you can reclone your repository, which deletes your local
branches and refreshes with the current remote branches, effectively
removing the deleted ones.
o You can also resolve merge conflicts and for more info on git, refer to Version
control basics.
o Version Control Options menu — The Changes section, under the Git actions
button, lists all file changes since the last commit. You can click on a change to
open the Git Diff View to see the inline changes. You can also right-click any file and
use the file-specific options in the Version Control Options menu.

Right-click edited files to access Version Control Options menu


Additional editing features

 Minimap — A Minimap (code outline) gives you a high-level overview of your


source code, which is useful for quick navigation and code understanding. A file's
minimap is displayed on the upper-right side of the editor. To quickly jump to
different sections of your file, click the shaded area.
Use the Minimap for quick navigation and code understanding

 dbt Editor Command Palette — The dbt Editor Command Palette displays text
editing actions and their associated keyboard shortcuts. This can be accessed by
pressing F1 or right-clicking in the text editing area and selecting Command
Palette.

Click F1 to access the dbt Editor Command Palette menu for editor shortcuts

 Git Diff View — Clicking on a file in the Changes section of the Version Control
Menu will open the changed file with Git Diff view. The editor will show the
previous version on the left and the in-line changes made on the right.
The Git Diff View displays the previous version on the left and the changes made on the
right of the Editor

 Markdown Preview console tab — The Markdown Preview console tab shows a
preview of your .md file's markdown code in your repository and updates it
automatically as you edit your code.

The Markdown Preview console tab renders markdown code below the Editor tab.

 CSV Preview console tab — The CSV Preview console tab displays the data from
your CSV file in a table, which updates automatically as you edit the file in your
seed directory.
View csv code in the CSV Preview console tab below the Editor tab.

Console section
The console section, located below the File editor, includes various console tabs and
buttons to help you with tasks such as previewing, compiling, building, and viewing
the DAG. Refer to the following sub-bullets for more details on the console tabs and
buttons.

The Console section is located below the File editor and has various tabs and buttons to help
execute tasks
1. Preview button — When you click on the Preview button, it runs the SQL in the active file
editor regardless of whether you have saved it or not and sends the results to
the Results console tab. You can preview a selected portion of saved or unsaved code by
highlighting it and then clicking the Preview button.

Row limits in IDE

2. Compile button — The Compile button compiles the saved or unsaved SQL code and
displays it in the Compiled Code tab.

Starting from dbt v1.6 or higher, when you save changes to a model, you can compile its
code with the model's specific context. This context is similar to what you'd have when
building the model and involves useful context variables
like {{ this }} or {{ is_incremental() }}.

3. Build button — The build button allows users to quickly access dbt commands
related to the active model in the File Editor. The available commands include dbt
build, dbt test, and dbt run, with options to include only the current resource, the
resource and its upstream dependencies, the resource, and its downstream
dependencies, or the resource with all dependencies. This menu is available for all
executable nodes.
4. Format button — The editor has a Format button that can reformat the contents
of your files. For SQL files, it uses either sqlfmt or sqlfluff, and for Python files, it
uses black.
5. Results tab — The Results console tab displays the most recent Preview results in
tabular format.

Preview results show up in the Results console tab

6. Compiled Code tab — The Compile button triggers a compile invocation that
generates compiled code, which is displayed in the Compiled Code tab.
Compile results show up in the Compiled Code tab

7. Lineage tab — The Lineage tab in the File Editor displays the active model's
lineage or DAG. By default, it shows two degrees of lineage in both directions
(2+model_name+2), however, you can change it to +model+ (full DAG).
o Double-click a node in the DAG to open that file in a new tab
o Expand or shrink the DAG using node selection syntax.
o Note, the --exclude flag isn't supported.

View resource lineage in the Lineage tab


Invocation history

The Invocation History Drawer stores information on dbt invocations in the IDE. When you
invoke a command, like executing a dbt command such as dbt run, the associated logs
are displayed in the Invocation History Drawer.

You can open the drawer in multiple ways:

 Clicking the ^ icon next to the Command bar on the lower left of the page
 Typing a dbt command and pressing enter
 Or pressing Control-backtick (or Ctrl + `)
The Invocation History Drawer returns a log and detail of all your dbt Cloud invocations.

1. Invocation History list — The left-hand panel of the Invocation History Drawer
displays a list of previous invocations in the IDE, including the command, branch
name, command status, and elapsed time.
2. Invocation Summary — The Invocation Summary, located above System Logs,
displays information about a selected command from the Invocation History list,
such as the command, its status (Running if it's still running), the git branch that
was active during the command, and the time the command was invoked.
3. System Logs toggle — The System Logs toggle, located under the Invocation
Summary, allows the user to see the full stdout and debug logs for the entirety of
the invoked command.
4. Command Control button — Use the Command Control button, located on the
right side, to control your invocation and cancel or rerun a selected run.

The Invocation History list displays a list of previous invocations in the IDE

5. Node Summary tab — Clicking on the Results Status Tabs will filter the Node
Status List based on their corresponding status. The available statuses are Pass
(successful invocation of a node), Warn (test executed with a warning), Error
(database error or test failure), Skip (nodes not run due to upstream error), and
Queued (nodes that have not executed yet).
6. Node result toggle — After running a dbt command, information about each
executed node can be found in a Node Result toggle, which includes a summary
and debug logs. The Node Results List lists every node that was invoked during the
command.
7. Node result list — The Node result list shows all the Node Results used in the dbt
run, and you can filter it by clicking on a Result Status tab.

Modals and Menus

Use menus and modals to interact with IDE and access useful options to help your
development workflow.

 Editor tab menu — To interact with open editor tabs, right-click any tab to access
the helpful options in the file tab menu.

Right-click a tab to view the Editor tab menu options

 File Search — You can easily search for and navigate between files using the File
Navigation menu, which can be accessed by pressing Command-O or Control-O or

clicking on the 🔍 icon in the File Explorer.


The Command History returns a log and detail of all your dbt Cloud invocations.

 Global Command Palette— The Global Command Palette provides helpful


shortcuts to interact with the IDE, such as git actions, specialized dbt commands,
and compile, and preview actions, among others. To open the menu, use
Command-P or Control-P.

The Command History returns a log and detail of all your dbt Cloud invocations.

 IDE Status modal — The IDE Status modal shows the current error message and
debug logs for the server. This also contains an option to restart the IDE. Open this
by clicking on the IDE Status button.
The Command History returns a log and detail of all your dbt Cloud invocations.

 Commit Changes modal — The Commit Changes modal is accessible via the Git
Actions button to commit all changes or via the Version Control Options menu to
commit individual changes. Once you enter a commit message, you can use the
modal to commit and sync the selected changes.

The Commit Changes modal is how users commit changes to their branch.

 Change Branch modal — The Change Branch modal allows users to switch git
branches in the IDE. It can be accessed through the Change Branch link or the Git
Actions button in the Version Control menu.
The Commit Changes modal is how users change their branch.

 Revert Uncommitted Changes modal — The Revert Uncommitted Changes modal


is how users revert changes in the IDE. This is accessible via the Revert
File option above the Version Control Options menu, or via the Git Actions button
when there are saved, uncommitted changes in the IDE.

The Commit Changes modal is how users change their branch.

 IDE Options menu — The IDE Options menu can be accessed by clicking on the
three-dot menu located at the bottom right corner of the IDE. This menu contains
global options such as:
o Toggling between dark or light mode for a better viewing experience
o Restarting the IDE
o Fully recloning your repository to refresh your git state and view status details
o Viewing status details, including the IDE Status modal.
Access the IDE Options menu to switch to dark or light mode, restart the IDE, reclone your repo, or
view the IDE status
Tags:

 IDE

Lint and format your code


Enhance your development workflow by integrating with popular linters and
formatters like SQLFluff, sqlfmt, Black, and Prettier. Leverage these powerful
tools directly in the dbt Cloud IDE without interrupting your development flow.

What are linters and formatters?

In the dbt Cloud IDE, you can perform linting, auto-fix, and formatting on five
different file types:

 SQL — Lint and fix with SQLFluff, and format with sqlfmt
 YAML, Markdown, and JSON — Format with Prettier
 Python — Format with Black

Each file type has its own unique linting and formatting rules. You
can customize the linting process to add more flexibility and enhance problem
and style detection.

By default, the IDE uses sqlfmt rules to format your code, making it convenient
to use right away. However, if you have a file named .sqlfluff in the root
directory of your dbt project, the IDE will default to SQLFluff rules instead.
Use SQLFluff to lint/format your SQL code, and view code errors in the Code Quality tab.

Use sqlfmt to format your SQL code.


Format YAML, Markdown, and JSON files using Prettier.

Use the Config button to select your tool.


Customize linting by configuring your own linting code rules, including dbtonic linting/styling.

Lint

With the dbt Cloud IDE, you can seamlessly use SQLFluff, a configurable SQL
linter, to warn you of complex functions, syntax, formatting, and compilation
errors. This integration allows you to run checks, fix, and display any code errors
directly within the Cloud IDE:

 Works with Jinja and SQL,


 Comes with built-in linting rules. You can also customize your own linting
rules.
 Empowers you to enable linting with options like Lint (displays linting
errors and recommends actions) or Fix (auto-fixes errors in the IDE).
 Displays a Code Quality tab to view code errors, and provides code
quality visibility and management.

EPHEMERAL MODELS NOT SUPPORTED

Linting doesn't support ephemeral models in dbt v1.5 and lower. Refer to
the FAQs for more info.

Enable linting

1. To enable linting, make sure you're on a development branch. Linting


isn't available on main or read-only branches.
2. Open a .sql file and click the Code Quality tab.
3. Click on the </> Config button on the bottom right side of the console
section, below the File editor.
4. In the code quality tool config pop-up, you have the option to
select sqlfluff or sqlfmt.
5. To lint your code, select the sqlfluff radio button. (Use sqlfmt
to format your code)
6. Once you've selected the sqlfluff radio button, go back to the console
section (below the File editor) to select the Lint or Fix dropdown button:
o Lint button — Displays linting issues in the IDE as wavy underlines
in the File editor. You can hover over an underlined issue to display
the details and actions, including a Quick Fix option to fix all or
specific issues. After linting, you'll see a message confirming the
outcome. Linting doesn't rerun after saving. Click Lint again to
rerun linting.
o Fix button — Automatically fixes linting errors in the File editor.
When fixing is complete, you'll see a message confirming the
outcome.
o Use the Code Quality tab to view and debug any code errors.

Use the Lint or Fix button in the console section to lint or auto-fix your code.

Customize linting

SQLFluff is a configurable SQL linter, which means you can configure your own
linting rules instead of using the default linting settings in the IDE. You can
exclude files and directories by using a standard .sqlfluffignore file. Learn
more about the syntax in the .sqlfluffignore syntax docs.

To configure your own linting rules:

1. Create a new file in the root project directory (the parent or top-level
directory for your files). Note: The root project directory is the directory
where your dbt_project.yml file resides.
2. Name the file .sqlfluff (make sure you add the . before sqlfluff).
3. Create and add your custom config code.
4. Save and commit your changes.
5. Restart the IDE.
6. Test it out and happy linting!

CONFIGURE DBTONIC LINTING RULES

Refer to the SQLFluff config file to add the dbt code (or dbtonic) rules we use for
our own projects:

dbtonic config code example provided by dbt Labs

For more info on styling best practices, refer to How we style our SQL.

Customize linting by configuring your own linting code rules, including dbtonic linting/styling.

Format

In the dbt Cloud IDE, you can format your code to match style guides with a click
of a button. The IDE integrates with formatters like sqlfmt, Prettier, and Black to
automatically format code on five different file types — SQL, YAML, Markdown,
Python, and JSON:

 SQL — Format with sqlfmt, which provides one way to format your dbt
SQL and Jinja.
 YAML, Markdown, and JSON — Format with Prettier.
 Python — Format with Black.

The Cloud IDE formatting integrations take care of manual tasks like code
formatting, enabling you to focus on creating quality data models,
collaborating, and driving impactful results.

Format SQL

To format your SQL code, dbt Cloud integrates with sqlfmt, which is an
uncompromising SQL query formatter that provides one way to format the SQL
query and Jinja.

By default, the IDE uses sqlfmt rules to format your code, making
the Format button available and convenient to use immediately. However, if
you have a file named .sqlfluff in the root directory of your dbt project, the IDE
will default to SQLFluff rules instead.

To enable sqlfmt:

1. Make sure you're on a development branch. Formatting isn't available on


main or read-only branches.
2. Open a .sql file and click on the Code Quality tab.
3. Click on the </> Config button on the right side of the console.
4. In the code quality tool config pop-up, you have the option to select
sqlfluff or sqlfmt.
5. To format your code, select the sqlfmt radio button. (Use sqlfluff
to lint your code).
6. Once you've selected the sqlfmt radio button, go to the console section
(located below the File editor) to select the Format button.
7. The Format button auto-formats your code in the File editor. Once
you've auto-formatted, you'll see a message confirming the outcome.
Use sqlfmt to format your SQL code.

Format YAML, Markdown, JSON

To format your YAML, Markdown, or JSON code, dbt Cloud integrates


with Prettier, which is an opinionated code formatter.

1. To enable formatting, make sure you're on a development branch.


Formatting isn't available on main or read-only branches.
2. Open a .yml, .md, or .json file.
3. In the console section (located below the File editor), select
the Format button to auto-format your code in the File editor. Use
the Code Quality tab to view code errors.
4. Once you've auto-formatted, you'll see a message confirming the
outcome.
Format YAML, Markdown, and JSON files using Prettier.

You can add a configuration file to customize formatting rules for YAML,
Markdown, or JSON files using Prettier. The IDE looks for the configuration file
based on an order of precedence. For example, it first checks for a "prettier" key
in your package.json file.

For more info on the order of precedence and how to configure files, refer
to Prettier's documentation. Please note, .prettierrc.json5, .prettierrc.js,
and .prettierrc.toml files aren't currently supported.

Format Python

To format your Python code, dbt Cloud integrates with Black, which is an
uncompromising Python code formatter.

1. To enable formatting, make sure you're on a development branch.


Formatting isn't available on main or read-only branches.
2. Open a .py file.
3. In the console section (located below the File editor), select
the Format button to auto-format your code in the File editor.
4. Once you've auto-formatted, you'll see a message confirming the
outcome.
Format Python files using Black.
FAQs
When should I use SQLFluff and when should I use sqlfmt?Hover to view
Can I nest `.sqlfluff` files?Hover to view
Can I run SQLFluff commands from the terminal?Hover to view
Why am I unable to see the Lint or Format button?Hover to view
Why is there inconsistent SQLFluff behavior when running outside the dbt
Cloud IDE?Hover to view
What are some considerations when using dbt Cloud linting?Hover to view
Related docs

About dbt projects


A dbt project informs dbt about the context of your project and how to
transform your data (build your data sets). By design, dbt enforces the top-level
structure of a dbt project such as the dbt_project.yml file, the models directory,
the snapshots directory, and so on. Within the directories of the top-level, you
can organize your project in any way that meets the needs of your organization
and data pipeline.

At a minimum, all a project needs is the dbt_project.yml project configuration


file. dbt supports a number of different resources, so a project may also include:
Resource Description

models Each model lives in a single file and contains logic that either
transforms raw data into a dataset that is ready for analytics or, more
often, is an intermediate step in such a transformation.

snapshot A way to capture the state of your mutable tables so you can refer to
s it later.

seeds CSV files with static data that you can load into your data platform
with dbt.

data tests SQL queries that you can write to test the models and resources in
your project.

macros Blocks of code that you can reuse multiple times.

docs Docs for your project that you can build.

sources A way to name and describe the data loaded into your warehouse by
your Extract and Load tools.

exposures A way to define and describe a downstream use of your project.

metrics A way for you to define metrics for your project.

groups Groups enable collaborative node organization in restricted


collections.

analysis A way to organize analytical SQL queries in your project such as the
general ledger from your QuickBooks.

When building out the structure of your project, you should consider these
impacts on your organization's workflow:

 How would people run dbt commands — Selecting a path


 How would people navigate within the project — Whether as
developers in the IDE or stakeholders from the docs
 How would people configure the models — Some bulk configurations
are easier done at the directory level so people don’t have to remember
to do everything in a config block with each new model
Project configuration

Every dbt project includes a project configuration file called dbt_project.yml. It


defines the directory of the dbt project and other project configurations.

Edit dbt_project.yml to set up common project configurations such as:

YAML key Value description

name Your project’s name in snake case

version Version of your project

require-dbt- Restrict your project to only work with a range of dbt Core versions
version

profile The profile dbt uses to connect to your data platform

model-paths Directories to where your model and source files live

seed-paths Directories to where your seed files live

test-paths Directories to where your test files live

analysis-paths Directories to where your analyses live

macro-paths Directories to where your macros live

snapshot-paths Directories to where your snapshots live

docs-paths Directories to where your docs blocks live

vars Project variables you want to use for data compilation

For complete details on project configurations, see dbt_project.yml.

Project subdirectories

You can use the Project subdirectory option in dbt Cloud to specify a
subdirectory in your git repository that dbt should use as the root directory for
your project. This is helpful when you have multiple dbt projects in one
repository or when you want to organize your dbt project files into
subdirectories for easier management.

To use the Project subdirectory option in dbt Cloud, follow these steps:

1. Click on the cog icon on the upper right side of the page and click
on Account Settings.
2. Under Projects, select the project you want to configure as a project
subdirectory.
3. Select Edit on the lower right-hand corner of the page.
4. In the Project subdirectory field, add the name of the subdirectory. For
example, if your dbt project files are located in a subdirectory
called <repository>/finance, you would enter finance as the subdirectory.
o You can also reference nested subdirectories. For example, if your
dbt project files are located in <repository>/teams/finance, you
would enter teams/finance as the subdirectory. Note: You do not
need a leading or trailing / in the Project subdirectory field.

5. Click Save when you've finished.

After configuring the Project subdirectory option, dbt Cloud will use it as the
root directory for your dbt project. This means that dbt commands, such as dbt
run or dbt test, will operate on files within the specified subdirectory. If there is
no dbt_project.yml file in the Project subdirectory, you will be prompted to
initialize the dbt project.

New projects

You can create new projects and share them with other people by making them
available on a hosted git repository like GitHub, GitLab, and BitBucket.

After you set up a connection with your data platform, you can initialize your
new project in dbt Cloud and start developing. Or, run dbt init from the
command line to set up your new project.

During project initialization, dbt creates sample model files in your project
directory to help you start developing quickly.

Sample projects

If you want to explore dbt projects more in-depth, you can clone dbt Lab’s Jaffle
shop on GitHub. It's a runnable project that contains sample configurations and
helpful notes.

If you want to see what a mature, production project looks like, check out
the GitLab Data Team public repo.

About dbt models


dbt Core and Cloud are composed of different moving parts working
harmoniously. All of them are important to what dbt does — transforming data
—the 'T' in ELT. When you execute dbt run, you are running a model that will
transform your data without that data ever leaving your warehouse.

Models are where your developers spend most of their time within a dbt
environment. Models are primarily written as a select statement and saved as
a .sql file. While the definition is straightforward, the complexity of the
execution will vary from environment to environment. Models will be written
and rewritten as needs evolve and your organization finds new ways to
maximize efficiency.

SQL is the language most dbt users will utilize, but it is not the only one for
building models. Starting in version 1.3, dbt Core and dbt Cloud support Python
models. Python models are useful for training or deploying data science models,
complex transformations, or where a specific Python package meets a
need — such as using the dateutil library to parse dates.

Models and modern workflows

The top level of a dbt workflow is the project. A project is a directory of


a .yml file (the project configuration) and either .sql or .py files (the models).
The project file tells dbt the project context, and the models let dbt know how
to build a specific data set. For more details on projects, refer to About dbt
projects.

Your organization may need only a few models, but more likely you’ll need a
complex structure of nested models to transform the required data. A model is a
single file containing a final select statement, and a project can have multiple
models, and models can even reference each other. Add to that, numerous
projects and the level of effort required for transforming complex data sets can
improve drastically compared to older methods.

Learn more about models in SQL models and Python models pages. If you'd like
to begin with a bit of practice, visit our Getting Started Guide for instructions on
setting up the Jaffle_Shop sample data so you can get hands-on with the power
of dbt.

Add snapshots to your DAG


Related documentation

 Snapshot configurations
 Snapshot properties
 snapshot command

What are snapshots?

Analysts often need to "look back in time" at previous data states in their
mutable tables. While some source data systems are built in a way that makes
accessing historical data possible, this is not always the case. dbt provides a
mechanism, snapshots, which records changes to a mutable table over time.

Snapshots implement type-2 Slowly Changing Dimensions over mutable source


tables. These Slowly Changing Dimensions (or SCDs) identify how a row in a
table changes over time. Imagine you have an orders table where
the status field can be overwritten as the order is processed.

i updated_a
status
d t

1 pending 2019-01-01

Now, imagine that the order goes from "pending" to "shipped". That same
record will now look like:

i updated_a
status
d t

1 shipped 2019-01-02

This order is now in the "shipped" state, but we've lost the information about
when the order was last in the "pending" state. This makes it difficult (or
impossible) to analyze how long it took for an order to ship. dbt can "snapshot"
these changes to help you understand how values in a row change over time.
Here's an example of a snapshot table for the previous example:

i updated_a dbt_valid_fro dbt_valid_t


status
d t m o

1 pendin 2019-01-01 2019-01-01 2019-01-02


i updated_a dbt_valid_fro dbt_valid_t
status
d t m o

1 shipped 2019-01-02 2019-01-02 null

In dbt, snapshots are select statements, defined within a snapshot block in


a .sql file (typically in your snapshots directory). You'll also need to configure
your snapshot to tell dbt how to detect record changes.
snapshots/orders_snapshot.sql

{% snapshot orders_snapshot %}

{{
config(
target_database='analytics',
target_schema='snapshots',
unique_key='id',

strategy='timestamp',
updated_at='updated_at',
)
}}

select * from {{ source('jaffle_shop', 'orders') }}

{% endsnapshot %}

PREVIEW OR COMPILE SNAPSHOTS IN IDE

It is not possible to "preview data" or "compile sql" for snapshots in dbt Cloud.
Instead, run the dbt snapshot command in the IDE by completing the following
steps.

When you run the dbt snapshot command:

 On the first run: dbt will create the initial snapshot table — this will be
the result set of your select statement, with additional columns
including dbt_valid_from and dbt_valid_to. All records will have
a dbt_valid_to = null.
 On subsequent runs: dbt will check which records have changed or if any
new records have been created:
o The dbt_valid_to column will be updated for any existing records
that have changed
o The updated record and any new records will be inserted into the
snapshot table. These records will now have dbt_valid_to = null
Snapshots can be referenced in downstream models the same way as
referencing models — by using the ref function.

Example

To add a snapshot to your project:

1. Create a file in your snapshots directory with a .sql file extension,


e.g. snapshots/orders.sql
2. Use a snapshot block to define the start and end of a snapshot:
snapshots/orders_snapshot.sql
{% snapshot orders_snapshot %}

{% endsnapshot %}

3. Write a select statement within the snapshot block (tips for writing a
good snapshot query are below). This select statement defines the results
that you want to snapshot over time. You can use sources and refs here.
snapshots/orders_snapshot.sql
{% snapshot orders_snapshot %}

select * from {{ source('jaffle_shop', 'orders') }}

{% endsnapshot %}

4. Check whether the result set of your query includes a reliable timestamp
column that indicates when a record was last updated. For our example,
the updated_at column reliably indicates record changes, so we can use
the timestamp strategy. If your query result set does not have a reliable
timestamp, you'll need to instead use the check strategy — more details
on this below.
5. Add configurations to your snapshot using a config block (more details
below). You can also configure your snapshot from
your dbt_project.yml file (docs).
snapshots/orders_snapshot.sql

{% snapshot orders_snapshot %}

{{
config(
target_database='analytics',
target_schema='snapshots',
unique_key='id',

strategy='timestamp',
updated_at='updated_at',
)
}}

select * from {{ source('jaffle_shop', 'orders') }}

{% endsnapshot %}

6. Run the dbt snapshot command — for our example a new table will be
created at analytics.snapshots.orders_snapshot. You can change
the target_database configuration, the target_schema configuration and
the name of the snapshot (as defined in {% snapshot .. %}) will change
how dbt names this table.
$ dbt snapshot
Running with dbt=0.16.0

15:07:36 | Concurrency: 8 threads (target='dev')


15:07:36 |
15:07:36 | 1 of 1 START snapshot snapshots.orders_snapshot...... [RUN]
15:07:36 | 1 of 1 OK snapshot snapshots.orders_snapshot..........[SELECT 3
in 1.82s]
15:07:36 |
15:07:36 | Finished running 1 snapshots in 0.68s.

Completed successfully

Done. PASS=2 ERROR=0 SKIP=0 TOTAL=1

7. Inspect the results by selecting from the table dbt created. After the first
run, you should see the results of your query, plus the snapshot meta
fields as described below.
8. Run the snapshot command again, and inspect the results. If any records
have been updated, the snapshot should reflect this.
9. Select from the snapshot in downstream models using the ref function.
models/changed_orders.sql

select * from {{ ref('orders_snapshot') }}

10.Schedule the snapshot command to run regularly — snapshots are only


useful if you run them frequently.

Detecting row changes

Snapshot "strategies" define how dbt knows if a row has changed. There are
two strategies built-in to dbt — timestamp and check.
Timestamp strategy (recommended)

The timestamp strategy uses an updated_at field to determine if a row has


changed. If the configured updated_at column for a row is more recent than the
last time the snapshot ran, then dbt will invalidate the old record and record the
new one. If the timestamps are unchanged, then dbt will not take any action.

The timestamp strategy requires the following configurations:

Config Description Example

updated_at A column which represents when the source row was last updated_at
updated

Example usage:
snapshots/orders_snapshot_timestamp.sql

{% snapshot orders_snapshot_timestamp %}

{{
config(
target_schema='snapshots',
strategy='timestamp',
unique_key='id',
updated_at='updated_at',
)
}}

select * from {{ source('jaffle_shop', 'orders') }}

{% endsnapshot %}

Check strategy

The check strategy is useful for tables which do not have a


reliable updated_at column. This strategy works by comparing a list of columns
between their current and historical values. If any of these columns have
changed, then dbt will invalidate the old record and record the new one. If the
column values are identical, then dbt will not take any action.

The check strategy requires the following configurations:


Config Description Example

check_cols A list of columns to check for changes, or all to ["name",


"email"]
check all columns
CHECK_COLS = 'ALL'
The check snapshot strategy can be configured to track changes to all columns
by supplying check_cols = 'all'. It is better to explicitly enumerate the columns
that you want to check. Consider using a surrogate key to condense many
columns into a single column.

Example Usage
snapshots/orders_snapshot_check.sql

{% snapshot orders_snapshot_check %}

{{
config(
target_schema='snapshots',
strategy='check',
unique_key='id',
check_cols=['status', 'is_cancelled'],
)
}}

select * from {{ source('jaffle_shop', 'orders') }}

{% endsnapshot %}

Hard deletes (opt-in)

Rows that are deleted from the source query are not invalidated by default. With
the config option invalidate_hard_deletes, dbt can track rows that no longer
exist. This is done by left joining the snapshot table with the source table, and
filtering the rows that are still valid at that point, but no longer can be found in
the source table. dbt_valid_to will be set to the current snapshot time.

This configuration is not a different strategy as described above, but is an


additional opt-in feature. It is not enabled by default since it alters the previous
behavior.

For this configuration to work with the timestamp strategy, the


configured updated_at column must be of timestamp type. Otherwise, queries
will fail due to mixing data types.

Example Usage
snapshots/orders_snapshot_hard_delete.sql

{% snapshot orders_snapshot_hard_delete %}

{{
config(
target_schema='snapshots',
strategy='timestamp',
unique_key='id',
updated_at='updated_at',
invalidate_hard_deletes=True,
)
}}

select * from {{ source('jaffle_shop', 'orders') }}

{% endsnapshot %}

Configuring snapshots

Snapshot configurations

There are a number of snapshot-specific configurations:

Config Description Required? Example

target_database The database that dbt No analytics


should render the
snapshot table into

target_schema The schema that dbt Yes snapshots


should render the
snapshot table into

strategy The snapshot strategy Yes timestamp


to use. One
of timestamp or check

unique_key A primary key column Yes id


or expression for the
record

check_cols If using Only if using ["status"]


the check strategy, the check strategy
then the columns to
check

updated_at If using Only if using updated_a


Config Description Required? Example

the timestamp strategy, the timestamp strateg t


the timestamp column y
to compare

invalidate_hard_delete Find hard deleted No True


s records in source, and
set dbt_valid_to curre
nt time if no longer
exists

A number of other configurations are also supported (e.g. tags and post-hook),
check out the full list here.

Snapshots can be configured from both your dbt_project.yml file and


a config block, check out the configuration docs for more information.

Note: BigQuery users can use target_project and target_dataset as aliases


for target_database and target_schema, respectively.

Configuration best practices

Use the timestamp strategy where possible

This strategy handles column additions and deletions better than


the check strategy.

Ensure your unique key is really unique

The unique key is used by dbt to match rows up, so it's extremely important to
make sure this key is actually unique! If you're snapshotting a source, I'd
recommend adding a uniqueness test to your source (example).

Use a target_schema that is separate to your analytics schema

Snapshots cannot be rebuilt. As such, it's a good idea to put snapshots in a


separate schema so end users know they are special. From there, you may want
to set different privileges on your snapshots compared to your models, and
even run them as a different user (or role, depending on your warehouse) to
make it very difficult to drop a snapshot unless you really want to.
Snapshot query best practices
Snapshot source data.

Your models should then select from these snapshots, treating them like regular
data sources. As much as possible, snapshot your source data in its raw form
and use downstream models to clean up the data

Use the source function in your query.


This helps when understanding data lineage in your project.

Include as many columns as possible.

In fact, go for select * if performance permits! Even if a column doesn't feel


useful at the moment, it might be better to snapshot it in case it becomes useful
– after all, you won't be able to recreate the column later.

Avoid joins in your snapshot query.

Joins can make it difficult to build a reliable updated_at timestamp. Instead,


snapshot the two tables separately, and join them in downstream models.

Limit the amount of transformation in your query.

If you apply business logic in a snapshot query, and this logic changes in the
future, it can be impossible (or, at least, very difficult) to apply the change in
logic to your snapshots.

Basically – keep your query as simple as possible! Some reasonable exceptions


to these recommendations include:

 Selecting specific columns if the table is wide.


 Doing light transformation to get data into a reasonable shape, for
example, unpacking a JSON blob to flatten your source data into
columns.

Snapshot meta-fields
Snapshot tables will be created as a clone of your source dataset, plus some
additional meta-fields*.

Field Meaning Usage

dbt_valid_from The timestamp when this This column can be used to


snapshot row was first order the different "versions" of
Field Meaning Usage

inserted a record.

dbt_valid_to The timestamp when this row The most recent snapshot
became invalidated. record will have dbt_valid_to set
to null.

dbt_scd_id A unique key generated for This is used internally by dbt


each snapshotted record.

dbt_updated_a The updated_at timestamp of This is used internally by dbt


t the source record when this
snapshot row was inserted.

*The timestamps used for each column are subtly different depending on the
strategy you use:

For the timestamp strategy, the configured updated_at column is used to


populate the dbt_valid_from, dbt_valid_to and dbt_updated_at columns.

Details for the timestamp strategy

For the check strategy, the current timestamp is used to populate each column.
If configured, the check strategy uses the updated_at column instead, as with the
timestamp strategy.

Details for the check strategy

FAQs
How do I run one snapshot at a time?Hover to view
How often should I run the snapshot command?Hover to view
What happens if I add new columns to my snapshot query?Hover to view
Do hooks run with snapshots?Hover to view
Why is there only one `target_schema` for snapshots?Hover to view
Can I store my snapshots in a directory other than the `snapshot` directory in
my project?Hover to view
By default, dbt expects your snapshot files to be located in
the snapshots subdirectory of your project.

To change this, update the snapshot-paths configuration in


your dbt_project.yml file, like so:
dbt_project.yml

snapshot-paths: ["snapshots"]

Note that you cannot co-locate snapshots and models in the same directory.
Debug Snapshot target is not a snapshot table errorsHover to view

Add data tests to your DAG


Related reference docs

 Test command
 Data test properties
 Data test configurations
 Test selection examples

Overview

Data tests are assertions you make about your models and other resources in your dbt
project (e.g. sources, seeds and snapshots). When you run dbt test, dbt will tell you if
each test in your project passes or fails.

You can use data tests to improve the integrity of the SQL in each model by making
assertions about the results generated. Out of the box, you can test whether a specified
column in a model only contains non-null values, unique values, or values that have a
corresponding value in another model (for example, a customer_id for
an order corresponds to an id in the customers model), and values from a specified list.
You can extend data tests to suit business logic specific to your organization – any
assertion that you can make about your model in the form of a select query can be turned
into a data test.

Data tests return a set of failing records. Generic data tests (f.k.a. schema tests) are
defined using test blocks.

Like almost everything in dbt, data tests are SQL queries. In particular, they
are select statements that seek to grab "failing" records, ones that disprove your
assertion. If you assert that a column is unique in a model, the test query selects for
duplicates; if you assert that a column is never null, the test seeks after nulls. If the data
test returns zero failing rows, it passes, and your assertion has been validated.
There are two ways of defining data tests in dbt:

 A singular data test is testing in its simplest form: If you can write a SQL query that returns
failing rows, you can save that query in a .sql file within your test directory. It's now a
data test, and it will be executed by the dbt test command.
 A generic data test is a parameterized query that accepts arguments. The test query is
defined in a special test block (like a macro). Once defined, you can reference the generic
test by name throughout your .yml files—define it on models, columns, sources,
snapshots, and seeds. dbt ships with four generic data tests built in, and we think you
should use them!

Defining data tests is a great way to confirm that your outputs and inputs are as expected,
and helps prevent regressions when your code changes. Because you can use them over
and over again, making similar assertions with minor variations, generic data tests tend to
be much more common—they should make up the bulk of your dbt data testing suite.
That said, both ways of defining data tests have their time and place.

CREATING YOUR FIRST DATA TESTS

If you're new to dbt, we recommend that you check out our quickstart guide to build your
first dbt project with models and tests.
Singular data tests

The simplest way to define a data test is by writing the exact SQL that will return failing
records. We call these "singular" data tests, because they're one-off assertions usable for a
single purpose.

These tests are defined in .sql files, typically in your tests directory (as defined by
your test-paths config). You can use Jinja (including ref and source) in the test
definition, just like you can when creating models. Each .sql file contains
one select statement, and it defines one data test:

tests/assert_total_payment_amount_is_positive.sql

-- Refunds have a negative amount, so the total amount should always be >=
0.
-- Therefore return records where this isn't true to make the test fail
select
order_id,
sum(amount) as total_amount
from {{ ref('fct_payments' )}}
group by 1
having not(total_amount >= 0)

The name of this test is the name of the


file: assert_total_payment_amount_is_positive. Simple enough.
Singular data tests are easy to write—so easy that you may find yourself writing the same
basic structure over and over, only changing the name of a column or model. By that
point, the test isn't so singular! In that case, we recommend...

Generic data tests

Certain data tests are generic: they can be reused over and over again. A generic data test
is defined in a test block, which contains a parametrized query and accepts arguments. It
might look like:

{% test not_null(model, column_name) %}

select *
from {{ model }}
where {{ column_name }} is null

{% endtest %}

You'll notice that there are two arguments, model and column_name, which are then
templated into the query. This is what makes the test "generic": it can be defined on as
many columns as you like, across as many models as you like, and dbt will pass the values
of model and column_name accordingly. Once that generic test has been defined, it can be
added as a property on any existing model (or source, seed, or snapshot). These properties
are added in .yml files in the same directory as your resource.

INFO

If this is your first time working with adding properties to a resource, check out the docs
on declaring properties.

Out of the box, dbt ships with four generic data tests already
defined: unique, not_null, accepted_values and relationships. Here's a full example
using those tests on an orders model:

version: 2

models:
- name: orders
columns:
- name: order_id
tests:
- unique
- not_null
- name: status
tests:
- accepted_values:
values: ['placed', 'shipped', 'completed', 'returned']
- name: customer_id
tests:
- relationships:
to: ref('customers')
field: id

In plain English, these data tests translate to:

 unique: the order_id column in the orders model should be unique


 not_null: the order_id column in the orders model should not contain null values
 accepted_values: the status column in the orders should be one
of 'placed', 'shipped', 'completed', or 'returned'
 relationships: each customer_id in the orders model exists as an id in
the customers table (also known as referential integrity)

Behind the scenes, dbt constructs a select query for each data test, using the
parametrized query from the generic test block. These queries return the rows where your
assertion is not true; if the test returns zero rows, your assertion passes.

You can find more information about these data tests, and additional configurations
(including severity and tags) in the reference section.

More generic data tests

Those four tests are enough to get you started. You'll quickly find you want to use a wider
variety of tests—a good thing! You can also install generic data tests from a package, or
write your own, to use (and reuse) across your dbt project. Check out the guide on custom
generic tests for more information.

INFO

There are generic tests defined in some open source packages, such as dbt-utils and dbt-
expectations — skip ahead to the docs on packages to learn more!

Example

To add a generic (or "schema") test to your project:

1. Add a .yml file to your models directory, e.g. models/schema.yml, with the following
content (you may need to adjust the name: values for an existing model)

models/schema.yml
version: 2

models:
- name: orders
columns:
- name: order_id
tests:
- unique
- not_null
2. Run the dbt test command:

$ dbt test

Found 3 models, 2 tests, 0 snapshots, 0 analyses, 130 macros, 0 operations,


0 seed files, 0 sources

17:31:05 | Concurrency: 1 threads (target='learn')


17:31:05 |
17:31:05 | 1 of 2 START test not_null_order_order_id.....................
[RUN]
17:31:06 | 1 of 2 PASS not_null_order_order_id...........................
[PASS in 0.99s]
17:31:06 | 2 of 2 START test unique_order_order_id.......................
[RUN]
17:31:07 | 2 of 2 PASS unique_order_order_id.............................
[PASS in 0.79s]
17:31:07 |
17:31:07 | Finished running 2 tests in 7.17s.

Completed successfully

Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2

3. Check out the SQL dbt is running by either:


o dbt Cloud: checking the Details tab.
o dbt Core: checking the target/compiled directory

Unique test

 Compiled SQL
 Templated SQL

select *
from (

select
order_id

from analytics.orders
where order_id is not null
group by order_id
having count(*) > 1

) validation_errors

Not null test

 Compiled SQL
 Templated SQL

select *
from analytics.orders
where order_id is null

Storing test failures

Normally, a data test query will calculate failures as part of its execution. If you set the
optional --store-failures flag, the store_failures, or the store_failures_as configs,
dbt will first save the results of a test query to a table in the database, and then query that
table to calculate the number of failures.

This workflow allows you to query and examine failing records much more quickly in
development:

Store test failures in the database for faster development-time debugging.

Note that, if you elect to store test failures:

 Test result tables are created in a schema suffixed or named dbt_test__audit, by


default. It is possible to change this value by setting a schema config. (For more details on
schema naming, see using custom schemas.)

 A test's results will always replace previous failures for the same test.

FAQs
How do I test one model at a time?Hover to view

One of my tests failed, how can I debug it?Hover to view

What tests should I add to my project?Hover to view

When should I run my tests?Hover to view

Can I store my tests in a directory other than the `tests` directory in my project?Hover to view
How do I run tests on just my sources?Hover to view

Can I set test failure thresholds?Hover to view

As of v0.20.0, you can use the error_if and warn_if configs to set custom failure
thresholds in your tests. For more details, see reference for more information.

For dbt v0.19.0 and earlier, you could try these possible solutions:

 Setting the severity to warn, or:


 Writing a custom generic test that accepts a threshold argument (example)

Can I test the uniqueness of two columns?Hover to view

Yes, There's a few different options.

Consider an orders table that contains records from multiple countries, and the
combination of ID and country code is unique:

order_id country_code

1 AU

2 AU

... ...

1 US

2 US

... ...

Here are some approaches:

1. Create a unique key in the model and test that


models/orders.sql

select
country_code || '-' || order_id as surrogate_key,
...

models/orders.yml

version: 2

models:
- name: orders
columns:
- name: surrogate_key
tests:
- unique
2. Test an expression
models/orders.yml

version: 2

models:
- name: orders
tests:
- unique:
column_name: "(country_code || '-' || order_id)"

3. Use the dbt_utils.unique_combination_of_columns test

This is especially useful for large datasets since it is more performant. Check out the docs
on packages for more information.

models/orders.yml

version: 2

models:
- name: orders
tests:
- dbt_utils.unique_combination_of_columns:
combination_of_columns:
- country_code
- order_id

Edit this page

Jinja and macros


Related reference docs

 Jinja Template Designer Documentation (external link)


 dbt Jinja context
 Macro properties

Overview

In dbt, you can combine SQL with Jinja, a templating language.

Using Jinja turns your dbt project into a programming environment for SQL,
giving you the ability to do things that aren't normally possible in SQL. For
example, with Jinja you can:

 Use control structures (e.g. if statements and for loops) in SQL


 Use environment variables in your dbt project for production
deployments
 Change the way your project builds based on the current target.
 Operate on the results of one query to generate another query, for
example:
o Return a list of payment methods, in order to create a subtotal
column per payment method (pivot)
o Return a list of columns in two relations, and select them in the
same order to make it easier to union them together
 Abstract snippets of SQL into reusable macros — these are analogous to
functions in most programming languages.

In fact, if you've used the {{ ref() }} function, you're already using Jinja!

Jinja can be used in any SQL in a dbt project, including models, analyses, tests,
and even hooks.

READY TO GET STARTED WITH JINJA AND MACROS?

Check out the tutorial on using Jinja for a step-by-step example of using Jinja in
a model, and turning it into a macro!
Getting started

Jinja

Here's an example of a dbt model that leverages Jinja:


/models/order_payment_method_amounts.sql

{% set payment_methods = ["bank_transfer", "credit_card", "gift_card"] %}

select
order_id,
{% for payment_method in payment_methods %}
sum(case when payment_method = '{{payment_method}}' then amount end) as
{{payment_method}}_amount,
{% endfor %}
sum(amount) as total_amount
from app_data.payments
group by 1

This query will get compiled to:


/models/order_payment_method_amounts.sql

select
order_id,
sum(case when payment_method = 'bank_transfer' then amount end) as
bank_transfer_amount,
sum(case when payment_method = 'credit_card' then amount end) as
credit_card_amount,
sum(case when payment_method = 'gift_card' then amount end) as
gift_card_amount,
sum(amount) as total_amount
from app_data.payments
group by 1

You can recognize Jinja based on the delimiters the language uses, which we
refer to as "curlies":

 Expressions {{ ... }}: Expressions are used when you want to output a
string. You can use expressions to reference variables and call macros.
 Statements {% ... %}: Statements don't output a string. They are used
for control flow, for example, to set up for loops and if statements,
to set or modify variables, or to define macros.
 Comments {# ... #}: Jinja comments are used to prevent the text within
the comment from executing or outputing a string.

When used in a dbt model, your Jinja needs to compile to a valid query. To
check what SQL your Jinja compiles to:

 Using dbt Cloud: Click the compile button to see the compiled SQL in the
Compiled SQL pane
 Using dbt Core: Run dbt compile from the command line. Then open the
compiled SQL file in the target/compiled/{project name}/ directory. Use a
split screen in your code editor to keep both files open at once.

Macros

Macros in Jinja are pieces of code that can be reused multiple times – they are
analogous to "functions" in other programming languages, and are extremely
useful if you find yourself repeating code across multiple models. Macros are
defined in .sql files, typically in your macros directory (docs).

Macro files can contain one or more macros — here's an example:


macros/cents_to_dollars.sql

{% macro cents_to_dollars(column_name, scale=2) %}


({{ column_name }} / 100)::numeric(16, {{ scale }})
{% endmacro %}
A model which uses this macro might look like:
models/stg_payments.sql

select
id as payment_id,
{{ cents_to_dollars('amount') }} as amount_usd,
...
from app_data.payments

This would be compiled to:


target/compiled/models/stg_payments.sql

select
id as payment_id,
(amount / 100)::numeric(16, 2) as amount_usd,
...
from app_data.payments

Using a macro from a package

A number of useful macros have also been grouped together into packages —
our most popular package is dbt-utils.

After installing a package into your project, you can use any of the macros in
your own project — make sure you qualify the macro by prefixing it with
the package name:

select
field_1,
field_2,
field_3,
field_4,
field_5,
count(*)
from my_table
{{ dbt_utils.dimensions(5) }}

You can also qualify a macro in your own project by prefixing it with
your package name (this is mainly useful for package authors).

FAQs
What parts of Jinja are dbt-specific?Hover to view
Which docs should I use when writing Jinja or creating a macro?Hover to view
Why do I need to quote column names in Jinja?Hover to view
My compiled SQL has a lot of spaces and new lines, how can I get rid of it?Hover to
view

How do I debug my Jinja?Hover to view


How do I document macros?Hover to view
Why does my dbt output have so many macros in it?Hover to view
dbtonic Jinja

Just like well-written python is pythonic, well-written dbt code is dbtonic.

Favor readability over DRY-ness

Once you learn the power of Jinja, it's common to want to abstract every
repeated line into a macro! Remember that using Jinja can make your models
harder for other users to interpret — we recommend favoring readability when
mixing Jinja with SQL, even if it means repeating some lines of SQL in a few
places. If all your models are macros, it might be worth re-assessing.

Leverage package macros

Writing a macro for the first time? Check whether we've open sourced one
in dbt-utils that you can use, and save yourself some time!

Set variables at the top of a model

{% set ... %} can be used to create a new variable, or update an existing one.
We recommend setting variables at the top of a model, rather than hardcoding
it inline. This is a practice borrowed from many other coding languages, since it
helps with readability, and comes in handy if you need to reference the variable
in two places:

-- 🙅 This works, but can be hard to maintain as your code grows


{% for payment_method in ["bank_transfer", "credit_card", "gift_card"] %}
...
{% endfor %}

-- ✅ This is our preferred method of setting variables


{% set payment_methods = ["bank_transfer", "credit_card", "gift_card"] %}

{% for payment_method in payment_methods %}


...
{% endfor %}

Questions from the Community

No recent forum posts for this topic. Ask a question!

Add sources to your DAG


Related reference docs

 Source properties
 Source configurations
 {{ source() }} jinja function
 source freshness command

Using sources

Sources make it possible to name and describe the data loaded into your
warehouse by your Extract and Load tools. By declaring these tables as sources
in dbt, you can then

 select from source tables in your models using


the {{ source() }} function, helping define the lineage of your data
 test your assumptions about your source data
 calculate the freshness of your source data

Declaring a source

Sources are defined in .yml files nested under a sources: key.


models/<filename>.yml

version: 2

sources:
- name: jaffle_shop
database: raw
schema: jaffle_shop
tables:
- name: orders
- name: customers

- name: stripe
tables:
- name: payments
*By default, schema will be the same as name. Add schema only if you want to use a
source name that differs from the existing schema.

If you're not already familiar with these files, be sure to check out the
documentation on schema.yml files before proceeding.

Selecting from a source

Once a source has been defined, it can be referenced from a model using
the {{ source()}} function.
models/orders.sql

select
...

from {{ source('jaffle_shop', 'orders') }}

left join {{ source('jaffle_shop', 'customers') }} using (customer_id)

dbt will compile this to the full table name:


target/compiled/jaffle_shop/models/my_model.sql

select
...

from raw.jaffle_shop.orders

left join raw.jaffle_shop.customers using (customer_id)

Using the {{ source () }} function also creates a dependency between the


model and the source table.

The source function tells dbt a model is dependent on a source


Testing and documenting sources

You can also:

 Add data tests to sources


 Add descriptions to sources, that get rendered as part of your
documentation site

These should be familiar concepts if you've already added tests and


descriptions to your models (if not check out the guides
on testing and documentation).
models/<filename>.yml

version: 2

sources:
- name: jaffle_shop
description: This is a replica of the Postgres database used by our app
tables:
- name: orders
description: >
One record per order. Includes cancelled and deleted orders.
columns:
- name: id
description: Primary key of the orders table
tests:
- unique
- not_null
- name: status
description: Note that the status can change over time

- name: ...

- name: ...

You can find more details on the available properties for sources in
the reference section.

FAQs

What if my source is in a poorly named schema or table?Hover to view


What if my source is in a different database to my target database?Hover to view
I need to use quotes to select from my source, what should I do?Hover to view
How do I run tests on just my sources?Hover to view
How do I run models downstream of one source?Hover to view
Snapshotting source data freshness

With a couple of extra configs, dbt can optionally snapshot the "freshness" of
the data in your source tables. This is useful for understanding if your data
pipelines are in a healthy state, and is a critical component of defining SLAs for
your warehouse.

Declaring source freshness

To configure sources to snapshot freshness information, add a freshness block


to your source and loaded_at_field to your table declaration:
models/<filename>.yml

version: 2

sources:
- name: jaffle_shop
database: raw
freshness: # default freshness
warn_after: {count: 12, period: hour}
error_after: {count: 24, period: hour}
loaded_at_field: _etl_loaded_at

tables:
- name: orders
freshness: # make this a little more strict
warn_after: {count: 6, period: hour}
error_after: {count: 12, period: hour}

- name: customers # this will use the freshness defined above

- name: product_skus
freshness: null # do not check freshness for this table

In the freshness block, one or both of warn_after and error_after can be


provided. If neither is provided, then dbt will not calculate freshness snapshots
for the tables in this source.

Additionally, the loaded_at_field is required to calculate freshness for a table. If


a loaded_at_field is not provided, then dbt will not calculate freshness for the
table.

These configs are applied hierarchically,


so freshness and loaded_at_field values specified for a source will flow through
to all of the tables defined in that source. This is useful when all of the tables in
a source have the same loaded_at_field, as the config can just be specified once
in the top-level source definition.
Checking source freshness

To snapshot freshness information for your sources, use the dbt source
freshness command (reference docs):

$ dbt source freshness

Behind the scenes, dbt uses the freshness properties to construct


a select query, shown below. You can find this query in the query logs.
select
max(_etl_loaded_at) as max_loaded_at,
convert_timezone('UTC', current_timestamp()) as snapshotted_at
from raw.jaffle_shop.orders

The results of this query are used to determine whether the source is fresh or
not:

Uh oh! Not everything is as fresh as we'd like!

Filter

Some databases can have tables where a filter over certain columns are
required, in order prevent a full scan of the table, which could be costly. In order
to do a freshness check on such tables a filter argument can be added to the
configuration, e.g. filter: _etl_loaded_at >= date_sub(current_date(),
interval 1 day). For the example above, the resulting query would look like

select
max(_etl_loaded_at) as max_loaded_at,
convert_timezone('UTC', current_timestamp()) as snapshotted_at
from raw.jaffle_shop.orders
where _etl_loaded_at >= date_sub(current_date(), interval 1 day)

FAQs

How do I exclude a table from a freshness snapshot?Hover to view


How do I snapshot freshness for one source only?Hover to view
Are the results of freshness stored anywhere?Hover to view

Yes!

The dbt source freshness command will output a pass/warning/error status for
each table selected in the freshness snapshot.

Additionally, dbt will write the freshness results to a file in the target/ directory
called sources.json by default. You can also override this destination, use the -
o flag to the dbt source freshness command.

After enabling source freshness within a job, configure Artifacts in your Project
Details page, which you can find by clicking the gear icon and then
selecting Account settings. You can see the current status for source freshness
by clicking View Sources in the job page.

Add Exposures to your DAG


Exposures make it possible to define and describe a downstream use of your dbt
project, such as in a dashboard, application, or data science pipeline. By
defining exposures, you can then:

 run, test, and list resources that feed into your exposure
 populate a dedicated page in the auto-generated documentation site
with context relevant to data consumers

Declaring an exposure

Exposures are defined in .yml files nested under an exposures: key.


models/<filename>.yml
version: 2

exposures:
- name: weekly_jaffle_metrics
label: Jaffles by the Week
type: dashboard
maturity: high
url: https://bi.tool/dashboards/1
description: >
Did someone say "exponential growth"?

depends_on:
- ref('fct_orders')
- ref('dim_customers')
- source('gsheets', 'goals')
- metric('count_orders')

owner:
name: Callum McData
email: [email protected]

Available properties

Required:

 name: a unique exposure name written in snake case


 type: one of dashboard, notebook, analysis, ml, application (used to
organize in docs site)
 owner: name or email required; additional properties allowed

Expected:

 depends_on: list of refable nodes, including ref, source,


and metric (While possible, it is highly unlikely you will ever need
an exposure to depend on a source directly)

Optional:

 label: may contain spaces, capital letters, or special characters.


 url: enables the link to View this exposure in the upper right corner of
the generated documentation site
 maturity: one of high, medium, low

General properties (optional)

 description
 tags
 meta

We plan to add more subtypes and optional properties in future releases.


Referencing exposures

Once an exposure is defined, you can run commands that reference it:
dbt run -s +exposure:weekly_jaffle_report
dbt test -s +exposure:weekly_jaffle_report

When we generate our documentation site, you'll see the exposure appear:

Dedicated page in dbt-docs for each exposure

Add groups to your DAG


A group is a collection of nodes within a dbt DAG. Groups are named, and every
group has an owner. They enable intentional collaboration within and across
teams by restricting access to private models.

Group members may include models, tests, seeds, snapshots, analyses, and
metrics. (Not included: sources and exposures.) Each node may belong to only
one group.

Declaring a group

Groups are defined in .yml files, nested under a groups: key.


models/marts/finance/finance.yml

groups:
- name: finance
owner:
# 'name' or 'email' is required; additional properties allowed
email: [email protected]
slack: finance-data
github: finance-data-team

Adding a model to a group

Use the group configuration to add one or more models to a group.

 Project-level
 Model-level
 In-file
dbt_project.yml

models:
marts:
finance:
+group: finance

Referencing a model in a group

By default, all models within a group have the protected access modifier. This
means they can be referenced by downstream resources in any group in the
same project, using the ref function. If a grouped model's access property is set
to private, only resources within its group can reference it.
models/schema.yml

models:
- name: finance_private_model
access: private
config:
group: finance

# in a different group!
- name: marketing_model
config:
group: marketing

models/marketing_model.sql

select * from {{ ref('finance_private_model') }}

$ dbt run -s marketing_model


...
dbt.exceptions.DbtReferenceError: Parsing Error
Node model.jaffle_shop.marketing_model attempted to reference node
model.jaffle_shop.finance_private_model,
which is not allowed because the referenced node is private to the
finance group.

Related docs

Analyses
Overview

dbt's notion of models makes it easy for data teams to version control and
collaborate on data transformations. Sometimes though, a certain SQL
statement doesn't quite fit into the mold of a dbt model. These more
"analytical" SQL files can be versioned inside of your dbt project using
the analysis functionality of dbt.

Any .sql files found in the analyses/ directory of a dbt project will be compiled,
but not executed. This means that analysts can use dbt functionality
like {{ ref(...) }} to select from models in an environment-agnostic way.

In practice, an analysis file might look like this (via the open source Quickbooks
models):
analyses/running_total_by_account.sql

-- analyses/running_total_by_account.sql

with journal_entries as (

select *
from {{ ref('quickbooks_adjusted_journal_entries') }}

), accounts as (

select *
from {{ ref('quickbooks_accounts_transformed') }}

select
txn_date,
account_id,
adjusted_amount,
description,
account_name,
sum(adjusted_amount) over (partition by account_id order by id rows
unbounded preceding)
from journal_entries
order by account_id, id

To compile this analysis into runnable sql, run:


dbt compile

Then, look for the compiled SQL file in target/compiled/{project


name}/analyses/running_total_by_account.sql. This sql can then be pasted into
a data visualization tool, for instance. Note that
no running_total_by_account relation will be materialized in the database as this
is an analysis, not a model.

Data Build Tool (DBT) is a popular open-source tool used in the data
analytics and data engineering fields. DBT helps data professionals
transform, model, and prepare data for analysis. If you’re preparing
for an interview related to DBT, it’s important to be well-versed in
its concepts and functionalities. To help you prepare, here’s a list of
common interview questions and answers about DBT.

1. What is DBT?

Answer: DBT, short for Data Build Tool, is an open-source data


transformation and modeling tool. It helps analysts and data
engineers manage the transformation and preparation of data for
analytics and reporting.
2. What are the primary use cases of DBT?

Answer:DBT is primarily used for data transformation, modeling,


and preparing data for analysis and reporting. It is commonly used
in data warehouses to create and maintain data pipelines.

3. How does DBT differ from traditional ETL tools?

Answer: Unlike traditional ETL tools, DBT focuses on transforming


and modeling data within the data warehouse itself, making it more
suitable for ELT (Extract, Load, Transform) workflows. DBT
leverages the power and scalability of modern data warehouses and
allows for version control and testing of data models.

4. What is a DBT model?

Answer: A DBT model is a SQL file that defines a transformation or


a table within the data warehouse. Models can be simple SQL
queries or complex transformations that create derived datasets.

5. Explain the difference between source and model in DBT.

Answer: A source in DBT refers to the raw or untransformed data


that is ingested into the data warehouse. Models are the transformed
and structured datasets created using DBT to support analytics.

6. What is a DBT project?


Answer: A DBT project is a directory containing all the files and
configurations necessary to define data models, tests, and
documentation. It is the primary unit of organization for DBT.

7. What is a DAG in the context of DBT?

Answer: DAG stands for Directed Acyclic Graph, and in the context
of DBT, it represents the dependencies between models. DBT uses a
DAG to determine the order in which models are built.

8. How do you write a DBT model to transform data?

Answer: To write a DBT model, you create a `.sql` file in the


appropriate project directory, defining the SQL transformation
necessary to generate the target dataset.

9. What are DBT macros, and how are they useful in


transformations?

Answer: DBT macros are reusable SQL code snippets that can
simplify and standardize common operations in your DBT models,
such as filtering, aggregating, or renaming columns.

10. How can you perform testing and validation of DBT models?

Answer: You can perform testing in DBT by writing custom SQL


tests to validate your data models. These tests can check for data
quality, consistency, and other criteria to ensure your models are
correct.

11. Explain the process of deploying DBT models to production.

Answer: Deploying DBT models to production typically involves


using DBT Cloud, CI/CD pipelines, or other orchestration tools.
You’ll need to compile and build the models and then deploy them to
your data warehouse environment.

12. How does DBT support version control and collaboration?

Answer: DBT integrates with version control systems like Git,


allowing teams to collaborate on DBT projects and track changes to
models over time. It provides a clear history of changes and enables
collaboration in a multi-user environment.

13. What are some common performance optimization techniques


for DBT models?

Answer: Performance optimization in DBT can be achieved by using


techniques like materialized views, optimizing SQL queries, and
using caching to reduce query execution times.

14. How do you monitor and troubleshoot issues in DBT?


Answer: DBT provides logs and diagnostics to help monitor and
troubleshoot issues. You can also use data warehouse-specific
monitoring tools to identify and address performance problems.

15. Can DBT work with different data sources and data warehouses?

Answer: Yes, DBT supports integration with a variety of data sources


and data warehouses, including Snowflake, BigQuery, Redshift, and
more. It’s adaptable to different cloud and on-premises
environments.

16. How does DBT handle incremental loading of data from source
systems?

Answer: DBT can handle incremental loading by using source


freshness checks and managing data updates from source systems. It
can be configured to only transform new or changed data.

17. What security measures does DBT support for data access and
transformation?

Answer: DBT supports the security features provided by your data


warehouse, such as row-level security and access control policies.
It’s important to implement proper access controls at the database
level.

18. How can you manage sensitive data in DBT models?


Answer: Sensitive data in DBT models should be handled according
to your organization’s data security policies. This can involve
encryption, tokenization, or other data protection measures.

19. Types of Materialization?

Answer: DBT supports several types of materialization are as


follows:

1)View (Default):

Purpose: Views are virtual tables that are not materialized. They
are essentially saved queries that are executed at runtime.
Use Case: Useful for simple transformations or when you want to
reference a SQL query in multiple models.

{{ config(
materialized='view'
) }}
SELECT
...
FROM ...

2)Table:

Purpose: Materializes the result of a SQL query as a physical table


in your data warehouse.
Use Case: Suitable for intermediate or final tables that you want to
persist in your data warehouse.
{{ config(
materialized='table'
) }}
SELECT
...
INTO {{ ref('my_table') }}
FROM ...

3)Incremental:

Purpose: Materializes the result of a SQL query as a physical table,


but is designed to be updated incrementally. It’s typically used for
incremental data loads.
Use Case: Ideal for situations where you want to update your table
with only the new or changed data since the last run.

{{ config(
materialized='incremental'
) }}
SELECT
...
FROM ...

4)Table + Unique Key:

Purpose: Similar to the incremental materialization, but specifies a


unique key that dbt can use to identify new or updated rows.

Use Case: Useful when dbt needs a way to identify changes in the
data.
{{ config(
materialized='table',
unique_key='id'
) }}
SELECT
...
INTO {{ ref('my_table') }}
FROM ...

5)Snapshot:

Purpose: Materializes a table in a way that retains a version history


of the data, allowing you to query the data as it was at different
points in time.

Use Case: Useful for slowly changing dimensions or situations


where historical data is important.

{{ config(
materialized='snapshot'
) }}
SELECT
...
INTO {{ ref('my_snapshot_table') }}
FROM ...

20. Types of Tests in DBT?

Answer: Dbt provides several types of tests that you can use to
validate your data. Here are some common test types in dbt:

1)Unique Key Test (unique):


Verifies that a specified column or set of columns contains unique
values.

version: 2

models:
- name: my_model
tests:
- unique:
columns: [id]

2)Not Null Test (not_null):

Ensures that specified columns do not contain null values.

version: 2

models:
- name: my_model
tests:
- not_null:
columns: [name, age]

3)Accepted Values Test (accepted_values):

Validates that the values in a column are among a specified list.

version: 2

models:
- name: my_model
tests:
- accepted_values:
column: status
values: ['active', 'inactive']
4)Relationship Test (relationship):

Verifies that the values in a foreign key column match primary key
values in the referenced table.

version: 2

models:
- name: orders
tests:
- relationship:
to: ref('customers')
field: customer_id

5)Referential Integrity Test (referential integrity):

Checks that foreign key relationships are maintained between two


tables.

version: 2

models:
- name: orders
tests:
- referential_integrity:
to: ref('customers')
field: customer_id

6)Custom SQL Test (custom_sql):

Allows you to define custom SQL expressions to test specific


conditions.
version: 2

models:
- name: my_model
tests:
- custom_sql: "column_name > 0"

21.What is seed?

Answer: A “seed” refers to a type of dbt model that represents a table


or view containing static or reference data. Seeds are typically used
to store data that doesn’t change often and doesn’t require
transformation during the ETL (Extract, Transform, Load) process.

Here are some key points about seeds in dbt:

1. Static Data: Seeds are used for static or reference data


that doesn’t change frequently. Examples include lookup
tables, reference data, or any data that serves as a fixed
input for analysis.
2. Initial Data Load: Seeds are often used to load initial
data into a data warehouse or data mart. This data is
typically loaded once and then used as a stable reference for
reporting and analysis.
3. YAML Configuration: In dbt, a seed is defined in a
YAML file where you specify the source of the data and the
destination table or view in your data warehouse. The
YAML file also includes configurations for how the data
should be loaded.
Here’s an example of a dbt seed YAML file:

version: 2

sources:
- name: my_seed_data
tables:
- name: my_seed_table
seed:
freshness: { warn_after: '7 days', error_after: '14 days' }

22.What is Pre-hook and Post-hook?

Answer: Pre-hooks and Post-hooks are mechanisms to execute SQL


commands or scripts before and after the execution of dbt models,
respectively. dbt is an open-source tool that enables analytics
engineers to transform data in their warehouse more effectively.

Here’s a brief explanation of pre-hooks and post-hooks:

1)Pre-hooks:

 A pre-hook is a SQL command or script that is executed


before running dbt models.
 It allows you to perform setup tasks or run additional SQL
commands before the main dbt modeling process.
 Common use cases for pre-hooks include tasks such as
creating temporary tables, loading data into staging tables,
or performing any other necessary setup before model
execution.
Example of a pre-hook :

-- models/my_model.sql
{{ config(
pre_hook = "CREATE TEMP TABLE my_temp_table AS SELECT * FROM
my_source_table"
) }}
SELECT
column1,
column2
FROM
my_temp_table

2)Post-hooks:

 A post-hook is a SQL command or script that is executed


after the successful completion of dbt models.
 It allows you to perform cleanup tasks, log information, or
execute additional SQL commands after the models have
been successfully executed.
 Common use cases for post-hooks include tasks such as
updating metadata tables, logging information about the
run, or deleting temporary tables created during the pre-
hook.

Example of a post-hook :

-- models/my_model.sql
SELECT
column1,
column2
FROM
my_source_table

{{ config(
post_hook = "UPDATE metadata_table SET last_run_timestamp =
CURRENT_TIMESTAMP"
) }}

23.what is snapshots?

Answer: “snapshots” refer to a type of dbt model that is used to track


changes over time in a table or view. Snapshots are particularly
useful for building historical reporting or analytics, where you want
to analyze how data has changed over different points in time.

Here’s how snapshots work in dbt:

1. Snapshot Tables: A snapshot table is a table that


represents a historical state of another table. For example,
if you have a table representing customer information, a
snapshot table could be used to capture changes to that
information over time.
2. Unique Identifiers: To track changes over time, dbt
relies on unique identifiers (primary keys) in the
underlying data. These identifiers are used to determine
which rows have changed, and dbt creates new records in
the snapshot table accordingly.
3. Timestamps: Snapshots also use timestamp columns to
determine when each historical version of a record was
valid. This allows you to query the data as it existed at a
specific point in time.
4. Configuring Snapshots: In dbt, you configure snapshots
in your project by creating a separate SQL file for each
snapshot table. This file defines the base table or view
you’re snapshotting, the primary key, and any other
necessary configurations.

Here’s a simplified example:

-- snapshots/customer_snapshot.sql

{{ config(
materialized='snapshot',
unique_key='customer_id',
target_database='analytics',
target_schema='snapshots',
strategy='timestamp'
) }}

SELECT
customer_id,
name,
email,
address,
current_timestamp() as snapshot_timestamp
FROM
source.customer;

24.What is macros?

Answer: macros refer to reusable blocks of SQL code that can be


defined and invoked within dbt models. dbt macros are similar to
functions or procedures in other programming languages, allowing
you to encapsulate and reuse SQL logic across multiple queries.

Here’s how dbt macros work:


1. Definition: A macro is defined in a separate file with
a .sql extension. It contains SQL code that can take
parameters, making it flexible and reusable.

-- my_macro.sql
{% macro my_macro(parameter1, parameter2) %}
SELECT
column1,
column2
FROM
my_table
WHERE
condition1 = {{ parameter1 }}
AND condition2 = {{ parameter2 }}
{% endmacro %}

2. Invocation: You can then use the macro in your dbt models by
referencing it.

-- my_model.sql
{{ my_macro(parameter1=1, parameter2='value') }}

When you run the dbt project, dbt replaces the macro invocation
with the actual SQL code defined in the macro.

3. Parameters: Macros can accept parameters, making them


dynamic and reusable for different scenarios. In the example
above, parameter1 and parameter2 are parameters that can be
supplied when invoking the macro.

4. Code Organization: Macros help in organizing and


modularizing your SQL code. They are particularly useful when you
have common patterns or calculations that need to be repeated
across multiple models.

-- my_model.sql
{{ my_macro(parameter1=1, parameter2='value') }}

-- another_model.sql
{{ my_macro(parameter1=2, parameter2='another_value') }}

25.what is project structure?

Answer: Aproject structure refers to the organization and layout of


files and directories within a dbt project. dbt is a command-line tool
that enables data analysts and engineers to transform data in their
warehouse more effectively. The project structure in dbt is designed
to be modular and organized, allowing users to manage and version
control their analytics code easily.

A typical dbt project structure includes the following key


components:

1. Models Directory:

This is where you store your SQL files containing dbt models. Each
model represents a logical transformation or aggregation of your
raw data. Models are defined using SQL syntax and are typically
organized into subdirectories based on the data source or business
logic.

2. Data Directory:
The data directory is used to store any data files that are required for
your dbt transformations. This might include lookup tables,
reference data, or any other supplemental data needed for your
analytics.

3. Analysis Directory:

This directory contains SQL files that are used for ad-hoc querying
or exploratory analysis. These files are separate from the main
models and are not intended to be part of the core data
transformation process.

4. Tests Directory:

dbt allows you to write tests to ensure the quality of your data
transformations. The tests directory is where you store YAML files
defining the tests for your models. Tests can include checks on the
data types, uniqueness, and other criteria.

5. Snapshots Directory:

Snapshots are used for slowly changing dimensions or historical


tracking of data changes. The snapshots directory is where you store
SQL files defining the logic for these snapshots.

6. Macros Directory:
Macros in dbt are reusable pieces of SQL code. The macros directory
is where you store these macros, and they can be included in your
models for better modularity and maintainability.

7. Docs Directory:

This directory is used for storing documentation for your dbt


project. Documentation is crucial for understanding the purpose and
logic behind each model and transformation.

8. dbt_project.yml:

This YAML file is the configuration file for your dbt project. It
includes settings such as the target warehouse, database connection
details, and other project-specific configurations.

9. Profiles.yml:

This file contains the connection details for your data warehouse. It
specifies how to connect to your database, including the type of
database, host, username, and password.

10. Analysis and Custom Folders:

You may have additional directories for custom scripts, notebooks,


or other artifacts related to your analytics workflow.
Having a well-organized project structure makes it easier to
collaborate with team members, maintain code, and manage version
control. It also ensures that your analytics code is modular, reusable,
and easy to understand.

my_project/
|-- analysis/
| |-- my_analysis_file.sql
|-- data/
| |-- my_model_file.sql
|-- macros/
| |-- my_macro_file.sql
|-- models/
| |-- my_model_file.sql
|-- snapshots/
| |-- my_snapshot_file.sql
|-- tests/
| |-- my_test_file.sql
|-- dbt_project.yml

26. What is data refresh?

Answer: “data refresh” typically refers to the process of updating or


reloading data in your data warehouse. Dbt is a command-line tool
that enables data analysts and engineers to transform data in their
warehouse more effectively. It allows you to write modular SQL
queries, called models, that define transformations on your raw
data.

Here’s a brief overview of the typical workflow involving data refresh


in dbt:
1. Write Models: Analysts write SQL queries to transform
raw data into analysis-ready tables. These queries are
defined in dbt models.
2. Run dbt: Analysts run dbt to execute the SQL queries and
create or update the tables in the data warehouse. This
process is often referred to as a dbt run.
3. Data Refresh: After the initial run, you may need to
refresh your data regularly to keep it up to date. This
involves re-running dbt on a schedule or as needed to
reflect changes in the source data.
4. Incremental Models: To optimize performance, dbt
allows you to write incremental models. These models only
transform and refresh the data that has changed since the
last run, rather than reprocessing the entire dataset. This is
particularly useful for large datasets where a full refresh
may be time-consuming.
5. Dependency Management: Dbt also handles
dependency management. If a model depends on another
model, dbt ensures that the dependencies are run first,
maintaining a proper order of execution.

By using dbt for data refresh, you can streamline and automate the
process of transforming raw data into a clean, structured format for
analysis. This approach promotes repeatability, maintainability, and
collaboration in the data transformation process
1. What is a model in dbt (data build tool)?
A model is a select statement. Models are defined in .sql files (typically in
your models directory):
Each .sql file contains one model / select statement
The name of the file is used as the model name
Models can be nested in subdirectories within the models directory
When you execute the dbt run command, dbt will build this model in your
data warehouse by wrapping it in a create view as or create table as
statement.
2. What are the configurations in a model?
Configurations are “model settings” that can be set in your dbt_project.yml
file, and in your model file using a config block. Some example
configurations include:

Change the materialization that a model uses – a materialization determines


the SQL that dbt uses to create the model in your warehouse.
3. Can I store my models in a directory other than the ⊨⊨ directory in
my project?
By default, dbt expects your seed files to be located in the models
subdirectory of your project.
To change this, update the source-paths configuration in your
dbt_project.yml file, like so:
dbt_project.yml
source-paths: [“transformations”]

4. Can I split my models across multiple schemas?


Yes. Use the schema configuration in your dbt_project.yml file, or using a
config block:
dbt_project.yml
name: jaffle_shop

models:
jaffle_shop:
marketing:
schema: marketing #

5. Do model names need to be unique?


Yes! To build dependencies between models, you need to use the ref
function. The ref function only takes one argument — the model
name (i.e. the filename). As a result, these model names need to be
unique, even if they are in distinct folders.
Post Views: 9,375
Related Posts
 DBT : How does DBT handle performance optimization and data
scalability
DBT does not handle performance optimization and data
scalability directly. However, it can be used…
 DBT : Handling Late-Arriving Data in DBT
Data warehousing and business intelligence often involve working
with data that arrives after a certain…
 DBT : DBT's way of handling versioning of data models.
DBT uses a versioning system called "Incremental Modeling"
which allows to version data models by…
 How does DBT handle dependencies and data lineage?
DBT handles dependencies and data lineage by providing a set of
features that allow users…
 DBT : How does DBT handle data lineage and auditing ?
DBT handles data lineage and auditing by tracking the history of
transformations and changes to…
 DBT : Explain DBT's seed-paths
In a DBT (Data Build Tool) project, seed-paths configuration in
the dbt_project.yml file is used…
 How does DBT handle incremental data loading?
DBT (Data Build Tool) does not have a built-in feature for
incremental data loading, but…
 DBT : What is DBT quoting ?
DBT (Data Build Tool) quoting refers to the process of wrapping a
string or identifier…
 DBT : How do you use DBT to document your data pipeline?
DBT helps maintain a clear and detailed documentation of the
entire data pipeline, making it…
 How do you use DBT to manage your data lineage?
Data lineage refers to the history of data as it moves from its
source to…
6. How do I remove deleted models from my data warehouse?
If you delete a model from your dbt project, dbt does not automatically drop
the relation from your schema. This means that you can end up with extra
objects in schemas that dbt creates, which can be confusing to other users.
7. If models can only be ‘select’ statements, how do I insert records?
If you wish to use insert statements for perfomance reasons (i.e. to reduce
data that is processed), consider incremental models
If you wish to use insert statements since your source data is constantly
changing (e.g. to create “Type 2 Slowly Changing Dimensions”), consider
snapshotting your source data, and building models on top of your snaphots.
8. What are the four types of materializations built into dbt ?
table
view
incremental
ephemeral
9. What is incremental models in dbt?
Incremental models are built as tables in your data warehouse – the first time
a model is run, the table is built by transforming all rows of source data. On
subsequent runs, dbt transforms only the rows in your source data that you
tell dbt to filter for, inserting them into the table that has already been built
(the target table). Incremental models allow dbt to insert or update records
into a table since the last time that dbt was run. You can significantly reduce
the build time by just transforming new records.Incremental models require
extra configuration and are an advanced usage of dbt.

10. What is ephemeral models in dbt ?


ephemeral models are not directly built into the database. Instead, dbt
will interpolate the code from this model into dependent models as a
common table expression. You can still write reusable logic. Ephemeral
models can help keep your data warehouse clean by reducing clutter
(also consider splitting your models across multiple schemas by using
custom schemas. You cannot select directly from this model. Overuse
of the ephemeral materialization can also make queries harder to
debug.
Post Views: 9,376
Related Posts
 DBT : How does DBT handle performance optimization and data
scalability
DBT does not handle performance optimization and data
scalability directly. However, it can be used…
 DBT : Handling Late-Arriving Data in DBT
Data warehousing and business intelligence often involve working
with data that arrives after a certain…
 DBT : DBT's way of handling versioning of data models.
DBT uses a versioning system called "Incremental Modeling"
which allows to version data models by…
 How does DBT handle dependencies and data lineage?
DBT handles dependencies and data lineage by providing a set of
features that allow users…
 DBT : How does DBT handle data lineage and auditing ?
DBT handles data lineage and auditing by tracking the history of
transformations and changes to…
 DBT : Explain DBT's seed-paths
In a DBT (Data Build Tool) project, seed-paths configuration in
the dbt_project.yml file is used…
 How does DBT handle incremental data loading?
DBT (Data Build Tool) does not have a built-in feature for
incremental data loading, but…
 DBT : What is DBT quoting ?
DBT (Data Build Tool) quoting refers to the process of wrapping a
string or identifier…
 DBT : How do you use DBT to document your data pipeline?
DBT helps maintain a clear and detailed documentation of the
entire data pipeline, making it…
 How do you use DBT to manage your data lineage?
Data lineage refers to the history of data as it moves from its
source to…
. How do I use the incremental materialization?
incremental models are defined with select statements, with the the
materialization defined in a config block.
{{
config(
materialized=’incremental’
)
}}
select …
To use incremental models, you also need to tell dbt, on how to filter the
rows on an incremental run and the uniqueness constraint of the model (if
any).
12. How to do Filtering rows on an incremental ?
To tell dbt which rows it should transform on an incremental run, wrap valid
SQL that filters for these rows in the is_incremental() macro. Often, you’ll
want to filter for “new” rows, as in, rows that have been created since the last
time dbt ran this model. The best way to find the timestamp of the most
recent run of this model is by checking the most recent timestamp in your
target table. dbt makes it easy to query your target table by using the
“{{ this }}” variable.
13. How do I rebuild an incremental model?
If your incremental model logic has changed, the transformations on your
new rows of data may diverge from the historical transformations, which are
stored in your target table. In this case, you should rebuild your incremental
model. To force dbt to rebuild the entire incremental model from scratch, use
the –full-refresh flag on the command line. This flag will cause dbt to drop
the existing target table in the database before rebuilding it for all-time.
$ dbt run –full-refresh –models my_incremental_model+
14. What is the is_incremental() macro ?
The is_incremental() macro will return True if:
the destination table already exists in the database
dbt is not running in full-refresh mode
the running model is configured with materialized=’incremental’
15. What if the columns of my incremental model change?
If you add a column from your incremental model, and execute a dbt run, this
column will not appear in your target table. Similarly, if you remove a
column from your incremental model, and execute a dbt run, this column will
not be removed from your target table. Instead, whenever the logic of your
incremental changes, execute a full-refresh run of both your incremental
model and any downstream models.
Post Views: 9,376
Related Posts
 DBT : How does DBT handle performance optimization and data
scalability
DBT does not handle performance optimization and data
scalability directly. However, it can be used…
 DBT : Handling Late-Arriving Data in DBT
Data warehousing and business intelligence often involve working
with data that arrives after a certain…
 DBT : DBT's way of handling versioning of data models.
DBT uses a versioning system called "Incremental Modeling"
which allows to version data models by…
 How does DBT handle dependencies and data lineage?
DBT handles dependencies and data lineage by providing a set of
features that allow users…
 DBT : How does DBT handle data lineage and auditing ?
DBT handles data lineage and auditing by tracking the history of
transformations and changes to…
 DBT : Explain DBT's seed-paths
In a DBT (Data Build Tool) project, seed-paths configuration in
the dbt_project.yml file is used…
 How does DBT handle incremental data loading?
DBT (Data Build Tool) does not have a built-in feature for
incremental data loading, but…
 DBT : What is DBT quoting ?
DBT (Data Build Tool) quoting refers to the process of wrapping a
string or identifier…
 DBT : How do you use DBT to document your data pipeline?
DBT helps maintain a clear and detailed documentation of the
entire data pipeline, making it…
 How do you use DBT to manage your data lineage?
Data lineage refers to the history of data as it moves from its
source to…
16. What is an incremental_strategy?
incremental_strategy config controls the code that dbt uses to build
incremental models. Different approaches may vary by effectiveness
depending on the volume of data, the reliability of your unique_key, or
the availability of certain features.
Snowflake: merge (default), delete+insert (optional)
BigQuery: merge (default), insert_overwrite (optional)
Spark: insert_overwrite (default), merge (optional, Delta-only)
17. What is aliases in dbt ?
When dbt runs a model, it will generally create a relation (either a table or a
view) in the database. By default, dbt uses the filename of the model as the
identifier for this relation in the database. This identifier can optionally be
overridden using the alias model configuration.
18. What is a custom schema in dbt ?
By default, all dbt models are built in the schema specified in your target. In
dbt projects with lots of models, it may be useful to instead build some
models in schemas other than your target schema – this can help logically
group models together. You can use custom schemas in dbt to build models
in a schema other than your target schema. It’s important to note that by
default, dbt will generate the schema name for a model by concatenating the
custom schema to the target schema, as in:
<target_schema>_<custom_schema>;.
19. How do I use custom schemas?
Use the schema configuration key to specify a custom schema for a model.
As with any configuration, you can either:
apply this configuration to a specific model by using a config block within a
model, or
apply it to a subdirectory of models by specifying it in your dbt_project.yml
file
{{ config(schema=’marketing’) }}
select
20. Which vars are available in generate_schema_name?
Globally-scoped variables and variables defined on the command line with –
vars are accessible in the generate_schema_name context.
Post Views: 9,376
Related Posts
 DBT : How does DBT handle performance optimization and data
scalability
DBT does not handle performance optimization and data
scalability directly. However, it can be used…
 DBT : Handling Late-Arriving Data in DBT
Data warehousing and business intelligence often involve working
with data that arrives after a certain…
 DBT : DBT's way of handling versioning of data models.
DBT uses a versioning system called "Incremental Modeling"
which allows to version data models by…
 How does DBT handle dependencies and data lineage?
DBT handles dependencies and data lineage by providing a set of
features that allow users…
 DBT : How does DBT handle data lineage and auditing ?
DBT handles data lineage and auditing by tracking the history of
transformations and changes to…
 DBT : Explain DBT's seed-paths
In a DBT (Data Build Tool) project, seed-paths configuration in
the dbt_project.yml file is used…
 How does DBT handle incremental data loading?
DBT (Data Build Tool) does not have a built-in feature for
incremental data loading, but…
 DBT : What is DBT quoting ?
DBT (Data Build Tool) quoting refers to the process of wrapping a
string or identifier…
 DBT : How do you use DBT to document your data pipeline?
DBT helps maintain a clear and detailed documentation of the
entire data pipeline, making it…
 How do you use DBT to manage your data lineage?
Data lineage refers to the history of data as it moves from its
source to…
21. What tests are available to use in dbt?
Out of the box, dbt ships with the following tests:
unique
not_null
accepted_values
relationships (i.e. referential integrity)
22. How do I build one seed at a time?
As of v0.16.0, you can use a –select option with the dbt seed command, like
so:
$ dbt seed –select country_codes
There is also an –exclude option.

23. How can I see the SQL that dbt is running?


To check out the SQL that dbt is running, you can look in:
dbt Cloud:
Within the run output, click on a model name, and then select “Details”
dbt CLI:
The target/compiled/ directory for compiled select statements
The target/run/ directory for compiled create statements
The logs/dbt.log file for verbose logging.
24. What is the difference between dbt Core, the dbt CLI and dbt
Cloud?
dbt Core is the software that takes a dbt project (.sql and .yml files) and a
command and then creates tables/views in your warehouse. dbt Core includes
a command line interface (CLI) so that users can execute dbt commands
using a terminal program. dbt Core is open source and free to use.
dbt Cloud is an application that helps teams use dbt. dbt Cloud provides a
web-based IDE to develop dbt projects, a purpose-built scheduler, and a way
to share dbt documentation with your team. dbt Cloud offers a number of
features for free, as well as additional features in paid tiers
25. Can I store my seeds in a directory other than
the data���� directory in my project?
By default, dbt expects your seed files to be located in the data subdirectory
of your project.
To change this, update the data-paths configuration in your dbt_project.yml
file, like so:
dbt_project.yml
data-paths: [“seeds”]
Post Views: 9,376
Related Posts
 DBT : How does DBT handle performance optimization and data
scalability
DBT does not handle performance optimization and data
scalability directly. However, it can be used…
 DBT : Handling Late-Arriving Data in DBT
Data warehousing and business intelligence often involve working
with data that arrives after a certain…
 DBT : DBT's way of handling versioning of data models.
DBT uses a versioning system called "Incremental Modeling"
which allows to version data models by…
 How does DBT handle dependencies and data lineage?
DBT handles dependencies and data lineage by providing a set of
features that allow users…
 DBT : How does DBT handle data lineage and auditing ?
DBT handles data lineage and auditing by tracking the history of
transformations and changes to…
 DBT : Explain DBT's seed-paths
In a DBT (Data Build Tool) project, seed-paths configuration in
the dbt_project.yml file is used…
 How does DBT handle incremental data loading?
DBT (Data Build Tool) does not have a built-in feature for
incremental data loading, but…
 DBT : What is DBT quoting ?
DBT (Data Build Tool) quoting refers to the process of wrapping a
string or identifier…
 DBT : How do you use DBT to document your data pipeline?
DBT helps maintain a clear and detailed documentation of the
entire data pipeline, making it…
 How do you use DBT to manage your data lineage?
Data lineage refers to the history of data as it moves from its
source to…
Pages:

HOME » DBT (DATA BUILD TOOL) INTERVIEW QUESTIONS » PAGE 6


POSTED INSOFTWARE

dbt (data build tool) interview questions


USER JANUARY 17, 2021 LEAVE A COMMENTON DBT (DATA BUILD TOOL) INTERVIEW QUESTIONS
26. Can I store my models in a directory other than the ⊨⊨ directory in
my project?
By default, dbt expects your seed files to be located in the models
subdirectory of your project.
To change this, update the source-paths configuration in your
dbt_project.yml file, like so:
dbt_project.yml
source-paths: [“transformations”]
27. Can I connect my dbt project to two databases?
It depends on the warehouse used in your tech stack.
dbt projects connecting to warehouses like Snowflake or Bigquery—these
empower one set of credentials to draw from all datasets or ‘projects’
available to an account—are sometimes said to connect to more than one
database.
dbt projects connecting to warehouses like Redshift and Postgres—these tie
one set of credentials to one database—are said to connect to one database
only.

28. Do I need to create my target schema before running dbt?


Nope. dbt will check if the schema exists when it runs. If the schema does not
exist, dbt will create it for you.
29. How do I create dependencies between models?
When you use the ref function, dbt automatically infers the dependencies
between models.
30. How do I define a column type?
Your warehouse’s SQL engine automatically assigns a datatype to every
column, whether it’s found in a source or model. To force SQL to treat a
columns a certain datatype, use cast functions:
select
cast(order_id as integer),
cast(order_price as double(6,2)) — a more generic way of doing type
conversion
from {{ ref(‘stg_orders’) }}
Post Views: 9,376
Related Posts
 DBT : How does DBT handle performance optimization and data
scalability
DBT does not handle performance optimization and data
scalability directly. However, it can be used…
 DBT : Handling Late-Arriving Data in DBT
Data warehousing and business intelligence often involve working
with data that arrives after a certain…
 DBT : DBT's way of handling versioning of data models.
DBT uses a versioning system called "Incremental Modeling"
which allows to version data models by…
 How does DBT handle dependencies and data lineage?
DBT handles dependencies and data lineage by providing a set of
features that allow users…
 DBT : How does DBT handle data lineage and auditing ?
DBT handles data lineage and auditing by tracking the history of
transformations and changes to…
 DBT : Explain DBT's seed-paths
In a DBT (Data Build Tool) project, seed-paths configuration in
the dbt_project.yml file is used…
 How does DBT handle incremental data loading?
DBT (Data Build Tool) does not have a built-in feature for
incremental data loading, but…
 DBT : What is DBT quoting ?
DBT (Data Build Tool) quoting refers to the process of wrapping a
string or identifier…
 DBT : How do you use DBT to document your data pipeline?
DBT helps maintain a clear and detailed documentation of the
entire data pipeline, making it…
 How do you use DBT to manage your data lineage?
Data lineage refers to the history of data as it moves from its
source to…
31. Do I need to add a yaml entry for column for it to appear in the docs
site?
no.dbt will introspect your warehouse to generate a list of columns in each
relation, and match it with the list of columns in your .yml files.
32. Can I document things other than models, like sources, seeds, and
snapshots?
Yes! You can document almost everything in your project using the
description.
33. How to debug if any of the tests failed?
To debug a failing test, find the SQL that dbt ran by:
dbt Cloud:
Within the test output, click on the failed test, and then select “Details”
dbt CLI:
Open the file path returned as part of the error message.
Navigate to the target/compiled/schema_tests directory for all compiled test
queries
Copy the SQL into a query editor (in dbt Cloud, you can paste it into a new
Statement), and run the query to find the records that failed.
34. If the compiled SQL has a lot of spaces and new lines, how can I get
rid of it?
This is known as “whitespace control”.
Use a minus sign (-, e.g. {{- … -}}, {%- … %}, {#- … -#}) at the start or
end of a block to strip whitespace before or after the block
35. How do I preserve leading zeros in a seed?
If you need to preserve leading zeros (for example in a zipcode or mobile
number):
v0.16.0 onwards: Include leading zeros in your seed file, and use the
column_types configuration with a varchar datatype of the correct length.
Prior to v0.16.0: Use a downstream model to pad the leading zeros using
SQL, for example: lpad(zipcode, 5, ‘0’)
Post Views: 9,376
Related Posts
 DBT : How does DBT handle performance optimization and data
scalability
DBT does not handle performance optimization and data
scalability directly. However, it can be used…
 DBT : Handling Late-Arriving Data in DBT
Data warehousing and business intelligence often involve working
with data that arrives after a certain…
 DBT : DBT's way of handling versioning of data models.
DBT uses a versioning system called "Incremental Modeling"
which allows to version data models by…
 How does DBT handle dependencies and data lineage?
DBT handles dependencies and data lineage by providing a set of
features that allow users…
 DBT : How does DBT handle data lineage and auditing ?
DBT handles data lineage and auditing by tracking the history of
transformations and changes to…
 DBT : Explain DBT's seed-paths
In a DBT (Data Build Tool) project, seed-paths configuration in
the dbt_project.yml file is used…
 How does DBT handle incremental data loading?
DBT (Data Build Tool) does not have a built-in feature for
incremental data loading, but…
 DBT : What is DBT quoting ?
DBT (Data Build Tool) quoting refers to the process of wrapping a
string or identifier…
 DBT : How do you use DBT to document your data pipeline?
DBT helps maintain a clear and detailed documentation of the
entire data pipeline, making it…
 How do you use DBT to manage your data lineage?
Data lineage refers to the history of data as it moves from its
source to…

36. How do I run models downstream of a seed?


You can run models downstream of a seed using the model selection
syntax, and treating the seed like a model.
$ dbt run –models country_codes+
37. How do I run one model at a time?
To run one model, use the –models flag (or -m flag), followed by the name of
the model:
38. How do I run models downstream of one source?
To run models downstream of a source, use the source: selector:
$ dbt run –models source:jaffle_shop+
39. What happens if I add new columns to my snapshot query?
When the columns of your source query changes, dbt will attempt to
reconcile this change in the destination snapshot table. dbt does this by:
Creating new columns from the source query in the destination table
Expanding the size of string types where necessary (eg. varchars on Redshift)
dbt will not delete columns in the destination snapshot table if they are
removed from the source query. It will also not change the type of a column
beyond expanding the size of varchar columns. That is, if a string column is
changed to a date column in the snapshot source query, dbt will not attempt
to change the type of the column in the destination table.
40. How do I specify column types?
Simply cast the column to the correct type in your model:
select
id,
created::timestamp as created
from some_other_table
41. Do model names need to be unique?
Yes. To build dependencies between models, you need to use the ref
function. The ref function only takes one argument – the model name (i.e. the
filename). As a result, these model names need to be unique, even if they are
in distinct folders.

Post Views: 9,376


Related Posts
 DBT : How does DBT handle performance optimization and data
scalability
DBT does not handle performance optimization and data scalability
directly. However, it can be used…
 DBT : Handling Late-Arriving Data in DBT
Data warehousing and business intelligence often involve working
with data that arrives after a certain…
 DBT : DBT's way of handling versioning of data models.
DBT uses a versioning system called "Incremental Modeling" which
allows to version data models by…
 How does DBT handle dependencies and data lineage?
DBT handles dependencies and data lineage by providing a set of
features that allow users…
 DBT : How does DBT handle data lineage and auditing ?
DBT handles data lineage and auditing by tracking the history of
transformations and changes to…
 DBT : Explain DBT's seed-paths
In a DBT (Data Build Tool) project, seed-paths configuration in the
dbt_project.yml file is used…
 How does DBT handle incremental data loading?
DBT (Data Build Tool) does not have a built-in feature for incremental
data loading, but…
 DBT : What is DBT quoting ?
DBT (Data Build Tool) quoting refers to the process of wrapping a
string or identifier…
 DBT : How do you use DBT to document your data pipeline?
DBT helps maintain a clear and detailed documentation of the entire
data pipeline, making it…
 How do you use DBT to manage your data lineage?
Data lineage refers to the history of data as it moves from its source
to…

dbt (Data Build Tool)


Overview: What is dbt
and What Can It Do
for My Data Pipeline?
There are many tools on the market to help
your organization transform data and make it
accessible for business users. One that we
recommend and use often—dbt (data build
tool) —focuses solely on making the process
of transforming data simpler and faster. In
this blog we will discuss what dbt is, how it
can transform the way your organization
curates its data for decision making, and how
you can get started with using dbt (data build
tool).
Data plays an instrumental role in decision making for
organizations. As the volume of data increases, so does the need to
make it accessible to everyone within your organization to use.
However, because there is a shortage of data engineers in the
marketplace, for most organizations there isn’t enough time or
resources available to curate data and make data analytics ready.

Disjointed sources, data quality issues, and inconsistent definitions


for metrics and business attributes lead to confusion, redundant
efforts, and poor information being distributed for decision making.
Transforming your data allows you to integrate, clean, de-duplicate,
restructure, filter, aggregate, and join your data—enabling your
organization to develop valuable, trustworthy insights through
analytics and reporting. There are many tools on the market to help
you do this, but one in particular—dbt (data build tool)—simplifies
and speeds up the process of transforming data and building data
pipelines.

In this blog, we cover:

 What is dbt?↵

 How is dbt Different Than Other Tools?↵

 What Can dbt Do for My Data Pipeline?↵

 How Can I Get Started with dbt?↵

 Training To Learn How to Use dbt↵

What is dbt (data build tool)?


According to dbt, the tool is a development framework that
combines modular SQL with software engineering best practices to
make data transformation reliable, fast, and fun.

dbt (data build tool) makes data engineering activities accessible to


people with data analyst skills to transform the data in the
warehouse using simple select statements, effectively creating your
entire transformation process with code. You can write custom
business logic using SQL, automate data quality testing, deploy the
code, and deliver trusted data with data documentation side-by-side
with the code. This is more important today than ever due to the
shortage of data engineering professionals in the marketplace.
Anyone who knows SQL can now build production-grade data
pipelines, reducing the barrier to entry that previously limited
staffing capabilities for legacy technologies.

In short, dbt (data build tool) turns your data analysts into engineers
and allows them to own the entire analytics engineering workflow.

How is dbt (Data Build Tool) Different


Than Other Tools?
With dbt, anyone who knows how to write SQL SELECT statements
has the power to build models, write tests, and schedule jobs to
produce reliable, actionable datasets for analytics. The tool acts as
an orchestration layer on top of your data warehouse to improve
and accelerate your data transformation and integration process.
dbt works by pushing down your code—doing all the calculations at
the database level—making the entire transformation process
faster, more secure, and easier to maintain.

dbt (data build tool) is easy to use for anyone who knows SQL—you
don’t need to have a high-powered data engineering skillset to build
data pipelines anymore.

Hear why dbt is the iFit engineering team’s favorite tool and how it
helped them drive triple-digit growth for the company:

dbt’s ELT methodology brings increased agility and speed to iFit’s


data pipeline. What would have taken months with traditional ETL
tools, now takes weeks or days.

What Can dbt (Data Build Tool) Do


for My Data Pipeline?
dbt (data build tool) has two core workflows: building data models
and testing data models. It fits nicely into the modern data stack
and is cloud agnostic—meaning it works within each of the major
cloud ecosystems: Azure, GCP, and AWS.

With dbt, data analysts take ownership of the entire analytics


engineering workflow from writing data transformation code all the
way through to deployment and documentation—as well as to
becoming better able to promote a data-driven culture within the
organization. They can:

1. Quickly and easily provide clean, transformed data


ready for analysis:
dbt enables data analysts to custom-write transformations through
SQL SELECT statements. There is no need to write boilerplate code.
This makes data transformation accessible for analysts that don’t
have extensive experience in other programming languages.

The dbt Cloud UI offers an attractive interface for individuals of all ranges of
experience to comfortably develop in.

2. Apply software engineering practices—such as modular


code, version control, testing, and continuous
integration/continuous deployment (CI/CD)—to analytics
code:
Continuous integration means less time testing and quicker time to
development, especially with dbt Cloud. You don’t need to push an
entire repository when there are necessary changes to deploy, but
rather just the components that change. You can test all the
changes that have been made before deploying your code into
production. dbt Cloud also has integration with GitHub for
automation of your continuous integration pipelines, so you won’t
need to manage your own orchestration, which simplifies the
process.

While configuring a continuous integration job in the dbt Cloud UI, you can take
advantage of dbt’s sleek slim UI feature and even use webhooks to run jobs
automatically when a pull request is open.

3. Build reusable and modular code using Jinja.


dbt (data build tool) allows you to establish macros and integrate
other functions outside of SQL’s capabilities for advanced use
cases. Macros in Jinja are pieces of code that can be used multiple
times. Instead of starting at the raw data with every analysis,
analysts instead build up reusable data models that can be
referenced in subsequent work.
Instead of repeating code to create a hashed surrogate key, create a dynamic macro
with Jinja and SQL to consolidate the logic in one spot using dbt.

4. Maintain data documentation and definitions within dbt


as they build and develop lineage graphs:
Data documentation is accessible, easily updated, and allows you to
deliver trusted data across the organization. dbt (data build tool)
automatically generates documentation around descriptions,
models dependencies, model SQL, sources, and tests. dbt creates
lineage graphs of the data pipeline, providing transparency and
visibility into what the data is describing, how it was produced, as
well as how it maps to business logic.

Lineage is automatically generated for all your models in dbt. This has saved teams
numerous hours in manual documentation time.

5. Perform simplified data refreshes within dbt Cloud:


There is no need to host an orchestration tool when using dbt Cloud.
It includes a feature that provides full autonomy with scheduling
production refreshes at whatever cadence the business wants.

Scheduling is simplified in the dbt Cloud UI. Just give it directions on what time you
want a production job to run, and it will take it from there.

6. Perform automated testing:


dbt (data build tool) comes prebuilt with unique, not null, referential
integrity, and accepted value testing. Additionally, you can write
your own custom tests using a combination of Jinja and SQL. To
apply any test on a given column, you simply reference it under the
same YAML file used for documentation for a given table or schema.
This makes testing data integrity an almost effortless process.

Simple example of applying tests on the primary key for a table in a project.

Talk to an expert about your dbt needs.


How Can I Get Started with dbt (Data
Build Tool)?
Prerequisites to Getting Started with dbt (Data Build
Tool)
Before learning dbt (data build tool), there are three pre-requisites
that we recommend:

1. SQL: Since dbt uses SQL as its core language to perform


transformations, you must be proficient in using SQL SELECT
statements. There are plenty of courses online available if you don’t
have this experience, so make sure to find one that gives you the
necessary foundation to begin learning dbt.
2. Modeling: Like any other data transformation tool, you should have
some strategy when it comes to data modeling. This will be critical
for re-usability of code, drilling down, and performance optimization.
Don’t just adopt the model of your data sources, we recommend
transforming data into the language and structure of the business.
Modeling will be essential to structure your project and find lasting
success.
3. Git: If you are interested in learning how to use dbt Core, you will
need to be proficient in Git. We recommend finding any course that
covers the Git Workflow, Git Branching, and using Git in a team
setting. There are lots of great options available online, so explore
and find one that you like.

Training To Learn How to Use dbt


(Data Build Tool)
There are many ways you can dive in and learn how to use dbt (data
build tool). Here are three tips on the best places to start:

1. The dbt Labs Free dbt Fundamentals Course: This course is a


great starting point for any individual interested in learning the
basics on using dbt (data build cloud). This covers many critical
concepts like setting up dbt, creating models and tests, generating
documentation, deploying your project, and much more.
2. The “Getting Started Tutorial” from dbt Labs: Although there is
some overlap with concepts from the fundamentals course above,
the “getting started tutorial” is a comprehensive hands-on way to
learn as you go. There are video series offered for both using dbt
Core and dbt Cloud. If you really want to dive in, you can find a
sample dataset from online to model out as you go through the
videos. This is a great way to learn how to use dbt (data build tool)
in a way that will directly reflect how you would build out a project
for your organization.
3. Join the dbt Slack Community: This is an active community of
thousands of members that range from beginner to advanced. There
are channels like #learn-on-demand and #advice-dbt-for-beginners
that will be very helpful for a beginner to ask questions as they go
through the above resources.

dbt (data build tool) simplifies and speeds up the process of


transforming data and building data pipelines. Now is the time to
dive in and learn how to use it to help your organization curate its
data for better decision making.

What is a Data Model?

A data model organizes different data elements and standardizes how they relate to one
another and real-world entity properties. So logically then, data modeling is the process of
creating those data models.

Data models are composed of entities, and entities are the objects and concepts whose data
we want to track. They, in turn, become tables found in a database. Customers, products,
manufacturers, and sellers are potential entities.

Each entity has attributes—details that the users want to track. For instance, a customer’s
name is an attribute.

With that out of the way, let’s check out those data modeling interview questions!
Basic Data Modeling Interview Questions

1. What Are the Three Types of Data Models?

The three types of data models:

 Physical data model - This is where the framework or schema describes how data is
physically stored in the database.

 Conceptual data model - This model focuses on the high-level, user’s view of the data in
question

 Logical data models - They straddle between physical and theoretical data models,
allowing the logical representation of data to exist apart from the physical storage.

2. What is a Table?

A table consists of data stored in rows and columns. Columns, also known as fields, show
data in vertical alignment. Rows also called a record or tuple, represent data’s horizontal
alignment.

3. What is Normalization?

Database normalization is the process of designing the database in such a way that it reduces
data redundancy without sacrificing integrity.

4. What Does a Data Modeler Use Normalization For?

The purposes of normalization are:

 Remove useless or redundant data

 Reduce data complexity

 Ensure relationships between the tables in addition to the data residing in the tables

 Ensure data dependencies and that the data is stored logically.

Become a Data Scientist with Hands-on Training!

Data Scientist Master’s ProgramEXPLORE PROGRAM

5. So, What is Denormalization, and What is its Purpose?


Denormalization is a technique where redundant data is added to an already normalized
database. The procedure enhances read performance by sacrificing write performance.

6. What Does ERD Stand for, and What is it?

ERD stands for Entity Relationship Diagram and is a logical entity representation, defining
the relationships between the entities. Entities reside in boxes, and arrows symbolize
relationships.

7. What’s the Definition of a Surrogate Key?

A surrogate key, also known as a primary key, enforces numerical attributes. This surrogate
key replaces natural keys. Instead of having primary or composite primary keys, data
modelers create the surrogate key, which is a valuable tool for identifying records,
building SQL queries, and enhancing performance.

8. What Are the Critical Relationship Types Found in a Data Model? Describe
Them.

The main relationship types are:

 Identifying. A relationship line normally connects parent and child tables. But if a child
table’s reference column is part of the table’s primary key, the tables are connected by a
thick line, signifying an identifying relationship.

 Non-identifying. If a child table’s reference column is NOT a part of the table’s primary
key, the tables are connected by a dotted line, signifying a no-identifying relationship.

 Self-recursive. A recursive relationship is a standalone column in a table connected to the


primary key in the same table.

9. What is an Enterprise Data Model?

This is a data model that consists of all the entries required by an enterprise.

Intermediate Data Modeling Interview Questions

10. What Are the Most Common Errors You Can Potentially Face in Data
Modeling?

These are the errors most likely encountered during data modeling.
 Building overly broad data models: If tables are run higher than 200, the data model
becomes increasingly complex, increasing the likelihood of failure

 Unnecessary surrogate keys: Surrogate keys must only be used when the natural key
cannot fulfill the role of a primary key

 The purpose is missing: Situations may arise where the user has no clue about the
business’s mission or goal. It’s difficult, if not impossible, to create a specific business
model if the data modeler doesn’t have a workable understanding of the company’s
business model

 Inappropriate denormalization: Users shouldn’t use this tactic unless there is an excellent
reason to do so. Denormalization improves read performance, but it creates redundant
data, which is a challenge to maintain.

11. Explain the Two Different Design Schemas.

The two design schema is called Star schema and Snowflake schema. The Star schema has a
fact table centered with multiple dimension tables surrounding it. A Snowflake schema is
similar, except that the level of normalization is higher, which results in the schema looking
like a snowflake.

Become a Data Scientist with Hands-on Training!

Data Scientist Master’s ProgramEXPLORE PROGRAM

12. What is a Slowly Changing Dimension?

These are dimensions used to manage both historical data and current data in data
warehousing. There are four different types of slowly changing dimensions: SCD Type 0
through SCD Type 3.

13. What is Data Mart?

A data mart is the most straightforward set of data warehousing and is used to focus on one
functional area of any given business. Data marts are a subset of data warehouses oriented to
a specific line of business or functional area of an organization (e.g., marketing, finance,
sales). Data enters data marts by an assortment of transactional systems, other data
warehouses, or even external sources.

14. What is Granularity?


Granularity represents the level of information stored in a table. Granularity is defined as
high or low. High granularity data contains transaction-level data. Low granularity has low-
level information only, such as that found in fact tables.

15. What is Data Sparsity, and How Does it Impact Aggregation?

Data sparsity defines how much data we have for a model’s specified dimension or entity. If
there is insufficient information stored in the dimensions, then more space is needed to store
these aggregations, resulting in an oversized, cumbersome database.

16. What Are Subtype and Supertype Entities?

Entities can be broken down into several sub-entities or grouped by specific features. Each
sub-entity has relevant attributes and is called a subtype entity. Attributes common to every
entity are placed in a higher or super level entity, which is why they are called supertype
entities.

17. In the Context of Data Modeling, What is the Importance of Metadata?

Metadata is defined as “data about data.” In the context of data modeling, it’s the data that
covers what types of data are in the system, what it’s used for, and who uses it.

Become a Data Scientist with Hands-on Training!

Data Scientist Master’s ProgramEXPLORE PROGRAM

Advanced-Data Modeling Interview Questions

18. Should All Databases Be Rendered in 3NF?

No, it’s not an absolute requirement. However, denormalized databases are easily accessible,
easier to maintain, and less redundant.

19. What’s the Difference Between forwarding and Reverse Engineering, in the
Context of Data Models?

Forward engineering is a process where Data Definition Language (DDL) scripts are
generated from the data model itself. DDL scripts can be used to create databases. Reverse
Engineering creates data models from a database or scripts. Some data modeling tools have
options that connect with the database, allowing the user to engineer a database into a data
model.

20. What Are Recursive Relationships, and How Do You Rectify Them?

Recursive relationships happen when a relationship exists between an entity and itself. For
instance, a doctor could be in a health center’s database as a care provider, but if the doctor is
sick and goes in as a patient, this results in a recursive relationship. You would need to add a
foreign key to the health center’s number in each patient’s record.

21. What’s a Confirmed Dimension?

If a dimension is confirmed, it’s attached to at least two fact tables.

Become a Data Scientist with Hands-on Training!

Data Scientist Master’s ProgramEXPLORE PROGRAM

22. Why Are NoSQL Databases More Useful than Relational Databases?

NoSQL databases have the following advantages:

 They can store structured, semi-structured, or unstructured data

 They have a dynamic schema, which means they can evolve and change as quickly as
needed

 NoSQL databases have sharding, the process of splitting up and distributing data to
smaller databases for faster access

 They offer failover and better recovery options thanks to the replication

 It’s easily scalable, growing or shrinking as necessary

23. What’s a Junk Dimension?

This is a grouping of low-cardinality attributes like indicators and flags, removed from other
tables, and subsequently “junked” into an abstract dimension table. They are often used to
initiate Rapidly Changing Dimensions within data warehouses.

24. If a Unique Constraint Gets Applied to a Column, Will It Generate an Error If


You Attempt to Place Two Nulls in It?
No, it won’t, because null error values are never equal. You can put in numerous null values
in a column and not generate an error.

Learn over a dozen of data science tools and skills with PG Program in Data Science and get access to
masterclasses by Purdue faculty. Enroll now and add a shining star to your data science resume!

Do You Want Data Modeling Training?

I hope these Data modeling interview questions have given you an idea of the kind of
questions can be asked in an interview. So, if you’re intrigued by what you’ve read about data
modeling and want to know how to become a data modeler, then you will want to check the
article that shows you how to become one.

But if you’re ready to accelerate your career in data science, then sign up for
Simplilearn’s Data Scientist Course. You will gain hands-on exposure to key technologies,
including R, SAS, Python, Tableau, Hadoop, and Spark. Experience world-class training by
an industry leader on the most in-demand Data Science and Machine learning skills.

The program boasts a half dozen courses, over 30 in-demand skills and tools, and more than
15 real-life projects. So check out Simplilearn’s resources and get that new data modeling
career off to a great start!

You might also like