Learning DBT
Learning DBT
Learning dbt 2
Resources 2
What are dbt sources? 3
Raw data pipeline 3
Additional sources 3
Raw data discovery 3
Discovery resources 4
What are dbt models? 4
Materializations 4
What is dbt documentation? 5
Additional dbt features 6
Aliases 6
Analyses 7
Identifiers 7
Jinja and macros 7
Example macro: auto_stage_columns 7
Packages 8
Seeds 8
Snapshots 8
Pre-snapshot 9
Post-snapshot 9
Tests 10
Generic tests 10
Singular tests 11
Constraints 11
dbt Tips & Tricks 12
Compile your code 12
Execute a full refresh 12
Development schema 12
Reset your dev schema 12
Snowflake: sql script against information_schema 13
How to run the entire dbt project 13
1
Handling whitespace 13
Run upstream and downstream dbt models 13
Learning dbt
Resources
Command Notes
dbt compile the entire project must compile before you can run dbt run or dbt docs
generate
CTRL + C if you execute dbt run accidentally or on unexpected model selection, use
CTRL+C to abort the run
dbt test --select runs all generic tests in the dbt project
test_type:generic
dbt test --select runs all tests on a single model in the dbt project
2
one_specific_model
Data is currently loaded into the RAW database. As the names imply, RAW contains production data that’s
loaded directly from the source. Little to no transformation applied, with the exception of JSON parsing.
Additional sources
Dbt provides a few additional “features” or objects that can be used as a source: Seeds and Snapshots. It’s
important to note these objects can be configured as sources. Additional details about these objects are
provided later in the documentation, under the Seeds and Snapshots sections.
3
Discovery resources
● Use external documentation to understand the source architecture
● Access the application and explore custom reports, field labels, or formulas
● Review dbt docs to find and review existing business logic
● Use tpl_discovery.sql to analyze the source data
○ Field values all null, exclude
○ Field values do not provide analytical value, exclude
○ Document questions or concerns
■ Schedule a team discussion on useful fields
■ Does the table contain any personally identifiable information (PII)?
■ Does the table contain any sensitive information (workday, financials)?
Materializations
“The exact Data Definition Language (DDL) that dbt will use when creating the model’s equivalent in
snowflake. It's the manner in which the data is represented, and each of those options is defined either
canonically (tables, views, incremental), or bespoke.” Execute dbt compile to view sql statements located
within the target folder. Make sure to complete the Advanced Materializations training course prior to
developing models and configuring materializations.
Tables
● Built as tables in the database
● Data is stored on disk
● Slower to build
● Faster to query
Views
● Default model configuration
● Built as views in the database
● Query is stored on disk
● Faster to build
● Slower to query
Ephemeral
● Does not exist in the database
● Imported as CTE into downstream models
● Increases build time of downstream models
4
● Cannot query directly
● Ephemeral Documentation
Incremental
● Built as table in the database
● On the first run, builds entire table
● On subsequent runs, only appends new records*
● Faster to build because you are only adding new records
● Does not capture 100% of the data all the time
○ IMPORTANT: incremental does not support history tracking, incremental tables can be
periodically re-built from the source by executing a full refresh. Learn more about how to
execute a full refresh in the tips and tricks section.
● Incremental Documentation
● Discourse post on Incrementality
Project documentation
● Documentation of models occurs in the YML files inside the models directory. Store the YML file in the
same subfolder as the models you are documenting.
● Doc blocks can render markdown in the generated documentation. Store the markdown files in the
documentation folder and reference the doc block from the YML file.
● Descriptions for models can happen on the model, source, or column level
● In the command line section, an updated version of documentation can be generated through the
command dbt docs generate.
● We have configured persist_docs on our entire dbt project. This takes the documentation defined
within dbt and writes the metadata, like column descriptions and business logic, to Snowflake. More
information at this dbt resource.
● If a column is documented in the YML file but does not exist in the model, that column will still appear
on the dbt docs site. We will need to remove documentation for columns that do not exist. If a column
on the docs site has no value in the Data Type column this is an indication the column does not exist
in the model and should be removed from the YML file.
Technical documentation
● Doc block references are stored in staging.yml or intermediate.yml file
● Information that can assist in development (i.e. ID fields, keys, etc.)
● Document keys for both staging and marts layers in a separate markdown file (i.e.
decisioning_keys.md) which helps promote reusability throughout the dbt project. Each id field will
have a primary key (suffixed by pk) and foreign key (suffixed by fk) doc block.
5
● Most marts layers/fields should have documentation. Business users will have access to this in a
hosted dbt documentation site generated through dbt docs generate.
● The activation layer may reuse fields from the marts layer, but additional information should be added
to describe aggregations or new business logic.
● If necessary, use “NEEDS CONFIRMATION” as a placeholder for documentation on fields pulled directly
from a source system that need the business or product team’s input to finalize a definition. This
helps us avoid a bottleneck waiting to hear back about documentation.
Lineage
Often referred to as the “DAG”, directed acyclic graph, is a powerful tool within dbt to visually represent the
relationships between data models.
Helpful Tips
● From a YML file you can navigate directly to a doc block’s definition by right clicking on the doc block
name reference and selecting “Go to Definition”.
● This query can be run in Snowflake to generate the documentation structure for a table according to
our standard naming conventions. NAME_IDENTIFIER, DESCRIPTION_IDENTIFIER, and
DOC_BLOCK_IDENTIFIER return syntax for the YML and markdown files.
Aliases
The alias function allows you to config an alias to a given model in order to change the identifer/name
associated with a given model. For example, you may have data coming from api_orders as a source and the
staging model named api_orders.sql. If this is generally referred to as “orders”, you could use the alias
function to allow this to be identified just as “orders” instead of “api_orders”.
When referencing the models into other models, note that you will still need to call it by its model name and
not its alias. Using the above example, {{ ref(‘api_orders’) }}
Note*: There is the potential for multiple models to be aliased to the same value (ex. braze_user_events =
user_events and user_traffic_events = user_events). dbt will throw an ambiguous error to alert you that this
alias conflicts with another one that is set to the same identifier.
6
Analyses
Learn more about models within the “Analyses” section of the dbt developer documentation.
● Analyses are .sql files that live in the analyses folder.
● Analyses will not be run with dbt run like models. However, you can still compile these from Jinja-SQL
to pure SQL using dbt compile. These will compile to the target folder.
● Analyses are useful for training queries, one-off queries, and audits
● Use analyses for storing queries that you may use often to do a quick check
Identifiers
Identifiers are parameters in the source yaml used to configure a source table name that differs from the
table name in the database. For example, data is loaded into an “api_orders table” while the official “orders”
table is being worked on. In your stage model, you can point to “orders” as the source name. Th code will
compile using the identifier “api_orders”.
This will allow you to build your complete model while referencing the table that will hold your data but allow
you to point at the temp/test/alternate table that may have your initial dataset. If you don't use the identifier,
then later on when you switch the source name, you will have to make changes to every file the source name
was used previously. Documentation: Identifier-dbt doc.
Example:
“Macros in Jinja are pieces of code that can be reused multiple times – they are analogous to "functions" in
other programming languages, and are extremely useful if you find yourself repeating code across multiple
models.” Learn more about models within the “Jinja and macros” section of the dbt developer documentation.
There are many useful macros built into our dbt project, and the list will continue to grow as we identify areas
for automation and repeatable functions. Macros are documented in dbt docs with examples of how to run
and execute them within the project.
7
i. user_id in application model should stay as user_id not application_user_id
c. Do not prefix booleans, working on adding this to the macro
d. ex. except=[“id”]
2. prefix= “”
a. set a prefix to all data fields
b. *** If not using the prefix function, either use doublequotes or remove it altogether
c. ex. prefix=“application” or “”
i. Result application_id, application_status, application_account_id
3. rename={}
a. rename a data field
b. ex. rename={“email”: “user_email”}
4. lower={}
a. lowercases all data in field(s)
b. ex. lower={“application_status”}
c. great to use on short text fields, ex “status_c”
5. replace_nulls = []
a. replaces nulls with 0 for numeric text fields
b. ex. replace_nulls = [“amount”]
Packages
Packages are standalone dbt projects, with models and macros that tackle a specific problem area. The dbt
package hub consolidates a list of available packages built by users across the dbt community. These
packages are open source and not maintained by our team. Packages should be tested in a development
environment, properly documented, and reviewed by the team prior to getting merged into production.
Seeds
● Seeds are .csv files that live in the seeds folder.
● When executing dbt seed, seeds will be built in your Data Warehouse as tables. Seeds can be
referenced using the ref macro - just like models!
● ✅ Use seeds for data that does not change frequently.
● Do NOT use seeds for uploading data that changes frequently.
● Seeds are useful for loading holiday dates, country codes, employee emails, or employee account IDs
○ Note: If you have a rapidly growing or large company, an orchestrated loading solution may
better address this.
● Seeds are meant to provide users that need access/own the data values to have somewhere to make
the updates themselves. Seeds in this approach provide slight protection from inadvertent changes
being made by the business to critical mapping files.
○ Any data owned by business users that needs to be altered/appended frequently should live
in a gsheet that the DE team extracts directly from.
Snapshots
“Analysts often need to "look back in time" at previous data states in their mutable tables. While some source
data systems are built in a way that makes accessing historical data possible, this is not always the case. dbt
provides a mechanism, snapshots, which records changes to a table over time.” Learn more about snapshots
8
within the “Snapshots” section of the dbt developer documentation. Make sure to complete the Advanced
Materializations training course prior to configuring snapshots.
● Snapshots are built as a table in the database, usually in a dedicated schema to minimize the chances
of the table being dropped:
○ RAW_TEST.DB.SNAPSHOTS
○ PRODUCTION.PUBLIC.SNAPSHOTS
● On the first run, dbt builds the entire table and adds four columns:
○ dbt_scd_id – a unique key generated for each snapshotted record. This is used internally by
dbt.
○ dbt_updated_at – the updated_at timestamp of the source record when this snapshot row
was inserted. This is used internally by dbt.
○ dbt_valid_from – the timestamp when this snapshot row was first inserted. This column can
be used to order the different "versions" of a record.
○ dbt_valid_to – the timestamp when this row became invalidated. IMPORTANT: The most
recent snapshot record will have dbt_valid_to set to null.
● In future runs, dbt will scan the underlying data and append new records based on the configuration.
● This allows you to capture historical data. IMPORTANT: Snapshots are built when history is not
tracked in the source system which makes them difficult or impossible to rebuild. Changes to
production snapshots may require coordination with the data engineering team to ensure data is
backed up.
● Snapshots are run using a specific command dbt snapshot, which adds complexity to the project
orchestration. Learn more about configuring models in the “dbt_project.yml” section of this
document.
Pre-snapshot
Use this option when you need to track historical changes on source data. A pre-snapshot DOES NOT include
any business logic transformations.
● Pre-snapshots are configured directly on top of the raw_dev or raw_prd table.
● All downstream models are dependent on the snapshot
○ Source >> Snapshot >> Stage >> Intermediate >> Mart >> Activation
● If the snapshot returns an error on run, all downstream models will be impacted. IMPORTANT:
Because of the added complexity, the decision to build a pre-snapshot table should be considered
carefully.
● Pre-snapshots need to be configured using a timestamp strategy. This ensures we can build all the
downstream models accurately.
Snapshot example.
Post-snapshot
Use this option when you need to track historical changes on the marts or activation layer. A post-snapshot
table INCLUDES business logic transformations.
● Post snapshots are configured directly on top of a marts or activation layer.
9
● Post snapshots are a safer option than pre snapshots because there are no downstream
dependencies
○ Source >> Stage >> Intermediate >> Mart >> Snapshot >> Activation
○ Source >> Stage >> Intermediate >> Mart >> Activation >> Snapshot
● Post-snapshots can be configured using a timestamp or check strategy. We recommend using a
check strategy and selecting specific fields for the snapshot (rather than a select *). This is a more
narrow focus on history tracking, which prevents new fields or unrelated changes to the business
logic from adding noise to the snapshot table.
Tests
“Tests are assertions you make about your models and other resources in your dbt project (e.g. sources,
seeds and snapshots).” Learn more about tests within the “Tests” section of the dbt developer
documentation. When you run a dbt test, dbt will tell you if each test in your project passes or fails. Tests can
be run against your current project using a range of commands:
● dbt test runs all tests in the dbt project
● dbt test --select test_type:generic runs all generic tests in the dbt project
● dbt test --select test_type:singular runs all singular tests in the dbt project
● dbt test --select one_specific_model runs all tests on a single model in the dbt project
Generic tests
Generic tests return the number of records that do not meet your assertions. These are run on specific
columns in a model. Generic tests are configured in a YAML file.
Defining too many tests can result in errors that prevent the model from generating. In order to prevent tests
from blocking the models, we have configured all tests with severity set to warn rather than error. The
severity can be set to error on an individual test within the YAML file which will override the default setting.
A unique test is required for each source table that will be used to create a reporting model. The unique test
helps determine if a natural key exists on the table. If the natural key is not unique, then a surrogate will need
to be created.
● A natural key is a key that is derived from the data itself, such as a customer ID or a date.
● A surrogate key is a key that is generated artificially, such as a sequential number, a GUID, or a hash.
The source.yml is configured with not_null and unique tests on primary ids (foreign keys should not have
unique or not_null tests).
Surrogate keys are generated using a dbt_utils macro. After a surrogate key is created, it’s important to test
your key for uniqueness to ensure that the correct fields were chosen to create the surrogate key.
10
{{- dbt_utils.generate_surrogate_key(['id', 'updated_at']) }} as account_history_id
The test shows a unique test defined at the column level for a natural key.
Nest a uniqueness test of two columns under the table, not under a column!
Singular tests
Singular tests are specific queries that you run against your models. These are run on the entire model.
Singular tests are stored as select statements in the tests folder.
Constraints
Example:
Execute dbt test in the terminal to create the constraints within your development schema.
11
Query to view constraints in Snowflake. Update the database and schema to view constraints within different
environments (ex.)
Common Misconceptions
● dbt compile is not a pre-requisite of dbt run, or other building commands. Those commands will
handle compilation themselves.
● If you just want dbt to read and validate your project code, without connecting to the data
warehouse, use dbt parse instead of dbt compile.
Development schema
A full refresh can be executed on a development schema using the command dbt run --full-refresh.
12
1. Open a Snowflake worksheet
2. Modify the SQL statement below to point to your dev schema and run the query
3. Copy the results and paste them back into a Snowflake worksheet
4. Review the results carefully, exclude objects from the list you want to keep in the schema
5. Select all DROP statements and execute the query
Note: We recommend selecting a specific model to run rather than executing the entire dbt project. Learn
more about selecting or excluding models in the dbt syntax overview.
Handling whitespace
Jinja can sometimes compile unwanted whitespace into sql statements, making the final deployment to the
database less readable. If you put an minus sign (-) at the start or end of a block you can remove the
whitespaces after that block or before that block:
The upstream command will run all the models required to build the model specified in model_name. The
downstream command will run all the models dependent on the model specified in model_name.
Note this command does not execute seed files. Seeds will need to be run prior to the models using dbt
seed.
13