0% found this document useful (0 votes)
15 views13 pages

Learning DBT

The document provides a comprehensive guide on learning dbt, covering essential concepts such as dbt sources, models, documentation, and additional features like aliases and snapshots. It includes resources for online learning, command references, and tips for effective usage of dbt in analytics engineering. The document emphasizes the importance of training and understanding various dbt functionalities to optimize data transformation processes.

Uploaded by

Raja Thangarajan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views13 pages

Learning DBT

The document provides a comprehensive guide on learning dbt, covering essential concepts such as dbt sources, models, documentation, and additional features like aliases and snapshots. It includes resources for online learning, command references, and tips for effective usage of dbt in analytics engineering. The document emphasizes the importance of training and understanding various dbt functionalities to optimize data transformation processes.

Uploaded by

Raja Thangarajan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Learning - dbt

Learning dbt 2
Resources 2
What are dbt sources? 3
Raw data pipeline 3
Additional sources 3
Raw data discovery 3
Discovery resources 4
What are dbt models? 4
Materializations 4
What is dbt documentation? 5
Additional dbt features 6
Aliases 6
Analyses 7
Identifiers 7
Jinja and macros 7
Example macro: auto_stage_columns 7
Packages 8
Seeds 8
Snapshots 8
Pre-snapshot 9
Post-snapshot 9
Tests 10
Generic tests 10
Singular tests 11
Constraints 11
dbt Tips & Tricks 12
Compile your code 12
Execute a full refresh 12
Development schema 12
Reset your dev schema 12
Snowflake: sql script against information_schema 13
How to run the entire dbt project 13

1
Handling whitespace 13
Run upstream and downstream dbt models 13

Learning dbt

Resources

Online learning resources (dbt courses)


● dbt Fundamentals 🔖(essential)
● Analyses and Seeds
● Refactoring Sql for Modularity
● Advanced Materializations 🔖(essential)
● Jinja, Macros, Packages
● Data Modeling Tutorial: Star Schema (aka Kimball approach) 🔖(essential)
○ Data Modeling in the Modern Data Stack (watch video from 3:40 to 4:43, 5:46 to 10:05)
○ Kimball group resources (star schema)
○ Dimensional modeling docs
○ The analytics setup guidebook, Kimball data modeling

dbt commands reference (link)


Outlines the commands supported by dbt and their relevant flags. Note: We will be running commands
through a local terminal.

Command Notes

dbt compile the entire project must compile before you can run dbt run or dbt docs
generate

CTRL + C if you execute dbt run accidentally or on unexpected model selection, use
CTRL+C to abort the run

dbt seed builds the seed data as a table in snowflake

dbt snapshot executes a snapshot defined in the project

dbt test runs all tests in the dbt project

dbt test --select runs all generic tests in the dbt project
test_type:generic

dbt test --select runs all tests on a single model in the dbt project

2
one_specific_model

dbt run --select runs the model and all dependencies


+fct_decisioning__rate
_checks

dbt developer hub (link)


“Your home base for learning dbt, connecting with the community and contributing to the craft of analytics
engineering.”

Analytics engineering glossary (link)


“The Analytics Engineering Glossary is a living collection of terms & concepts commonly used in the data
industry.”

What are dbt sources?


Sources – They name and describe the data loaded into Snowflake by a data extraction platform. By declaring
these tables as sources in dbt, you can then
● select from source tables in your models using the {{ source() }} function, setting up sources in
dbt and referring to them with the source function enables a few important tools
○ multiple tables from a single source can be configured in one place
○ sources are easily identified as green nodes in the lineage graph
● test your assumptions and add documentation about your source data

Raw data pipeline


The Data & Analytics Engineering team uses the “ELT” (extract-load-transform) paradigm as opposed to the
more traditional “ETL” (extract-transform-load). dbt fits better in the more modern ELT paradigm. This means
data engineers load data in as raw a form as possible and let the analytics engineers transform it in dbt.

Data is currently loaded into the RAW database. As the names imply, RAW contains production data that’s
loaded directly from the source. Little to no transformation applied, with the exception of JSON parsing.

Additional sources
Dbt provides a few additional “features” or objects that can be used as a source: Seeds and Snapshots. It’s
important to note these objects can be configured as sources. Additional details about these objects are
provided later in the documentation, under the Seeds and Snapshots sections.

Raw data discovery


Resource links should be added to each source.yml to help consolidate documentation for developers and
business users. Common examples include:
● Developer documentation directly from the source platform, Alloy docs.
● Package documentation from the dbt community.

3
Discovery resources
● Use external documentation to understand the source architecture
● Access the application and explore custom reports, field labels, or formulas
● Review dbt docs to find and review existing business logic
● Use tpl_discovery.sql to analyze the source data
○ Field values all null, exclude
○ Field values do not provide analytical value, exclude
○ Document questions or concerns
■ Schedule a team discussion on useful fields
■ Does the table contain any personally identifiable information (PII)?
■ Does the table contain any sensitive information (workday, financials)?

What are dbt models?


“Models are where your developers spend most of their time within a dbt environment. Models are primarily
written as a select statement and saved as a .sql file. While the definition is straightforward, the complexity of
the execution will vary from environment to environment. Models will be written and rewritten as needs evolve
and your organization finds new ways to maximize efficiency.” Learn more about models within the “About dbt
models” section of the dbt developer documentation. Make sure to complete the dbt Fundamentals training
course prior to developing models.

Materializations
“The exact Data Definition Language (DDL) that dbt will use when creating the model’s equivalent in
snowflake. It's the manner in which the data is represented, and each of those options is defined either
canonically (tables, views, incremental), or bespoke.” Execute dbt compile to view sql statements located
within the target folder. Make sure to complete the Advanced Materializations training course prior to
developing models and configuring materializations.

Tables
● Built as tables in the database
● Data is stored on disk
● Slower to build
● Faster to query

Views
● Default model configuration
● Built as views in the database
● Query is stored on disk
● Faster to build
● Slower to query

Ephemeral
● Does not exist in the database
● Imported as CTE into downstream models
● Increases build time of downstream models

4
● Cannot query directly
● Ephemeral Documentation

Incremental
● Built as table in the database
● On the first run, builds entire table
● On subsequent runs, only appends new records*
● Faster to build because you are only adding new records
● Does not capture 100% of the data all the time
○ IMPORTANT: incremental does not support history tracking, incremental tables can be
periodically re-built from the source by executing a full refresh. Learn more about how to
execute a full refresh in the tips and tricks section.
● Incremental Documentation
● Discourse post on Incrementality

What is dbt documentation?

Project documentation
● Documentation of models occurs in the YML files inside the models directory. Store the YML file in the
same subfolder as the models you are documenting.
● Doc blocks can render markdown in the generated documentation. Store the markdown files in the
documentation folder and reference the doc block from the YML file.
● Descriptions for models can happen on the model, source, or column level
● In the command line section, an updated version of documentation can be generated through the
command dbt docs generate.
● We have configured persist_docs on our entire dbt project. This takes the documentation defined
within dbt and writes the metadata, like column descriptions and business logic, to Snowflake. More
information at this dbt resource.
● If a column is documented in the YML file but does not exist in the model, that column will still appear
on the dbt docs site. We will need to remove documentation for columns that do not exist. If a column
on the docs site has no value in the Data Type column this is an indication the column does not exist
in the model and should be removed from the YML file.

Technical documentation
● Doc block references are stored in staging.yml or intermediate.yml file
● Information that can assist in development (i.e. ID fields, keys, etc.)
● Document keys for both staging and marts layers in a separate markdown file (i.e.
decisioning_keys.md) which helps promote reusability throughout the dbt project. Each id field will
have a primary key (suffixed by pk) and foreign key (suffixed by fk) doc block.

Business logic documentation


● Doc block references are stored in marts.yml or activation.yml file
● Intended to help business users understand the data model and data definitions

5
● Most marts layers/fields should have documentation. Business users will have access to this in a
hosted dbt documentation site generated through dbt docs generate.
● The activation layer may reuse fields from the marts layer, but additional information should be added
to describe aggregations or new business logic.
● If necessary, use “NEEDS CONFIRMATION” as a placeholder for documentation on fields pulled directly
from a source system that need the business or product team’s input to finalize a definition. This
helps us avoid a bottleneck waiting to hear back about documentation.

Lineage
Often referred to as the “DAG”, directed acyclic graph, is a powerful tool within dbt to visually represent the
relationships between data models.

Helpful Tips
● From a YML file you can navigate directly to a doc block’s definition by right clicking on the doc block
name reference and selecting “Go to Definition”.
● This query can be run in Snowflake to generate the documentation structure for a table according to
our standard naming conventions. NAME_IDENTIFIER, DESCRIPTION_IDENTIFIER, and
DOC_BLOCK_IDENTIFIER return syntax for the YML and markdown files.

Additional dbt features

Aliases
The alias function allows you to config an alias to a given model in order to change the identifer/name
associated with a given model. For example, you may have data coming from api_orders as a source and the
staging model named api_orders.sql. If this is generally referred to as “orders”, you could use the alias
function to allow this to be identified just as “orders” instead of “api_orders”.

When referencing the models into other models, note that you will still need to call it by its model name and
not its alias. Using the above example, {{ ref(‘api_orders’) }}

Note*: There is the potential for multiple models to be aliased to the same value (ex. braze_user_events =
user_events and user_traffic_events = user_events). dbt will throw an ambiguous error to alert you that this
alias conflicts with another one that is set to the same identifier.

Documentation: Alias -dbt doc

6
Analyses
Learn more about models within the “Analyses” section of the dbt developer documentation.
● Analyses are .sql files that live in the analyses folder.
● Analyses will not be run with dbt run like models. However, you can still compile these from Jinja-SQL
to pure SQL using dbt compile. These will compile to the target folder.
● Analyses are useful for training queries, one-off queries, and audits
● Use analyses for storing queries that you may use often to do a quick check

Identifiers
Identifiers are parameters in the source yaml used to configure a source table name that differs from the
table name in the database. For example, data is loaded into an “api_orders table” while the official “orders”
table is being worked on. In your stage model, you can point to “orders” as the source name. Th code will
compile using the identifier “api_orders”.
This will allow you to build your complete model while referencing the table that will hold your data but allow
you to point at the temp/test/alternate table that may have your initial dataset. If you don't use the identifier,
then later on when you switch the source name, you will have to make changes to every file the source name
was used previously. Documentation: Identifier-dbt doc.

Example:

Jinja and macros


“Jinja is a fast, expressive, extensible templating engine. Special placeholders in the template allow writing
code similar to Python syntax. Then the template is passed data to render the final document.” Learn more
about the Jinja templating engine, on the “Jinja” documentation site.

“Macros in Jinja are pieces of code that can be reused multiple times – they are analogous to "functions" in
other programming languages, and are extremely useful if you find yourself repeating code across multiple
models.” Learn more about models within the “Jinja and macros” section of the dbt developer documentation.

There are many useful macros built into our dbt project, and the list will continue to grow as we identify areas
for automation and repeatable functions. Macros are documented in dbt docs with examples of how to run
and execute them within the project.

Example macro: auto_stage_columns


One of our most common macros, auto_stage_columns, is used as the standard for building the stage layer.
We will cover how to use this macro during training sessions. Make sure you understand how to use this
macro before starting development on new models.
1. except=[]
a. Restrict specific fields from going through the auto stage function
b. Do not prefix foreign keys(FK’s) in a model. Do this by placing that field in the except function.

7
i. user_id in application model should stay as user_id not application_user_id
c. Do not prefix booleans, working on adding this to the macro
d. ex. except=[“id”]
2. prefix= “”
a. set a prefix to all data fields
b. *** If not using the prefix function, either use doublequotes or remove it altogether
c. ex. prefix=“application” or “”
i. Result application_id, application_status, application_account_id
3. rename={}
a. rename a data field
b. ex. rename={“email”: “user_email”}
4. lower={}
a. lowercases all data in field(s)
b. ex. lower={“application_status”}
c. great to use on short text fields, ex “status_c”
5. replace_nulls = []
a. replaces nulls with 0 for numeric text fields
b. ex. replace_nulls = [“amount”]

Packages
Packages are standalone dbt projects, with models and macros that tackle a specific problem area. The dbt
package hub consolidates a list of available packages built by users across the dbt community. These
packages are open source and not maintained by our team. Packages should be tested in a development
environment, properly documented, and reviewed by the team prior to getting merged into production.

Seeds
● Seeds are .csv files that live in the seeds folder.
● When executing dbt seed, seeds will be built in your Data Warehouse as tables. Seeds can be
referenced using the ref macro - just like models!
● ✅ Use seeds for data that does not change frequently.
● Do NOT use seeds for uploading data that changes frequently.
● Seeds are useful for loading holiday dates, country codes, employee emails, or employee account IDs
○ Note: If you have a rapidly growing or large company, an orchestrated loading solution may
better address this.
● Seeds are meant to provide users that need access/own the data values to have somewhere to make
the updates themselves. Seeds in this approach provide slight protection from inadvertent changes
being made by the business to critical mapping files.
○ Any data owned by business users that needs to be altered/appended frequently should live
in a gsheet that the DE team extracts directly from.

Snapshots
“Analysts often need to "look back in time" at previous data states in their mutable tables. While some source
data systems are built in a way that makes accessing historical data possible, this is not always the case. dbt
provides a mechanism, snapshots, which records changes to a table over time.” Learn more about snapshots

8
within the “Snapshots” section of the dbt developer documentation. Make sure to complete the Advanced
Materializations training course prior to configuring snapshots.
● Snapshots are built as a table in the database, usually in a dedicated schema to minimize the chances
of the table being dropped:
○ RAW_TEST.DB.SNAPSHOTS
○ PRODUCTION.PUBLIC.SNAPSHOTS
● On the first run, dbt builds the entire table and adds four columns:
○ dbt_scd_id – a unique key generated for each snapshotted record. This is used internally by
dbt.
○ dbt_updated_at – the updated_at timestamp of the source record when this snapshot row
was inserted. This is used internally by dbt.
○ dbt_valid_from – the timestamp when this snapshot row was first inserted. This column can
be used to order the different "versions" of a record.
○ dbt_valid_to – the timestamp when this row became invalidated. IMPORTANT: The most
recent snapshot record will have dbt_valid_to set to null.
● In future runs, dbt will scan the underlying data and append new records based on the configuration.
● This allows you to capture historical data. IMPORTANT: Snapshots are built when history is not
tracked in the source system which makes them difficult or impossible to rebuild. Changes to
production snapshots may require coordination with the data engineering team to ensure data is
backed up.
● Snapshots are run using a specific command dbt snapshot, which adds complexity to the project
orchestration. Learn more about configuring models in the “dbt_project.yml” section of this
document.

We have defined two standards for when a snapshot should occur.

Pre-snapshot
Use this option when you need to track historical changes on source data. A pre-snapshot DOES NOT include
any business logic transformations.
● Pre-snapshots are configured directly on top of the raw_dev or raw_prd table.
● All downstream models are dependent on the snapshot
○ Source >> Snapshot >> Stage >> Intermediate >> Mart >> Activation
● If the snapshot returns an error on run, all downstream models will be impacted. IMPORTANT:
Because of the added complexity, the decision to build a pre-snapshot table should be considered
carefully.
● Pre-snapshots need to be configured using a timestamp strategy. This ensures we can build all the
downstream models accurately.

Snapshot example.

Post-snapshot
Use this option when you need to track historical changes on the marts or activation layer. A post-snapshot
table INCLUDES business logic transformations.
● Post snapshots are configured directly on top of a marts or activation layer.

9
● Post snapshots are a safer option than pre snapshots because there are no downstream
dependencies
○ Source >> Stage >> Intermediate >> Mart >> Snapshot >> Activation
○ Source >> Stage >> Intermediate >> Mart >> Activation >> Snapshot
● Post-snapshots can be configured using a timestamp or check strategy. We recommend using a
check strategy and selecting specific fields for the snapshot (rather than a select *). This is a more
narrow focus on history tracking, which prevents new fields or unrelated changes to the business
logic from adding noise to the snapshot table.

Tests
“Tests are assertions you make about your models and other resources in your dbt project (e.g. sources,
seeds and snapshots).” Learn more about tests within the “Tests” section of the dbt developer
documentation. When you run a dbt test, dbt will tell you if each test in your project passes or fails. Tests can
be run against your current project using a range of commands:
● dbt test runs all tests in the dbt project
● dbt test --select test_type:generic runs all generic tests in the dbt project
● dbt test --select test_type:singular runs all singular tests in the dbt project
● dbt test --select one_specific_model runs all tests on a single model in the dbt project

Generic tests
Generic tests return the number of records that do not meet your assertions. These are run on specific
columns in a model. Generic tests are configured in a YAML file.

There are four generic tests in dbt:


● Unique tests to see if every value in a column is unique
● Not_null tests to see if every value in a column is not null
● Accepted_values tests to make sure every value in a column is equal to a value in a provided list
● Relationships tests to ensure that every value in a column exists in a column in another model (see:
referential integrity).

Defining too many tests can result in errors that prevent the model from generating. In order to prevent tests
from blocking the models, we have configured all tests with severity set to warn rather than error. The
severity can be set to error on an individual test within the YAML file which will override the default setting.

A unique test is required for each source table that will be used to create a reporting model. The unique test
helps determine if a natural key exists on the table. If the natural key is not unique, then a surrogate will need
to be created.
● A natural key is a key that is derived from the data itself, such as a customer ID or a date.
● A surrogate key is a key that is generated artificially, such as a sequential number, a GUID, or a hash.

The source.yml is configured with not_null and unique tests on primary ids (foreign keys should not have
unique or not_null tests).

Surrogate keys are generated using a dbt_utils macro. After a surrogate key is created, it’s important to test
your key for uniqueness to ensure that the correct fields were chosen to create the surrogate key.

10
{{- dbt_utils.generate_surrogate_key(['id', 'updated_at']) }} as account_history_id

The test shows a unique test defined at the column level for a natural key.

Nest a uniqueness test of two columns under the table, not under a column!

Singular tests
Singular tests are specific queries that you run against your models. These are run on the entire model.
Singular tests are stored as select statements in the tests folder.

Constraints
Example:

Execute dbt test in the terminal to create the constraints within your development schema.

dbt test --select models/marts/twilio

11
Query to view constraints in Snowflake. Update the database and schema to view constraints within different
environments (ex.)

Reference the sql template within github

dbt Tips & Tricks


Below is a quick reference guide containing best practices, common functions, recommendations, and
resources.

Compile your code


Run dbt compile to generate executable SQL from source, model, test, and analysis files. You can find
these compiled SQL files in the target/ directory of your dbt project.

The compile command is useful for:


1. Visually inspecting the compiled output of model files. This is useful for validating complex jinja logic
or macro usage.
2. Manually running compiled SQL. While debugging a model or schema test, it's often useful to execute
the underlying select statement to find the source of the bug.
3. Compiling analysis files.

Common Misconceptions
● dbt compile is not a pre-requisite of dbt run, or other building commands. Those commands will
handle compilation themselves.
● If you just want dbt to read and validate your project code, without connecting to the data
warehouse, use dbt parse instead of dbt compile.

dbt compile #or dbt parse

Execute a full refresh


When a model's materialization is configured as incremental, the first run builds the entire table, and all
subsequent runs only append new records unless a full refresh is invoked. Learn more about full_refresh in
the dbt developer documentation.

Development schema
A full refresh can be executed on a development schema using the command dbt run --full-refresh.

dbt run --full-refresh --select models/data_team/lending_decisioning_service

Reset your dev schema


Once you finish a project, you might want to start fresh in your dev schema. Here are two options to quickly
remove all tables and views.

Snowflake: sql script against information_schema


Use this option if you want to select a list of specific objects to drop.

12
1. Open a Snowflake worksheet
2. Modify the SQL statement below to point to your dev schema and run the query
3. Copy the results and paste them back into a Snowflake worksheet
4. Review the results carefully, exclude objects from the list you want to keep in the schema
5. Select all DROP statements and execute the query

Reference the sql template within github

How to run the entire dbt project

Note: We recommend selecting a specific model to run rather than executing the entire dbt project. Learn
more about selecting or excluding models in the dbt syntax overview.

Handling whitespace
Jinja can sometimes compile unwanted whitespace into sql statements, making the final deployment to the
database less readable. If you put an minus sign (-) at the start or end of a block you can remove the
whitespaces after that block or before that block:

{% for item in seq -%}


{{ item }}
{%- endfor %}

Run upstream and downstream dbt models


dbt has the capability to run all downstream or upstream models in the DAG with a single command instead of
having to run each model individually.

● dbt run --select +model_name runs all upstream models


● dbt run --select model_name+ runs all downstream models

The upstream command will run all the models required to build the model specified in model_name. The
downstream command will run all the models dependent on the model specified in model_name.

Note this command does not execute seed files. Seeds will need to be run prior to the models using dbt
seed.

13

You might also like