Data Build Tool (DBT)
Data Build Tool (DBT)
Zidi Chen
Examiner
Vladimir Vlassov
KTH Royal Institute of Technology
Supervisor
Jim Dowling
KTH Royal Institute of Technology
Sina Sheikholeslami
KTH Royal Institute of Technology
ii
Abstract
Feature engineering at scale is always critical and challenging in the machine learning
pipeline. Modern data warehouses enable data analysts to do feature engineering by
transforming, validating and aggregating data in Structured Query Language (SQL).
To help data analysts do this work, Data Build Tool (DBT), an open-source tool, was
proposed to build and orchestrate SQL pipelines. Hopsworks, an open-source scalable
feature store, would like to add support for DBT so that data scientists can do feature
engineering in Python, Spark, Flink, and SQL in a single platform. This project aims
to create a concept about how to build this support and then implement it. The
project checks the feasibility of the solution using a sample DBT project. According
to measurements, this working solution needs around 800 MB of space in the server
and it takes more time than executing DBT commands locally. However, it persistently
stores the results of each execution in HopsFS, which are available to users. By adding
this novel support for SQL using DBT, Hopsworks might be one of the completest
platforms for feature engineering so far.
Keywords
iii
Abstract
Att utveckla funktioner i stor skala är alltid kritiskt och utmanande i pipeline för
maskininlärning. Moderna datalager gör det möjligt för dataanalytiker att göra feature
engineering genom att omvandla, validera och aggregera data i Structured Query
Language (SQL). För att hjälpa dataanalytiker att utföra detta arbete föreslogs Data
Build Tool (DBT), ett verktyg med öppen källkod, för att bygga och organisera SQL-
pipelines. Hopsworks, ett skalbart funktionslager med öppen källkod, vill lägga till
stöd för DBT så att datavetare kan göra funktionsutveckling i Python, Spark, Flink och
SQL på en enda plattform. Det här projektet syftar till att skapa ett koncept för hur
man bygger detta stöd och sedan genomföra det. Projektet kontrollerar lösningens
genomförbarhet med hjälp av ett exempel på DBT-projekt. Enligt mätningar behöver
denna fungerande lösning cirka 800 MB utrymme på servern och det tar mer tid än
att utföra DBT-kommandon lokalt. Den lagrar dock permanent resultaten av varje
körning i HopsFS, vilka är tillgängliga för användarna. Genom att lägga till detta nya
stöd för SQL med DBT kan Hopsworks vara en av de mest kompletta plattformarna för
funktionsutveckling hittills.
Nyckelord
iv
Acknowledgements
I would like to extend my deepest gratitude to Jim Dowling and Hopsworks to give me
the opportunity to do this project. Special thanks to Gibson Chikafa. Without his help,
I cannot complete this project. I also want to thank my examiner Vladimir Vlassov and
my supervisor Sina Sheikholeslami. Many thanks to all the colleagues in Hopsworks
for their support and help.
v
Acronyms
vi
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Purpose and Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background 4
2.1 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Hopsworks Feature Store . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 ELT and ETL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 SQL and other data engineering languages . . . . . . . . . . . . . . . . 7
2.5 DBT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5.1 What is DBT? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5.2 Features of DBT CLI . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5.3 DBT projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5.4 Great expectations DBT plugin . . . . . . . . . . . . . . . . . . . 11
2.5.5 DBT artifacts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.6 Docker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.6.1 What is Docker? . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.6.2 Docker advantages . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6.3 Docker architecture . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 Method 16
3.1 Research Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Validation Tests and Evaluation . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.1 Validation Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
vii
CONTENTS
4 Work 20
4.1 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1.1 DBT Jobs in Hopsworks . . . . . . . . . . . . . . . . . . . . . . . 20
4.1.2 A sample DBT project . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2.1 Management of DBT jobs . . . . . . . . . . . . . . . . . . . . . . 23
4.2.2 Execution management of DBT jobs . . . . . . . . . . . . . . . . 24
4.2.3 Start the execution of DBT jobs . . . . . . . . . . . . . . . . . . 25
4.2.4 Stop the execution of DBT jobs . . . . . . . . . . . . . . . . . . 27
4.2.5 The sample DBT project . . . . . . . . . . . . . . . . . . . . . . 27
References 49
viii
Chapter 1
Introduction
This chapter gives a brief introduction to the thesis project. It starts with the motivation
of the project, and then it describes the problem to solve. The purposes and goals
are introduced to explain the research question and the expected results. In addition,
benefits, ethics, sustainability issues and methodology of the project are also included.
At the end of this chapter, an outline of the thesis is presented.
1.1 Motivation
Both data warehouses and traditional databases store data, but they do have
differences when they are used for machine learning purposes. Compared to
traditional databases, modern data warehouses do not perform satisfactorily in
transactions, but enable the analysis of data. Data warehouses can also handle the
situation of multiple data sources, which means that they can integrate data from
several sources such as several different databases.
Based on data warehouses, Extract, Load, Transform (ELT), which is a new kind
of data pipeline, is widely used to deal with raw data in a more efficient way. In
the ELT process, raw data are firstly extracted and loaded by data warehouses.
Transformations on data are then applied.
To transform data, modern data warehouses enable data engineers to write data
pipelines in SQL. In simple cases, a few SQL queries might complete the task.
However, when the tasks become complex with multiple dependencies across the
queries, it might cause issues. To solve this problem, DBT was proposed by Fishtown
1
CHAPTER 1. INTRODUCTION
DBT helps to build models with dependencies and tests, using a combination of
SQL and Jinja [26]. DBT also offers other features, such as automatically generated
documentation, self-defined macros and so on. Data validation is also possible
by installing external plugins in the project. Fishtown Analytics provides the DBT
Command Line Interface (CLI) tool, an open-source implementation, to run DBT
projects. DBT is now a popular and probably dominating tool to build pipelines of
SQL jobs.
This thesis targets to use DBT CLI to help data analysts to orchestrate the SQL
transformation tasks for machine learning purposes.
1.2 Problem
Hopsworks is a scalable feature store that assists data engineers and data scientists
to do data transformations for feature engineering. This feature store makes creating
features and training datasets more automatic [18]. It also improves the reusability
and quality of the generated features and datasets.
In feature engineering, data scientists might commonly use Spark, Flink and Python to
convert data Resilient Distributed Dataset (RDD)s, while data engineers and analysts
usually use SQL for this purpose. Hopsworks platform now offers support to run Spark
and Flink jobs in the community version and Python jobs for the enterprise version
[18]. However, there is no support for executing and orchestration of SQL jobs.
This project investigates the capabilities of DBT as a tool in the feature engineering
stage of machine learning pipelines within the context of the open-sourced Hopsworks
platform and support for data provenance.
The research question of this project would be: How to apply DBT as a tool for
feature engineering in the Hopsworks platform to build data pipelines and support
data provenance?
2
CHAPTER 1. INTRODUCTION
The goal of this project is to enable the support for the execution of DBT jobs, so that
users can do feature engineering using Flink, Spark, Python and SQL jobs within a
single platform. With this new support, data analysts can run and orchestrate SQL
jobs in the Hopsworks platform using DBT to create features, feature groups and
datasets.
The deliverables include the thesis, the code and the documentation to explain the
code.
1.4 Methodology
The methodology of this project can be considered as system research on Hopsworks.
The first step would be to figure out the components related to this project, and the
interactions between them. The next step is to build a concept about how to realize the
goals. The concept should be demonstrated whether it is feasible or not. In this case, it
is to generate a design about how to run a DBT project as a job in Hopsworks. Then, the
design should be implemented to check its feasibility. According to the situation, minor
changes could be applied to adjust the design in order to achieve a better result.
1.5 Outline
Chapter 2 introduces the necessary background for the project, including the
introduction of applied technologies. Chapter 3 presents the method and the
evaluation method of this project. Chapter 4 explains the implementation of the work
in detail. Chapter 5 shows the results and Chapter 6 gives the conclusion of the
project.
3
Chapter 2
Background
This section introduces the background required for this thesis project, including the
concept of featuring engineering, the difference between using SQL and other data
engineering languages, and the comparison between ELT and Extract, Transform,
Load (ETL) process. It also presents the utilized technologies and frameworks in the
project, such as the Hopsworks platform, DBT CLI and Docker.
More intuitively, feature engineering processes the raw data from data sources, and the
results are input into machine learning models, as shown in Figure 2.1.2. These results
4
CHAPTER 2. BACKGROUND
could be some data columns, which are transformed from the original data.
Feature engineering also has its own pipeline as shown in Figure 2.1.3. The first step is
to understand the data, and then data are transformed into the formats which can be
evaluated and understood by machine learning models. For example, transforming
text into vectors and transforming images into a matrix [24]. Then, the obtained
structured features need to be improved for the machine learning models. We might
select some useful data columns and do necessary transformations such as one-hot
encoding. After applying several steps to the raw data, we can use a single model to
evaluate the designed feature engineering steps. According to the model performance,
we might do modifications to the designed transformation steps.
Data scientists and engineers usually do feature engineering to extract the features for
machine learning models to explore hidden patterns in the data. If the features, namely
data columns, are poorly-formatted and non-meaningful to the models, the training
results might not be worthy [24]. In contrast, well-prepared features could help data
scientists to achieve a faster training speed and a more accurate model. Feature
engineering aims to represent data as better features for models while preserving the
internal relationships of data.
5
CHAPTER 2. BACKGROUND
A feature store is where data scientists and data engineers can manage and persist the
features. It can help them to do transformations of raw data and store the produced
features. The computed features can be reused for multiple machine learning models.
The popular feature stores now include Hopsworks, Feast, Featureform and Databricks
Feature Store.
Among those feature stores, Hopsworks distinguishes itself as a strong and scalable
platform with an open-source community version. The platform is project-based and
it supports dynamic roles [21]. As shown in Figure 2.2.1, Hopsworks provides an online
feature store with low latency and an offline feature store with high throughput, with
Application Programming Interface (API)s for Java, Scala and Python [17]. The online
feature store handles the latest feature values with high availability and provides them
to online applications. The offline feature store supports asynchronously transforming
and analysing big data based on HopsFS, which is a highly-scalable extension of
Hadoop Distributed File System (HDFS) [22]. The output of the offline feature store
can be large numbers of batches of features, which will be used by machine learning
models [17].
Figure 2.2.1: Online and offline features stores in Hopsworks (redrawn after [17])
Figure 2.2.2 shows the system architecture of the Hopsworks platform. Hopsworks can
provide service based on Amazon Web Services (AWS), Google Compute Engine (GCE)
and other platforms. HopsFS and RonDB are used in the storage layer. Users can run
Spark, Flink or Python jobs in the platform based on HopsYARN, which is the resource
management layer. In addition, Docker jobs can also be executed and managed by
Kubernetes in Hopsworks.
6
CHAPTER 2. BACKGROUND
ETL is comparatively new regarding ELT. In practice, ETL is good for structured and
operational data with relational databases. If the data come from several different
sources, then we need to integrate and synchronize the data sources when transforming
them, which might not be very time-efficient. In contrast, ELT supports raw data and
is widely used for machine learning purposes. It offers quicker data processing speed
but less precision compared to ETL [20]. In Machine Learning Operations (MLOps),
the most common choice is ELT because it shows better performance in handling big
and real-time data [19].
7
CHAPTER 2. BACKGROUND
2.5 DBT
DBT is a tool to help data engineers and scientists do the ”T” (transformation) in
the ELT process [26]. It can work with RedShift, BigQuery, Snowflake and many
other popular data warehouses or databases. This section introduces this popular data
processing tool including the features of DBT, DBT CLI, DBT project structure, DBT
artifacts and the great expectations plugin for DBT to validate data.
DBT is a data processing tool to help data analysts and engineers to orchestrate data
engineering workflow. Fishtown Analytics provides the DBT CLI product to run DBT
projects, which is free and open source. The company also released DBT Cloud with an
Integrated Development Environment (IDE), which is more powerful than DBT CLI.
In this thesis, the DBT CLI tool is integrated.
Figure 2.5.1 shows that DBT works on transforming data in data warehouses and the
transformed data will be used by data consumers, such as machine learning models.
Data loader extracts raw data from sources, which is the ”E” in the ELT process. The
extracted raw data are then stored in data warehouses, which is the ”L”. Then DBT
helps to do the ”T” in the data warehouses. DBT completes its task by compiling the
given code into SQL code and executing them against the database. DBT code is a
combination of SQL language and Jinja1 , which is a template engine for Python.
Figure 2.5.1: How DBT works in the data processing procedure (redrawn after [26])
1
Jinja documentation: https://jinja.palletsprojects.com/en/3.1.x/
8
CHAPTER 2. BACKGROUND
DBT enables users to create or update SQL models with defined materialization by
writing SELECT queries, which means users do not write CREATE or UPDATE queries with
DBT [12]. The most commonly used materializations are views and tables. There are
also other predefined materialization levels and customizing materialization is also
allowed.
Another important feature of DBT is the ref function. DBT uses the ref function
to define the dependencies and orders between models. Figure 2.5.2 shows
my_second_dbt_model which is an example of using the ref function. It means to
select all the records whose id is 1 from the first DBT model. When we run the DBT
project, my_first_dbt_model will be executed before my_second_dbt_model, because
my_second_dbt_model depends on the result of my_first_dbt_model.
DBT can generate Directed Acyclic Graph (DAG)s according to the dependencies
between models. Figure 2.5.3 shows the DAG of my_first_dbt_model and
my_second_dbt_model.
DBT offers built-in test frameworks. The tool allows users to write tests in a both
convenient and highly-customized way. The simplest method to test the DBT models
9
CHAPTER 2. BACKGROUND
would be using predefined generic tests, such as not-null, unique and so on. In
addition, the models can also be tested by singular tests, which are also SQL SELECT
queries [11]. For example, if we want that there are no values less than 0 in the age
column in the result table, then we can select all the records whose ages are bigger
than or equal to 0 in the result as the singular test. If the result column number is
bigger than 0, then the singular test fails. Furthermore, the most complicated way for
testing in DBT is to tailor the parameters of the SELECT queries in the singular test [11].
This means that the value 0 in the previous example can be replaced by any given value
as an input parameter. In this way, DBT improves the flexibility and reusability of the
code.
Apart from these critical features, DBT CLI also provides other functionalities to help
users build their data transformation pipelines, such as DBT package manager and
macros. This tool assists data engineers and data analysts to integrate and convert raw
data to expected ones by writing simple SQL SELECT queries with Jinja.
This section introduces the structure of a DBT project, the commands in DBT CLI to
run a project and the resulting artifacts.
Figure 2.5.4 shows an example of the directory structure of a DBT project whose name
is demo. This directory is automatically generated by dbt init command. In the
directory, dbt_project.yml stores the configuration of the project, including project
name, directory path, DBT version and so on. packages.yml is used to configure
necessary packages or plugins. The most important sub-folder in the project would
be the models folder, where all the models and schemas are defined. In addition, The
tests sub-folder contains the tests against the models and the target folder contains
the artifacts. Besides, to connect to the target data warehouses or databases, DBT users
need to configure their accounts in the file profiles.yml. This file is not located in the
project folder, but it can be found at ~/.dbt/ if the operating system is Linux.
10
CHAPTER 2. BACKGROUND
To run a DBT project, DBT CLI provides a set of commands. The commonly used ones
are dbt init to initialize a project, dbt deps to install the package dependencies, dbt
run to run the models and dbt test to run the tests [4].
Great expectation is an open-source data validation tool to validate the data pipelines
by declaring the expected data format [15]. It can be considered as a kind of data unit
test using assertion, and it also offers reports and documentation of the tests.
The plugin dbt_expectations allows using great expectations in the DBT projects. To
use the plugin, the dbt_expectations plugin needs to be included in the packages.yml
file in the DBT project. After installing the package, we can write great expectations
tests as DBT tests.
DBT commands might result in some artifacts. As shown in Figure 2.5.5, the artifacts
are stored in the target directory under the DBT project folder. The important
three artifacts are the three JavaScript Object Notation (JSON) files: manifest.json,
run_results.json and catalog.json.
11
CHAPTER 2. BACKGROUND
These artifacts can be generated by specific commands. They can be used for many
purposes, including finding changes in the tables in the data warehouses and collecting
the running time of execution [8].
2.6 Docker
Docker is used in this project to provide containers for the applications. This
section introduces what is Docker, the possible benefits of using Docker and Docker
architecture.
Figure 2.6.1 shows an example with three containers running on an operating system.
The three containers are created by Docker and they are running as children processes
of Docker. The resources are separated between containers, which means one
container can only access its own resources.
12
CHAPTER 2. BACKGROUND
Figure 2.6.1: Three Docker containers running on the operating system (redrawn after
[23])
Docker can bring many benefits. The possibly common benefit would be the increased
portability [23]. For Windows users, some applications which are developed for Linux
machines are not available. One possible solution is to install those applications on a
virtual machine with Linux operating system, but it might take a lot of space and time.
A better approach would be using Docker because it does not need too much space and
the user can possibly do it with a few commands. In addition, Docker containers can
protect the computer from potential attacks or bugs [23]. Viruses or the trouble caused
by bugs can hardly touch the programs and resources out of the container because the
Docker containers are usually sandboxed within the hosting environment.
Furthermore, Docker can solve the situation of the version conflict of the dependencies
between applications. Two applications can be deployed in two containers with
the specific required version of packages, instead of installing two versions of
dependencies on the same operating system. In addition, Docker can also provide a
neatly deployed environment with necessary packages for the applications. To some
extent, messy and unnecessary dependencies are avoided.
Figure 2.6.2 shows the difference in dependencies of two applications in the operating
system and the Docker containers. Before using Docker containers, application A has
an unnecessary dependency on package z, and application B unnecessarily depends
on package x. In addition, they use different versions of package y. Installing two
versions of package y might also cause problems in the operating system. After
deploying them in different containers, the unnecessary dependencies could possibly
be identified and excluded by programmers. Two different versions of packages are
installed independently in different containers.
13
CHAPTER 2. BACKGROUND
Figure 2.6.3 shows how the Docker components interact with each other. The three
main components are Docker client, Docker server/host and the remote registry.
Docker client is a CLI tool and it interacts with Docker server by Representational
state transfer (REST) API. Docker daemon handles the commands from the client
and it also handles Docker images and containers [13]. Docker registry is a remote
repository which owns a lot of Docker images. Those remote images can be pulled
to the local machine to generate containers. Figure 2.6.3 also shows the process in
which a container is generated from a remote image. Docker CLI sends the command
to Docker daemon, and Docker daemon then pulls the image from the remote registry.
The container is then generated from the local image.
14
CHAPTER 2. BACKGROUND
Figure 2.6.3: Docker consists of Docker client, Docker server and remote registry
(redrawn after [13])
15
Chapter 3
Method
This chapter introduces the research method using which this project was completed.
It also describes how the designed and developed solution was tested and
assessed.
16
CHAPTER 3. METHOD
To test the feasibility, the project used a simple scenario which was to run a DBT project
in the Hopsworks platform connecting to Snowflake after applying the solution. The
project firstly needed to set up a DBT pipeline in a DBT CLI project locally. This
project was used as the input for a DBT job in Hopsworks. The following scenarios
were included in the tests:
As described, DBT commands should be executed as input arguments for DBT job
execution. Due to time constraints, not all the DBT commands were tested in the scope
of this project. The following DBT commands were selected for testing because they
can be considered the basic and necessary commands for getting the results of a DBT
pipeline in the data warehouses.
1. dbt deps: Download and install the dependencies specified in package.yml [5].
17
CHAPTER 3. METHOD
2. dbt compile: Compile the files in model, test and analysis to be SQL files [3].
5. dbt build: Run the models, tests, seeds and snapshots [2].
The used DBT version in the project was the latest at the time of development: 1.1.0
for dbt-core and dbt-snowflake.
The expected results of starting an execution of a DBT job would include two parts.
The first necessary result is the logs, which should show the execution results. The
other result is the artifacts generated by DBT itself. Both of them should be available
to users.
This project used Docker to realize the support of running DBT jobs in Hopsworks
and a simple DBT pipeline project was used to test the feasibility of the solution. The
performance measurements were based on those conditions and assumptions. To
assess the performance, it is a common way to track the resource consumption and
the execution time.
Measuring the resource consumption consists of investigating the size of the Docker
containers and the Docker image in this project. The other commonly used attributes
like Central Processing Unit (CPU) utilization were not considered because they might
vary a lot depending on the actual input DBT project and the input DBT command.
The sizes of Docker containers are related to the size of their Docker image. To clarify,
Figure 3.2.1 shows that there are three Docker containers created using the base Docker
image. The base Docker image is actually reused by different containers and each
container adds a writable layer on top of the image [25]. Therefore, the size of a Docker
container could be considered as the size of the base reusable Docker image size and
the size of the writable layer. docker ps -s could present those size information and
the Docker image size would be presented as ”virtual size” [25]. It is worth mentioning
that the size shown by this command does not include some disk space usage such as
volumes and configuration files for the containers [25].
18
CHAPTER 3. METHOD
Figure 3.2.1: The relationship between a Docker image and Docker containers
Other than the resource consumption, the execution time was also considered for
performance measurement. As Figure 3.2.2 shows, the execution time of a DBT job
could be divided into time for preparation, actual command execution and preparation
of return results. Those times were tracked by printing the current time in the
system. The system was configured as having only one node to avoid the possible
time difference between several machines. So the execution time was recorded in a
stopwatch way by the server.
The development and tests were both completed on the Hopsworks platform,
which was installed in a virtual machine using Vagrant1 . Table 3.2.1 presents the
characteristics of the experimental environment.
1
Installing Hopsworks using Vagrant: https://hopsworks.readthedocs.io/en/stable/getting_
started/installation_guide/platforms/vagrant.html
19
Chapter 4
Work
This section introduces the system design of the solution, including how to develop
the DBT support in Hopsworks and a DBT pipeline project. This DBT pipeline was
used to test the feasibility of the implemented solution. In addition, this section also
presents the implementation details of the solution. It is important to mention that this
project only targeted completing the back-end development, not including the User
Interface (UI) and front-end development.
To support DBT jobs on Hopsworks, there are two vital use cases, as shown in Figure
4.1.1. To manage DBT jobs, the core functionalities would be to create, modify, get and
delete a DBT job. Moreover, managing the execution of DBT jobs would be the main
focus of this project. To fulfill this use case, the solution should be able to make the
DBT job executed or stopped after creation. An execution could also be viewed and
deleted.
20
CHAPTER 4. WORK
Currently, the community version of the Hopsworks platform provides support for
running Spark and Flink jobs. They are based on YARN. In addition, the Hopsworks
Enterprise version also offers support for Docker and Python. Figure 4.1.2 shows the
relationship between those classes. YarnJob, DockerJob and PythonJob extend the
basic job type. In addition, SparkJob and FlinkJob are based on the YarnJob class.
With this knowledge, it was proposed that DBTJob should extend the basic job class
since it doesn’t need Yarn services, just like DockerJob.
As mentioned before, the project should enable users to create, modify, get and delete
a DBT job. Since DBTJob is another extension of the basic Job class, it was possible to
reuse most of the existing code for managing DBT jobs. As a consequence, the main
focus and the difficulty of the solution would be how to execute and stop a DBT job in
the Hopsworks platform.
The proposed solution was to use Docker containers which can provide a configurable
light-weight environment so that the DBT project can run inside the container without
being affected by the complex system environment. Figure 4.1.3 presents the solution
for running a DBT job in a Docker container. Users can take advantage of the
21
CHAPTER 4. WORK
created storage connectors1 in Hopsworks. When creating a DBT job, the user should
specify which storage connector they want to use. Then, a Docker container should
be generated according to the Docker image with the specified storage connector
information and other environment variables. This container should complete a few
main tasks. The main task would be to execute the defined DBT command. Then, the
execution results should be stored in logs and the execution might also cause some
DBT artifacts, such as the manifest.json file. The container should mount those files
to HopsFS so that they are visible to the users and available for further use.
To test the feasibility of the proposed solution, a sample DBT project with a simple
pipeline was designed. This DBT sample project used an open data set from Kaggle2 ,
which was about students’ exam performance. The dataset was loaded into Snowflake
in the table called students_exam. It simulated a scenario that the data engineers want
to focus on the students who have not taken the preparation course but passed the
math exam. The DBT project should select the students who haven’t completed the test
preparation courses. In the end, the records of those who have passed the math exams
should be figured out in a view. The project also needed to pass a dbt_expectation
test to check if the math grades in the records of final results are between 60 to 100.
The complete pipeline is shown in Figure 4.1.4.
1
An introduction to the storage connector in Hopsworks:
https://docs.hopsworks.ai/feature-store-api/2.5.9/generated/storage_connector/
2
Students Performance in Exams:
https://www.kaggle.com/datasets/spscientist/students-performance-in-exams
22
CHAPTER 4. WORK
Figure 4.1.5 presents the expected resulting tables and the view in the data warehouse.
students_exam was a table loaded from the open dataset. not_prepared_performance
had the same structure as students_exam, but all the testPreparationCourse should
be ’none’. Different from the two tables, passed_math_performance was designed to
be a view to save space and add diversity to the scenario to some extent.
The management of DBT jobs could be divided into creation, read, modification
and deletion of DBT jobs. Since DbtJob is an extension of the basic Job class, just
like DockerJob and YarnJob, the management of DBT jobs followed the existing
architecture of the management of the other job classes. Figure 4.2.1 shows the brief
architecture of the job management in Hopsworks. The front-end UI communicates
23
CHAPTER 4. WORK
with Hopsworks via RESTful APIs. The jobs are stored in the database in the back-
end.
As stated, the APIs and architecture of creating, deleting, viewing and modifying the
other Job classes were reused for DbtJob class. The only work to do for the DBT job
management in this section was to add this new job type.
The execution management of DBT jobs was similar to the implementation of the
management of DBT jobs. As Figure 4.2.2 shows, clients can send the requests
including executing a DBT job, viewing the execution information, stopping the
execution and deleting the execution. In those functionalities, the project reused the
existing code of viewing the execution and deleting execution, just like managing the
execution of other kinds of jobs in Hopsworks.
24
CHAPTER 4. WORK
The main focus of this project would be developing the functionalities to start and stop
the execution of a DBT job, which will be introduced in the next sections.
This functionality would be the core of the support for running DBT jobs on
Hopsworks. According to the design presented in Chapter 4.1.1, the implementation
would use Docker containers to run DBT jobs. Figure 4.2.3 demonstrates the core
implementation of starting an execution of a DBT job. The execution request would
be sent to DbtController class, which is a vital class to start and stop the execution.
DbtController would create a Docker container according to the constructed Docker
image, with necessary environment variables injected. The environment variables
include the DBT project directory, the execution id, the project id, the log directory, the
DBT profile information and the user groups for different directories. These variables
can be injected using -e when executing docker run command [14].
The tasks that the Docker container completes are also explained in Figure 4.2.3.
The main tasks are installing the dependencies for DBT CLI, generating certificates,
mounting to HopsFS, executing the DBT command and then exporting the generated
logs and DBT artifacts to the HopsFS directory. The generated certificates are
necessary for mounting to HopsFS. The logs and artifacts are copied from the original
directory to the HopsFS directories so that they are visible to users and will not
be covered. The artifacts can also be downloaded and used by users for further
purposes.
25
CHAPTER 4. WORK
There are several ways to install DBT CLI, such as using pip or Homebrew [7]. This
project installed DBT CLI from source3 because it would be easier to select the specific
version. To install dbt-core and dbt-snowflake, it is necessary to install git and pip
as the dependencies based on the basic ubuntu environment in the container. Table
4.2.1 shows the version of dependencies in the Docker image.
ubuntu 18.04
pip 22.1.1
dbt-core v1.1.0
dbt-snowflake v1.1.0
In addition, it is also worth mentioning that each execution would be stored in the
database for all job types in Hopsworks. Therefore, the execution of DBT jobs also
reused this part of code to save each execution.
3
Install dbt from source: https://docs.getdbt.com/dbt-cli/install/from-source
26
CHAPTER 4. WORK
Stopping the execution of DBT jobs was also implemented based on the Docker. As
Figure 4.2.4 shows, the implementation killed the running Docker container to stop
the execution of a DBT job. Specifically, docker rm -f was used to force the removal of
the running container. This command can stop the container immediately and remove
all the data of the container.
Figure 4.2.4: Stop an DBT job execution kills the created Docker container
The previous sections introduced the development of the management of DBT jobs
and the execution of DBT jobs. The most critical part of the solution was using Docker
containers to start and stop the execution of the DBT jobs. In addition, DbtController
was the core class to control the interaction between requests and Docker containers in
Hopsworks. The following section will introduce the details of constructing the sample
DBT project, which will be used to check the feasibility of the solution later.
As mentioned in Chapter 2, this project used DBT CLI to create a DBT project which
contained a data pipeline. The sample project was created by dbt init command4 . In
addition to that, it was necessary to configure the data warehouse to connect. When
developing locally, this connection configuration needs to be set in the profiles.yml5 .
4
Introduction to dbt init command: https://docs.getdbt.com/reference/commands/init
5
Introduction to configure the profile:
https://docs.getdbt.com/dbt-cli/configure-your-profile
27
CHAPTER 4. WORK
Snowflake was used in this sample project. The implementation of the DBT project
included the models and the tests.
The models needed to be created in the models sub folder, as shown in Figure 4.2.5.
There were two SQL files to define the tables and views, and the schema.yml file to
configure the models.
The first model to create was the not_prepared_performance. Listing 4.1 shows that
this model took all columns from a source table with the records that didn’t take the
test preparation course. The source table was the table uploaded from the open data
set in Snowflake, which would be specified in schema.yml later. This model created a
new table in Snowflake naming not_prepared_performance. Figure 4.2.6 shows part
of the data in the table in Snowflake.
3 select *
4 from { { so u r c e ( ' PUBLIC ' , 'STUDENTS_EXAM' ) } }
5 where t e s t p r e p a r a t i o n c o u r s e = ' none '
28
CHAPTER 4. WORK
After creating the SQL model, it was necessary to define it in schema.yml, as presented
in Listing 4.2. It defined the source table STUDENTS_EXAM in Snowflake from the
public data set and also defined the model to create. There were two tests for the
model not_prepared_performance.sql, checking that the students didn’t take the
preparation course and the math scores are null. Those tests were using the self-
contained test framework from DBT.
6 models :
7 − name : not_prepared_performance
8 columns :
9 − name : gender
10 − name : r a c e
11 − name : p a r e n t a l E d u c a t i o n
12 − name : lunch
13 − name : t e s t p r e p a r a t i o n c o u r s e
14 tests :
15 − accepted_values :
16 v a l u e s : [ ' none ' ]
17 − name : w r i t i n g s c o r e
18 − name : r e a d i n g s c o r e
19 − name : mathscore
20 tests :
21 − not_null
Listing 4.3 shows the second SQL model, which created a view of records who have
passed the math exam based on the previous model not_prepared_performance.sql.
The reading score and writing score were excluded from this view. This model used
ref function to refer to the previous model. Figure 4.2.7 presents part of the resulting
view in Snowflake.
29
CHAPTER 4. WORK
30
CHAPTER 4. WORK
12 tests :
13 − d b t _ e x p e c t a t i o n s . expect_column_values_to_be_between :
14 min_value : 60
15 max_value : 100
Again, all the models were defined in schema.yml file. The used external package
was included in packages.yml and was installed. This is a relatively simple and
basic DBT pipeline project. DBT also offers features such as using macros and other
materializations to solve more complex tasks. The complete code is attached in the
Appendix A.
In Chapter 5, the results of applying the solution will be demonstrated. The constructed
DBT pipeline will be used to be run on Hopsworks to check if the solution is
feasible.
6
Introduction to dbt_expectations: https://hub.getdbt.com/calogica/dbt_expectations/0.1.
2/
31
Chapter 5
This chapter first includes the test results of the management of DBT jobs and
executions. It also describes using the developed sample DBT project to run several
basic DBT commands for executing a DBT job on the Hopsworks platform, to check the
feasibility of the implemented solution. After making sure the solution was working,
the execution time and memory usage were tracked for measuring performance.
This section describes the tests for the use cases to manage DBT jobs and execution, and
also the tests of executing several basic DBT commands as the input arguments using
the created DBT pipeline project. Although this project didn’t include the development
of the front end, some UIs could be reused for DBT jobs. The tests took the advantage
of the existing UIs and used APIs to test the rest of the functionalities. All the API tests
were executed using Postman v8.6.2. The created DBT project in Chapter 4.2.5 was
uploaded in HopsFS under the folder /DBTProjects/demo for test use.
The tests for managing DBT jobs include creating, modifying, viewing and deleting
a DBT job. The tests used the new UI of Hopsworks and the demo project naming
demo_fs_meb10000 with project id 119 created by the Hopsworks tutorial.
32
CHAPTER 5. RESULTS AND DISCUSSION
Since this project didn’t include developing the UI, the DBT jobs could only be created
by using API now. To test the creation functionality, Listing 5.1 was used as the request
body in JSON. dbtProjectPath and dbtProfilesStorageConnectorName specified the
project files path in HopsFS and the name of the existing storage connector. The
name of the job could be specified in the PUT request Uniform Resource Locator
(URL), such as https://localhost:8181/hopsworks-api/api/project/119/jobs/
dbt_exam_performance.
Figure 5.1.1 shows the created DBT job in UI, reusing the UI for other types of
jobs.
Viewing a DBT job was also tested by using API. The GET request API address
was: https://localhost:8181/hopsworks-api/api/project/119/jobs/dbt_exam_
performance and the returned result is shown in Listing 5.2.
33
CHAPTER 5. RESULTS AND DISCUSSION
8 ” dbtProfilesStorageConnectorName ” : ” snowflake_sc ” ,
9 ” jobType ” : ”DBT”
10 },
11 ” cr e a ti on T im e ” : ”2022−06−11T14 : 5 7 : 4 5 Z ” ,
12 ” creator ”: {
13 ” h r e f ” : ” h t t p s : / / l o c a l h o s t : 8 1 8 1 / hopsworks−a p i / a p i / u s e r s
/10000”
14 },
15 ” executions ” : {
16 ” h r e f ” : ” h t t p s : / / l o c a l h o s t : 8 1 8 1 / hopsworks−a p i / a p i / p r o j e c t
/ 1 1 9 / j o b s / dbt_exam_performance / e x e c u t i o n s ”
17 },
18 ” i d ” : 43 ,
19 ” jobType ” : ”DBT” ,
20 ”name ” : ” dbt_exam_performance ”
21 }
This function also reused the code for modifying other types of jobs. Modifying a job
calls the same API as creating a job with the same name of the job to change. In this test,
the same PUT request was sent with a different body parameter, which was changing
dbtProfilesStorageConnectorName to be ”snowflake_new_sc”. The response body is
shown in Listing 5.3 to verify the modification succeeded. This was also the response
of the GET request of the DBT job dbt_exam_performance.
Listing 5.3: Response of modifying the DBT job dbt_exam_performance with a new
storage connector
1 {
2 ” t y p e ” : ” jobDTO ” ,
3 ” h r e f ” : ” h t t p s : / / l o c a l h o s t : 8 1 8 1 / hopsworks−a p i / a p i / p r o j e c t / 1 1 9 /
j o b s / dbt_exam_performance ” ,
4 ” config ”: {
5 ” type ” : ” dbtJobConfiguration ” ,
6 ”appName ” : ” dbt_exam_performance ” ,
7 ” d b t P r o j e c t P a t h ” : ” / DBTProjects /demo ” ,
8 ” dbtProfilesStorageConnectorName ” : ” snowflake_new_sc ” ,
34
CHAPTER 5. RESULTS AND DISCUSSION
9 ” jobType ” : ”DBT”
10 },
11 ” cr e a ti on T im e ” : ”2022−06−11T14 : 5 7 : 4 5 Z ” ,
12 ” creator ”: {
13 ” h r e f ” : ” h t t p s : / / l o c a l h o s t : 8 1 8 1 / hopsworks−a p i / a p i / u s e r s
/10000”
14 },
15 ” executions ” : {
16 ” h r e f ” : ” h t t p s : / / l o c a l h o s t : 8 1 8 1 / hopsworks−a p i / a p i / p r o j e c t
/ 1 1 9 / j o b s / dbt_exam_performance / e x e c u t i o n s ”
17 },
18 ” i d ” : 43 ,
19 ” jobType ” : ”DBT” ,
20 ”name ” : ” dbt_exam_performance ”
21 }
Users can click on the delete button in UI in Hopsworks to delete any type of job. After
clicking on the delete button and confirming the deletion, the https://localhost:
8181/hopsworks-api/api/project/119/jobs/dbt_exam_performance job could not
be found using GET request, as shown in Listing 5.4.
Besides testing the use cases of managing DBT jobs, the tests for managing DBT
job execution were also executed. The tests included start, stop, get and delete an
execution. Different from job management, an execution could not be modified.
Start an execution
The dbt_exam_performance job was created again for testing the execution. Starting
35
CHAPTER 5. RESULTS AND DISCUSSION
a DBT job execution simply reused the UI of starting Docker jobs’ execution. After
clicking on the Run button, users also need to input an argument to start execution,
as shown in Figure 5.1.2. This argument could be empty or a DBT CLI command. In
this test, an empty command was tested to check if the Docker container was created.
The tests of executing a set of DBT CLI commands were also included in this project in
Chapter 5.1.2.
Figure 5.1.2: Users need to input an argument to start an execution of a DBT job
Figure 5.1.3: The execution record of the DBT job dbt_exam_performance with an
empty argument
Figure 5.1.4 presents that the execution with id 66 created an Docker container in the
server.
Figure 5.1.4: The Docker container of the execution of DBT job dbt_exam_performance
Get an execution
Besides the shown result in the UI, the execution could also be obtained from API.
Listing 5.5 presents the
result of the GET request to https://localhost:8181/hopsworks-api/api/project/
119/jobs/dbt_exam_performance/executions/66 with execution id 66.
36
CHAPTER 5. RESULTS AND DISCUSSION
Stop an execution
To test the stop function, an execution was firstly started using UI and then a PUT
request was called to stop this execution. The request needed to specify the target
execution id like https://localhost:8181/hopsworks-api/api/project/119/jobs/
dbt_exam_performance/executions/67/status, with payload "state":"stopped".
Listing 5.6 gives the response of stopping DBT job execution 67. The stated became
”killed”.
37
CHAPTER 5. RESULTS AND DISCUSSION
Figure 5.1.5 shows that the Docker container for execution 67 was killed after calling
the API.
Delete an execution
38
CHAPTER 5. RESULTS AND DISCUSSION
Similar to the delete functionality for DBT jobs, this function also reused the
existing mechanism for deleting executions of other types of jobs. After calling
DELETE request https://localhost:8181/hopsworks-api/api/project/119/jobs/
dbt_exam_performance/executions/67, this execution disappeared in the execution
list in UI and this execution could not be obtained by GET request as shown in Listing
5.7.
To summarize, the tests for DBT job and execution management shows that those
functionalities were completed and the use cases were fulfilled.
In this section, the tests for executing a set of basic DBT commands were recorded.
To check the commands are successfully executed, it was necessary to verify
the generated logs and artifacts. The logs for an execution include stdout.log
for standard output and stderr.log for error outputs. The logs should under
the directory root/Logs/DBT/{execution id} in HopsFS and the {execution id}
should be replaced by actual numbers. Artifacts of DBT jobs should be under
root/DBTArtifacts/{execution id} and different DBT commands could result in
different DBT artifacts. In the tests, each DBT command was used as the input
argument to start a DBT job execution.
dbt deps
This command installs the specified packages, which does not result in any artifacts.
The error log was empty and the output log is shown in Figure 5.1.6, meaning that the
command installed the dependencies successfully.
39
CHAPTER 5. RESULTS AND DISCUSSION
dbt compile
dbt run
dbt run is a vital command to execute the models against the data warehouses.
Figure 5.1.9 shows the stdout.log and the corresponding run_results.json and
manifest.json were also available in HopsFS. Although the log showed that the
execution succeeded, there were some weird characteristics in the log near ”SUCCESS”
and ”Completed successfully”. The possible cause would be the characters’ color.
Those words were in green if the execution is done in the terminal. There might be
some errors in setting the words’ color in DBT.
40
CHAPTER 5. RESULTS AND DISCUSSION
dbt test
Figure 5.1.10 shows the out log of executing dbt test. It executed all the defined tests
including not null tests and the dbt expectation test. The error log was empty and two
artifact files: run_results.json and manifest.json were generated as expected. The
wrong characters would be caused by the same reason as previously described.
dbt build
dbt build ran the models and the tests since there were no defined seeds and
snapshots in the developed DBT project. As expected, it resulted in an empty
error log, a succeeded out log shown in Figure 5.1.11, run_results.json and
manifest.json.
41
CHAPTER 5. RESULTS AND DISCUSSION
dbt ls
All the resources were listed as shown in Figure 5.1.12 in the stdout.log. No error was
reported. manifest.json was set to be available to users, while dbt ls didn’t result in
any update in the other artifacts.
In all, the tests for executing a set of basic DBT jobs passed. The standard out logs
showed the result summary and the artifacts were generated accordingly as expected.
There was a minor problem which is the wrong characters in the log. This could be
caused by the defined color of characters resulting from DBT itself.
The performance of the solution was measured by memory usage of the Docker image
and the Docker containers, and the execution time of executing a DBT job.
42
CHAPTER 5. RESULTS AND DISCUSSION
Figure 5.2.1 shows that the created Docker image registry.service.consul:4443/dbt had
size 795MB. It also presents the size of each layer using the command docker history
<image id>. Most part of the memory was used for installing dependencies, such as
dbt-core, dbt-snowflake, python, git, wget and so on.
Figure 5.2.2 shows the size of one Docker container for executing a DBT job. The virtual
size was the size of its image which could be shared by different containers. 503kB was
the size of the writable layer on top of the image.
In this project, the time for executing dbt run against the created sample DBT project
was recorded. This command was used because it could be considered as the most
basic and necessary command to generate the models in the DBT project. Figure 5.2.3
presents the execution time separated by a few tasks.
43
CHAPTER 5. RESULTS AND DISCUSSION
Figure 5.2.3: Execution time of dbt run command of the sample DBT project
There was 1 second between the Docker container called to be created using command
docker run and it actually started. The possible cause would be that the environmental
variables and volumes were used to start a container which might take some time.
In addition, Hopsworks uses Kubernetes to manage the Docker containers, so it
probably used some time to start the container. Then, 2 seconds were used to generate
certificates for connecting to HopsFS. Mounting to HopsFS from the container used 1
second. After that, 44 seconds were consumed before executing the DBT command.
That time was mainly used for checking the existence of directories, generating DBT
profiles, changing files’ ownership and so on. This preparation time took the longest
time without considering the actual DBT command execution time, the possible reason
might executing commands on the mounted HopsFS directory needs more time. As
shown in Figure 5.2.2, dbt run against the current DBT project used 25 seconds to
complete. The 25 seconds also included the time to write results to logs, so it might
be longer than usual. Then only 1 second was used for preparing results including logs
and artifacts.
44
CHAPTER 5. RESULTS AND DISCUSSION
5.3 Discussion
According to the validation test, the developed support worked as expected. With this
support, users can orchestrate SQL tasks in the Hopsworks Feature Store to do data
transformation for feature engineering. The DBT job executions could result in tables
or views in data warehouses, which is Snowflake in this project, and they can be used
to create feature groups in Hopsworks Feature Store. Users can create normal feature
groups1 or on-demand feature groups2 stored externally in Hopsworks based on the
results of DBT executions. Those feature groups could be used for machine learning
models in the feature store.
The performance measurements include measuring the size of a Docker container and
the time of executing dbt run command for the sample DBT project. The size of a
Docker container consists of 795 MB for the base image size and 503 kB for the writable
layer’s size. This indicates that the user might need at least around 800 MB of space
on the server to use the developed support for DBT jobs in this project. In addition,
compared to executing a DBT command locally, executing it in Hopsworks needs extra
time to generate certificates, mount to HopsFS and prepare the environment. The
operations like writing and reading to HopsFS also need more time than writing and
reading to local files. However, those times are necessary so that users can view the
persistent results for each execution including logs and artifacts in the Hopsworks
platform.
1
Feature Groups in Hopsworks: https://docs.hopsworks.ai/feature-store-api/3.0.0-RC2/
generated/feature_group/
2
On-Demand (External) Feature Groups in Hopsworks: https:
//docs.hopsworks.ai/feature-store-api/3.0.0-RC2/generated/on_demand_feature_group/
45
Chapter 6
This chapter concludes the development and the tests of this project and also states
limitations and possible future work.
6.1 Conclusion
To summarize, this project designed and developed a working solution to support
executing DBT tasks in Hopsworks for feature engineering, which enabled analytics
engineers to transform data using SQL within the Hopsworks Feature Store. The
solution was based on Docker creating lightweight environments for connecting to data
warehouse Snowflake and executing DBT commands.
A DBT project connecting to Snowflake was used in the project to verify if the solution
was working. The solution passed the tests of creating, modifying, deleting, getting
a DBT job and starting, stopping, getting and deleting an execution of a DBT job,
meaning that the developed solution fulfilled the desired use cases. In addition, the
tests for executing a set of basic DBT commands demonstrated that the solution could
result in correct expected logs and artifacts. Furthermore, it was measured that the
Docker image had a size of 795MB in the server, which was mostly used for installing
the dependencies. Considering the time of executing a DBT command, the longest time
was spent on preparing the folders, the DBT profile file and log files on HopsFS. The
advantage is that users can read the logs and artifacts of each execution in HopsFS in
the Hopsworks platform.
46
CHAPTER 6. CONCLUSION AND FUTURE WORK
by orchestrating SQL models on Hopsworks using DBT, and it was proved that the
solution was working as expected. Supporting DBT jobs might make Hopsworks
become one of the most complete feature engineering platforms, with full support of
Python, Docker, Spark, Flink and SQL.
6.2 Limitations
There were a few limitations to this project. For instance, the developed solution
only supports connecting to Snowflake. The users now can only use Snowflake to
do the transformation. Officially, dbt Labs fully supports following data warehouses
or platforms: Postgres, Snowflake, Redshift, Apache Spark and BigQuery [1]. There
are also plugins for other platforms provided by the community or vendors [1]. The
project started to build the DBT project for testing purposes using Snowflake because
it provided a free trial. Due to the time limit, support for other data warehouses was
not considered in this project. This could also be future work to do.
Another limitation would be that the project tested only a set of necessary DBT
commands, without covering all the commands. However, it might be difficult to
cover all possible DBT commands with all possible arguments within the constraint
time.
Moreover, this project was a pure back-end development, so one important future
work is developing UI for the support of DBT jobs. Although some web pages could
be reused, it is still necessary to add UI for creating a DBT job. The logs and artifacts
could also be shown on the details page of a DBT execution.
47
CHAPTER 6. CONCLUSION AND FUTURE WORK
automatically create on-demand feature groups using the result table or view in
the data warehouses. It means that the data will be stored in Snowflake outside
Hopsworks, but users can use those data to train models within Hopsworks [16]. This
new feature could possibly make the whole feature engineering process more automatic
and provide a more consistent user experience.
48
Bibliography
[2] dbt Labs. build. https : / / docs . getdbt . com / reference / commands / build.
Retrieved 31 May 2022.
[4] dbt Labs. dbt Command reference. https : / / docs . getdbt . com / reference /
dbt-commands. Retrieved 6 May 2022.
[5] dbt Labs. deps. https : / / docs . getdbt . com / reference / commands / deps.
Retrieved 3 June 2022.
[6] dbt Labs. deps. https : / / docs . getdbt . com / reference / commands / deps.
Retrieved 3 June 2022.
[9] dbt Labs. Packages. https : / / docs . getdbt . com / docs / building - a - dbt -
project/package-management. Retrieved 27 May 2022.
[12] dbt Labs. What is dbt? https : / / docs . getdbt . com / docs / introduction.
Retrieved 4 May 2022.
49
BIBLIOGRAPHY
[13] Docker Inc. Docker overview. https : / / docs . docker . com / get - started /
overview/. Retrieved 16 May 2022.
[14] Docker Inc. docker run. https : / / docs . docker . com / engine / reference /
commandline/run/. Retrieved 30 May 2022.
[15] Great Expectations. Down with Pipeline debt / Introducing Great Expectations.
https : / / medium . com / @expectgreatdata / down - with - pipeline - debt -
introducing-great-expectations-862ddc46782a. Retrieved 11 May 2022.
[22] Ismail, Mahmoud, Niazi, Salman, Ronström, Mikael, Haridi, Seif, and Dowling,
Jim. “Scaling HDFS to more than 1 million operations per second with HopsFS”.
In: 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid
Computing (CCGRID). IEEE. 2017, pp. 683–688.
[23] Nickoloff, Jeffrey and Kuenzli, Stephen. “Docker in action”. In: Manning, 2019.
Chap. 1. ISBN: 9781617294761.
[24] Ozdemir, Sinan. “Feature Engineering Bookcamp”. In: Manning, 2021. Chap. 1.
ISBN: 9781617299797.
50
BIBLIOGRAPHY
[25] thaJeztah. Explain the SIZE column in ”docker ps -s” and what ”virtual”
keyword means. https://github.com/docker/docker.github.io/issues/
1520#issuecomment-305179362. Retrieved 4 June 2022.
[26] Tristan Handy. What, exactly, is dbt? https : / / blog . getdbt . com / what -
exactly-is-dbt/. Retrieved 30 March 2022.
51
Appendix - Contents
52
Appendix A
53
APPENDIX A. DBT PROJECT CODE
23
24 # C o n f i g u r i n g models
25 # F u l l documentation : h t t p s : / / docs . g e t d b t . com/ docs / c o n f i g u r i n g −
models
26
36
37 vars :
38 ' dbt_date : time_zone ' : ' Europe / Stockholm '
54
APPENDIX A. DBT PROJECT CODE
4 select *
5 from { { so u r c e ( ' PUBLIC ' , 'STUDENTS_EXAM' ) } }
6 where t e s t p r e p a r a t i o n c o u r s e = ' none '
3 sources :
4 − name : PUBLIC
5 tables :
6 − name : STUDENTS_EXAM
7
8 models :
9 − name : not_prepared_performance
10 columns :
11 − name : gender
55
APPENDIX A. DBT PROJECT CODE
12 − name : r a c e
13 − name : p a r e n t a l E d u c a t i o n
14 − name : lunch
15 − name : t e s t p r e p a r a t i o n c o u r s e
16 tests :
17 − accepted_values :
18 v a l u e s : [ ' none ' ]
19 − name : w r i t i n g s c o r e
20 − name : r e a d i n g s c o r e
21 − name : mathscore
22 tests :
23 − not_null
24
25 − name : passed_math_performance
26 columns :
27 − name : gender
28 − name : r a c e
29 − name : p a r e n t a l E d u c a t i o n
30 − name : lunch
31 − name : t e s t p r e p a r a t i o n c o u r s e
32 tests :
33 − accepted_values :
34 v a l u e s : [ ' none ' ]
35 − name : mathscore
36 tests :
37 − d b t _ e x p e c t a t i o n s . expect_column_values_to_be_between :
38 min_value : 60
39 max_value : 100
56
TRITA-EECS-EX-2022:
www.kth.se