0% found this document useful (0 votes)

813 views65 pages

Data Build Tool (DBT)

Uploaded by

edutariq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

813 views65 pages

Data Build Tool (DBT)

Uploaded by

edutariq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 65

DEGREE PROJECT IN TECHNOLOGY,

SECOND CYCLE, 30 CREDITS

STOCKHOLM, SWEDEN 2022

Data Build Tool (DBT)

Jobs in Hopsworks

Zidi Chen

KTH ROYAL INSTITUTE OF TECHNOLOGY

ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
Author
Zidi Chen <[email protected]>
Software Engineering of Distributed Systems
KTH Royal Institute of Technology

Place for Project

Stockholm, Sweden

Examiner
Vladimir Vlassov
KTH Royal Institute of Technology

Supervisor

Jim Dowling
KTH Royal Institute of Technology
Sina Sheikholeslami
KTH Royal Institute of Technology

ii
Abstract

Feature engineering at scale is always critical and challenging in the machine learning
pipeline. Modern data warehouses enable data analysts to do feature engineering by
transforming, validating and aggregating data in Structured Query Language (SQL).
To help data analysts do this work, Data Build Tool (DBT), an open-source tool, was
proposed to build and orchestrate SQL pipelines. Hopsworks, an open-source scalable
feature store, would like to add support for DBT so that data scientists can do feature
engineering in Python, Spark, Flink, and SQL in a single platform. This project aims
to create a concept about how to build this support and then implement it. The
project checks the feasibility of the solution using a sample DBT project. According
to measurements, this working solution needs around 800 MB of space in the server
and it takes more time than executing DBT commands locally. However, it persistently
stores the results of each execution in HopsFS, which are available to users. By adding
this novel support for SQL using DBT, Hopsworks might be one of the completest
platforms for feature engineering so far.

Keywords

feature engineering, Structured Query Language (SQL)

iii
Abstract

Att utveckla funktioner i stor skala är alltid kritiskt och utmanande i pipeline för
maskininlärning. Moderna datalager gör det möjligt för dataanalytiker att göra feature
engineering genom att omvandla, validera och aggregera data i Structured Query
Language (SQL). För att hjälpa dataanalytiker att utföra detta arbete föreslogs Data
Build Tool (DBT), ett verktyg med öppen källkod, för att bygga och organisera SQL-
pipelines. Hopsworks, ett skalbart funktionslager med öppen källkod, vill lägga till
stöd för DBT så att datavetare kan göra funktionsutveckling i Python, Spark, Flink och
SQL på en enda plattform. Det här projektet syftar till att skapa ett koncept för hur
man bygger detta stöd och sedan genomföra det. Projektet kontrollerar lösningens
genomförbarhet med hjälp av ett exempel på DBT-projekt. Enligt mätningar behöver
denna fungerande lösning cirka 800 MB utrymme på servern och det tar mer tid än
att utföra DBT-kommandon lokalt. Den lagrar dock permanent resultaten av varje
körning i HopsFS, vilka är tillgängliga för användarna. Genom att lägga till detta nya
stöd för SQL med DBT kan Hopsworks vara en av de mest kompletta plattformarna för
funktionsutveckling hittills.

Nyckelord

funktionsteknik, strukturerat frågespråk (SQL)

iv
Acknowledgements

I would like to extend my deepest gratitude to Jim Dowling and Hopsworks to give me
the opportunity to do this project. Special thanks to Gibson Chikafa. Without his help,
I cannot complete this project. I also want to thank my examiner Vladimir Vlassov and
my supervisor Sina Sheikholeslami. Many thanks to all the colleagues in Hopsworks
for their support and help.

I am very happy to complete my master’s study in KTH and to do my master thesis

project in Hopsworks, which will be an unforgettable experience in my life.

v
Acronyms

CPU Central Processing Unit

HDFS Hadoop Distributed File System
API Application Programming Interface
ETL Extract, Transform, Load
ELT Extract, Load, Transform
CLI Command Line Interface
IDE Integrated Development Environment
DBT Data Build Tool
SQL Structured Query Language
MLOps Machine Learning Operations
DAG Directed Acyclic Graph
AWS Amazon Web Services
GCE Google Compute Engine
REST REpresentational State Transfer
API Application Programming Interface
JSON JavaScript Object Notation
RDD Resilient Distributed Dataset
DAG Directed Acyclic Graph
REST Representational state transfer
API Application Programming Interface
UI User Interface
URL Uniform Resource Locator

vi
Contents

1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Purpose and Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 4
2.1 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Hopsworks Feature Store . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 ELT and ETL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 SQL and other data engineering languages . . . . . . . . . . . . . . . . 7
2.5 DBT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5.1 What is DBT? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5.2 Features of DBT CLI . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5.3 DBT projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5.4 Great expectations DBT plugin . . . . . . . . . . . . . . . . . . . 11
2.5.5 DBT artifacts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.6 Docker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.6.1 What is Docker? . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.6.2 Docker advantages . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6.3 Docker architecture . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Method 16
3.1 Research Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Validation Tests and Evaluation . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.1 Validation Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

vii
CONTENTS

3.2.2 Performance measurements . . . . . . . . . . . . . . . . . . . . 18

4 Work 20
4.1 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1.1 DBT Jobs in Hopsworks . . . . . . . . . . . . . . . . . . . . . . . 20
4.1.2 A sample DBT project . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2.1 Management of DBT jobs . . . . . . . . . . . . . . . . . . . . . . 23
4.2.2 Execution management of DBT jobs . . . . . . . . . . . . . . . . 24
4.2.3 Start the execution of DBT jobs . . . . . . . . . . . . . . . . . . 25
4.2.4 Stop the execution of DBT jobs . . . . . . . . . . . . . . . . . . 27
4.2.5 The sample DBT project . . . . . . . . . . . . . . . . . . . . . . 27

5 Results and Discussion 32

5.1 Validation Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.1.1 Validtion Tests of managing DBT jobs and executions . . . . . . 32
5.1.2 Validation Tests of executing basic DBT commands . . . . . . . 39
5.2 Performance measurements . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2.1 Memory usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.2.2 Execution time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6 Conclusion and Future Work 46

6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

References 49

viii
Chapter 1

Introduction

This chapter gives a brief introduction to the thesis project. It starts with the motivation
of the project, and then it describes the problem to solve. The purposes and goals
are introduced to explain the research question and the expected results. In addition,
benefits, ethics, sustainability issues and methodology of the project are also included.
At the end of this chapter, an outline of the thesis is presented.

1.1 Motivation
Both data warehouses and traditional databases store data, but they do have
differences when they are used for machine learning purposes. Compared to
traditional databases, modern data warehouses do not perform satisfactorily in
transactions, but enable the analysis of data. Data warehouses can also handle the
situation of multiple data sources, which means that they can integrate data from
several sources such as several different databases.

Based on data warehouses, Extract, Load, Transform (ELT), which is a new kind
of data pipeline, is widely used to deal with raw data in a more efficient way. In
the ELT process, raw data are firstly extracted and loaded by data warehouses.
Transformations on data are then applied.

To transform data, modern data warehouses enable data engineers to write data
pipelines in SQL. In simple cases, a few SQL queries might complete the task.
However, when the tasks become complex with multiple dependencies across the
queries, it might cause issues. To solve this problem, DBT was proposed by Fishtown

1
CHAPTER 1. INTRODUCTION

Analytics to help data analysts do the transformation process in ELT [26].

DBT helps to build models with dependencies and tests, using a combination of
SQL and Jinja [26]. DBT also offers other features, such as automatically generated
documentation, self-defined macros and so on. Data validation is also possible
by installing external plugins in the project. Fishtown Analytics provides the DBT
Command Line Interface (CLI) tool, an open-source implementation, to run DBT
projects. DBT is now a popular and probably dominating tool to build pipelines of
SQL jobs.

This thesis targets to use DBT CLI to help data analysts to orchestrate the SQL
transformation tasks for machine learning purposes.

1.2 Problem

Hopsworks is a scalable feature store that assists data engineers and data scientists
to do data transformations for feature engineering. This feature store makes creating
features and training datasets more automatic [18]. It also improves the reusability
and quality of the generated features and datasets.

In feature engineering, data scientists might commonly use Spark, Flink and Python to
convert data Resilient Distributed Dataset (RDD)s, while data engineers and analysts
usually use SQL for this purpose. Hopsworks platform now offers support to run Spark
and Flink jobs in the community version and Python jobs for the enterprise version
[18]. However, there is no support for executing and orchestration of SQL jobs.

This project investigates the capabilities of DBT as a tool in the feature engineering
stage of machine learning pipelines within the context of the open-sourced Hopsworks
platform and support for data provenance.

The research question of this project would be: How to apply DBT as a tool for
feature engineering in the Hopsworks platform to build data pipelines and support
data provenance?

2
CHAPTER 1. INTRODUCTION

1.3 Purpose and Goal

The purpose of this project is to find a proof-of-concept solution for DBT jobs and
orchestration in Hopsworks with novel support for the provenance of DBT Jobs.

The goal of this project is to enable the support for the execution of DBT jobs, so that
users can do feature engineering using Flink, Spark, Python and SQL jobs within a
single platform. With this new support, data analysts can run and orchestrate SQL
jobs in the Hopsworks platform using DBT to create features, feature groups and
datasets.

The deliverables include the thesis, the code and the documentation to explain the
code.

1.4 Methodology
The methodology of this project can be considered as system research on Hopsworks.
The first step would be to figure out the components related to this project, and the
interactions between them. The next step is to build a concept about how to realize the
goals. The concept should be demonstrated whether it is feasible or not. In this case, it
is to generate a design about how to run a DBT project as a job in Hopsworks. Then, the
design should be implemented to check its feasibility. According to the situation, minor
changes could be applied to adjust the design in order to achieve a better result.

1.5 Outline
Chapter 2 introduces the necessary background for the project, including the
introduction of applied technologies. Chapter 3 presents the method and the
evaluation method of this project. Chapter 4 explains the implementation of the work
in detail. Chapter 5 shows the results and Chapter 6 gives the conclusion of the
project.

3
Chapter 2

Background

This section introduces the background required for this thesis project, including the
concept of featuring engineering, the difference between using SQL and other data
engineering languages, and the comparison between ELT and Extract, Transform,
Load (ETL) process. It also presents the utilized technologies and frameworks in the
project, such as the Hopsworks platform, DBT CLI and Docker.

2.1 Feature Engineering

In general, there can be five processes in a machine learning pipeline, as shown in
Figure 2.1.1. The first step is to define what is the problem to solve and what is the
desired model. Then, it is necessary to collect data from sources in a legal and unbiased
way [24]. Feature engineering plays an important role here to give the collected raw
data a more meaningful representation. The fourth step is to choose a model according
to the requirement in the first step and use the output of feature engineering to train the
model. In the end, we can use the trained model to do some tasks, such as classification
or prediction.

Figure 2.1.1: Machine learning pipeline (redrawn after [24])

More intuitively, feature engineering processes the raw data from data sources, and the
results are input into machine learning models, as shown in Figure 2.1.2. These results

4
CHAPTER 2. BACKGROUND

could be some data columns, which are transformed from the original data.

Figure 2.1.2: The role of feature engineering in machine learning procedures

Feature engineering also has its own pipeline as shown in Figure 2.1.3. The first step is
to understand the data, and then data are transformed into the formats which can be
evaluated and understood by machine learning models. For example, transforming
text into vectors and transforming images into a matrix [24]. Then, the obtained
structured features need to be improved for the machine learning models. We might
select some useful data columns and do necessary transformations such as one-hot
encoding. After applying several steps to the raw data, we can use a single model to
evaluate the designed feature engineering steps. According to the model performance,
we might do modifications to the designed transformation steps.

Figure 2.1.3: Feature engineering pipeline (redrawn after [24])

Data scientists and engineers usually do feature engineering to extract the features for
machine learning models to explore hidden patterns in the data. If the features, namely
data columns, are poorly-formatted and non-meaningful to the models, the training
results might not be worthy [24]. In contrast, well-prepared features could help data
scientists to achieve a faster training speed and a more accurate model. Feature
engineering aims to represent data as better features for models while preserving the
internal relationships of data.

5
CHAPTER 2. BACKGROUND

2.2 Hopsworks Feature Store

A feature store is where data scientists and data engineers can manage and persist the
features. It can help them to do transformations of raw data and store the produced
features. The computed features can be reused for multiple machine learning models.
The popular feature stores now include Hopsworks, Feast, Featureform and Databricks
Feature Store.

Among those feature stores, Hopsworks distinguishes itself as a strong and scalable
platform with an open-source community version. The platform is project-based and
it supports dynamic roles [21]. As shown in Figure 2.2.1, Hopsworks provides an online
feature store with low latency and an offline feature store with high throughput, with
Application Programming Interface (API)s for Java, Scala and Python [17]. The online
feature store handles the latest feature values with high availability and provides them
to online applications. The offline feature store supports asynchronously transforming
and analysing big data based on HopsFS, which is a highly-scalable extension of
Hadoop Distributed File System (HDFS) [22]. The output of the offline feature store
can be large numbers of batches of features, which will be used by machine learning
models [17].

Figure 2.2.1: Online and offline features stores in Hopsworks (redrawn after [17])

Figure 2.2.2 shows the system architecture of the Hopsworks platform. Hopsworks can
provide service based on Amazon Web Services (AWS), Google Compute Engine (GCE)
and other platforms. HopsFS and RonDB are used in the storage layer. Users can run
Spark, Flink or Python jobs in the platform based on HopsYARN, which is the resource
management layer. In addition, Docker jobs can also be executed and managed by
Kubernetes in Hopsworks.

6
CHAPTER 2. BACKGROUND

Figure 2.2.2: System architecture overview of Hopsworks [21]

2.3 ELT and ETL

When working with data pipelines, there are two kinds of processes: ETL and ELT.
Both of them take raw data and move them to a target database or data warehouse. ETL
means to execute the following operations in sequence: obtain data from data sources,
transform the extracted data and store the transformed data, while ELT architecture
supports loading of data before the transformation. ELT utilizes data warehouses to
do the transformation and for the storage of data.

ETL is comparatively new regarding ELT. In practice, ETL is good for structured and
operational data with relational databases. If the data come from several different
sources, then we need to integrate and synchronize the data sources when transforming
them, which might not be very time-efficient. In contrast, ELT supports raw data and
is widely used for machine learning purposes. It offers quicker data processing speed
but less precision compared to ETL [20]. In Machine Learning Operations (MLOps),
the most common choice is ELT because it shows better performance in handling big
and real-time data [19].

2.4 SQL and other data engineering languages

Python, Spark, SQL and Flink are common tools in feature engineering. In most cases,
SQL is for data engineers or analysts to aggregate data into useful reports. Python,
Spark and Flink are used by data scientists both to process data and to train models.
SQL sometimes is more preferred in the process of data preparation because there is
no need to create a lot of data frames. In addition, big data frames in Python, Spark

7
CHAPTER 2. BACKGROUND

and Flink might cause troubles such as slow processing speed.

2.5 DBT
DBT is a tool to help data engineers and scientists do the ”T” (transformation) in
the ELT process [26]. It can work with RedShift, BigQuery, Snowflake and many
other popular data warehouses or databases. This section introduces this popular data
processing tool including the features of DBT, DBT CLI, DBT project structure, DBT
artifacts and the great expectations plugin for DBT to validate data.

2.5.1 What is DBT?

DBT is a data processing tool to help data analysts and engineers to orchestrate data
engineering workflow. Fishtown Analytics provides the DBT CLI product to run DBT
projects, which is free and open source. The company also released DBT Cloud with an
Integrated Development Environment (IDE), which is more powerful than DBT CLI.
In this thesis, the DBT CLI tool is integrated.

Figure 2.5.1 shows that DBT works on transforming data in data warehouses and the
transformed data will be used by data consumers, such as machine learning models.
Data loader extracts raw data from sources, which is the ”E” in the ELT process. The
extracted raw data are then stored in data warehouses, which is the ”L”. Then DBT
helps to do the ”T” in the data warehouses. DBT completes its task by compiling the
given code into SQL code and executing them against the database. DBT code is a
combination of SQL language and Jinja1 , which is a template engine for Python.

Figure 2.5.1: How DBT works in the data processing procedure (redrawn after [26])
1
Jinja documentation: https://jinja.palletsprojects.com/en/3.1.x/

8
CHAPTER 2. BACKGROUND

2.5.2 Features of DBT CLI

DBT enables users to create or update SQL models with defined materialization by
writing SELECT queries, which means users do not write CREATE or UPDATE queries with
DBT [12]. The most commonly used materializations are views and tables. There are
also other predefined materialization levels and customizing materialization is also
allowed.

Another important feature of DBT is the ref function. DBT uses the ref function
to define the dependencies and orders between models. Figure 2.5.2 shows
my_second_dbt_model which is an example of using the ref function. It means to
select all the records whose id is 1 from the first DBT model. When we run the DBT
project, my_first_dbt_model will be executed before my_second_dbt_model, because
my_second_dbt_model depends on the result of my_first_dbt_model.

Figure 2.5.2: An Example of using DBT ref function

DBT can generate Directed Acyclic Graph (DAG)s according to the dependencies
between models. Figure 2.5.3 shows the DAG of my_first_dbt_model and
my_second_dbt_model.

Figure 2.5.3: DAG of my_first_dbt_model and my_second_dbt_model

Those automatically-generated DAGs are available in DBT documentation. DBT CLI

also supports the automatically generated documentation feature by offering dbt docs
command. The generated documentation includes the information on the model
dependencies, the project description, model test results and information about the
data warehouses. It can be accessed from a local web page.

DBT offers built-in test frameworks. The tool allows users to write tests in a both
convenient and highly-customized way. The simplest method to test the DBT models

9
CHAPTER 2. BACKGROUND

would be using predefined generic tests, such as not-null, unique and so on. In
addition, the models can also be tested by singular tests, which are also SQL SELECT
queries [11]. For example, if we want that there are no values less than 0 in the age
column in the result table, then we can select all the records whose ages are bigger
than or equal to 0 in the result as the singular test. If the result column number is
bigger than 0, then the singular test fails. Furthermore, the most complicated way for
testing in DBT is to tailor the parameters of the SELECT queries in the singular test [11].
This means that the value 0 in the previous example can be replaced by any given value
as an input parameter. In this way, DBT improves the flexibility and reusability of the
code.

Apart from these critical features, DBT CLI also provides other functionalities to help
users build their data transformation pipelines, such as DBT package manager and
macros. This tool assists data engineers and data analysts to integrate and convert raw
data to expected ones by writing simple SQL SELECT queries with Jinja.

2.5.3 DBT projects

This section introduces the structure of a DBT project, the commands in DBT CLI to
run a project and the resulting artifacts.

Figure 2.5.4 shows an example of the directory structure of a DBT project whose name
is demo. This directory is automatically generated by dbt init command. In the
directory, dbt_project.yml stores the configuration of the project, including project
name, directory path, DBT version and so on. packages.yml is used to configure
necessary packages or plugins. The most important sub-folder in the project would
be the models folder, where all the models and schemas are defined. In addition, The
tests sub-folder contains the tests against the models and the target folder contains
the artifacts. Besides, to connect to the target data warehouses or databases, DBT users
need to configure their accounts in the file profiles.yml. This file is not located in the
project folder, but it can be found at ~/.dbt/ if the operating system is Linux.

10
CHAPTER 2. BACKGROUND

Figure 2.5.4: An example of DBT project directory structure

To run a DBT project, DBT CLI provides a set of commands. The commonly used ones
are dbt init to initialize a project, dbt deps to install the package dependencies, dbt
run to run the models and dbt test to run the tests [4].

2.5.4 Great expectations DBT plugin

Great expectation is an open-source data validation tool to validate the data pipelines
by declaring the expected data format [15]. It can be considered as a kind of data unit
test using assertion, and it also offers reports and documentation of the tests.

The plugin dbt_expectations allows using great expectations in the DBT projects. To
use the plugin, the dbt_expectations plugin needs to be included in the packages.yml
file in the DBT project. After installing the package, we can write great expectations
tests as DBT tests.

2.5.5 DBT artifacts

DBT commands might result in some artifacts. As shown in Figure 2.5.5, the artifacts
are stored in the target directory under the DBT project folder. The important
three artifacts are the three JavaScript Object Notation (JSON) files: manifest.json,
run_results.json and catalog.json.

11
CHAPTER 2. BACKGROUND

Figure 2.5.5: DBT artifacts in the target directory of DBT project

manifest.json is an artifact which contains a complete representation of the DBT

models, tests and so on, including the dependencies between them. It also contains
source and configuration information. run_results.json contains the result of
the execution of the DBT commands, including the model and execution time.
catalog.json describes the generated tables and views in data warehouses or
databases.

These artifacts can be generated by specific commands. They can be used for many
purposes, including finding changes in the tables in the data warehouses and collecting
the running time of execution [8].

2.6 Docker

Docker is used in this project to provide containers for the applications. This
section introduces what is Docker, the possible benefits of using Docker and Docker
architecture.

2.6.1 What is Docker?

Docker is an open-source tool to containerize the environment. The applications can

be built, shipped or run in those containers [13].

Figure 2.6.1 shows an example with three containers running on an operating system.
The three containers are created by Docker and they are running as children processes
of Docker. The resources are separated between containers, which means one
container can only access its own resources.

12
CHAPTER 2. BACKGROUND

Figure 2.6.1: Three Docker containers running on the operating system (redrawn after
[23])

2.6.2 Docker advantages

Docker can bring many benefits. The possibly common benefit would be the increased
portability [23]. For Windows users, some applications which are developed for Linux
machines are not available. One possible solution is to install those applications on a
virtual machine with Linux operating system, but it might take a lot of space and time.
A better approach would be using Docker because it does not need too much space and
the user can possibly do it with a few commands. In addition, Docker containers can
protect the computer from potential attacks or bugs [23]. Viruses or the trouble caused
by bugs can hardly touch the programs and resources out of the container because the
Docker containers are usually sandboxed within the hosting environment.

Furthermore, Docker can solve the situation of the version conflict of the dependencies
between applications. Two applications can be deployed in two containers with
the specific required version of packages, instead of installing two versions of
dependencies on the same operating system. In addition, Docker can also provide a
neatly deployed environment with necessary packages for the applications. To some
extent, messy and unnecessary dependencies are avoided.

Figure 2.6.2 shows the difference in dependencies of two applications in the operating
system and the Docker containers. Before using Docker containers, application A has
an unnecessary dependency on package z, and application B unnecessarily depends
on package x. In addition, they use different versions of package y. Installing two
versions of package y might also cause problems in the operating system. After
deploying them in different containers, the unnecessary dependencies could possibly
be identified and excluded by programmers. Two different versions of packages are
installed independently in different containers.

13
CHAPTER 2. BACKGROUND

Figure 2.6.2: Dependencies of applications in the operating system and Docker

containers

2.6.3 Docker architecture

Figure 2.6.3 shows how the Docker components interact with each other. The three
main components are Docker client, Docker server/host and the remote registry.
Docker client is a CLI tool and it interacts with Docker server by Representational
state transfer (REST) API. Docker daemon handles the commands from the client
and it also handles Docker images and containers [13]. Docker registry is a remote
repository which owns a lot of Docker images. Those remote images can be pulled
to the local machine to generate containers. Figure 2.6.3 also shows the process in
which a container is generated from a remote image. Docker CLI sends the command
to Docker daemon, and Docker daemon then pulls the image from the remote registry.
The container is then generated from the local image.

14
CHAPTER 2. BACKGROUND

Figure 2.6.3: Docker consists of Docker client, Docker server and remote registry
(redrawn after [13])

15
Chapter 3

Method

This chapter introduces the research method using which this project was completed.
It also describes how the designed and developed solution was tested and
assessed.

3.1 Research Process

This project is a systems research on Hopsworks. It can also be considered a proof of

concept. As Figure 3.1.1 shows, the project is to firstly investigate the components and
the interactions between them in Hopsworks. After the elements and dependencies
become clear, a new concept should be generated about how to make the designed
component work on the platform. Then, the concept should be implemented and
the results should be evaluated. Specifically, the system research of Hopsworks may
include the following sub-steps. The investigation could start by figuring out how
Flink, Spark, Docker and Python jobs can be created, modified, deleted and viewed in
Hopsworks. Then, how are those jobs executed and how to stop the executions should
be identified. These knowledge could help to design the solution for this project. It is
also important to make it clear how these modules, such as HopsFS and so on, interact
with each other. Then, based on this knowledge, the idea of how to make DBT jobs run
in the Hopsworks platform could be proposed. The implementation should basically
follow the generated idea, while slightly elaborated or improved in detail according to
the situation and difficulties in reality.

16
CHAPTER 3. METHOD

Figure 3.1.1: Research process

3.2 Validation Tests and Evaluation

This section describes the tests to check the feasibility of the solution and also the
measurements to evaluate the solution.

3.2.1 Validation Tests

To test the feasibility, the project used a simple scenario which was to run a DBT project
in the Hopsworks platform connecting to Snowflake after applying the solution. The
project firstly needed to set up a DBT pipeline in a DBT CLI project locally. This
project was used as the input for a DBT job in Hopsworks. The following scenarios
were included in the tests:

1. Create a DBT job.

2. Modify a DBT job.

3. View/get a DBT job.

4. Delete a DBT job.

5. Start an execution of a DBT job, with input of a DBT command.

6. View/get an execution of a DBT job.

7. Stop an execution of a DBT job.

8. Delete an execution of a DBT job.

As described, DBT commands should be executed as input arguments for DBT job
execution. Due to time constraints, not all the DBT commands were tested in the scope
of this project. The following DBT commands were selected for testing because they
can be considered the basic and necessary commands for getting the results of a DBT
pipeline in the data warehouses.

1. dbt deps: Download and install the dependencies specified in package.yml [5].

17
CHAPTER 3. METHOD

2. dbt compile: Compile the files in model, test and analysis to be SQL files [3].

3. dbt run: Execute the generated SQL files [10].

4. dbt test: Execute the defined tests [11].

5. dbt build: Run the models, tests, seeds and snapshots [2].

6. dbt ls: Show the resources [6].

The used DBT version in the project was the latest at the time of development: 1.1.0
for dbt-core and dbt-snowflake.

The expected results of starting an execution of a DBT job would include two parts.
The first necessary result is the logs, which should show the execution results. The
other result is the artifacts generated by DBT itself. Both of them should be available
to users.

3.2.2 Performance measurements

This project used Docker to realize the support of running DBT jobs in Hopsworks
and a simple DBT pipeline project was used to test the feasibility of the solution. The
performance measurements were based on those conditions and assumptions. To
assess the performance, it is a common way to track the resource consumption and
the execution time.

Measuring the resource consumption consists of investigating the size of the Docker
containers and the Docker image in this project. The other commonly used attributes
like Central Processing Unit (CPU) utilization were not considered because they might
vary a lot depending on the actual input DBT project and the input DBT command.
The sizes of Docker containers are related to the size of their Docker image. To clarify,
Figure 3.2.1 shows that there are three Docker containers created using the base Docker
image. The base Docker image is actually reused by different containers and each
container adds a writable layer on top of the image [25]. Therefore, the size of a Docker
container could be considered as the size of the base reusable Docker image size and
the size of the writable layer. docker ps -s could present those size information and
the Docker image size would be presented as ”virtual size” [25]. It is worth mentioning
that the size shown by this command does not include some disk space usage such as
volumes and configuration files for the containers [25].

18
CHAPTER 3. METHOD

Figure 3.2.1: The relationship between a Docker image and Docker containers

Other than the resource consumption, the execution time was also considered for
performance measurement. As Figure 3.2.2 shows, the execution time of a DBT job
could be divided into time for preparation, actual command execution and preparation
of return results. Those times were tracked by printing the current time in the
system. The system was configured as having only one node to avoid the possible
time difference between several machines. So the execution time was recorded in a
stopwatch way by the server.

Figure 3.2.2: The components of whole the execution time

The development and tests were both completed on the Hopsworks platform,
which was installed in a virtual machine using Vagrant1 . Table 3.2.1 presents the
characteristics of the experimental environment.

Table 3.2.1: Experimental environment characteristics

Environment platform Virtual box

CPU Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
Memory 27G
Operating System Ubuntu 18.04.6 LTS

1
Installing Hopsworks using Vagrant: https://hopsworks.readthedocs.io/en/stable/getting_
started/installation_guide/platforms/vagrant.html

19
Chapter 4

Work

This section introduces the system design of the solution, including how to develop
the DBT support in Hopsworks and a DBT pipeline project. This DBT pipeline was
used to test the feasibility of the implemented solution. In addition, this section also
presents the implementation details of the solution. It is important to mention that this
project only targeted completing the back-end development, not including the User
Interface (UI) and front-end development.

4.1 System Design

After investigating the Hopsworks platform, especially investigating how the Spark
jobs, Flink jobs, Docker jobs and Git are supported by the platform, the basic idea
was generated. In this section, the design of the support of DBT jobs is described.
Furthermore, this project also planned a DBT pipeline as the job to be run in the
Hopsworks platform.

4.1.1 DBT Jobs in Hopsworks

To support DBT jobs on Hopsworks, there are two vital use cases, as shown in Figure
4.1.1. To manage DBT jobs, the core functionalities would be to create, modify, get and
delete a DBT job. Moreover, managing the execution of DBT jobs would be the main
focus of this project. To fulfill this use case, the solution should be able to make the
DBT job executed or stopped after creation. An execution could also be viewed and
deleted.

20
CHAPTER 4. WORK

Figure 4.1.1: Use case diagram of DBT jobs on Hopsworks

Currently, the community version of the Hopsworks platform provides support for
running Spark and Flink jobs. They are based on YARN. In addition, the Hopsworks
Enterprise version also offers support for Docker and Python. Figure 4.1.2 shows the
relationship between those classes. YarnJob, DockerJob and PythonJob extend the
basic job type. In addition, SparkJob and FlinkJob are based on the YarnJob class.
With this knowledge, it was proposed that DBTJob should extend the basic job class
since it doesn’t need Yarn services, just like DockerJob.

Figure 4.1.2: The relationship of job classes in Hopsworks

As mentioned before, the project should enable users to create, modify, get and delete
a DBT job. Since DBTJob is another extension of the basic Job class, it was possible to
reuse most of the existing code for managing DBT jobs. As a consequence, the main
focus and the difficulty of the solution would be how to execute and stop a DBT job in
the Hopsworks platform.

The proposed solution was to use Docker containers which can provide a configurable
light-weight environment so that the DBT project can run inside the container without
being affected by the complex system environment. Figure 4.1.3 presents the solution
for running a DBT job in a Docker container. Users can take advantage of the

21
CHAPTER 4. WORK

created storage connectors1 in Hopsworks. When creating a DBT job, the user should
specify which storage connector they want to use. Then, a Docker container should
be generated according to the Docker image with the specified storage connector
information and other environment variables. This container should complete a few
main tasks. The main task would be to execute the defined DBT command. Then, the
execution results should be stored in logs and the execution might also cause some
DBT artifacts, such as the manifest.json file. The container should mount those files
to HopsFS so that they are visible to the users and available for further use.

Figure 4.1.3: Run a DBT job in a Docker container

4.1.2 A sample DBT project

To test the feasibility of the proposed solution, a sample DBT project with a simple
pipeline was designed. This DBT sample project used an open data set from Kaggle2 ,
which was about students’ exam performance. The dataset was loaded into Snowflake
in the table called students_exam. It simulated a scenario that the data engineers want
to focus on the students who have not taken the preparation course but passed the
math exam. The DBT project should select the students who haven’t completed the test
preparation courses. In the end, the records of those who have passed the math exams
should be figured out in a view. The project also needed to pass a dbt_expectation
test to check if the math grades in the records of final results are between 60 to 100.
The complete pipeline is shown in Figure 4.1.4.

1
An introduction to the storage connector in Hopsworks:
https://docs.hopsworks.ai/feature-store-api/2.5.9/generated/storage_connector/
2
Students Performance in Exams:
https://www.kaggle.com/datasets/spscientist/students-performance-in-exams

22
CHAPTER 4. WORK

Figure 4.1.4: A DAG of DBT pipeline

Figure 4.1.5 presents the expected resulting tables and the view in the data warehouse.
students_exam was a table loaded from the open dataset. not_prepared_performance
had the same structure as students_exam, but all the testPreparationCourse should
be ’none’. Different from the two tables, passed_math_performance was designed to
be a view to save space and add diversity to the scenario to some extent.

Figure 4.1.5: Tables and the view in the Data warehouse

4.2 Implementation Details

This section presents the implementation details following the design. The
implementation of the management of DBT jobs and execution are included. Then,
the main focus of this project, which is how to start and stop the execution of the DBT
jobs, is presented. In addition, this section also introduces the details of the designed
DBT pipeline.

4.2.1 Management of DBT jobs

The management of DBT jobs could be divided into creation, read, modification
and deletion of DBT jobs. Since DbtJob is an extension of the basic Job class, just
like DockerJob and YarnJob, the management of DBT jobs followed the existing
architecture of the management of the other job classes. Figure 4.2.1 shows the brief
architecture of the job management in Hopsworks. The front-end UI communicates

23
CHAPTER 4. WORK

with Hopsworks via RESTful APIs. The jobs are stored in the database in the back-
end.

Figure 4.2.1: Job management in Hopsworks

As stated, the APIs and architecture of creating, deleting, viewing and modifying the
other Job classes were reused for DbtJob class. The only work to do for the DBT job
management in this section was to add this new job type.

4.2.2 Execution management of DBT jobs

The execution management of DBT jobs was similar to the implementation of the
management of DBT jobs. As Figure 4.2.2 shows, clients can send the requests
including executing a DBT job, viewing the execution information, stopping the
execution and deleting the execution. In those functionalities, the project reused the
existing code of viewing the execution and deleting execution, just like managing the
execution of other kinds of jobs in Hopsworks.

24
CHAPTER 4. WORK

Figure 4.2.2: DBT execution management in Hopsworks

The main focus of this project would be developing the functionalities to start and stop
the execution of a DBT job, which will be introduced in the next sections.

4.2.3 Start the execution of DBT jobs

This functionality would be the core of the support for running DBT jobs on
Hopsworks. According to the design presented in Chapter 4.1.1, the implementation
would use Docker containers to run DBT jobs. Figure 4.2.3 demonstrates the core
implementation of starting an execution of a DBT job. The execution request would
be sent to DbtController class, which is a vital class to start and stop the execution.
DbtController would create a Docker container according to the constructed Docker
image, with necessary environment variables injected. The environment variables
include the DBT project directory, the execution id, the project id, the log directory, the
DBT profile information and the user groups for different directories. These variables
can be injected using -e when executing docker run command [14].

The tasks that the Docker container completes are also explained in Figure 4.2.3.
The main tasks are installing the dependencies for DBT CLI, generating certificates,
mounting to HopsFS, executing the DBT command and then exporting the generated
logs and DBT artifacts to the HopsFS directory. The generated certificates are
necessary for mounting to HopsFS. The logs and artifacts are copied from the original
directory to the HopsFS directories so that they are visible to users and will not
be covered. The artifacts can also be downloaded and used by users for further
purposes.

25
CHAPTER 4. WORK

Figure 4.2.3: Executing a DBT job starts a Docker container

There are several ways to install DBT CLI, such as using pip or Homebrew [7]. This
project installed DBT CLI from source3 because it would be easier to select the specific
version. To install dbt-core and dbt-snowflake, it is necessary to install git and pip
as the dependencies based on the basic ubuntu environment in the container. Table
4.2.1 shows the version of dependencies in the Docker image.

Table 4.2.1: The version of dependencies in the Docker image

ubuntu 18.04
pip 22.1.1
dbt-core v1.1.0
dbt-snowflake v1.1.0

In addition, it is also worth mentioning that each execution would be stored in the
database for all job types in Hopsworks. Therefore, the execution of DBT jobs also
reused this part of code to save each execution.

3
Install dbt from source: https://docs.getdbt.com/dbt-cli/install/from-source

26
CHAPTER 4. WORK

4.2.4 Stop the execution of DBT jobs

Stopping the execution of DBT jobs was also implemented based on the Docker. As
Figure 4.2.4 shows, the implementation killed the running Docker container to stop
the execution of a DBT job. Specifically, docker rm -f was used to force the removal of
the running container. This command can stop the container immediately and remove
all the data of the container.

Figure 4.2.4: Stop an DBT job execution kills the created Docker container

The previous sections introduced the development of the management of DBT jobs
and the execution of DBT jobs. The most critical part of the solution was using Docker
containers to start and stop the execution of the DBT jobs. In addition, DbtController
was the core class to control the interaction between requests and Docker containers in
Hopsworks. The following section will introduce the details of constructing the sample
DBT project, which will be used to check the feasibility of the solution later.

4.2.5 The sample DBT project

As mentioned in Chapter 2, this project used DBT CLI to create a DBT project which
contained a data pipeline. The sample project was created by dbt init command4 . In
addition to that, it was necessary to configure the data warehouse to connect. When
developing locally, this connection configuration needs to be set in the profiles.yml5 .
4
Introduction to dbt init command: https://docs.getdbt.com/reference/commands/init
5
Introduction to configure the profile:
https://docs.getdbt.com/dbt-cli/configure-your-profile

27
CHAPTER 4. WORK

Snowflake was used in this sample project. The implementation of the DBT project
included the models and the tests.

The models needed to be created in the models sub folder, as shown in Figure 4.2.5.
There were two SQL files to define the tables and views, and the schema.yml file to
configure the models.

Figure 4.2.5: SQL files and schema.yml in the models folder

The first model to create was the not_prepared_performance. Listing 4.1 shows that
this model took all columns from a source table with the records that didn’t take the
test preparation course. The source table was the table uploaded from the open data
set in Snowflake, which would be specified in schema.yml later. This model created a
new table in Snowflake naming not_prepared_performance. Figure 4.2.6 shows part
of the data in the table in Snowflake.

Listing 4.1: DBT model not_prepared_performance.sql to select records of those who

have not taken the test preparation course.
1 {{ config ( materialized = ' table ' ) }}
2

3 select *
4 from { { so u r c e ( ' PUBLIC ' , 'STUDENTS_EXAM' ) } }
5 where t e s t p r e p a r a t i o n c o u r s e = ' none '

Figure 4.2.6: Part of the data of Table not_prepared_performance in Snowflake

28
CHAPTER 4. WORK

After creating the SQL model, it was necessary to define it in schema.yml, as presented
in Listing 4.2. It defined the source table STUDENTS_EXAM in Snowflake from the
public data set and also defined the model to create. There were two tests for the
model not_prepared_performance.sql, checking that the students didn’t take the
preparation course and the math scores are null. Those tests were using the self-
contained test framework from DBT.

Listing 4.2: Definition of not_prepared_performance.sql

1 sources :
2 − name : PUBLIC
3 tables :
4 − name : STUDENTS_EXAM
5

6 models :
7 − name : not_prepared_performance
8 columns :
9 − name : gender
10 − name : r a c e
11 − name : p a r e n t a l E d u c a t i o n
12 − name : lunch
13 − name : t e s t p r e p a r a t i o n c o u r s e
14 tests :
15 − accepted_values :
16 v a l u e s : [ ' none ' ]
17 − name : w r i t i n g s c o r e
18 − name : r e a d i n g s c o r e
19 − name : mathscore
20 tests :
21 − not_null

Listing 4.3 shows the second SQL model, which created a view of records who have
passed the math exam based on the previous model not_prepared_performance.sql.
The reading score and writing score were excluded from this view. This model used
ref function to refer to the previous model. Figure 4.2.7 presents part of the resulting
view in Snowflake.

29
CHAPTER 4. WORK

Listing 4.3: DBT model passed_math_performance.sql to select records of those who

have passed the math exam and exclude the writing score column and the reading score
column.
1 { { c o n f i g ( m a t e r i a l i z e d = ' view ' ) } }
2

3 s e l e c t mathscore , gender , race , p a r e n t a l e d u c a t i o n , lunch ,

testpreparationcourse
4 from { { r e f ( ' not_prepared_performance ' ) } }
5 where mathscore between 60 and 100

Figure 4.2.7: Part of the data of Table passed_math_performance in Snowflake

Listing 4.4 shows the definition of not_prepared_performance.sql in schema.yml.

Especially, it included a dbt_expectations test to check that the math scores were
between 60 and 100.

Listing 4.4: Definition of passed_math_performance.sql

1 − name : passed_math_performance
2 columns :
3 − name : gender
4 − name : r a c e
5 − name : p a r e n t a l E d u c a t i o n
6 − name : lunch
7 − name : t e s t p r e p a r a t i o n c o u r s e
8 tests :
9 − accepted_values :
10 v a l u e s : [ ' none ' ]
11 − name : mathscore

30
CHAPTER 4. WORK

12 tests :
13 − d b t _ e x p e c t a t i o n s . expect_column_values_to_be_between :
14 min_value : 60
15 max_value : 100

To use dbt_expectations plugin, it was necessary to add the package dependency6

in the packages.yml as shown in Listing 4.5. After specifying the plugin, dbt deps
command was used to install the packages [9].

Listing 4.5: Add dbt_expectation plugin in packages.yml

1 packages :
2 − package : c a l o g i c a / d b t _ e x p e c t a t i o n s
3 version : 0.5.6

Again, all the models were defined in schema.yml file. The used external package
was included in packages.yml and was installed. This is a relatively simple and
basic DBT pipeline project. DBT also offers features such as using macros and other
materializations to solve more complex tasks. The complete code is attached in the
Appendix A.

In Chapter 5, the results of applying the solution will be demonstrated. The constructed
DBT pipeline will be used to be run on Hopsworks to check if the solution is
feasible.

6
Introduction to dbt_expectations: https://hub.getdbt.com/calogica/dbt_expectations/0.1.
2/

31
Chapter 5

Results and Discussion

This chapter first includes the test results of the management of DBT jobs and
executions. It also describes using the developed sample DBT project to run several
basic DBT commands for executing a DBT job on the Hopsworks platform, to check the
feasibility of the implemented solution. After making sure the solution was working,
the execution time and memory usage were tracked for measuring performance.

5.1 Validation Tests

This section describes the tests for the use cases to manage DBT jobs and execution, and
also the tests of executing several basic DBT commands as the input arguments using
the created DBT pipeline project. Although this project didn’t include the development
of the front end, some UIs could be reused for DBT jobs. The tests took the advantage
of the existing UIs and used APIs to test the rest of the functionalities. All the API tests
were executed using Postman v8.6.2. The created DBT project in Chapter 4.2.5 was
uploaded in HopsFS under the folder /DBTProjects/demo for test use.

5.1.1 Validtion Tests of managing DBT jobs and executions

The tests for managing DBT jobs include creating, modifying, viewing and deleting
a DBT job. The tests used the new UI of Hopsworks and the demo project naming
demo_fs_meb10000 with project id 119 created by the Hopsworks tutorial.

Create a DBT job

32
CHAPTER 5. RESULTS AND DISCUSSION

Since this project didn’t include developing the UI, the DBT jobs could only be created
by using API now. To test the creation functionality, Listing 5.1 was used as the request
body in JSON. dbtProjectPath and dbtProfilesStorageConnectorName specified the
project files path in HopsFS and the name of the existing storage connector. The
name of the job could be specified in the PUT request Uniform Resource Locator
(URL), such as https://localhost:8181/hopsworks-api/api/project/119/jobs/
dbt_exam_performance.

Listing 5.1: Request body of creating the DBT job dbt_exam_performance

1 {
2 ” type ” : ” dbtJobConfiguration ” ,
3 ” jobType ” : ”DBT” ,
4 ” d b t P r o j e c t P a t h ” : ” / DBTProjects /demo ” ,
5 ” dbtProfilesStorageConnectorName ” : ” snowflake_sc ”
6 }

Figure 5.1.1 shows the created DBT job in UI, reusing the UI for other types of
jobs.

Figure 5.1.1: DBT job dbt_exam_performance

Get a DBT job

Viewing a DBT job was also tested by using API. The GET request API address
was: https://localhost:8181/hopsworks-api/api/project/119/jobs/dbt_exam_
performance and the returned result is shown in Listing 5.2.

Listing 5.2: Response of getting the DBT job dbt_exam_performance

1 {
2 ” t y p e ” : ” jobDTO ” ,
3 ” h r e f ” : ” h t t p s : / / l o c a l h o s t : 8 1 8 1 / hopsworks−a p i / a p i / p r o j e c t / 1 1 9 /
j o b s / dbt_exam_performance ” ,
4 ” config ”: {
5 ” type ” : ” dbtJobConfiguration ” ,
6 ”appName ” : ” dbt_exam_performance ” ,
7 ” d b t P r o j e c t P a t h ” : ” / DBTProjects /demo ” ,

33
CHAPTER 5. RESULTS AND DISCUSSION

8 ” dbtProfilesStorageConnectorName ” : ” snowflake_sc ” ,
9 ” jobType ” : ”DBT”
10 },
11 ” cr e a ti on T im e ” : ”2022−06−11T14 : 5 7 : 4 5 Z ” ,
12 ” creator ”: {
13 ” h r e f ” : ” h t t p s : / / l o c a l h o s t : 8 1 8 1 / hopsworks−a p i / a p i / u s e r s
/10000”
14 },
15 ” executions ” : {
16 ” h r e f ” : ” h t t p s : / / l o c a l h o s t : 8 1 8 1 / hopsworks−a p i / a p i / p r o j e c t
/ 1 1 9 / j o b s / dbt_exam_performance / e x e c u t i o n s ”
17 },
18 ” i d ” : 43 ,
19 ” jobType ” : ”DBT” ,
20 ”name ” : ” dbt_exam_performance ”
21 }

Modify a DBT job

This function also reused the code for modifying other types of jobs. Modifying a job
calls the same API as creating a job with the same name of the job to change. In this test,
the same PUT request was sent with a different body parameter, which was changing
dbtProfilesStorageConnectorName to be ”snowflake_new_sc”. The response body is
shown in Listing 5.3 to verify the modification succeeded. This was also the response
of the GET request of the DBT job dbt_exam_performance.

Listing 5.3: Response of modifying the DBT job dbt_exam_performance with a new
storage connector
1 {
2 ” t y p e ” : ” jobDTO ” ,
3 ” h r e f ” : ” h t t p s : / / l o c a l h o s t : 8 1 8 1 / hopsworks−a p i / a p i / p r o j e c t / 1 1 9 /
j o b s / dbt_exam_performance ” ,
4 ” config ”: {
5 ” type ” : ” dbtJobConfiguration ” ,
6 ”appName ” : ” dbt_exam_performance ” ,
7 ” d b t P r o j e c t P a t h ” : ” / DBTProjects /demo ” ,
8 ” dbtProfilesStorageConnectorName ” : ” snowflake_new_sc ” ,

34
CHAPTER 5. RESULTS AND DISCUSSION

9 ” jobType ” : ”DBT”
10 },
11 ” cr e a ti on T im e ” : ”2022−06−11T14 : 5 7 : 4 5 Z ” ,
12 ” creator ”: {
13 ” h r e f ” : ” h t t p s : / / l o c a l h o s t : 8 1 8 1 / hopsworks−a p i / a p i / u s e r s
/10000”
14 },
15 ” executions ” : {
16 ” h r e f ” : ” h t t p s : / / l o c a l h o s t : 8 1 8 1 / hopsworks−a p i / a p i / p r o j e c t
/ 1 1 9 / j o b s / dbt_exam_performance / e x e c u t i o n s ”
17 },
18 ” i d ” : 43 ,
19 ” jobType ” : ”DBT” ,
20 ”name ” : ” dbt_exam_performance ”
21 }

Delete a DBT job

Users can click on the delete button in UI in Hopsworks to delete any type of job. After
clicking on the delete button and confirming the deletion, the https://localhost:
8181/hopsworks-api/api/project/119/jobs/dbt_exam_performance job could not
be found using GET request, as shown in Listing 5.4.

Listing 5.4: Response of getting the deleted DBT job dbt_exam_performance

1 {
2 ” t y p e ” : ” restApiJsonResponse ” ,
3 ” errorCode ” : 130009 ,
4 ” errorMsg ” : ” Job not found . ” ,
5 ” usrMsg ” : ” j o b I d : dbt_exam_performance ”
6 }

Besides testing the use cases of managing DBT jobs, the tests for managing DBT
job execution were also executed. The tests included start, stop, get and delete an
execution. Different from job management, an execution could not be modified.

Start an execution

The dbt_exam_performance job was created again for testing the execution. Starting

35
CHAPTER 5. RESULTS AND DISCUSSION

a DBT job execution simply reused the UI of starting Docker jobs’ execution. After
clicking on the Run button, users also need to input an argument to start execution,
as shown in Figure 5.1.2. This argument could be empty or a DBT CLI command. In
this test, an empty command was tested to check if the Docker container was created.
The tests of executing a set of DBT CLI commands were also included in this project in
Chapter 5.1.2.

Figure 5.1.2: Users need to input an argument to start an execution of a DBT job

After executing dbt_exam_performances job with an empty argument, there was an

execution result found in UI, as shown in Figure 5.1.3. However, the reused UI showed
unnecessary information including YARN Application ID for the DBT job execution,
which could be removed and improved in future work.

Figure 5.1.3: The execution record of the DBT job dbt_exam_performance with an
empty argument

Figure 5.1.4 presents that the execution with id 66 created an Docker container in the
server.

Figure 5.1.4: The Docker container of the execution of DBT job dbt_exam_performance

Get an execution

Besides the shown result in the UI, the execution could also be obtained from API.
Listing 5.5 presents the
result of the GET request to https://localhost:8181/hopsworks-api/api/project/
119/jobs/dbt_exam_performance/executions/66 with execution id 66.

36
CHAPTER 5. RESULTS AND DISCUSSION

Listing 5.5: Response of getting the execution of dbt_exam_performance

1 {
2 ” t y p e ” : ” executionDTO ” ,
3 ” h r e f ” : ” h t t p s : / / l o c a l h o s t : 8 1 8 1 / hopsworks−a p i / a p i / p r o j e c t / 1 1 9 /
j o b s / dbt_exam_performance / e x e c u t i o n s / 6 6 ” ,
4 ” args ” : ” ” ,
5 ” d u r a t i o n ” : 49355 ,
6 ” filesToRemove ” : [
7 ” / s r v / hops / s t a g i n g / p r i v a t e _ d i r s /
eb666ebc4f56ce29589b41287e029a6c9141136c74a21f5cdd18df7969
edb136 ”
8 ],
9 ” f i n a l S t a t u s ” : ”UNDEFINED” ,
10 ” hdfsUser ” : ” demo_fs_meb10000__meb10000 ” ,
11 ” i d ” : 66 ,
12 ” progress ” : 0.0 ,
13 ” s t a t e ” : ”FINISHED ” ,
14 ” s t d e r r P a t h ” : ” / P r o j e c t s /demo_fs_meb10000/ Logs /DBT/66/ s t d e r r . l o g
”,
15 ” s t d o u t P a t h ” : ” / P r o j e c t s /demo_fs_meb10000/ Logs /DBT/66/ s t d o u t . l o g
”,
16 ” submissionTime ” : ”2022−06−11T23 : 1 5 : 4 8 Z ” ,
17 ” us e r ” : {
18 ” h r e f ” : ” h t t p s : / / l o c a l h o s t : 8 1 8 1 / hopsworks−a p i / a p i / u s e r s
/10000”
19 }
20 }

Stop an execution

To test the stop function, an execution was firstly started using UI and then a PUT
request was called to stop this execution. The request needed to specify the target
execution id like https://localhost:8181/hopsworks-api/api/project/119/jobs/
dbt_exam_performance/executions/67/status, with payload "state":"stopped".
Listing 5.6 gives the response of stopping DBT job execution 67. The stated became
”killed”.

37
CHAPTER 5. RESULTS AND DISCUSSION

Listing 5.6: Response of stopping the execution of dbt_exam_performance

1 {
2 ” t y p e ” : ” executionDTO ” ,
3 ” h r e f ” : ” h t t p s : / / l o c a l h o s t : 8 1 8 1 / hopsworks−a p i / a p i / p r o j e c t / 1 1 9 /
j o b s / dbt_exam_performance / e x e c u t i o n s / 6 7 ” ,
4 ” args ” : ” ” ,
5 ” d u r a t i o n ” : 7199 ,
6 ” filesToRemove ” : [
7 ” / s r v / hops / s t a g i n g / p r i v a t e _ d i r s /
b05cde92e29d4d41db8f45b7dc5c7c5bef1bc3b6bb601e1c686cdc94c3
4245a2 ”
8 ],
9 ” f i n a l S t a t u s ” : ”UNDEFINED” ,
10 ” hdfsUser ” : ” demo_fs_meb10000__meb10000 ” ,
11 ” i d ” : 67 ,
12 ” progress ” : 0.0 ,
13 ” s t a t e ” : ”KILLED ” ,
14 ” s t d e r r P a t h ” : ” / P r o j e c t s /demo_fs_meb10000/ Logs /DBT/67/ s t d e r r . l o g
”,
15 ” s t d o u t P a t h ” : ” / P r o j e c t s /demo_fs_meb10000/ Logs /DBT/67/ s t d o u t . l o g
”,
16 ” submissionTime ” : ”2022−06−11T23 : 2 1 : 0 9 Z ” ,
17 ” us e r ” : {
18 ” h r e f ” : ” h t t p s : / / l o c a l h o s t : 8 1 8 1 / hopsworks−a p i / a p i / u s e r s
/10000”
19 }
20 }

Figure 5.1.5 shows that the Docker container for execution 67 was killed after calling
the API.

Figure 5.1.5: The Docker container of the execution 67 was killed

Delete an execution

38
CHAPTER 5. RESULTS AND DISCUSSION

Similar to the delete functionality for DBT jobs, this function also reused the
existing mechanism for deleting executions of other types of jobs. After calling
DELETE request https://localhost:8181/hopsworks-api/api/project/119/jobs/
dbt_exam_performance/executions/67, this execution disappeared in the execution
list in UI and this execution could not be obtained by GET request as shown in Listing
5.7.

Listing 5.7: Response of getting the deleted execution of dbt_exam_performance

1 {
2 ” t y p e ” : ” restApiJsonResponse ” ,
3 ” errorCode ” : 130007 ,
4 ” errorMsg ” : ” Execution not found . ” ,
5 ” usrMsg ” : ” Execution : 67”
6 }

To summarize, the tests for DBT job and execution management shows that those
functionalities were completed and the use cases were fulfilled.

5.1.2 Validation Tests of executing basic DBT commands

In this section, the tests for executing a set of basic DBT commands were recorded.
To check the commands are successfully executed, it was necessary to verify
the generated logs and artifacts. The logs for an execution include stdout.log
for standard output and stderr.log for error outputs. The logs should under
the directory root/Logs/DBT/{execution id} in HopsFS and the {execution id}
should be replaced by actual numbers. Artifacts of DBT jobs should be under
root/DBTArtifacts/{execution id} and different DBT commands could result in
different DBT artifacts. In the tests, each DBT command was used as the input
argument to start a DBT job execution.

dbt deps

This command installs the specified packages, which does not result in any artifacts.
The error log was empty and the output log is shown in Figure 5.1.6, meaning that the
command installed the dependencies successfully.

39
CHAPTER 5. RESULTS AND DISCUSSION

Figure 5.1.6: stdout.log of executing dbt deps

dbt compile

The execution of dbt compile resulted in an empty stderr.log and a stdout.log

as shown in Figure 5.1.7. This command also generated manifest.json and
run_results.json as artifacts [8],which are shown in Figure 5.1.8.

Figure 5.1.7: stdout.log of executing dbt compile

Figure 5.1.8: Artifacts of executing dbt compile

dbt run

dbt run is a vital command to execute the models against the data warehouses.
Figure 5.1.9 shows the stdout.log and the corresponding run_results.json and
manifest.json were also available in HopsFS. Although the log showed that the
execution succeeded, there were some weird characteristics in the log near ”SUCCESS”
and ”Completed successfully”. The possible cause would be the characters’ color.
Those words were in green if the execution is done in the terminal. There might be
some errors in setting the words’ color in DBT.

40
CHAPTER 5. RESULTS AND DISCUSSION

Figure 5.1.9: stdout.log of executing dbt run

dbt test

Figure 5.1.10 shows the out log of executing dbt test. It executed all the defined tests
including not null tests and the dbt expectation test. The error log was empty and two
artifact files: run_results.json and manifest.json were generated as expected. The
wrong characters would be caused by the same reason as previously described.

Figure 5.1.10: stdout.log of executing dbt test

dbt build

dbt build ran the models and the tests since there were no defined seeds and
snapshots in the developed DBT project. As expected, it resulted in an empty
error log, a succeeded out log shown in Figure 5.1.11, run_results.json and
manifest.json.

41
CHAPTER 5. RESULTS AND DISCUSSION

Figure 5.1.11: stdout.log of executing dbt build

dbt ls

All the resources were listed as shown in Figure 5.1.12 in the stdout.log. No error was
reported. manifest.json was set to be available to users, while dbt ls didn’t result in
any update in the other artifacts.

Figure 5.1.12: stdout.log of executing dbt ls

In all, the tests for executing a set of basic DBT jobs passed. The standard out logs
showed the result summary and the artifacts were generated accordingly as expected.
There was a minor problem which is the wrong characters in the log. This could be
caused by the defined color of characters resulting from DBT itself.

5.2 Performance measurements

The performance of the solution was measured by memory usage of the Docker image
and the Docker containers, and the execution time of executing a DBT job.

42
CHAPTER 5. RESULTS AND DISCUSSION

5.2.1 Memory usage

Figure 5.2.1 shows that the created Docker image registry.service.consul:4443/dbt had
size 795MB. It also presents the size of each layer using the command docker history
<image id>. Most part of the memory was used for installing dependencies, such as
dbt-core, dbt-snowflake, python, git, wget and so on.

Figure 5.2.1: Docker image size

Figure 5.2.2 shows the size of one Docker container for executing a DBT job. The virtual
size was the size of its image which could be shared by different containers. 503kB was
the size of the writable layer on top of the image.

Figure 5.2.2: Docker container size

5.2.2 Execution time

In this project, the time for executing dbt run against the created sample DBT project
was recorded. This command was used because it could be considered as the most
basic and necessary command to generate the models in the DBT project. Figure 5.2.3
presents the execution time separated by a few tasks.

43
CHAPTER 5. RESULTS AND DISCUSSION

Figure 5.2.3: Execution time of dbt run command of the sample DBT project

There was 1 second between the Docker container called to be created using command
docker run and it actually started. The possible cause would be that the environmental
variables and volumes were used to start a container which might take some time.
In addition, Hopsworks uses Kubernetes to manage the Docker containers, so it
probably used some time to start the container. Then, 2 seconds were used to generate
certificates for connecting to HopsFS. Mounting to HopsFS from the container used 1
second. After that, 44 seconds were consumed before executing the DBT command.
That time was mainly used for checking the existence of directories, generating DBT
profiles, changing files’ ownership and so on. This preparation time took the longest
time without considering the actual DBT command execution time, the possible reason
might executing commands on the mounted HopsFS directory needs more time. As
shown in Figure 5.2.2, dbt run against the current DBT project used 25 seconds to
complete. The 25 seconds also included the time to write results to logs, so it might
be longer than usual. Then only 1 second was used for preparing results including logs
and artifacts.

44
CHAPTER 5. RESULTS AND DISCUSSION

5.3 Discussion
According to the validation test, the developed support worked as expected. With this
support, users can orchestrate SQL tasks in the Hopsworks Feature Store to do data
transformation for feature engineering. The DBT job executions could result in tables
or views in data warehouses, which is Snowflake in this project, and they can be used
to create feature groups in Hopsworks Feature Store. Users can create normal feature
groups1 or on-demand feature groups2 stored externally in Hopsworks based on the
results of DBT executions. Those feature groups could be used for machine learning
models in the feature store.

The performance measurements include measuring the size of a Docker container and
the time of executing dbt run command for the sample DBT project. The size of a
Docker container consists of 795 MB for the base image size and 503 kB for the writable
layer’s size. This indicates that the user might need at least around 800 MB of space
on the server to use the developed support for DBT jobs in this project. In addition,
compared to executing a DBT command locally, executing it in Hopsworks needs extra
time to generate certificates, mount to HopsFS and prepare the environment. The
operations like writing and reading to HopsFS also need more time than writing and
reading to local files. However, those times are necessary so that users can view the
persistent results for each execution including logs and artifacts in the Hopsworks
platform.

1
Feature Groups in Hopsworks: https://docs.hopsworks.ai/feature-store-api/3.0.0-RC2/
generated/feature_group/
2
On-Demand (External) Feature Groups in Hopsworks: https:
//docs.hopsworks.ai/feature-store-api/3.0.0-RC2/generated/on_demand_feature_group/

45
Chapter 6

Conclusion and Future Work

This chapter concludes the development and the tests of this project and also states
limitations and possible future work.

6.1 Conclusion
To summarize, this project designed and developed a working solution to support
executing DBT tasks in Hopsworks for feature engineering, which enabled analytics
engineers to transform data using SQL within the Hopsworks Feature Store. The
solution was based on Docker creating lightweight environments for connecting to data
warehouse Snowflake and executing DBT commands.

A DBT project connecting to Snowflake was used in the project to verify if the solution
was working. The solution passed the tests of creating, modifying, deleting, getting
a DBT job and starting, stopping, getting and deleting an execution of a DBT job,
meaning that the developed solution fulfilled the desired use cases. In addition, the
tests for executing a set of basic DBT commands demonstrated that the solution could
result in correct expected logs and artifacts. Furthermore, it was measured that the
Docker image had a size of 795MB in the server, which was mostly used for installing
the dependencies. Considering the time of executing a DBT command, the longest time
was spent on preparing the folders, the DBT profile file and log files on HopsFS. The
advantage is that users can read the logs and artifacts of each execution in HopsFS in
the Hopsworks platform.

The thesis proposed a way to enable analytics engineers to do feature engineering

46
CHAPTER 6. CONCLUSION AND FUTURE WORK

by orchestrating SQL models on Hopsworks using DBT, and it was proved that the
solution was working as expected. Supporting DBT jobs might make Hopsworks
become one of the most complete feature engineering platforms, with full support of
Python, Docker, Spark, Flink and SQL.

6.2 Limitations
There were a few limitations to this project. For instance, the developed solution
only supports connecting to Snowflake. The users now can only use Snowflake to
do the transformation. Officially, dbt Labs fully supports following data warehouses
or platforms: Postgres, Snowflake, Redshift, Apache Spark and BigQuery [1]. There
are also plugins for other platforms provided by the community or vendors [1]. The
project started to build the DBT project for testing purposes using Snowflake because
it provided a free trial. Due to the time limit, support for other data warehouses was
not considered in this project. This could also be future work to do.

Another limitation would be that the project tested only a set of necessary DBT
commands, without covering all the commands. However, it might be difficult to
cover all possible DBT commands with all possible arguments within the constraint
time.

6.3 Future Work

The previously mentioned limitations could be improved in the future. It would not be
difficult to develop the support for more data warehouses. The only thing to do would
be to generate different profiles according to different data warehouses, following the
way of configuring Snowflake. More DBT commands with more arguments could also
be tested to check if they can result in correct logs and artifacts, with more complicated
and real-life DBT projects in the future.

Moreover, this project was a pure back-end development, so one important future
work is developing UI for the support of DBT jobs. Although some web pages could
be reused, it is still necessary to add UI for creating a DBT job. The logs and artifacts
could also be shown on the details page of a DBT execution.

Besides, it would be good to have an additional functionality that Hopsworks can

47
CHAPTER 6. CONCLUSION AND FUTURE WORK

automatically create on-demand feature groups using the result table or view in
the data warehouses. It means that the data will be stored in Snowflake outside
Hopsworks, but users can use those data to train models within Hopsworks [16]. This
new feature could possibly make the whole feature engineering process more automatic
and provide a more consistent user experience.

48
Bibliography

[1] dbt Labs. Available adapters. https://docs.getdbt.com/docs/available-

adapters. Retrieved 11 June 2022.

[2] dbt Labs. build. https : / / docs . getdbt . com / reference / commands / build.
Retrieved 31 May 2022.

[3] dbt Labs. compile. https://docs.getdbt.com/reference/commands/compile.

Retrieved 31 May 2022.

[4] dbt Labs. dbt Command reference. https : / / docs . getdbt . com / reference /
dbt-commands. Retrieved 6 May 2022.

[5] dbt Labs. deps. https : / / docs . getdbt . com / reference / commands / deps.
Retrieved 3 June 2022.

[6] dbt Labs. deps. https : / / docs . getdbt . com / reference / commands / deps.
Retrieved 3 June 2022.

[7] dbt Labs. How to install dbt. https://docs.getdbt.com/dbt- cli/install/

overview. Retrieved 11 June 2022.

[8] dbt Labs. Overview. https://docs.getdbt.com/reference/artifacts/dbt-

artifacts. Retrieved 15 May 2022.

[9] dbt Labs. Packages. https : / / docs . getdbt . com / docs / building - a - dbt -
project/package-management. Retrieved 27 May 2022.

[10] dbt Labs. run. https://docs.getdbt.com/reference/commands/run. Retrieved

31 May 2022.

[11] dbt Labs. Tests. https://docs.getdbt.com/docs/building-a-dbt-project/

tests. Retrieved 5 May 2022.

[12] dbt Labs. What is dbt? https : / / docs . getdbt . com / docs / introduction.
Retrieved 4 May 2022.

49
BIBLIOGRAPHY

[13] Docker Inc. Docker overview. https : / / docs . docker . com / get - started /
overview/. Retrieved 16 May 2022.

[14] Docker Inc. docker run. https : / / docs . docker . com / engine / reference /
commandline/run/. Retrieved 30 May 2022.

[15] Great Expectations. Down with Pipeline debt / Introducing Great Expectations.
https : / / medium . com / @expectgreatdata / down - with - pipeline - debt -
introducing-great-expectations-862ddc46782a. Retrieved 11 May 2022.

[16] hopsworks. https : / / docs . hopsworks . ai / feature - store - api / 2 . 5 . 9 /

generated/on_demand_feature_group/. Retrieved 11 June 2022.

[17] hopsworks. Concept Overview. https://docs.hopsworks.ai/feature-store-

api/2.5.9/overview/. Retrieved 13 May 2022.

[18] hopsworks. Hopsworks Feature Store. https://docs.hopsworks.ai/feature-

store-api/2.5.9/. Retrieved 18 May 2022.

[19] IBM Cloud Education. https://www.ibm.com/cloud/learn/elt. Retrieved 17

June 2022.

[20] IBM Cloud Education, IBM Cloud Education. https://www.ibm.com/cloud/

blog/elt-vs-etl-whats-the-difference. Retrieved 17 June 2022.

[21] Ismail, Mahmoud, Gebremeskel, Ermias, Kakantousis, Theofilos, Berthou,

Gautier, and Dowling, Jim. “Hopsworks: Improving User Experience and
Development on Hadoop with Scalable, Strongly Consistent Metadata”. In:
2017 IEEE 37th International Conference on Distributed Computing Systems
(ICDCS). 2017, pp. 2525–2528. DOI: 10.1109/ICDCS.2017.41.

[22] Ismail, Mahmoud, Niazi, Salman, Ronström, Mikael, Haridi, Seif, and Dowling,
Jim. “Scaling HDFS to more than 1 million operations per second with HopsFS”.
In: 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid
Computing (CCGRID). IEEE. 2017, pp. 683–688.

[23] Nickoloff, Jeffrey and Kuenzli, Stephen. “Docker in action”. In: Manning, 2019.
Chap. 1. ISBN: 9781617294761.

[24] Ozdemir, Sinan. “Feature Engineering Bookcamp”. In: Manning, 2021. Chap. 1.
ISBN: 9781617299797.

50
BIBLIOGRAPHY

[25] thaJeztah. Explain the SIZE column in ”docker ps -s” and what ”virtual”
keyword means. https://github.com/docker/docker.github.io/issues/
1520#issuecomment-305179362. Retrieved 4 June 2022.

[26] Tristan Handy. What, exactly, is dbt? https : / / blog . getdbt . com / what -
exactly-is-dbt/. Retrieved 30 March 2022.

51
Appendix - Contents

A DBT project code 53

52
Appendix A

DBT project code

Figure A.0.1: Directory structure of the sample DBT project

Listing A.1: dbt_project.yml

1 name : ' demo '
2 version : '1.1.0 '
3 c o n f i g −v e r s i o n : 2
4

5 # This s e t t i n g c o n f i g u r e s which ” p r o f i l e ” dbt uses f o r t h i s p r o j e c t .

6 profile : ' snowflake_sc '
7

8 # These c o n f i g u r a t i o n s s p e c i f y where dbt should l o o k f o r d i f f e r e n t

types of f i l e s .

53
APPENDIX A. DBT PROJECT CODE

9 # The `model−paths ` c o n f i g , f o r example , s t a t e s t h a t models i n t h i s

p r o j e c t can be
10 # found i n the ” models / ” d i r e c t o r y . You probably won ' t need t o
change t h e s e !
11 model−paths : [ ” models ” ]
12 a n a l y s i s −paths : [ ” a n a l y s e s ” ]
13 t e s t −paths : [ ” t e s t s ” ]
14 seed −paths : [ ” seeds ” ]
15 macro−paths : [ ” macros ” ]
16 snapshot −paths : [ ” snapshots ” ]
17

18 t a r g e t −path : ” t a r g e t ” # d i r e c t o r y which w i l l s t o r e compiled SQL

files
19 clean − t a r g e t s : # d i r e c t o r i e s t o be removed by ` dbt clean `
20 − ” target ”
21 − ” dbt_packages ”
22

24 # C o n f i g u r i n g models
25 # F u l l documentation : h t t p s : / / docs . g e t d b t . com/ docs / c o n f i g u r i n g −
models
26

27 # In t h i s example c o n f i g , we t e l l dbt t o b u i l d a l l models i n the

example / d i r e c t o r y
28 # as t a b l e s . These s e t t i n g s can be o v e r r i d d e n i n the i n d i v i d u a l
model f i l e s
29 # us i ng the `{{ c o n f i g ( . . . ) }} ` macro .
30 models :
31 demo :
32 # Co nfi g i n d i c a t e d by + and a p p l i e s t o a l l f i l e s under models /
example /
33 example :
34 + m a t e r i a l i z e d : view
35

37 vars :
38 ' dbt_date : time_zone ' : ' Europe / Stockholm '

54
APPENDIX A. DBT PROJECT CODE

Listing A.2: packages.yml

1 packages :
2 − package : c a l o g i c a / d b t _ e x p e c t a t i o n s
3 version : 0.5.6

Listing A.3: not_prepared_performance.yml

1 {{ config ( materialized = ' table ' ) }}
2

4 select *
5 from { { so u r c e ( ' PUBLIC ' , 'STUDENTS_EXAM' ) } }
6 where t e s t p r e p a r a t i o n c o u r s e = ' none '

Listing A.4: passed_math_performance.yml

1 { { c o n f i g ( m a t e r i a l i z e d = ' view ' ) } }
2

4 s e l e c t mathscore , gender , race , p a r e n t a l e d u c a t i o n , lunch ,

testpreparationcourse
5 from { { r e f ( ' not_prepared_performance ' ) } }
6 where mathscore between 60 and 100

Listing A.5: schema.yml

1 version : 2
2

3 sources :
4 − name : PUBLIC
5 tables :
6 − name : STUDENTS_EXAM
7

8 models :
9 − name : not_prepared_performance
10 columns :
11 − name : gender

55
APPENDIX A. DBT PROJECT CODE

12 − name : r a c e
13 − name : p a r e n t a l E d u c a t i o n
14 − name : lunch
15 − name : t e s t p r e p a r a t i o n c o u r s e
16 tests :
17 − accepted_values :
18 v a l u e s : [ ' none ' ]
19 − name : w r i t i n g s c o r e
20 − name : r e a d i n g s c o r e
21 − name : mathscore
22 tests :
23 − not_null
24

25 − name : passed_math_performance
26 columns :
27 − name : gender
28 − name : r a c e
29 − name : p a r e n t a l E d u c a t i o n
30 − name : lunch
31 − name : t e s t p r e p a r a t i o n c o u r s e
32 tests :
33 − accepted_values :
34 v a l u e s : [ ' none ' ]
35 − name : mathscore
36 tests :
37 − d b t _ e x p e c t a t i o n s . expect_column_values_to_be_between :
38 min_value : 60
39 max_value : 100

56
TRITA-EECS-EX-2022:

www.kth.se

DBT Analytics Engineering Exam Questions
No ratings yet
DBT Analytics Engineering Exam Questions
23 pages
Databricks Practice Questions
No ratings yet
Databricks Practice Questions
83 pages
Databuildtoolpdf 220704 142715
No ratings yet
Databuildtoolpdf 220704 142715
39 pages
DBT Developer Guide
No ratings yet
DBT Developer Guide
28 pages
SQL Scenario Based Interview Questions - ThinkETL
100% (3)
SQL Scenario Based Interview Questions - ThinkETL
23 pages
Pyspark PDF
0% (1)
Pyspark PDF
239 pages
Programming+in+Snowflake+ +All+Slides
100% (1)
Programming+in+Snowflake+ +All+Slides
342 pages
DBT Certificate Study Guide
100% (1)
DBT Certificate Study Guide
11 pages
Odoo HR PDF
100% (3)
Odoo HR PDF
17 pages
Py Spark
83% (6)
Py Spark
195 pages
Talend Cookbook
100% (1)
Talend Cookbook
294 pages
IDMC Best Practices and Standards
100% (1)
IDMC Best Practices and Standards
27 pages
Snowflake Interview 2024 03
100% (1)
Snowflake Interview 2024 03
167 pages
SQL Interview
No ratings yet
SQL Interview
73 pages
DBT Interview Questions
No ratings yet
DBT Interview Questions
18 pages
Snowflake Faq
No ratings yet
Snowflake Faq
185 pages
Snowflake Architecture - Concepts
No ratings yet
Snowflake Architecture - Concepts
38 pages
Databricks Data Engg Pro Certification Dumps
100% (2)
Databricks Data Engg Pro Certification Dumps
41 pages
Advanced Data Engineering With Databricks
No ratings yet
Advanced Data Engineering With Databricks
154 pages
Databricks Certified Developer For Apache Spark 3.0 Practice Tests 540 Questions
0% (1)
Databricks Certified Developer For Apache Spark 3.0 Practice Tests 540 Questions
290 pages
Spark Interview QUestions
No ratings yet
Spark Interview QUestions
200 pages
Snowflake and Its Benefits
No ratings yet
Snowflake and Its Benefits
93 pages
DBT Flow
No ratings yet
DBT Flow
15 pages
Snowpro™ Core: Study Guide
100% (1)
Snowpro™ Core: Study Guide
17 pages
Certified Data Engineer Associate
No ratings yet
Certified Data Engineer Associate
24 pages
Solution Path For Implementing A Comprehensive Architecture For Data and Analytics Strategies
100% (1)
Solution Path For Implementing A Comprehensive Architecture For Data and Analytics Strategies
25 pages
Sparksql PDF
100% (2)
Sparksql PDF
119 pages
Spark SQL Optimization
No ratings yet
Spark SQL Optimization
29 pages
Azure Databricks Course Slide Deck
75% (4)
Azure Databricks Course Slide Deck
169 pages
All Course Slides
100% (1)
All Course Slides
192 pages
Azure Data Engineer Interview Questions
No ratings yet
Azure Data Engineer Interview Questions
35 pages
Databricks Certified Data Engineer Associate 9
No ratings yet
Databricks Certified Data Engineer Associate 9
12 pages
Snowflake Best Practice Guide
100% (1)
Snowflake Best Practice Guide
75 pages
Interview Questions
No ratings yet
Interview Questions
16 pages
Interview Data Engineer
100% (1)
Interview Data Engineer
13 pages
Snowflake For: Data Engineering
No ratings yet
Snowflake For: Data Engineering
15 pages
Ssis Interview Questions and Answers 2
No ratings yet
Ssis Interview Questions and Answers 2
7 pages
PracticeExam DataEngineerAssociate
No ratings yet
PracticeExam DataEngineerAssociate
23 pages
Very Important Example of Hibernate3.0
No ratings yet
Very Important Example of Hibernate3.0
161 pages
Snowflake UNIT II
No ratings yet
Snowflake UNIT II
44 pages
8888888888888888888
100% (1)
8888888888888888888
131 pages
Best Practices For Optimizing Your DBT and Snowflake Deployment
No ratings yet
Best Practices For Optimizing Your DBT and Snowflake Deployment
30 pages
AS400 Control Language Notes
No ratings yet
AS400 Control Language Notes
15 pages
Interview DE by Company Azurelib Dot Com
No ratings yet
Interview DE by Company Azurelib Dot Com
14 pages
CHƯƠNG 5 -NHÓM 7 - HTTTKT - BÀI TẬP
100% (1)
CHƯƠNG 5 -NHÓM 7 - HTTTKT - BÀI TẬP
19 pages
4 New Table Types in 2022 by Snowflake - A Summary - by Somen Swain - Snowflake - Dec, 2022 - Medium
No ratings yet
4 New Table Types in 2022 by Snowflake - A Summary - by Somen Swain - Snowflake - Dec, 2022 - Medium
7 pages
Snowflake
No ratings yet
Snowflake
16 pages
3 Snowflake+Architecture
No ratings yet
3 Snowflake+Architecture
20 pages
Talend Architecture White Paper - Branded - Final 11302020
No ratings yet
Talend Architecture White Paper - Branded - Final 11302020
18 pages
BigQuery Query Optimization With Troposphere PDF
No ratings yet
BigQuery Query Optimization With Troposphere PDF
51 pages
Spark SQL and DataFrames - Spark 2.2.0 Documentation
No ratings yet
Spark SQL and DataFrames - Spark 2.2.0 Documentation
35 pages
Airflow Introduction
No ratings yet
Airflow Introduction
9 pages
ELT Architecture in The Azure Cloud
No ratings yet
ELT Architecture in The Azure Cloud
8 pages
Achieving Control Levels Functionality in Adobe Forms
No ratings yet
Achieving Control Levels Functionality in Adobe Forms
13 pages
How To Create and Register Custom Form in Oracle Apps 11i R12
No ratings yet
How To Create and Register Custom Form in Oracle Apps 11i R12
6 pages
Delta Lake Cheat Sheet-1
100% (1)
Delta Lake Cheat Sheet-1
2 pages
Azure Data Factory
100% (2)
Azure Data Factory
14 pages
SSIS Succinctly
No ratings yet
SSIS Succinctly
116 pages
Create Temporary, Permanent & Transient Table
No ratings yet
Create Temporary, Permanent & Transient Table
2 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
4 pages
A Detailed View Inside Snowflake
No ratings yet
A Detailed View Inside Snowflake
14 pages
Local Media693168421150223252
No ratings yet
Local Media693168421150223252
2 pages
SnowFlake Schema
No ratings yet
SnowFlake Schema
8 pages
Swot Analysis & Turtle Diagram Purchasing
No ratings yet
Swot Analysis & Turtle Diagram Purchasing
2 pages
Snowflake:: Data Warehouse For Cloud
No ratings yet
Snowflake:: Data Warehouse For Cloud
2 pages
Kyubit BI Installation and Administration
No ratings yet
Kyubit BI Installation and Administration
48 pages
UNIX and Linux Technical Control SA Build Guideline (12192011)
No ratings yet
UNIX and Linux Technical Control SA Build Guideline (12192011)
12 pages
IBM DB2 9.7 For Linux, UNIX, and Windows
No ratings yet
IBM DB2 9.7 For Linux, UNIX, and Windows
481 pages
Small Change Form Template
No ratings yet
Small Change Form Template
2 pages
Technical University of Crete: Data Structures File Structures
No ratings yet
Technical University of Crete: Data Structures File Structures
20 pages
Department of Electrical Engineering: CS212 Object-Oriented Programming
No ratings yet
Department of Electrical Engineering: CS212 Object-Oriented Programming
3 pages
MICE Vendor Database
No ratings yet
MICE Vendor Database
5 pages
Cisco Unified Contact Center Express Install and Upgrade Guide, Release 11.0
No ratings yet
Cisco Unified Contact Center Express Install and Upgrade Guide, Release 11.0
50 pages
The Kcachegrind Handbook
No ratings yet
The Kcachegrind Handbook
20 pages
Python For OIl and Gas Professionals
No ratings yet
Python For OIl and Gas Professionals
6 pages
Linux: How To Harden Install Source Servers - AlliedModders
No ratings yet
Linux: How To Harden Install Source Servers - AlliedModders
4 pages
Ontapcuoiky
No ratings yet
Ontapcuoiky
11 pages
Data Protector License and Upgrade Overview Flyer
No ratings yet
Data Protector License and Upgrade Overview Flyer
4 pages
Parallel Computing: Lecture 4: Parallel Software: Basics
No ratings yet
Parallel Computing: Lecture 4: Parallel Software: Basics
31 pages
Hoffer Edm PP Ch05
No ratings yet
Hoffer Edm PP Ch05
30 pages
Application Development and Emerging Technologies
No ratings yet
Application Development and Emerging Technologies
9 pages
Coding Download
No ratings yet
Coding Download
6 pages
Arsitektur Informasi Pada Sistem Pengelolaan
No ratings yet
Arsitektur Informasi Pada Sistem Pengelolaan
8 pages
E Governance Solutions
No ratings yet
E Governance Solutions
9 pages
Cookie Testing Guru99
No ratings yet
Cookie Testing Guru99
9 pages
Gist of Central Committee Recommendations For High Courts
No ratings yet
Gist of Central Committee Recommendations For High Courts
7 pages
Ultimate Data Engineering with Databricks: Develop Scalable Data Pipelines Using Data Engineering's Core Tenets Such as Delta Tables, Ingestion, Transformation, Security, and Scalability
From Everand
Ultimate Data Engineering with Databricks: Develop Scalable Data Pipelines Using Data Engineering's Core Tenets Such as Delta Tables, Ingestion, Transformation, Security, and Scalability
Mayank Malhotra
No ratings yet
Mastering Azure Synapse Analytics: Learn how to develop end-to-end analytics solutions with Azure Synapse Analytics (English Edition)
From Everand
Mastering Azure Synapse Analytics: Learn how to develop end-to-end analytics solutions with Azure Synapse Analytics (English Edition)
Debananda Ghosh
No ratings yet
Databricks Essentials: A Guide to Unified Data Analytics
From Everand
Databricks Essentials: A Guide to Unified Data Analytics
Robert Johnson
No ratings yet
Practice Questions for Snowflake Snowpro Core Certification Concept Based - Latest Edition 2023
From Everand
Practice Questions for Snowflake Snowpro Core Certification Concept Based - Latest Edition 2023
Exam OG
5/5 (1)