TT Data2BotsDataEngineeringTechnicalAssessment 170922 1841 (3143)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Data2Bots Data Engineering Technical Assessment

This is a data2bots take-home assignment. For the Data Engineer role at Data2Bots!

Overview:

This guide details an assessment, which serves as the technical assessment phase associated with the Data Engineering role at Data2Bots.
Upon completion of this assessment, your submission will be graded, and based on this you will receive further correspondence from our HR
Team.

This guides details the following:

The goal of the assessment


The assessment preamble
Submission requirements
Important Guidelines

Goal

The purpose of this exercise is for you to demonstrate how you would solve a real-world Data Engineering problem in the absence of the
pressure of a live coding exercise.

This assessment especially tests your thought process and approach.

Assessment Preamble

As the sole data engineer of ABC Inc, a business stakeholder has come to you with the following requirements;

Hey, we have data coming into our central data lake (i.e. a file system) every day.

Our central data lake is an Amazon S3 Bucket:

Bucket Name: d2b-internal-assessment-bucket

Data Locations: s3://d2b-internal-assessment-bucket/orders_data/*

This directory contains the following files:


orders.csv: This data is a fact table about orders gotten on our website ABC.com
reviews.csv: This data is a fact table on reviews given for a particular delivered product
shipments_deliveries.csv: This is a fact table on shipments and their delivery dates

YOU CAN USE THIS SNIPPET OF CODE TO ACCESS THE S3 BUCKET:

import boto3
from botocore import UNSIGNED
from botocore.client import Config

s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))

bucket_name = "d2b-internal-assessment-bucket"
response = s3.list_objects(Bucket=bucket_name, Prefix="orders_data")

# for example to download the orders.csv


s3.download_file(bucket_name, "orders_data/orders.csv", "orders.csv")

Below is the ER shows the data model of our data warehouse;


Please note all dim_* tables are loaded into the if_common schema.

For example, you can access the dim_customers table with if_common.dim_customers
You have only select access on these tables.

We would like to load the raw data files into a schema {your_id}_staging within our enterprise data warehouse which is a Postgres datab
ase.

Your schema which holds your tables are already created within the data warehouse.

And you can create and access tables in that schema using {your_id}_staging.{table_name}

We want to know the following:


The total number of orders placed on a public holiday every month, for the past year.
A public holiday is a day with a day_of_the_week number in the range 1 - 5 and a working_day value of false.
After your transformation, the derived table agg_public_holiday should be loaded into the {your_id}_analytics schema,
using the below table schema.

Total number of late shipments


A late shipment is one with shipment_date greater than or equal to 6 days after the order_date and delivery_date is NULL
Total number of undelivered shipments
An undelivered shipment is one with delivery_date as NULL and shipment_date as NULL and the current_date 15 days after
order_date.

NB:: current_date here refers to 2022-09-05

Write the two transformations into the following table in your {your_id}_analytics schema:

The product with the highest reviews, the day it was ordered the most, either that day was a public holiday, total review points,
percentage distribution of the review points, and percentage distribution of early shipments to late shipments for that particular
product.
Write this transformation into the following table in your {your_id}_analytics schema:
All ingestion_date are the current date the table was generated

We would also like an export of these tables to be loaded into the analytics_export folder on our data lake. As analytics_export/
{your_id}/best_performing_product.csv

For example, say you have done the transformation of the best_performing_product, and your id is user1234 you will export
the table to the following s3 location.

s3://d2b-internal-assessment-bucket/analytics_export/user1234/best_performing_product.csv

Submission Requirements

After reading and understanding the above preamble:

Create a private git repository and do your work in there as if this was a real project at work.

Please share with us the repo once you're done.

IMPORTANT WARNING! Your submission WILL NOT be considered if your repository is NOT PRIVATE.

Build an ELT pipeline (batch or streaming) that loads the business data in our data warehouse and performs the transformation.
Explain your work in a README file.

Important Guidelines

We are not judging you based on submission time. If there are things you wanted to include, but had no time for them, explain them in the
REAMDE.
Please write your extract and load code in python, your transformation in SQL, and if you plan to use any infrastructure as code ‘
framework’ preferably use Terraform.
Feel free to be creative and come up with requirements to make the assignment feel more real. Please note these down for us in the
README.
We don’t have a concrete solution in mind. We are heavily interested in your thinking process, and the way you design and build a
solution.
Although this is small dummy data and a toy exercise, try to aim for a production-grade
solution. Think of scalability, maintainability, and reliability.

You might also like