TT Data2BotsDataEngineeringTechnicalAssessment 170922 1841 (3143)
TT Data2BotsDataEngineeringTechnicalAssessment 170922 1841 (3143)
TT Data2BotsDataEngineeringTechnicalAssessment 170922 1841 (3143)
This is a data2bots take-home assignment. For the Data Engineer role at Data2Bots!
Overview:
This guide details an assessment, which serves as the technical assessment phase associated with the Data Engineering role at Data2Bots.
Upon completion of this assessment, your submission will be graded, and based on this you will receive further correspondence from our HR
Team.
Goal
The purpose of this exercise is for you to demonstrate how you would solve a real-world Data Engineering problem in the absence of the
pressure of a live coding exercise.
Assessment Preamble
As the sole data engineer of ABC Inc, a business stakeholder has come to you with the following requirements;
Hey, we have data coming into our central data lake (i.e. a file system) every day.
import boto3
from botocore import UNSIGNED
from botocore.client import Config
s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))
bucket_name = "d2b-internal-assessment-bucket"
response = s3.list_objects(Bucket=bucket_name, Prefix="orders_data")
For example, you can access the dim_customers table with if_common.dim_customers
You have only select access on these tables.
We would like to load the raw data files into a schema {your_id}_staging within our enterprise data warehouse which is a Postgres datab
ase.
Your schema which holds your tables are already created within the data warehouse.
And you can create and access tables in that schema using {your_id}_staging.{table_name}
Write the two transformations into the following table in your {your_id}_analytics schema:
The product with the highest reviews, the day it was ordered the most, either that day was a public holiday, total review points,
percentage distribution of the review points, and percentage distribution of early shipments to late shipments for that particular
product.
Write this transformation into the following table in your {your_id}_analytics schema:
All ingestion_date are the current date the table was generated
We would also like an export of these tables to be loaded into the analytics_export folder on our data lake. As analytics_export/
{your_id}/best_performing_product.csv
For example, say you have done the transformation of the best_performing_product, and your id is user1234 you will export
the table to the following s3 location.
s3://d2b-internal-assessment-bucket/analytics_export/user1234/best_performing_product.csv
Submission Requirements
Create a private git repository and do your work in there as if this was a real project at work.
IMPORTANT WARNING! Your submission WILL NOT be considered if your repository is NOT PRIVATE.
Build an ELT pipeline (batch or streaming) that loads the business data in our data warehouse and performs the transformation.
Explain your work in a README file.
Important Guidelines
We are not judging you based on submission time. If there are things you wanted to include, but had no time for them, explain them in the
REAMDE.
Please write your extract and load code in python, your transformation in SQL, and if you plan to use any infrastructure as code ‘
framework’ preferably use Terraform.
Feel free to be creative and come up with requirements to make the assignment feel more real. Please note these down for us in the
README.
We don’t have a concrete solution in mind. We are heavily interested in your thinking process, and the way you design and build a
solution.
Although this is small dummy data and a toy exercise, try to aim for a production-grade
solution. Think of scalability, maintainability, and reliability.