Talend ETL Project
Talend ETL Project
Introduction
ETL stands for extract, transform, and load, is the process data engineers use to
extract data from different sources, transform the data into a usable and trusted
resource, and load that data into the systems end-users can access and use
downstream to solve business problems.
ETL Process
Extract:
- Reads data from multiple data sources and extracts required set of data.
- Recovers necessary data with optimum usage of resources.
Transform:
- Filtration, cleansing, and preparation of data extracted, with lookup tables.
- Authentication of records, refutation, and integration of data.
- Data to be sorted, filtered, cleared, standardized, translated, or verified for
consistency.
Load:
- Writing data output, after transformation to a data warehouse
- Either physical insertion of record as a new row in database table or link
processes for each record from the main source
1
Project Overview
In this ETL project, I extracted the data from 4 different data sources and did data
transformation then uploaded it to AWS S3 storage in order to move it to Redshift
later.
https://github.com/neo4j-contrib/northwind-neo4j/tree/master/data
https://docs.google.com/spreadsheets/d/1amQgBgIaUMVEj8gYKbvmlzuoA21ABDiLe0v
1orZjjkg/edit#gid=1531710140
2
Objectives
There are many business drivers that can be driven from Northwind data
warehouse as follows:
2. Request fulfillment Report to track the order by how much time it has taken
to get delivered to the customer and it can be analyzed to see if it can be
improved.
3. Employee level reporting to track the performance of the employees and see
how it can be improved by either providing rewards to the best performers or
giving training to the worst performers or both.
● What are the best-selling products, and do we need to store them more?
● What is the count of products left in the inventory?
● Are we going to run out of any products for delivery?
● What are the products that are going unsold and what can be done as
improvement in selling or discontinuing them?
● Can we give discounts on unsold products to get attention in purchase?
Source:
https://medium.com/@kmsbmadhan/dimensional-modelling-visualization-of-northwind-database
-beaac7fecb20
3
Northwind schema:
4
Solving Some Errors when splitting the data on different data sources
Solve the problem of column header names when Upload the data to bigquery
https://medium.com/google-cloud/bigquery-create-table-from-google-sheets-caus
ing-incorrect-column-names-string-field-0-134f6ecd3fc8
Transformation phase
- Concatenating first name and last name for employees and customers
- Parsing date from CSV files
- Transforming date to the used format
- Append unique rows from the 4 sources together
- Create date table
- Upload the new tables to AWS S3 (Simple Storage Service) which is the
very popular storage service of Amazon Web Services. It is widely used by
customers and Talend provides out-of-the-box connectivity with S3.
5
Orders Table
6
Products Table
7
Employees Table
8
Customers Table
9
Dim time
10
Upload the final tables to AWS S3
Iterate through the files in the data staging folder and upload them to AWS S3 storage in
order to move them to Redshift data warehouse.
11
Courses
I have studied these data engineering courses before I made the ETL project.
Talend course
https://www.youtube.com/playlist?list=PLOr008ImHvfan_fuDr5RVyexpeYJAp9FX
12