0% found this document useful (0 votes)
245 views12 pages

Talend ETL Project

This document summarizes an ETL project that extracted data from 4 sources (CSV files, Excel sheets, BigQuery, PostgreSQL database), transformed the data (concatenated names, parsed dates, appended rows), and loaded it into AWS S3 storage and eventually Redshift. The project used the Northwind sample database and involved solving errors during data extraction and transformation. The transformed data was loaded into fact and dimension tables to enable sales, order fulfillment, employee, and inventory reporting.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
245 views12 pages

Talend ETL Project

This document summarizes an ETL project that extracted data from 4 sources (CSV files, Excel sheets, BigQuery, PostgreSQL database), transformed the data (concatenated names, parsed dates, appended rows), and loaded it into AWS S3 storage and eventually Redshift. The project used the Northwind sample database and involved solving errors during data extraction and transformation. The transformed data was loaded into fact and dimension tables to enable sales, order fulfillment, employee, and inventory reporting.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

ETL Project With Talend

Introduction
ETL stands for extract, transform, and load, is the process data engineers use to
extract data from different sources, transform the data into a usable and trusted
resource, and load that data into the systems end-users can access and use
downstream to solve business problems.

ETL Process

Image credits to: https://www.databricks.com/glossary/extract-transform-load

Extract:
- Reads data from multiple data sources and extracts required set of data.
- Recovers necessary data with optimum usage of resources.
Transform:
- Filtration, cleansing, and preparation of data extracted, with lookup tables.
- Authentication of records, refutation, and integration of data.
- Data to be sorted, filtered, cleared, standardized, translated, or verified for
consistency.
Load:
- Writing data output, after transformation to a data warehouse
- Either physical insertion of record as a new row in database table or link
processes for each record from the main source

1
Project Overview

In this ETL project, I extracted the data from 4 different data sources and did data
transformation then uploaded it to AWS S3 storage in order to move it to Redshift
later.

Data Sources Staging ِArea Final Destination

1- CSV files Local PC AWS S3


2- Excel sheets then AWS Redshift
3- BigQuery
4- PostgreSQL database

Northwind is a database created by Microsoft for training and educational


purpose.

Links to the data:


https://github.com/engindemirog/Northwind-Database-Script-for-Postgre-Sql/blob
/master/script.sql

https://github.com/neo4j-contrib/northwind-neo4j/tree/master/data

https://docs.google.com/spreadsheets/d/1amQgBgIaUMVEj8gYKbvmlzuoA21ABDiLe0v
1orZjjkg/edit#gid=1531710140

Related article this ETL project


https://medium.com/@kmsbmadhan/dimensional-modelling-visualization-of-north
wind-database-beaac7fecb20

2
Objectives

There are many business drivers that can be driven from Northwind data
warehouse as follows:

1. Sales Reporting to track sales by customer, employees, products, and


suppliers to answer the following questions:

● what is our overall sales number?


● How much have we sold of each product?
● Which products are our best and worst sellers?
● Which of our clients order the most products? What do they order?
● How do our sales look when broken down by region?

2. Request fulfillment Report to track the order by how much time it has taken
to get delivered to the customer and it can be analyzed to see if it can be
improved.

3. Employee level reporting to track the performance of the employees and see
how it can be improved by either providing rewards to the best performers or
giving training to the worst performers or both.

4. Order distribution & Product inventory analysis to find orders distributed to


customers across the world, track inventory, Order level, and Re-order level of
the company for the betterment to answer important questions like

● What are the best-selling products, and do we need to store them more?
● What is the count of products left in the inventory?
● Are we going to run out of any products for delivery?
● What are the products that are going unsold and what can be done as
improvement in selling or discontinuing them?
● Can we give discounts on unsold products to get attention in purchase?

Source:
https://medium.com/@kmsbmadhan/dimensional-modelling-visualization-of-northwind-database
-beaac7fecb20

3
Northwind schema:

Target DWH schema:

4
Solving Some Errors when splitting the data on different data sources

Solve the problem of column header names when Upload the data to bigquery
https://medium.com/google-cloud/bigquery-create-table-from-google-sheets-caus
ing-incorrect-column-names-string-field-0-134f6ecd3fc8

Import CSV file to PostgreSQL


https://www.neilwithdata.com/copy-permission-denied#:~:text=What%20the%20e
rror%20means,the%20server%2C%20not%20the%20client.

Import data from Excel error solving


https://community.talend.com/s/feed/0D73p000004kIYlCAM?language=en_US

Transformation phase

- Concatenating first name and last name for employees and customers
- Parsing date from CSV files
- Transforming date to the used format
- Append unique rows from the 4 sources together
- Create date table

Load the data

- Upload the new tables to AWS S3 (Simple Storage Service) which is the
very popular storage service of Amazon Web Services. It is widely used by
customers and Talend provides out-of-the-box connectivity with S3.

5
Orders Table

6
Products Table

7
Employees Table

8
Customers Table

9
Dim time

10
Upload the final tables to AWS S3

Iterate through the files in the data staging folder and upload them to AWS S3 storage in
order to move them to Redshift data warehouse.

Link for the method: https://www.youtube.com/watch?v=TqJAd6RQypU&t=4s

11
Courses

I have studied these data engineering courses before I made the ETL project.

University course for data warehouse


https://www.youtube.com/playlist?list=PLiJhHdYdI84DzwH47lZQWN6de1tOq8LFu

Big Data Engineering in Depth Course


https://www.youtube.com/playlist?list=PLxNoJq6k39G_m6DYjpz-V92DkaQEiXxkF

Talend course
https://www.youtube.com/playlist?list=PLOr008ImHvfan_fuDr5RVyexpeYJAp9FX

Talend ETL project (the previous project was inspired by it)


https://www.youtube.com/playlist?list=PLKdHo47jRFvf1VzST1RDYiAs8ItMFychE

data engineering podcast


https://www.youtube.com/playlist?list=PLiKvD85qG0l6pLvsQChJO6UBFLfMFfwOw

Thanks for Reading, and I will appreciate your feedback.


Kindly check my Previous projects here: https://linktr.ee/mhmod36

See you next project,


Mahmoud Sallam

12

You might also like