0% found this document useful (0 votes)
54 views13 pages

A Beginner's Guide To ETL With Python - by Jesús Cantú - Medium

Uploaded by

bindushree1996m
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views13 pages

A Beginner's Guide To ETL With Python - by Jesús Cantú - Medium

Uploaded by

bindushree1996m
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Open in app

Search

Get unlimited access to the best of Medium for less than $1/week. Become a member

A Beginner’s Guide to ETL with Python


Jesús Cantú · Follow
11 min read · Jul 9, 2023

Listen Share More

ETL Process

ETL, which stands for Extract, Transform, Load, is a crucial process in data
management. It involves retrieving data from various sources, modifying it to fit
business needs or target system requirements, and then loading it into a central
location such as a data warehouse. In this article, we will simplify the ETL process
for beginners, delve into each step, and illustrate it with a real-world Python
example using publicly available data.

What is Data Extraction?


Data extraction is the initial phase of the ETL process. This involves pulling data
from a range of source systems, including but not limited to databases, CRM
systems, or flat files like CSVs. The objective is to collect as much relevant data as
possible to power our analytics systems.

The Transformation Process


Once the data is extracted, it enters the transformation stage. Here, the data is
cleaned, validated, and prepared to meet the requirements of the business or the
target system. This could involve removing duplicates, correcting errors, or
converting data into compatible formats. The goal is to maintain data integrity and
quality before it is loaded into the data warehouse.

Loading Data
After the data has been transformed, it is then loaded into a data warehouse or other
target system. This is where the processed data will reside for further use, usually
for analytics, business intelligence, or reporting purposes.

ETL in Python
Python is a great language for performing ETL because of its readability and the
wide range of data-centric libraries it offers. Here, we’re going to use Python to
perform ETL on two datasets from the UCI Machine Learning Repository: the Wine
dataset and the Wine Quality dataset. We’ll use the Pandas and NumPy libraries for
data manipulation.

The Wine Dataset


This dataset is the result of a chemical analysis of wines grown in the same region in
Italy but derived from three different cultivars. It consists of 178 samples with 13
chemical analysis attributes. Columns are labeled 1–13 and correspond to the
following:

1. Alcohol: The alcohol content in the wine.

2. Malic Acid: A type of acid found in wine.

3. Ash: A measure of the wine’s “ash” content, which is the inorganic matter
remaining after evaporation and incineration.

4. Alkalinity of Ash: A measure of the wine’s alkalinity, specifically the ash’s


alkalinity.

5. Magnesium: The magnesium content in the wine.


6. Total Phenols: A measure of the total phenol content in the wine.

7. Flavonoids: A subset of phenols, flavonoids are a type of antioxidant.

8. Non-flavonoid Phenols: This attribute measures the non-flavonoid phenols in


the wine.

9. Proanthocyanidins: Another group of phenols.

10. Color Intensity: A measure of the wine’s color intensity.

11. Hue: A wine attribute related to its color.

12. OD280/OD315 of Diluted Wines: This is a measurement of the wine’s absorbance


at 280nm and 315nm wavelengths after dilution.

13. Proline: A measure of the Proline (an amino acid) content in the wine.

First five rows of the Wine dataset (raw data)

The Wine Quality Dataset


This dataset contains physicochemical attributes from red variants of the
Portuguese “Vinho Verde” wine. It includes 1,599 samples and 12 attributes,
including a quality score:

1. Fixed Acidity: The non-volatile acids present in the wine.

2. Volatile Acidity: The amount of acetic acid in the wine.

3. Citric Acid: The citric acid content, which can add freshness and flavor to wines.

4. Residual Sugar: The amount of sugar remaining after fermentation.

5. Chlorides: The amount of salt in the wine.

6. Free Sulfur Dioxide: The free form of SO2 present in the wine.
7. Total Sulfur Dioxide: The total amount of SO2 in the wine.

8. Density: The density of the wine, which can provide some idea about the alcohol
content and sugar content.

9. pH: Describes how acidic or basic the wine is on a scale from 0 (very acidic) to
14 (very basic).

10. Sulphates: Wine additive which can contribute to sulfur dioxide gas (S02) levels.

11. Alcohol: The alcohol content in the wine.

12. Quality: A score between 0 and 10 given by experts.

First five rows of the Wine Quality dataset (raw data)

Coding Example

# Required Libraries
import pandas as pd
import numpy as np

# Extraction
wine_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine
wine_data = pd.read_csv(wine_url, header=None)

wine_quality_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/w
wine_quality_data = pd.read_csv(wine_quality_url, sep=";")

# Initial look at the data


print(wine_data.head())
print(wine_quality_data.head())

# Transformation
# Assigning meaningful column names
wine_data.columns = ['class', 'alcohol', 'malic acid', 'ash',
'alcalinity of ash', 'magnesium', 'total phenols',
'flavonoids', 'nonflavonoid phenols', 'proanthocyanidins',
'color intensity', 'hue', 'OD280/OD315 of diluted wines',
'proline']
# Converting Class column into categorical datatype
wine_data['class'] = wine_data['class'].astype('category')

# Checking for any missing values in both datasets


print(wine_data.isnull().sum())
print(wine_quality_data.isnull().sum())

# Normalizing 'alcohol' column in the wine_data using Min-Max normalization


wine_data['alcohol'] = (wine_data['alcohol'] - wine_data['alcohol'].min()) / (w

# Creating an average quality column in wine_quality_data


wine_quality_data['average_quality'] = wine_quality_data[['fixed acidity', 'vol
'residual sugar', 'ch
'total sulfur dioxide
'alcohol']].mean(axis

# Creating a 'quality_label' column based on 'average_quality'


wine_quality_data['quality_label'] = pd.cut(wine_quality_data['average_quality'
labels = ['low', 'medium', 'high'])

# Loading
# Saving the transformed data as a csv file
wine_data.to_csv('wine_dataset.csv', index = False)
wine_quality_data.to_csv('wine_quality_dataset.csv', index = False)

In this Python script, we extract the Wine and Wine Quality datasets from the UCI
repository, and then carry out several transformation steps. For the Wine dataset,
we assign meaningful column names, convert the ‘class’ column into a categorical
datatype, check for missing values, and normalize the ‘alcohol’ column using Min-
Max normalization.

Min-Max normalization is a data preprocessing technique that scales numerical


data to a common range, typically between 0 and 1. It is useful for ensuring features
are on a similar scale, handling outliers, improving interpretability and
visualization, and enhancing convergence speed in optimization algorithms. By
normalizing the data, it prevents certain features from dominating and facilitates
comparisons between variables. However, other normalization methods may be
more suitable depending on the data characteristics and analysis requirements.

For the Wine Quality dataset, we create an ‘average_quality’ column by taking the
mean of several columns, and then create a ‘quality_label’ column based on this
average. Finally, we load the transformed data into two new CSV files.
First five rows of the Wine dataset (processed data).

First five rows of the Wine Quality dataset (processed data).

Once the ETL process is complete and your data has been extracted, transformed,
and loaded into a data warehouse or other system, there are a number of different
ways this data can be utilized. Data analytics, machine learning, data visualization,
and reporting are some common applications. Here are some examples:

Data Analytics
The primary purpose of performing ETL is to prepare your data for analysis. You
can use descriptive, predictive, or prescriptive analytics techniques depending on
your needs. For instance, with the Wine Quality dataset, you could perform a
statistical analysis to understand the factors that influence wine quality. This might
involve using correlation matrices or regression analysis to identify the variables
most strongly associated with high-quality wine. Here’s a simple example that uses
Seaborn and Matplotlib to create and plot a correlation matrix:

import seaborn as sns


import matplotlib.pyplot as plt

# Correlation Matrix
corr = wine_quality_data.corr(numeric_only = True)

# Plot heatmap
plt.figure(figsize = (12, 10))
sns.heatmap(corr, annot = True, cmap = 'coolwarm')
plt.title('Correlation Matrix of Wine Quality Data')
# Save the figure
plt.savefig('correlation_matrix.png', dpi = 300, bbox_inches = 'tight')

plt.show()

When reading a correlation matrix and its corresponding heatmap, you can focus
on a few key aspects. First, examine the diagonal elements, which represent the
correlations of variables with themselves and should always be 1 since a variable is
perfectly correlated with itself. Next, observe the off-diagonal elements. Positive
values indicate a positive correlation, while negative values indicate a negative
correlation. The closer the value is to 1 or -1, the stronger the correlation. Pay
attention to the color intensity, with darker shades indicating stronger correlations.
Additionally, patterns in the heatmap can reveal clusters of variables with similar
correlations. This information helps identify relationships between variables and
can guide further analysis or modeling decisions.

Machine Learning
ETL also paves the way for machine learning. Once the data is cleaned and
transformed, it’s possible to train machine learning models on this data to make
predictions or classifications. For example, with the Wine dataset, you could
develop a model to classify wines into their respective classes based on the provided
features. Here’s a simple example using Scikit-Learn’s Random Forest Classifier:

from sklearn.model_selection import train_test_split


from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Define predictors and target


X = wine_data.drop('class', axis = 1)
y = wine_data['class']

# Split the data


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2,
random_state = 42)

# Initialize the model


clf = RandomForestClassifier(random_state = 42)

# Fit the model


clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Check accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))

Accuracy: 1.0

The result `Accuracy: 1.0` signifies that your model correctly classified all samples
in your test set. Accuracy is a measure of the model’s performance and is calculated
as the ratio of the correctly predicted instances to the total instances in the dataset.
An accuracy of 1.0 means that the model made no mistakes when classifying the
data in the test set.

However, an accuracy score of 1.0 (or 100%) could indicate a few things:

1. The model has learned the training data perfectly: This might be the case if the
dataset is simple and all features contribute directly to the outcome. However,
this is rare in complex real-world datasets.

2. Overfitting: This is a more common reason for a perfect accuracy score.


Overfitting occurs when the model learns the training data too well, capturing
not just the general patterns, but also the noise or the outliers. It means the
model will perform perfectly on the training data (and, due to the random
chance of splitting, potentially on your test set), but it may not generalize well to
new, unseen data.

3. Data leakage: This occurs when information from the test set leaks into the
training set. In such cases, the model has access to the data it should not know,
thereby leading to overly optimistic results.

If your model is consistently giving you an accuracy of 1.0, it’s worth checking to
make sure you haven’t accidentally introduced data leakage or overfitting. You could
try cross-validation, introduce regularization, or check your feature selection and
data cleaning steps to prevent these issues.

For example, using cross-validation might give a more realistic picture of your
model’s performance:

from sklearn.model_selection import cross_val_score

# Initialize the model


clf = RandomForestClassifier(random_state = 42)

# Compute cross-validation score


scores = cross_val_score(clf, X, y, cv = 5)

print("Cross-validation scores:", scores)


print("Average cross-validation score:", scores.mean())
Cross-validation scores: [0.97222222 0.94444444 0.97222222 0.97142857 1. ]
Average cross-validation score: 0.9720634920634922

This will give you an array of scores from five different runs due to (cv=5) of the
classifier, each with a different train/test split, and the average of those scores. If
these scores are all high and relatively close to 1.0, your model is likely performing
well. If there is large variation or the scores are significantly less than 1.0, then the
original score of 1.0 on the test set may be due to overfitting or random chance.

Data Visualization
Data visualization is an excellent way to understand the patterns, trends, and
correlations in your data. For instance, you could use Seaborn or Matplotlib to
create scatter plots, histograms, or bar plots to better understand the Wine and
Wine Quality datasets.

import numpy as np
import matplotlib.pyplot as plt

# Plot histogram
plt.figure(figsize = (10, 7))
wine_quality_data['quality'].plot(kind = 'hist', rwidth = 0.95,
bins = np.arange(2.5, 9))

plt.title('Distribution of Wine Quality Ratings')


plt.xlabel('Quality Ratings')
plt.ylabel('Count')

# Adjust x-ticks
plt.xticks(np.arange(3, 9, step=1))

# Save the figure


plt.savefig('histogram_wine_quality_ratings.png', dpi = 300,
bbox_inches = 'tight')
plt.show()

In this code, the bins parameter is set to np.arange(2.5, 9) , which generates bins
from 2.5 to 9 with a step of 1, aligning with the rating categories. The
plt.xticks(np.arange(3, 9, step=1)) line adjusts the x-ticks so that they are
positioned at the center of each bin (i.e., between the bars).
Please note that the starting and ending values for the bins parameter (i.e., 2.5 and
9 in this case) and the values for the np.arange() function inside plt.xticks() (i.e., 3
and 9 in this case) should be adjusted based on the actual range of your 'quality'
ratings.

Reporting
ETL also enables you to conduct more effective reporting. Clean, organized data is
crucial for generating accurate, up-to-date reports. For example, you could use
Python’s Pandas library to generate monthly reports of wine quality or use a BI tool
like Tableau or Power BI to create interactive dashboards based on your cleaned and
transformed data.

ETL to Infinity & Beyond


As our world continues to generate vast amounts of data, the demand for efficient
data management processes such as ETL grows. As we’ve demonstrated, utilizing
ETL best practices can significantly enhance the quality of data and the efficiency of
its analysis, making it a powerful asset in today’s data-driven landscape.
Looking into the future, the importance of ETL and data engineering is set to rise.
The emergence of Big Data, machine learning, artificial intelligence, and IoT
technologies all depend on reliable, clean data, making ETL processes more critical
than ever. Businesses that can efficiently extract, transform, and load their data are
more likely to leverage these technologies successfully, providing a competitive
edge.

The rise of cloud computing is also shaping the ETL landscape, with ETL tasks
becoming increasingly integrated with cloud services. Such integration allows for
greater scalability and performance, enabling businesses to handle larger datasets
faster and more effectively.

For individuals considering a career in data science, business intelligence, or


similar fields, learning data engineering, and ETL processes is a valuable
investment. The demand for professionals proficient in ETL processes and tools is
steadily increasing as more companies realize the potential value of their data.

Understanding ETL not only opens up career opportunities but also allows you to
contribute to this exciting field. With the ongoing advancement in data processing
technologies, there’s no doubt that the importance and complexity of ETL processes
will only continue to grow. By learning and implementing ETL best practices, you’ll
be well-equipped to navigate and thrive in this evolving landscape.

Embracing ETL and data engineering practices is more than just a technological
decision — it is a strategic move that can propel businesses into the next phase of
digital innovation and secure a competitive edge in the future. Those who are well-
versed in these areas will be leading the way, making the journey from raw data to
actionable insight more streamlined and enlightening than ever.

Resources
1. Data Profiling: Getting Started with Data Profiling (Talend)

2. Data Validation Rules: Data Quality Rules & Validation (Informatica)

3. ETL: Translate raw data into actionable date (Databricks)

4. ETL testing: A comprehensive guide to ensuring data quality and integration


(Talend)

5. Documentation: Overview Software Documentation (Geeksforgeeks)


6. Continuous Monitoring and Optimization: 5 Ideas to Expand Your ETL Process
Monitoring (Paessler)

ETL is a fundamental part of data management. While it might seem complex at


first, understanding the basic steps and practicing with real-world examples can
simplify the process. Python, with its extensive range of data-focused libraries, can
be a powerful tool for ETL. Remember, effective data management is key to any
data-driven organization. Happy Learning!!

Did you enjoy this post? Would you like to sponsor my caffeine addiction? If YES, you
can BUY ME A COFFEE and help me stay awake to continue creating awesome things!

Etl Etl Pipeline Python Programming Data Engineering Data Management

Follow

Written by Jesús Cantú


98 Followers

I write blogs about topics in computer science, data science & engineering.

More from Jesús Cantú

You might also like