A Beginner's Guide To ETL With Python - by Jesús Cantú - Medium
A Beginner's Guide To ETL With Python - by Jesús Cantú - Medium
Search
Get unlimited access to the best of Medium for less than $1/week. Become a member
ETL Process
ETL, which stands for Extract, Transform, Load, is a crucial process in data
management. It involves retrieving data from various sources, modifying it to fit
business needs or target system requirements, and then loading it into a central
location such as a data warehouse. In this article, we will simplify the ETL process
for beginners, delve into each step, and illustrate it with a real-world Python
example using publicly available data.
Loading Data
After the data has been transformed, it is then loaded into a data warehouse or other
target system. This is where the processed data will reside for further use, usually
for analytics, business intelligence, or reporting purposes.
ETL in Python
Python is a great language for performing ETL because of its readability and the
wide range of data-centric libraries it offers. Here, we’re going to use Python to
perform ETL on two datasets from the UCI Machine Learning Repository: the Wine
dataset and the Wine Quality dataset. We’ll use the Pandas and NumPy libraries for
data manipulation.
3. Ash: A measure of the wine’s “ash” content, which is the inorganic matter
remaining after evaporation and incineration.
13. Proline: A measure of the Proline (an amino acid) content in the wine.
3. Citric Acid: The citric acid content, which can add freshness and flavor to wines.
6. Free Sulfur Dioxide: The free form of SO2 present in the wine.
7. Total Sulfur Dioxide: The total amount of SO2 in the wine.
8. Density: The density of the wine, which can provide some idea about the alcohol
content and sugar content.
9. pH: Describes how acidic or basic the wine is on a scale from 0 (very acidic) to
14 (very basic).
10. Sulphates: Wine additive which can contribute to sulfur dioxide gas (S02) levels.
Coding Example
# Required Libraries
import pandas as pd
import numpy as np
# Extraction
wine_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine
wine_data = pd.read_csv(wine_url, header=None)
wine_quality_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/w
wine_quality_data = pd.read_csv(wine_quality_url, sep=";")
# Transformation
# Assigning meaningful column names
wine_data.columns = ['class', 'alcohol', 'malic acid', 'ash',
'alcalinity of ash', 'magnesium', 'total phenols',
'flavonoids', 'nonflavonoid phenols', 'proanthocyanidins',
'color intensity', 'hue', 'OD280/OD315 of diluted wines',
'proline']
# Converting Class column into categorical datatype
wine_data['class'] = wine_data['class'].astype('category')
# Loading
# Saving the transformed data as a csv file
wine_data.to_csv('wine_dataset.csv', index = False)
wine_quality_data.to_csv('wine_quality_dataset.csv', index = False)
In this Python script, we extract the Wine and Wine Quality datasets from the UCI
repository, and then carry out several transformation steps. For the Wine dataset,
we assign meaningful column names, convert the ‘class’ column into a categorical
datatype, check for missing values, and normalize the ‘alcohol’ column using Min-
Max normalization.
For the Wine Quality dataset, we create an ‘average_quality’ column by taking the
mean of several columns, and then create a ‘quality_label’ column based on this
average. Finally, we load the transformed data into two new CSV files.
First five rows of the Wine dataset (processed data).
Once the ETL process is complete and your data has been extracted, transformed,
and loaded into a data warehouse or other system, there are a number of different
ways this data can be utilized. Data analytics, machine learning, data visualization,
and reporting are some common applications. Here are some examples:
Data Analytics
The primary purpose of performing ETL is to prepare your data for analysis. You
can use descriptive, predictive, or prescriptive analytics techniques depending on
your needs. For instance, with the Wine Quality dataset, you could perform a
statistical analysis to understand the factors that influence wine quality. This might
involve using correlation matrices or regression analysis to identify the variables
most strongly associated with high-quality wine. Here’s a simple example that uses
Seaborn and Matplotlib to create and plot a correlation matrix:
# Correlation Matrix
corr = wine_quality_data.corr(numeric_only = True)
# Plot heatmap
plt.figure(figsize = (12, 10))
sns.heatmap(corr, annot = True, cmap = 'coolwarm')
plt.title('Correlation Matrix of Wine Quality Data')
# Save the figure
plt.savefig('correlation_matrix.png', dpi = 300, bbox_inches = 'tight')
plt.show()
When reading a correlation matrix and its corresponding heatmap, you can focus
on a few key aspects. First, examine the diagonal elements, which represent the
correlations of variables with themselves and should always be 1 since a variable is
perfectly correlated with itself. Next, observe the off-diagonal elements. Positive
values indicate a positive correlation, while negative values indicate a negative
correlation. The closer the value is to 1 or -1, the stronger the correlation. Pay
attention to the color intensity, with darker shades indicating stronger correlations.
Additionally, patterns in the heatmap can reveal clusters of variables with similar
correlations. This information helps identify relationships between variables and
can guide further analysis or modeling decisions.
Machine Learning
ETL also paves the way for machine learning. Once the data is cleaned and
transformed, it’s possible to train machine learning models on this data to make
predictions or classifications. For example, with the Wine dataset, you could
develop a model to classify wines into their respective classes based on the provided
features. Here’s a simple example using Scikit-Learn’s Random Forest Classifier:
# Make predictions
y_pred = clf.predict(X_test)
# Check accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))
Accuracy: 1.0
The result `Accuracy: 1.0` signifies that your model correctly classified all samples
in your test set. Accuracy is a measure of the model’s performance and is calculated
as the ratio of the correctly predicted instances to the total instances in the dataset.
An accuracy of 1.0 means that the model made no mistakes when classifying the
data in the test set.
However, an accuracy score of 1.0 (or 100%) could indicate a few things:
1. The model has learned the training data perfectly: This might be the case if the
dataset is simple and all features contribute directly to the outcome. However,
this is rare in complex real-world datasets.
3. Data leakage: This occurs when information from the test set leaks into the
training set. In such cases, the model has access to the data it should not know,
thereby leading to overly optimistic results.
If your model is consistently giving you an accuracy of 1.0, it’s worth checking to
make sure you haven’t accidentally introduced data leakage or overfitting. You could
try cross-validation, introduce regularization, or check your feature selection and
data cleaning steps to prevent these issues.
For example, using cross-validation might give a more realistic picture of your
model’s performance:
This will give you an array of scores from five different runs due to (cv=5) of the
classifier, each with a different train/test split, and the average of those scores. If
these scores are all high and relatively close to 1.0, your model is likely performing
well. If there is large variation or the scores are significantly less than 1.0, then the
original score of 1.0 on the test set may be due to overfitting or random chance.
Data Visualization
Data visualization is an excellent way to understand the patterns, trends, and
correlations in your data. For instance, you could use Seaborn or Matplotlib to
create scatter plots, histograms, or bar plots to better understand the Wine and
Wine Quality datasets.
import numpy as np
import matplotlib.pyplot as plt
# Plot histogram
plt.figure(figsize = (10, 7))
wine_quality_data['quality'].plot(kind = 'hist', rwidth = 0.95,
bins = np.arange(2.5, 9))
# Adjust x-ticks
plt.xticks(np.arange(3, 9, step=1))
In this code, the bins parameter is set to np.arange(2.5, 9) , which generates bins
from 2.5 to 9 with a step of 1, aligning with the rating categories. The
plt.xticks(np.arange(3, 9, step=1)) line adjusts the x-ticks so that they are
positioned at the center of each bin (i.e., between the bars).
Please note that the starting and ending values for the bins parameter (i.e., 2.5 and
9 in this case) and the values for the np.arange() function inside plt.xticks() (i.e., 3
and 9 in this case) should be adjusted based on the actual range of your 'quality'
ratings.
Reporting
ETL also enables you to conduct more effective reporting. Clean, organized data is
crucial for generating accurate, up-to-date reports. For example, you could use
Python’s Pandas library to generate monthly reports of wine quality or use a BI tool
like Tableau or Power BI to create interactive dashboards based on your cleaned and
transformed data.
The rise of cloud computing is also shaping the ETL landscape, with ETL tasks
becoming increasingly integrated with cloud services. Such integration allows for
greater scalability and performance, enabling businesses to handle larger datasets
faster and more effectively.
Understanding ETL not only opens up career opportunities but also allows you to
contribute to this exciting field. With the ongoing advancement in data processing
technologies, there’s no doubt that the importance and complexity of ETL processes
will only continue to grow. By learning and implementing ETL best practices, you’ll
be well-equipped to navigate and thrive in this evolving landscape.
Embracing ETL and data engineering practices is more than just a technological
decision — it is a strategic move that can propel businesses into the next phase of
digital innovation and secure a competitive edge in the future. Those who are well-
versed in these areas will be leading the way, making the journey from raw data to
actionable insight more streamlined and enlightening than ever.
Resources
1. Data Profiling: Getting Started with Data Profiling (Talend)
Did you enjoy this post? Would you like to sponsor my caffeine addiction? If YES, you
can BUY ME A COFFEE and help me stay awake to continue creating awesome things!
Follow
I write blogs about topics in computer science, data science & engineering.