Create A Web Scraping Pipeline With Python Using Data Contracts by Stephen David-Williams Feb, 2024 Level Up Coding
Create A Web Scraping Pipeline With Python Using Data Contracts by Stephen David-Williams Feb, 2024 Level Up Coding
Open in app
Search
Member-only story
1 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
Preface �
This is a practical end-to-end data pipeline demo to show what a data project
incorporating data contracts looks like.
We’ll be scraping the Premier League table standings for the 2023/24 season, as of
13th February 2024 (the date this article is posted). The scraped data will be
uploaded into Postgres database through multiple stages via data contracts, and then
saved into AWS S3 programmatically.
Pseudo-code �
Here’s a rough brain dump on the steps we want the program to follow:
• Scrape the data if we’re allowed to do so, otherwise find out if they have an API
we can extract data from instead
• Check the transformation steps have shaped the data into the expected output
• Upload to AWS S3
Technologies �
We’ve defined the basic steps our pipeline should follow, so now we can map the
right modules to support the process:
• os
2 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
• boto3
• pandas
• requests
• selenium
• python-dotenv
• soda-core
• soda-core-postgres
• soda-core-contracts
Architectural diagram �
This is a visual representation of what the data pipeline looks like:
GIF by author
Folder structure �
│ .env
│ .gitignore
3 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
│ check_robots
│ requirements.txt
│
├───config
│ extraction_config.yml
│ transformation_config.yml
│
├───data
│ transformed_data.csv
│
├───drivers
│ chromedriver.exe
│
├───src
│ │ etl_pipeline.py
│ │ __init__.py
│ │
│ ├───extraction
│ │ alter_col_type.py
│ │ main.py
│ │ scraping_bot.py
│ │ __init__.py
│ │
│ ├───loading
│ │ s3_uploader.py
│ │ __init__.py
│ │
│ ├───transformation
│ │ main.py
│ │ transformations.py
│ │ __init__.py
│ │
│ └───web_checker
│ robots_txt_checker.py
│ __init__.py
│
├───tests
│ │ __init__.py
│ │
│ ├───data_contracts
│ │ extraction_data_contract.yml
│ │ transformation_data_contract.yml
│ │
│ └───data_quality_checks
│ scan_extraction_data_contract.py
│ scan_transformation_data_contract.py
│
└───utils
4 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
aws_utils.py
db_utils.py
__init__.py
Before we scrape the website, we need to verify if we’re allowed to do this in the first
place. No one likes a bunch of bots overloading their websites with traffic that
doesn’t benefit them, especially if it’s used for commercial purposes.
To do this, we’ll need to check the robots.txt of the site. Doing things
programmatically reduces the chances of humans misinterpreting the response of
this check.
So click here to find the Python code that performs this operation (which can be
found on GitHub).
5 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
The response contains an Allow directive that implies all users are allowed to access
any areas of the site, but no Disallow entries.
It’s safe to assume there are no areas of the site we’re not allowed to access, but it’s
still important that we scrape in accordance with their terms of service (ToS)
respectfully.
�
Data contracts �
We’ll be using data contracts in the blog, so let me provide TLDR context for you on
it -
A data contract helps data consumers articulate everything they expect from the
developers producing the data (including the ways to meet these expectations,
timelines to meet the expectations etc), and these data producers fulfil the
expectations using the details in this document (the data contact). A data contract is
also referred to as a
6 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
One key reason why data contracts are good is that it makes it difficult for changes
to occur without consumption users and tools being informed first. So any changes
occurring at the source level will be approved first before merging or replacing the
existing schema structure.
On that note, consider this blog as a mini proof of concept into why coupling data
contracts with your data pipelines (or even data products) may be a good approach
to improving data governance initiatives with minimal effort.
In the real world, we would need some of the following listed below to implement
them in a production environment (among other tools & considerations of course):
• Data contract as a file, for defining the schema, data rules and other constraints
• Version control system — to keep and maintain each version of the data contract
details
• Orchestration tool — to automatically run the data quality tests once changes
are detected in the data source
• CI/CD pipeline — to merge the changes if they pass the data quality tests, or
circuit break the entire operation if they fail
For simplicity’s sake, we’ll just use the data contract and leave the other components
mentioned because the current release of Soda data contracts doesn’t support
Docker (as of the time of this writing). So once it does, I’ll write a separate blog for
you on this.
• play the role of data producers and simulate the data consumers for each data
contract created
7 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
Disclaimer: I have no affiliates or financial links with Soda whatsoever, I’m simply using
this objectively for experimentational purposes, so opinions are my own and developed
progressively through POCs like these.
Extract (E) �
�
�
�
��
To scrape the data we want, we need to first understand what we need to scrape in
the first place…from the perspective of the users who will be using it themselves.
• contract.yml — to add the information about the data source, the schema and
the data quality checks to run
In relation to this extraction stage, the data consumers are the transformation team,
who happen to have submitted their list of expectations to us (the data producers)
for the scraped data they want landed in Postgres:
• 20 unique rows
• 16 fields
├───extraction
│ alter_col_type.py
│ main.py
│ scraping_bot.py
│ __init__.py
8 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
dataset: scraped_fb_data
columns:
- name: pos
data_type: integer
unique: true
- name: team
data_type: varchar
not_null: true
- name: p
data_type: integer
not_null: true
- name: w1
data_type: integer
not_null: true
- name: d1
data_type: integer
not_null: true
- name: l1
data_type: integer
not_null: true
- name: gf1
data_type: integer
not_null: true
- name: ga1
data_type: integer
not_null: true
- name: w2
data_type: integer
not_null: true
- name: d2
data_type: integer
not_null: true
- name: l2
data_type: integer
not_null: true
- name: gf2
data_type: integer
not_null: true
- name: ga2
data_type: integer
9 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
not_null: true
- name: gd
data_type: integer
not_null: true
- name: pts
data_type: integer
not_null: true
- name: date
data_type: date
not_null: true
checks:
- row_count = 20
1. Scrape the content and print a message to signal the job succeeded
We need to establish where the content we want to scrape is located on the webpage,
so that we know how we want to go about scraping it in the first place. This requires
us looking at the DOM the website generates when it first loads.
The DOM (Document Object Model) is just a representation of the HTML objects
that form the content you see on a website. So each paragraph, heading and button
is on a webpage represented as a node within the DOM in a tree-like structure.
If you’re reading this on a desktop, you can check the DOM of this page by pressing
F12 on your keyboard (or right click on a section of the webpage and click Inspect).
This should open up the DevTools panel — you can find the DOM under the Elements
tab.
Selenium provides different methods to choose from. We could access the DOM’s
elements by:
10 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
• ID
• Name
• Tag name
• CSS selector
• XPath
• the entire Premier League table standings is represented by the <table> tag
• the same <table> tag has a class attribute with the value “leaguetable”
• each row in the Premier League table standings is represented by the <tr> tag
11 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
• Iterate through each row within the table (i.e. each <tr> element within the
<table> tag)
• For each row, extract from the data from each of its cells
• db_utils.py
• scraping_bot.py
• main.py
The db_utils.py module will hold the commands to interact with the Postgres
database:
import os
import psycopg2
import pandas as pd
from dotenv import load_dotenv
def connect_to_db():
"""
Uses environemnt variables to connect to the Postgres database
"""
HOST=os.getenv('HOST')
PORT=os.getenv('PORT')
DATABASE=os.getenv('DATABASE')
POSTGRES_USERNAME=os.getenv('POSTGRES_USERNAME')
POSTGRES_PASSWORD=os.getenv('POSTGRES_PASSWORD')
12 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
host=HOST,
port=PORT,
dbname=DATABASE,
user=POSTGRES_USERNAME,
password=POSTGRES_PASSWORD,
)
db_connection.set_session(autocommit=True)
print("�
� Connection to the database established successfully.")
return db_connection
except Exception as e:
raise Exception(f"�
�[ERROR - DB CONNECTION]: Error connecting to the database:
create_table_query = f"""
CREATE TABLE IF NOT EXISTS {schema_name}.{table_name} (
"pos" INTEGER,
"team" TEXT NOT NULL,
"p" INTEGER,
"w1" INTEGER,
"d1" INTEGER,
"l1" INTEGER,
"gf1" INTEGER,
"ga1" INTEGER,
"w2" INTEGER,
"d2" INTEGER,
"l2" INTEGER,
"gf2" INTEGER,
"ga2" INTEGER,
"gd" INTEGER,
"pts" INTEGER,
"date" DATE
)
"""
cursor.execute(create_table_query)
db_connection.commit()
cursor.close()
13 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
Inserts data from pandas dataframe into the extraction Postgres table in
"""
cursor = db_connection.cursor()
for index, row in dataframe.iterrows():
data = tuple(row)
placeholders = ",".join(["%s"] * len(row))
insert_query = f"""
INSERT INTO {schema_name}.{table_name} (
"pos", "team", "p", "w1", "d1", "l1", "gf1", "ga1", "w2", "d2", "l2", "g
)
VALUES ({placeholders})
"""
try:
cursor.execute(insert_query, data)
db_connection.commit()
except Exception as e:
print(f"�
�Failed to insert data: {data}��. Error: {e}")
cursor.close()
# Transformation
create_table_query = f"""
CREATE TABLE IF NOT EXISTS {schema_name}.{table_name} (
"position" INTEGER,
"team_name" VARCHAR,
"games_played" INTEGER,
"goals_for" INTEGER,
"goals_against" INTEGER,
"goal_difference" INTEGER,
"points" INTEGER,
"match_date" DATE
)
"""
cursor.execute(create_table_query)
db_connection.commit()
cursor.close()
14 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
# Building placeholders for the VALUES part of the INSERT INTO statement
placeholders = ', '.join(['%s' for _ in dataframe.columns])
# Execute the INSERT INTO statement for each row in the dataframe
for index, row in dataframe.iterrows():
# Print the row to be inserted for debugging purposes
# print(f"Row data: {tuple(row)}")
try:
cursor.execute(insert_query, tuple(row))
db_connection.commit()
except Exception as e:
print(f"�
�Failed to insert transformed data: {tuple(row)}� �. Error:
cursor.close()
# Loading
def fetch_transformed_data(db_connection):
"""
Pulls data from the transformation table in Postgres.
15 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
"""
print("Fetching transformed data from the database...")
try:
query = "SELECT * FROM staging.transformed_fb_data;"
df = pd.read_sql(query, db_connection)
print("Data fetched successfully.")
return df
except Exception as e:
raise Exception(f"�
�[ERROR - FETCH DATA]: {e}")
The .env file will hold the database credentials our scraper needs to connect to the
Postgres database:
# Postgres
HOST="localhost"
PORT=5434
DATABASE="test_db"
POSTGRES_USERNAME=${POSTGRES_USERNAME}
POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
# AWS
...
16 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
if show_output:
print(f"Premier League Table Standings (as of {formatted_date}):")
print('-'*60)
for row_data in all_data:
print(' '.join(row_data))
print('\\n' + '-'*60)
driver.implicitly_wait(2)
driver.quit()
# I've commented out the normal-cased columns because Soda's data contract parser (a
# columns = ["Pos", "Team", "P", "W1", "D1", "L1", "GF1", "GA1", "W2", "D2", "L2", "
columns = ["pos", "team", "p", "w1", "d1", "l1", "gf1", "ga1", "w2", "d2",
df = pd.DataFrame(all_data, columns=columns)
17 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
return df
def run_dq_checks_for_extraction_stage():
"""
Performs data quality checks for the extraction stage using Soda SQL
based on the predefined data contract.
1. Pulls the YAML files for the config + data contracts
2. Reads the data source, schema and data quality checks specified in th
3. Executes the data quality checks
"""
# Correctly set the path to the project root directory
project_root_directory = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abs
18 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
scan.assert_no_checks_fail()
if __name__ == "__main__":
run_dq_checks_for_extraction_stage()
…and the main.py file is where we scrape and load the data to Postgres and then run
the data quality tests:
def extraction_job():
# Flag for running the data quality checks only
RUN_DQ_CHECKS_ONLY = True
if not RUN_DQ_CHECKS_ONLY:
# Connect to the database
connection = connect_to_db()
19 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
else:
run_dq_checks_for_extraction_stage()
if __name__=="__main__":
extraction_job()
The scan states there was a mismatch between the expected data type of the team
column (varchar) and the actual data type for the team column in the Postgres
database (text).
20 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
21 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
This now forces us to consider one of these responses (among many potential
others):
• raise the flagged error message with the ‘data consumer’ for an adequate
response (change it, leave it, or do something else?)
• ignore the previous option, enforce one ourselves and update the consumers on
which approach we’ve adopted (provided we have the official green light from
the consumers to do so)
Whichever option we advance with, a good thing about going down the data contract
route is that the data quality issue is immediately exposed, which gives the data
22 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
In a real world scenario, an issue like this would be more severe because there
would be high revenue-generating products that depend on a column’s data type
needing to be accurate.
If these issues go unnoticed, they could break downstream tools and make it harder
to figure out where the bugs are occurring. This inevitably means more time (and
money) spent diagnosing and firefighting to solve the avoidable issue, which could
have been solved if the data quality expectations were articulated in an accessible
and version-controllable file like a data contract (not too different from this simple
YAML example).
Now we can troubleshoot the issue and re-run the checks to confirm if this works
well again:
23 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
But let’s take a quick glance into Postgres just to be extra certain…:
Success! Our data quality checks have passed for the extraction stage�
�!
Transform (T) �
This is where we define the steps for curating the data into the format. These steps
will be a direct function of the data consumer’s expected version of the dataframe.
The data consumers in relation to this stage would be the loading team. Here are
their list of expectations:
24 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
• using the home + away columns to calculate the points (but deducting 10 points
from Everton FC due to FFA violations)
• sort + reset the index (’position’ field) once the points have been recalculated
Now that the expectations have been defined by the consumer, we can begin our
development.
• Connect to Postgres
├───transformation
│ add_columns.py
│ main.py
│ transformations.py
│ __init__.py
import pandas as pd
def rename_fields(df):
25 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
"""
Rename each field name to a longer form version.
"""
df_renamed = df.rename(columns={
'pos': 'position',
'team': 'team_name',
'p': 'games_played',
'w1': 'home_wins',
'd1': 'home_draws',
'l1': 'home_losses',
'gf1': 'home_goals_for',
'ga1': 'home_goals_against',
'w2': 'away_wins',
'd2': 'away_draws',
'l2': 'away_losses',
'gf2': 'away_goals_for',
'ga2': 'away_goals_against',
'gd': 'goal_difference',
'pts': 'points',
'date': 'match_date'
})
return df_renamed
def calculate_points(df):
"""
Use the home and away columns to calculate the points, and
deduct 10 points from Everton FC due to PSR violations starting
from November 2023.
"""
# Calculate points normally for all rows
df['points'] = (
df['home_wins'] * 3 + df['away_wins'] * 3 +
df['home_draws'] + df['away_draws']
)
26 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
return df
def deduct_points_from_everton(df):
"""
Deduct points for Everton FC if the match_date is in or after November 2023
"""
psr_violation_start_date = pd.to_datetime('2023-11-01')
everton_mask = (df['team_name'] == 'Everton') & (df['match_date'] >= psr_violation_s
df.loc[everton_mask, 'points'] -= 10
return df
def sort_and_reset_index(df):
"""
Sort the dataframe based on the Premier League table standings rules
and reset the 'position' column to reflect the new ranking.
"""
# Sort by points, then goal difference, then goals for
df_sorted = df.sort_values(by=['points', 'goal_difference', 'goals_for'], ascending=
return df_sorted
def transform_data(df):
"""
Apply all the transformation intents on the dataframe.
"""
df_renamed = rename_fields(df)
df_points_calculated = calculate_points(df_renamed)
df_points_deducted = deduct_points_from_everton(df_points_calculated)
# Sort the dataframe by points, goal_difference, and goals_for to apply the league s
df_final = df_cleaned.sort_values(by=['points', 'goal_difference', 'goals_for'
# Reset the position column to reflect the new ranking after sorting
df_final.reset_index(drop=True, inplace=True)
27 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
df_final['position'] = df_final.index + 1
return df_final
We’re
• deducting points from Everton FC starting from November 2023 (due to PSR
violations)
• dropping the home and away fields once we’re done calculating the points
Most of these are fairly straight forward, so let’s just go straight into the Everton FC
situation.
Just for context, there’s a rule known as the PSR (Profit & Sustainability Rule) which
states every Premier League club is allowed to lose a maximum of £105 million, but
Everton FC lost £124.5 million up to the 2021/22 period, which exceeded the PSR
threshold by almost £20 million.
As far as the independent commission reviewing their case was concerned, Everton
violated this rule, and therefore penalized with a 10-point deduction. This is no
small punishment by any means…this has impacted Everton’s position on the
Premier League table, which could potentially place them in danger of being
relegated from the league entirely for the first time in their history. This naturally
forces Everton to appeal this decision, as they also believe the commission have not
accurately calculated the losses, so at the time of this writing (February 2024), the
point deduction still remains.
28 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
• set the start date for the penalty (17th November 2023)
• highlight the rows that correspond to every game played by Everton FC after the
penalty date
def deduct_points_from_everton(df):
"""
Deduct points for Everton FC if the match_date is in or after November 2023
"""
psr_violation_start_date = pd.to_datetime('2023-11-17')
everton_mask = (df['team_name'] == 'Everton') & (df['match_date'] >= psr_violation_s
df.loc[everton_mask, 'points'] -= 10
29 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
return df
dataset: transformed_fb_data
columns:
- name: position
data_type: integer
unique: true
- name: team_name
data_type: varchar
not_null: true
- name: games_played
data_type: integer
not_null: true
- name: wins
data_type: integer
not_null: true
- name: draws
data_type: integer
not_null: true
- name: losses
data_type: integer
not_null: true
- name: goals_for
data_type: integer
not_null: true
- name: goals_against
data_type: integer
not_null: true
- name: goal_difference
data_type: integer
not_null: true
- name: points
data_type: integer
not_null: true
valid_min: 0
- name: match_date
data_type: date
not_null: true
checks:
30 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
Now how do we incorporate these point deductions into the data quality checks in
the data contract?
This requires us to break down the logical sequence of steps the check needs to take
to make this happen. We would need to
• calculate the total number of points before the penalty date (i.e. November 2023)
As of the time of this writing, creating two CTEs was the best approach I could come
up with and incorporate it into the checks using SodaCL. There may be a better
31 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
approach down the line but this seems to be the most sensible way of going about
this with the current Soda release for YAML-based data contracts.
def transformation_job():
# Establish a connection to the database
connection = connect_to_db()
# Define schema and table names for extracted and transformed data
extracted_schema_name = 'raw'
extracted_table_name = 'scraped_fb_data'
transformed_schema_name = 'staging'
transformed_table_name = 'transformed_fb_data'
# Create schema and table for the transformed data if not exist
create_transformed_schema_and_table(connection, transformed_schema_name, transformed
if __name__ == "__main__":
transformation_job()
32 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
There are no data quality errors returned, so we’re good to advance to the next stage
�
Load (L) �
�
�
�
�
�
�
Now we need to write the dataframe to a CSV file to get it ready to upload to our AWS
S3 bucket.
At this point, we don’t need to apply data quality checks because there are no more
transformation processes applied at the data level.
The only transformation we need now is to convert the transformed data into CSV
format and upload it to AWS S3.
Here are the steps we need to take for this stage for each table standing we process:
• Set up S3 client
• Create bucket if it doesn’t exist to hold premier league table standings data
33 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
For this to occur, the load layer directory needs files that
├───loading
│ s3_uploader.py
│ __init__.py
You’ll need an AWS account, so be sure to set one up to follow along this part.
AWS Utilities �
The aws_utils.py helps interact with the AWS services using Python. It helps us
manage th eAWS configurations, set up the bucker if it doesn’t exist and error
handling with logs that are easy to read (including emojis for visual cues).
34 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
import os
import boto3
from dotenv import load_dotenv
from boto3.exceptions import Boto3Error
def connect_to_aws_s3():
print("Connecting to AWS S3...")
try:
s3_client = boto3.client(
's3',
aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
aws_secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY"),
region_name=os.getenv("REGION_NAME")
)
print("Connected to AWS S3 successfully.")
return s3_client
except Boto3Error as e:
raise Exception(f"�
�[ERROR - AWS S3 CONNECTION]: {e}")
35 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
# Postgres
...
# AWS
AWS_ACCESS_KEY_ID="xxxxxxxxx"
AWS_SECRET_ACCESS_KEY="xxxxxxxxx"
REGION_NAME="eu-west-2"
S3_BUCKET="premier-league-standings-2024"
S3_FOLDER="football_data"
�
Loading to S3 �
The s3_uploader.py file is responsible for uploading the CSV file to the S3 bucket of
our choice.
def loading_job():
print("Starting data transfer process...")
connection = None
try:
connection = connect_to_db()
df = fetch_transformed_data(connection)
36 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
local_filename = 'transformed_data.csv'
convert_dataframe_to_csv(df, local_filename)
s3_client = connect_to_aws_s3()
bucket_name = create_bucket_if_not_exists(s3_client, os.getenv("S3_BUCKET"
s3_folder = os.getenv("S3_FOLDER")
upload_file_to_s3(s3_client, local_filename, bucket_name, s3_folder)
except Exception as e:
print(f"�
� An error occurred: {e}")
finally:
if connection:
print("Closing the database connection.")
connection.close()
if __name__ == "__main__":
loading_job()
Once we’ve ran the s3_uploader.py, the CSV file is successfully uploaded to the S3
bucket, like so:
37 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
Results �
Now that the scripts have ran, we can compare the actual Premier League table to
the outputs we’ve generated in our CSV file and Postgres table:
38 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
39 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
Postgres
40 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
Key takeaways �
• Data contracts are not difficult to incorporate into a data pipeline (especially
programmatic ones)
• Using data contracts ensures only valid and accurate data progresses through
each stage without any surprises.
• We can increase the quality of data even with Soda’s experimental data contracts
feature
Resources �
You can find all the code examples used in this article on Github here.
�
Looking ahead �
The GA version of Soda’s data contracts (at the time of this writing) promises to
support Docker, which would also support more advanced integrations with Airflow,
which means there will be availability to automate the orchestration of each stage
41 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
In a future post I’ll deep dive into another end-to-end project using data contracts
for more sophisticated real-world tasks.
�
Feel free to share your feedback, questions and comments if you have any!
Follow
42 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
�
�
�
698 1
43 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
Google Grants $1 Million To Move From C++ To Rust. Is C++ Slowly Dying?
Google Is (Early) Adopting Rust. Are You?
1.1K 22
1.5K 29
44 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
577 12
45 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
The Most Advanced Libraries for Data Visualization and Analysis on the
Web
A careful selection looking for performance, flexibility, and richness of features.
378 2
46 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
798 5
Lists
ChatGPT
21 stories · 474 saves
47 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
Akhilesh Mishra
2.4K 23
48 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
How to Build Your Own LLM Coding Assistant With Code Llama
Creating a local LLM-chatbot with CodeLlama-7b-Instruct-hf and Streamlit
683 3
348 6
49 of 50 2/22/2024, 2:06 AM
Create a web scraping pipeline with Python using data contracts | by St... https://levelup.gitconnected.com/create-a-web-scraping-pipeline-with-p...
Aleksei Rozanov
336 2
50 of 50 2/22/2024, 2:06 AM