Extract and Transform Data With Web Scraping - Learn Python Basics - OpenClassrooms
Extract and Transform Data With Web Scraping - Learn Python Basics - OpenClassrooms
Home > Course > Learn Python Basics > Extract and Transform Data With Web Scraping
% &
Extract Data
Extract and Transform From the Web
' Using Python (
Data With Web Scraping Libraries
) Import Python
Libraries
Extract and
Transform Data
With Web Scraping
02:46
+ , -
https://openclassrooms.com/en/courses/6902811-learn-python-basics/7091156-extract-and-transform-data-with-web-scraping Page 1 of 16
Extract and Transform Data With Web Scraping - Learn Python Basics - OpenClassrooms 08/01/2023, 15:49
(
Web scraping is the automated process of
retrieving (or scraping) data from a website.
Instead of manually collecting data, you can write
Python scripts (a fancy way of saying a code
process) that can collect the data from a website
and save it to a .txt or .csv file.
https://openclassrooms.com/en/courses/6902811-learn-python-basics/7091156-extract-and-transform-data-with-web-scraping Page 2 of 16
Extract and Transform Data With Web Scraping - Learn Python Basics - OpenClassrooms 08/01/2023, 15:49
(
Load
ETL (extract, transform, load) is the “general
procedure of copying data from one or more
sources into a destination system which represents
the data differently from the source” (Wikipedia).
That’s just a fancy way to say that ETL is the
process of taking data from one place, massaging
it a little, and saving it in another place.
Tags
https://openclassrooms.com/en/courses/6902811-learn-python-basics/7091156-extract-and-transform-data-with-web-scraping Page 3 of 16
Extract and Transform Data With Web Scraping - Learn Python Basics - OpenClassrooms 08/01/2023, 15:49
https://openclassrooms.com/en/courses/6902811-learn-python-basics/7091156-extract-and-transform-data-with-web-scraping Page 4 of 16
Extract and Transform Data With Web Scraping - Learn Python Basics - OpenClassrooms 08/01/2023, 15:49
1 <p class="clothing">shirt</p>
2 <p class="clothing">socks</p>
This way, you know that all the items with the
“clothing” class will have a clothing item contained
within. You can use this “clothing” class later to get
all the items marked with the same class.
https://openclassrooms.com/en/courses/6902811-learn-python-basics/7091156-extract-and-transform-data-with-web-scraping Page 5 of 16
Extract and Transform Data With Web Scraping - Learn Python Basics - OpenClassrooms 08/01/2023, 15:49
https://openclassrooms.com/en/courses/6902811-learn-python-basics/7091156-extract-and-transform-data-with-web-scraping Page 6 of 16
Extract and Transform Data With Web Scraping - Learn Python Basics - OpenClassrooms 08/01/2023, 15:49
(
To extract data from the website, we need to use
the requests library. Remember that it provides
functionality for making HTTP requests. We can
use it since we’re trying to get data from a website
that uses the HTTP protocol
(e.g., http://google.com).
import requests
url = "https://www.gov.uk/search/news-and-
communications"
page = requests.get(url)
https://openclassrooms.com/en/courses/6902811-learn-python-basics/7091156-extract-and-transform-data-with-web-scraping Page 7 of 16
Extract and Transform Data With Web Scraping - Learn Python Basics - OpenClassrooms 08/01/2023, 15:49
(
Now that we have the HTML source, we need to
parse it. The way to parse the HTML is through the
HTML attributes of class and id
mentioned earlier.
import requests
from bs4 import BeautifulSoup
url = "https://www.gov.uk/search/news-and-
communications"
page = requests.get(url)
https://openclassrooms.com/en/courses/6902811-learn-python-basics/7091156-extract-and-transform-data-with-web-scraping Page 8 of 16
Extract and Transform Data With Web Scraping - Learn Python Basics - OpenClassrooms 08/01/2023, 15:49
https://openclassrooms.com/en/courses/6902811-learn-python-basics/7091156-extract-and-transform-data-with-web-scraping Page 9 of 16
Extract and Transform Data With Web Scraping - Learn Python Basics - OpenClassrooms 08/01/2023, 15:49
Now let’s see what data we can get from the news
and communications page. First, let’s get the titles
of all the stories. After inspecting the HTML page,
we can see that the titles of all the news stories
are in link elements denoted by <a> tags and have
the same class:
gem-c-document-list__item-title .
Here’s an example:
html
1 <a data-ecommerce-
path="/government/news/restart-of-the-
uk-in-japan-campaign--2" data-ecommerce-
row="1" data-ecommerce-index="1" data-
track-category="navFinderLinkClicked"
data-track-action="News and
communications.1" data-track-
label="/government/news/restart-of-the-
uk-in-japan-campaign--2" data-track-
options='{"dimension28":20,"dimension29"
:"Restart of the UK in JAPAN campaign"}'
https://openclassrooms.com/en/courses/6902811-learn-python-basics/7091156-extract-and-transform-data-with-web-scraping Page 10 of 16
Extract and Transform Data With Web Scraping - Learn Python Basics - OpenClassrooms 08/01/2023, 15:49
class="gem-c-document-list__item-title
gem-c-document-list__item-link"
href="/government/news/restart-of-the-
uk-in-japan-campaign--2">Restart of the
UK in JAPAN campaign</a>
https://openclassrooms.com/en/courses/6902811-learn-python-basics/7091156-extract-and-transform-data-with-web-scraping Page 11 of 16
Extract and Transform Data With Web Scraping - Learn Python Basics - OpenClassrooms 08/01/2023, 15:49
07:14
1 <p class="gem-c-document-list__item-
description">Joint Statement by the
Missions of the United States, the
United Kingdom, Switzerland and the
European Union on behalf of the EU
Member States represented in Minsk on
the use of violence and repression in
Belarus</p>
1 descriptions = soup.find_all("p",
class_="gem-c-document-list__item-
description")
https://openclassrooms.com/en/courses/6902811-learn-python-basics/7091156-extract-and-transform-data-with-web-scraping Page 12 of 16
Extract and Transform Data With Web Scraping - Learn Python Basics - OpenClassrooms 08/01/2023, 15:49
Requests
Beautiful Soup
Transform Data
(
https://openclassrooms.com/en/courses/6902811-learn-python-basics/7091156-extract-and-transform-data-with-web-scraping Page 13 of 16
Extract and Transform Data With Web Scraping - Learn Python Basics - OpenClassrooms 08/01/2023, 15:49
1 bs_titles = soup.find_all("a",
class_="gem-c-document-list__item-
title")
2 titles = []
3 for title in bs_titles:
4 titles.append(title.string)
03:07
https://openclassrooms.com/en/courses/6902811-learn-python-basics/7091156-extract-and-transform-data-with-web-scraping Page 14 of 16
Extract and Transform Data With Web Scraping - Learn Python Basics - OpenClassrooms 08/01/2023, 15:49
1 bs_descriptions = soup.find_all(“p”,
class_=“gem-c-document-list__item-
description”)
2 descriptions = []
3 for desc in bs_descriptions:
4 descriptions.append(desc.string)
Let’s Recap!
(
Web scraping is the automated process of
retrieving data from the internet.
ETL stands for extract, transform, load, and
is a widely used industry acronym
representing the process of taking data from
one place, changing it up a little, and storing
it in another place.
HTML is the backbone of any web page, and
understanding its structure will help
you figure out how to get the data you need.
Requests and Beautiful Soup are third-party
Python libraries that can help you
retrieve and parse data from the internet.
Parsing data means preparing it for
transformation or storage.
https://openclassrooms.com/en/courses/6902811-learn-python-basics/7091156-extract-and-transform-data-with-web-scraping Page 15 of 16
Extract and Transform Data With Web Scraping - Learn Python Basics - OpenClassrooms 08/01/2023, 15:49
Teachers
Will Alexander
Scottish developer, teacher and musician based in Paris.
Raye Schiller
Raye Schiller is a backend software engineer based in New York City and has
an MEng. in Computer Science from Cornell University
coach
Store
Terms of use
Privacy policy
Cookies
Accessibility
https://openclassrooms.com/en/courses/6902811-learn-python-basics/7091156-extract-and-transform-data-with-web-scraping Page 16 of 16