0% found this document useful (0 votes)
17 views

Extract and Transform Data With Web Scraping - Learn Python Basics - OpenClassrooms

i don’t know what tot ell you

Uploaded by

Jonasz Sowada
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Extract and Transform Data With Web Scraping - Learn Python Basics - OpenClassrooms

i don’t know what tot ell you

Uploaded by

Jonasz Sowada
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Extract and Transform Data With Web Scraping - Learn Python Basics - OpenClassrooms 08/01/2023, 15:49

Training programs For Business JS

Home > Course > Learn Python Basics > Extract and Transform Data With Web Scraping

Learn Python Basics

6 hours ! Easy License "#$


Last updated on 5/24/22

% &

Extract Data
Extract and Transform From the Web
' Using Python (
Data With Web Scraping Libraries

)  Import Python
Libraries

 Extract and
Transform Data
With Web Scraping

 Load Data With


Python

 Meet the Challenges


to Web Scraping

* Quiz: Extract Data


From the Web Using
Python Libraries

02:46
+ , -

https://openclassrooms.com/en/courses/6902811-learn-python-basics/7091156-extract-and-transform-data-with-web-scraping Page 1 of 16
Extract and Transform Data With Web Scraping - Learn Python Basics - OpenClassrooms 08/01/2023, 15:49

What Is Web Scraping?

(
Web scraping is the automated process of
retrieving (or scraping) data from a website.
Instead of manually collecting data, you can write
Python scripts (a fancy way of saying a code
process) that can collect the data from a website
and save it to a .txt or .csv file.

Let’s say you’re a digital marketer, and you’re


planning a campaign for a new type of blazer. It
would be helpful to collect information like the
price and description for similar blazers. Instead of
manually searching and copy/pasting that
information into a spreadsheet, you can write
Python code to automatically collect data from the
internet and save it to a CSV file.

Throughout these next two chapters, I’ll be taking


you step by step through a web scraping exercise.
You’ll learn some cool new things and get to
practice some of the tools you’ve used already, like
functions and variables. Make sure to follow along
in your text editor. You’ll get much more out of this
if you carry out the steps on your end along the
way!

We’re going to extract data about news and


communications from the UK government’s
services and information website, transform the
data into our desired format, and load the data to a
CSV file for a web scraping exercise.

https://openclassrooms.com/en/courses/6902811-learn-python-basics/7091156-extract-and-transform-data-with-web-scraping Page 2 of 16
Extract and Transform Data With Web Scraping - Learn Python Basics - OpenClassrooms 08/01/2023, 15:49

Web scraping allows you to collect


data from the web.

CSV stands for comma-separated values.


The CSV file format is used to store tabular
data (i.e., information structured as a table),
such as a spreadsheet or database.

ETL: Extract, Transform,

(
Load
ETL (extract, transform, load) is the “general
procedure of copying data from one or more
sources into a destination system which represents
the data differently from the source” (Wikipedia).
That’s just a fancy way to say that ETL is the
process of taking data from one place, massaging
it a little, and saving it in another place.

Web scraping is one form of ETL: you extract data


from a website, transform it to fit the format you
want, and load it into a CSV file.

To extract data from the web, you need to know a


few basics about HTML, the backbone of each
web page you see on the internet. If you haven't
worked with HTML yet, don't worry; we'll go over
what you need to know for web scraping below.

The Basics of Reading HTML


(

Tags

For a complementary overview of HTML


basics, don't hesitate to refer to the What
is HTML? chapter of our Understanding the
Web course.

https://openclassrooms.com/en/courses/6902811-learn-python-basics/7091156-extract-and-transform-data-with-web-scraping Page 3 of 16
Extract and Transform Data With Web Scraping - Learn Python Basics - OpenClassrooms 08/01/2023, 15:49

HTML is the language used for all the web pages


you see on the internet. If you right-click on any
website (including this one) and press View Page
Source, you’ll be able to see the HTML that is used
to display what you are seeing.

HTML is built in a tree-like structure called the


Document Object Model, or DOM. The DOM is
made up of a bunch of different tags that can be
nested into each other. Different tags can
represent each part of an HTML page, and most
elements have an opening and closing one.

An opening tag looks like: <element_name>


and a closing tag usually has the same
element_name just with a / in the front,
e.g., </element_name> . For example, every
web page has the html opening tag,

<html> , and the closing tag, </html> . All


the information you want in that element needs to
be between the opening and closing tags.

HTML is built in a tree-like structure


called the DOM.

There are many different types of tags, each


representing different elements that you can put
into your web page. Here are some common ones:

https://openclassrooms.com/en/courses/6902811-learn-python-basics/7091156-extract-and-transform-data-with-web-scraping Page 4 of 16
Extract and Transform Data With Web Scraping - Learn Python Basics - OpenClassrooms 08/01/2023, 15:49

<p> for a paragraph element.

<b> for a bold element.

<a> for a hyperlink.

<div> a new section or division.

The HTML tag represents the top level of the tree


in the DOM and the rest of the tags are like
children of the corresponding parent tags.

A couple of important things to know about HTML


tags are class and id attributes, which are
ways to give different HTML elements identifiers.
For example, if you want to identify all the items of
clothing with a single identifier, you can write the
following:
html

1 <p class="clothing">shirt</p>
2 <p class="clothing">socks</p>

This way, you know that all the items with the
“clothing” class will have a clothing item contained
within. You can use this “clothing” class later to get
all the items marked with the same class.

Similarly, to get all the titles and descriptions from


the UK services and information site, we can find
the class or ID that each of those items has. We
can use the “View Page Source” button to see the
HTML of the page and look for the identifier we
need.

Sometimes HTML looks like it's all one line


of text, which can be overwhelming and
confusing. Head over to this HTML
formatter, which makes your HTML look
structured and organized!

https://openclassrooms.com/en/courses/6902811-learn-python-basics/7091156-extract-and-transform-data-with-web-scraping Page 5 of 16
Extract and Transform Data With Web Scraping - Learn Python Basics - OpenClassrooms 08/01/2023, 15:49

If you scroll down on the source page or use


ctrl + f to find the first news title, you can
see that the title and description are straight in
the HTML!

Here is a sample of some of the HTML that we


need to extract from the web page:
html

1 <li class="gem-c-document-list__item ">


2 <a data-ecommerce-
path="/government/news/restart-of-the-
uk-in-japan-campaign--2" data-ecommerce-
row="1" data-ecommerce-index="1" data-
track-category="navFinderLinkClicked"
data-track-action="News and
communications.1" data-track-
label="/government/news/restart-of-the-
uk-in-japan-campaign--2" data-track-
options='{"dimension28":20,"dimension29"
:"Restart of the UK in JAPAN campaign"}'
class="gem-c-document-list__item-title
gem-c-document-list__item-link"
href="/government/news/restart-of-the-
uk-in-japan-campaign--2">Restart of the
UK in JAPAN campaign</a>
3
4 <p class="gem-c-document-list__item-
description">The British Embassy,
British Consulate-General and the
British Council, in partnership with
principal partners Jaguar Land Rover and
Standard Chartered Bank are proud to
announce the resumption of our ambitious
UK in JAPAN c…</p>

Don’t get overwhelmed by all this code! You just


have to look for the title and description class
element. Don’t worry if you don’t find it right away,
we’ll go into more detail on this later.

If you'd like to learn more about HTML, try a


few chapters of the course Build Your First
Web Pages With HTML and CSS.

https://openclassrooms.com/en/courses/6902811-learn-python-basics/7091156-extract-and-transform-data-with-web-scraping Page 6 of 16
Extract and Transform Data With Web Scraping - Learn Python Basics - OpenClassrooms 08/01/2023, 15:49

Since this information is available on the web, we


can write a Python script to extract the data we
want straight from the page. We’ll be using
the requests and Beautiful Soup libraries to help!

The Requests Library

(
To extract data from the website, we need to use
the requests library. Remember that it provides
functionality for making HTTP requests. We can
use it since we’re trying to get data from a website
that uses the HTTP protocol
(e.g., http://google.com).

The requests library contains a .get()


function that we can use to get the HTML from the
site.

To apply this to the web scraping exercise, we’ll


use the requests library to get the HTML of the UK
news and communications page into our Python
code. In the code below, we import the library,
save the URL we want to web scrape in a url
variable, and then use the .get() method to
retrieve the HTML data. If you run the code
below, you'll see the HTML source printed out in
the console.
python

import requests
url = "https://www.gov.uk/search/news-and-
communications"
page = requests.get(url)

#See html source


print(page.content)

Even though we have all the HTML saved in our


code, it still looks like a whole lot of mumbo jumbo.
We have to figure out how to parse out the exact
elements that we want - and we can use Beautiful

https://openclassrooms.com/en/courses/6902811-learn-python-basics/7091156-extract-and-transform-data-with-web-scraping Page 7 of 16
Extract and Transform Data With Web Scraping - Learn Python Basics - OpenClassrooms 08/01/2023, 15:49

Soup to do just that!

Parse means "to split a file or other input


into pieces of data that can be easily
manipulated or stored." (Wiktionary)

The Beautiful Soup Library

(
Now that we have the HTML source, we need to
parse it. The way to parse the HTML is through the
HTML attributes of class and id
mentioned earlier.

We can use Beautiful Soup to help find the


elements that can be identified with the class or
the ID that we want to find. Similar to any library,
we’ll use pip to install Beautiful Soup.
python

pip install beautifulsoup4

Next, we’ll import Beautiful Soup and create a


“soup object” out of the HTML we got using
requests:
python

import requests
from bs4 import BeautifulSoup
url = "https://www.gov.uk/search/news-and-
communications"
page = requests.get(url)

soup = BeautifulSoup(page.content, 'html.parser')

The soup variable we made using Beautiful


Soup has all these extra features that make it
easier to get data from the HTML. Before we get
data from the UK news and communications page,
we’ll go through some of the awesome
functionality of Beautiful Soup using the sample

https://openclassrooms.com/en/courses/6902811-learn-python-basics/7091156-extract-and-transform-data-with-web-scraping Page 8 of 16
Extract and Transform Data With Web Scraping - Learn Python Basics - OpenClassrooms 08/01/2023, 15:49

HTML snippet below.


html

1 <html><head><title>The Cutest Dogs


Around</title></head>
2 <body>
3 <p class="title"><b>Best dog breeds</b>
</p>
4
5 <p class="dogs">There are many awesome
dog breeds, the best ones are:
6 <a
href="http://example.com/goldendoodle"
class="breed"
id="link1">GoldenDoodle</a>,
7 <a href="http://example.com/retriever"
class="breed" id="link2">Golden
Retriever</a> and
8 <a href="http://example.com/pug"
class="breed" id="link3">Pug</a>;
9 </p>
10
11 </body>
12 </html>

Once we create a “soup object” out of this HTML,


we can access all the elements of the page really
easily!
python

1 #Get HTML page title


2 >> soup.title
3 <title>The Cutest Dogs Around</title>
4
5 #Get string of HTML title
6 >> soup.title.string
7 "The Cutest Dogs Around"
8
9 #Find all elements with <a> tag
10 >>soup.find_all('a')
11 [ <a
href="http://example.com/goldendoodle"
class="breed"
id="link1">GoldenDoodle</a>,
12 <a href="http://example.com/retriever"
class="breed" id="link2">Golden
Retriever</a>
13 <a href="http://example.com/pug"
class="breed" id="link3">Pug</a>]
14

https://openclassrooms.com/en/courses/6902811-learn-python-basics/7091156-extract-and-transform-data-with-web-scraping Page 9 of 16
Extract and Transform Data With Web Scraping - Learn Python Basics - OpenClassrooms 08/01/2023, 15:49

15 # Find element with id of “link1”


16 >> soup.find(id="link1")
17 <a
href="http://example.com/goldendoodle"
class="breed"
id="link1">GoldenDoodle</a>,
18
19 #Find all p elements with class “title”
20 >> soup.find_all("p", class_="title")
21 "Best Dog Breeds"

This is just a taste of how Beautiful Soup helps you


easily get the specific elements that you need from
an HTML page. You can get items by tag, ID, or
class.

Let’s apply this to the UK government services and


information exercise. We already made the web
page into a soup object using this line
soup = BeautifulSoup(page.content,
'html.parser')
.

Now let’s see what data we can get from the news
and communications page. First, let’s get the titles
of all the stories. After inspecting the HTML page,
we can see that the titles of all the news stories
are in link elements denoted by <a> tags and have
the same class:
gem-c-document-list__item-title .

Here’s an example:
html

1 <a data-ecommerce-
path="/government/news/restart-of-the-
uk-in-japan-campaign--2" data-ecommerce-
row="1" data-ecommerce-index="1" data-
track-category="navFinderLinkClicked"
data-track-action="News and
communications.1" data-track-
label="/government/news/restart-of-the-
uk-in-japan-campaign--2" data-track-
options='{"dimension28":20,"dimension29"
:"Restart of the UK in JAPAN campaign"}'

https://openclassrooms.com/en/courses/6902811-learn-python-basics/7091156-extract-and-transform-data-with-web-scraping Page 10 of 16
Extract and Transform Data With Web Scraping - Learn Python Basics - OpenClassrooms 08/01/2023, 15:49

class="gem-c-document-list__item-title
gem-c-document-list__item-link"
href="/government/news/restart-of-the-
uk-in-japan-campaign--2">Restart of the
UK in JAPAN campaign</a>

We can use both of these together to get a list of


all the title elements:
python

titles = soup.find_all("a", class_="gem-c-


document-list__item-title")

This gives us a list of all the elements with the


gem-c-document-list__item-title class.
To view just the string value within the element, we
can loop through each item in the list and print out
the string element.
python

>> for title in titles:


>> print(title.string)

"Restart of the UK in JAPAN campaign"


"Joint Statement on the use of violence and
repression in Belarus"
"Foreign Secretary commits to more effective and
accountable aid spending under new Foreign,
Commonwealth and Development Office"
#
"UK military dog to receive PDSA Dickin Medal
after tackling Al Qaeda insurgents."

Since this data is time-relevant, you’ll have


different titles here as you follow along.

Check out the screencast below to walk through


the steps of extracting the HTML from the web
page, finding the right identifier for the titles, and
printing out just the title strings for each title.

https://openclassrooms.com/en/courses/6902811-learn-python-basics/7091156-extract-and-transform-data-with-web-scraping Page 11 of 16
Extract and Transform Data With Web Scraping - Learn Python Basics - OpenClassrooms 08/01/2023, 15:49

07:14

Now that you know how to get the page titles,


try getting the page descriptions on your own! We’ll
cover the rest of this below.

The descriptions are in a <p> tag and have the


gem-c-document-list__item-description
class, like below:
html

1 <p class="gem-c-document-list__item-
description">Joint Statement by the
Missions of the United States, the
United Kingdom, Switzerland and the
European Union on behalf of the EU
Member States represented in Minsk on
the use of violence and repression in
Belarus</p>

To get all the descriptions, type:


python

1 descriptions = soup.find_all("p",
class_="gem-c-document-list__item-
description")

https://openclassrooms.com/en/courses/6902811-learn-python-basics/7091156-extract-and-transform-data-with-web-scraping Page 12 of 16
Extract and Transform Data With Web Scraping - Learn Python Basics - OpenClassrooms 08/01/2023, 15:49

Now that we've gone through this step by step, it's


your turn to look at the code. You can download it
by clicking here. Go through it on your own and
check that you understand all of it.

Now that we have extracted the web data, we


need to transform it to fit the format we want to
save.

The libraries we're using in this part of the


course each have their own public
documentation. If you're feeling curious,
don't hesitate to practice referring to the
documentation as you get experience with
these tools:

Requests
Beautiful Soup

Transform Data
(

You transform data when you convert it from one


format into another. It can be as simple as
converting a string to a list, or thousands of lists
into dictionaries. Usually, it entails combining
different data points. There are many ways to
transform data, and ultimately the decisions
depend on the type of data and the format you
want it in.

Some examples of transforming data:

Converting a date field format from


December 28, 2019 to 28/12/19.
Converting a money amount in dollars to
euros.
Standardizing email or postal addresses.

https://openclassrooms.com/en/courses/6902811-learn-python-basics/7091156-extract-and-transform-data-with-web-scraping Page 13 of 16
Extract and Transform Data With Web Scraping - Learn Python Basics - OpenClassrooms 08/01/2023, 15:49

For our UK news and communications example,


we’ll save all the titles and descriptions from the
HTML page into lists of strings. Instead of printing
the string information to the console like earlier, we
want to save the elements into a list:
python

1 bs_titles = soup.find_all("a",
class_="gem-c-document-list__item-
title")
2 titles = []
3 for title in bs_titles:
4 titles.append(title.string)

Here we start with an empty list called titles. Then


we loop through all the elements in the bs_titles
list and append just the string version of the title
into our titles list. Now the titles list will be a list of
strings of all the titles on the HTML page.

To walk through this code line by line, check out


the screencast below!

03:07

https://openclassrooms.com/en/courses/6902811-learn-python-basics/7091156-extract-and-transform-data-with-web-scraping Page 14 of 16
Extract and Transform Data With Web Scraping - Learn Python Basics - OpenClassrooms 08/01/2023, 15:49

Follow the same approach to extract and save the


page descriptions.
python

1 bs_descriptions = soup.find_all(“p”,
class_=“gem-c-document-list__item-
description”)
2 descriptions = []
3 for desc in bs_descriptions:
4 descriptions.append(desc.string)

Let’s Recap!

(
Web scraping is the automated process of
retrieving data from the internet.
ETL stands for extract, transform, load, and
is a widely used industry acronym
representing the process of taking data from
one place, changing it up a little, and storing
it in another place.
HTML is the backbone of any web page, and
understanding its structure will help
you figure out how to get the data you need.
Requests and Beautiful Soup are third-party
Python libraries that can help you
retrieve and parse data from the internet.
Parsing data means preparing it for
transformation or storage.

Now that you've seen how to extract and transform


web data, you’ll learn how to load web data!

I finished this chapter. Onto the next!

Import Python Load Data With


' (
Libraries Python

https://openclassrooms.com/en/courses/6902811-learn-python-basics/7091156-extract-and-transform-data-with-web-scraping Page 15 of 16
Extract and Transform Data With Web Scraping - Learn Python Basics - OpenClassrooms 08/01/2023, 15:49

Teachers
Will Alexander
Scottish developer, teacher and musician based in Paris.

Raye Schiller
Raye Schiller is a backend software engineer based in New York City and has
an MEng. in Computer Science from Cornell University

OPENCLASSROOMS OPPORTUNITIES FOR BUSINESS


English
What we do Work with us Upskilling, reskilling,
and apprenticeships
Learning experience Become a mentor

Blog Become a career MORE

coach
Store

SUPPORT Legal information

Terms of use

Privacy policy

Cookies

Accessibility

https://openclassrooms.com/en/courses/6902811-learn-python-basics/7091156-extract-and-transform-data-with-web-scraping Page 16 of 16

You might also like