0% found this document useful (0 votes)
2 views

Web Scraping

This document provides an overview of web scraping using Python, focusing on techniques for gathering data from websites. It covers essential concepts such as HTML, CSS, and the use of libraries like BeautifulSoup and requests for effective scraping. Key points include understanding the rules and limitations of web scraping, as well as practical examples for extracting data and images from web pages.

Uploaded by

jrmelania4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Web Scraping

This document provides an overview of web scraping using Python, focusing on techniques for gathering data from websites. It covers essential concepts such as HTML, CSS, and the use of libraries like BeautifulSoup and requests for effective scraping. Key points include understanding the rules and limitations of web scraping, as well as practical examples for extracting data and images from web pages.

Uploaded by

jrmelania4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 53

Web Scraping

Complete Python Bootcamp

● Web scraping is a general term for


techniques involving automating the
gathering of data from a website.
● In this section we will learn how to use
Python to conduct web scraping tasks,
such as downloading images or
information off a website.
Complete Python Bootcamp

● In order to web scrape with Python we


need to understand the basic concepts of
how a website works.
● When a browser loads a website, the user
gets to see what is known as the “front-
end” of the website.
Complete Python Bootcamp
Complete Python Bootcamp
Complete Python Bootcamp
Complete Python Bootcamp
Complete Python Bootcamp
Complete Python Bootcamp
Complete Python Bootcamp

<!DOCTYPE html>
<html>
<head>
<title>Title on
Browser Tab</title>
</head>
<body>
<h1> Website
Header </h1>
<p> Some
Paragraph </p>
<body>
</html>
Complete Python Bootcamp

<!DOCTYPE html>
<html>
<head>
<title>Title on
Browser Tab</title>
</head>
<body>
<h1> Website
Header </h1>
<p> Some
Paragraph </p>
<body>
</html>
Complete Python Bootcamp

<!DOCTYPE html>
<html>
<head>
<title>Title on
Browser Tab</title>
</head>
<body>
<h1> Website
Header </h1>
<p> Some
Paragraph </p>
<body>
</html>
Complete Python Bootcamp

<!DOCTYPE html>
<html>
<head>
<title>Title on
Browser Tab</title>
</head>
<body>
<h1> Website
Header </h1>
[“Germany”, “France”, “Spain”]
<p> Some
Paragraph </p>
<body>
</html>
Complete Python Bootcamp

● Main things we need to understand


○ Rules of Web Scraping
○ Limitations of Web Scraping
○ Basic HTML and CSS
Complete Python Bootcamp

● Rules of Web Scraping


○ Always try to get permission before
scraping!
○ If you make too many scraping
attempts or requests your IP Address
could get blocked!
○ Some sites automatically block
scraping software.
Complete Python Bootcamp

● Limitations of Web Scraping


○ In general every website is unique,
which means every web scraping
script is unique.
○ A slight change or update to a
website may completely break your
web scraping script.
Complete Python Bootcamp

Main front end components of a website


p{
color: red;
font-family: courier;
<!DOCTYPE html>
font-size: 160%; var values = ["Volvo", "Saab", "Fiat"];
<html>
}
<head>
.someclass{ var person = {
<title>Title on Browser Tab</title>
color: green; firstName: "John",
</head>
font-family: verdana; lastName: "Doe",
<body>
font-size: 300%; age: 50,
<h1> Website Header </h1>
} eyeColor: "blue"
<p> Some Paragraph </p>
#someid{ };
<body>
color: blue;
</html>
}

HTML CSS JS
Complete Python Bootcamp

● When viewing a website, the browser


doesn’t show you all the source code
behind the website, instead it shows you
the HTML and some CSS and JS that the
website sends to your browser.
Complete Python Bootcamp

● HTML is used to create the basic structure


and content of a webpage
● CSS is used for the design and style of a
web page, where elements are placed and
how it looks
● JavaScript is used to define the interactive
elements of a webpage
Complete Python Bootcamp

● For effective basic web scraping we only


need to have a basic understanding of
HTML and CSS.
● Python can view these HTML and CSS
elements programmatically, and then
extract information from the website.
● Let’s explore HTML and CSS in more detail.
Complete Python Bootcamp

● HTML is Hypertext Markup Language and is


present on every website on the internet.
● You can right-click on a website and select
“View Page Source” to get an example.
● Let’s see a small example of HTML code.
Complete Python Bootcamp
<!DOCTYPE html>
<html>
<head>
<title>Title on Browser Tab</title>
</head>
<body>
<h1> Website Header </h1>
<p> Some Paragraph </p>
<body>
</html>
Complete Python Bootcamp
<!DOCTYPE html>
<html>
<head>
<title>Title on Browser Tab</title>
</head>
<body>
<h1> Website Header </h1>
<p> Some Paragraph </p>
<body>
</html>
Complete Python Bootcamp
<!DOCTYPE html>
<html>
<head>
<title>Title on Browser Tab</title>
</head>
<body>
<h1> Website Header </h1>
<p> Some Paragraph </p>
<body>
</html>
Complete Python Bootcamp
<!DOCTYPE html>
<html>
<head>
<title>Title on Browser Tab</title>
</head>
<body>
<h1> Website Header </h1>
<p> Some Paragraph </p>
<body>
</html>
Complete Python Bootcamp

● CSS stands for Cascading Style Sheets.


● CSS gives “style” to a website, such as
changing colors and fonts.
● CSS uses tags to define what html
elements will be styled.
Complete Python Bootcamp
<!DOCTYPE html>
<html>
<head>
<link rel="stylesheet" href="styles.css">
<title>Some Title</title>
</head>
<body>
<p id=‘para2’> Some Text </p>
<body>
</html>
Complete Python Bootcamp
<!DOCTYPE html>
<html>
<head>
<link rel="stylesheet" href="styles.css">
<title>Some Title</title>
</head>
<body>
<p id=‘para2’> Some Text </p>
<body>
</html>
Complete Python Bootcamp
<!DOCTYPE html>
<html>
<head>
<link rel="stylesheet" href="styles.css">
<title>Some Title</title>
</head>
<body>
<p id=‘para2’> Some Text </p>
<body>
</html>
Complete Python Bootcamp

Example of the style.css file:

#para2 {
color: red;
}
Complete Python Bootcamp
<!DOCTYPE html>
<html>
<head>
<link rel="stylesheet" href="styles.css">
<title>Some Title</title>
</head>
<body>
<p class=‘cool’> Some Text </p>
<body>
</html>
Complete Python Bootcamp

Example of the style.css file:

.cool {
color: red;
font-family: verdana;
}
Complete Python Bootcamp
p{
color: red;
font-family: courier;
font-size: 160%;
}
.someclass{
color: green;
font-family: verdana;
font-size: 300%;
}
#someid{
color: blue;
}
Complete Python Bootcamp

● Don’t worry about memorizing this! We’ll


see lots of examples, main ideas to note:
○ HTML contains the information
○ CSS contains the styling
○ We can use HTML and CSS tags to
locate specific information on a page
Complete Python Bootcamp

● To web scrape with Python we can use the


BeautifulSoup and requests libraries.
● These are external libraries outside of
Python so you need to install them with
either conda or pip at your command line.
Complete Python Bootcamp

● Directly at your command line use:


○ pip install requests
○ pip install lxml
○ pip install bs4
● Or for Anaconda distributions, use conda
install instead of pip install.
Complete Python Bootcamp

● Let’s work through some examples of web


scraping with Python!
Setting Up
For Web Scraping
Complete Python Bootcamp

● Install the necessary libraries


● Explore how to inspect elements and view
source of a webpage
● Note: We will suggest you use Chrome so
you can follow along exactly as we do, but
these tools are available in all major
browsers.
Grabbing a Page Title
Grabbing All
Elements of a Class
Complete Python Bootcamp

● We previously mentioned a big part of web


scraping with the BeautifulSoup library is
figuring out what string syntax to pass into
the soup.select() method.
● Let’s go through a table with some
common examples (these make a lot of
sense if you know CSS syntax)
Syntax Match Results

soup.select(‘div’) All elements with ‘div’ tag

soup.select(‘#some_id’) Elements containing id=’some_id’

soup.select(‘.some_class’) Elements containing class = ‘some_class’

soup.select(‘div span’) Any elements named span within a div


element.

soup.select(‘div > span’) Any elements named span directly


within a div element, with nothing in
between.
Grabbing an Image
Complete Python Bootcamp

● Now that we understand how to grab text


information based on tags and element
names, let’s explore how to grab images
from a website.
● Images on a website typically have their
own URL link (ending in .jpg or .png)
Complete Python Bootcamp

● Beautiful Soup can scan a page, locate the


<img> tags and grab these URLs.
● Then we can download the URLs as images
and write them to the computer.
● Note: You should always check copyright
permission before downloading and using
an image from a website.
Working with
Multiple Pages and
Items
Complete Python Bootcamp

● We’ve seen how to grab elements one at a


time, but realistically, we want to be able
to grab multiple elements, most likely
across multiple pages.
● This is where we can combine our prior
python knowledge with the web scraping
libraries to create powerful scripts!
Complete Python Bootcamp

● We will use a site specifically designed to


practice web scraping:
www.toscrape.com
● We will practice grabbing elements across
multiple pages.
● Let’s get started!
Working with
Multiple Pages
Web Scraping
Exercises
Overview
Web Scraping
Exercises Solutions
Web Scraping
Exercises
Solutions - Part Two

You might also like