100% found this document useful (2 votes)

563 views35 pages

Web Scraping With Python Tutorials From A To Z

This document provides a tutorial on web scraping using Python. It discusses why Python is well-suited for web scraping due to its easy syntax, diverse library support, and large community. The tutorial then outlines the steps to build a basic web scraper in Python, including preparing the environment, importing libraries like BeautifulSoup and Selenium, defining functions to extract and export data, and best practices for web scraping. It also provides instructions for scraping images from websites.

Uploaded by

twixfix

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (2 votes)

563 views35 pages

Web Scraping With Python Tutorials From A To Z

Uploaded by

twixfix

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Comprehensive Guide on

Data Collection:

Web Scraping with Python

From A to Z
Introduction
In today’s fast-changing business world, data gathering is essential
for every data-driven business, so the concept of web scraping
becomes more and more known to many. Data collection at scale
manually is a time-consuming task, so by automating the whole
process with web scraping, companies can focus on more vital tasks.

Getting started in web scraping is simple, except when it isn’t, which

is why you are here. It requires some time and effort to understand
the main principles of web scraping with Python. If you're interested
in starting web scraping, we can assure you that you’re in the right
place.

Python is the most popular programming language for web

scraping because it can handle almost all data extraction processes
smoothly. This is the reason why we prepared you two tutorials: how
to scrape text-based data and how to scrape images with Python.

Over the years we spent in the web scraping industry, we’ve

collected many technical insights for you to begin web scraping
easily. By following the steps outlined in our articles, you will
understand web scraping basics.

2
Why is Python Used For Web Scraping 4
Python advantages for web scraping 4
Python libraries used for web scraping 5

Python Web Scraping Tutorial: Step-By-Step 7

Building a web scraper: Python prepwork 7
Getting to the libraries 8
WebDrivers and browsers 8
Finding a cozy place for our Python web scraper 9
Importing and using libraries 9
Picking a URL 10
Defining objects and building lists 11
Extracting data with our Python web scraper 12
Exporting the data 15
More lists. More! 18
Web scraping with Python best practices 21
Scrape Images From a Website with Python 24
Libraries: new and old 24
Back to square one 25
Moving forward with defined functions 27
Time to extract images from the website 28
Putting it all together 29
Cleaning up 31

Conclusion 34

3
Why is Python Used For Web Scraping
Python is an interpreted, general-purpose, and high-level
programming language. Python is used for pretty much anything
you would need, from building web apps to data analysis. Python’s
creators gave attention to its syntax and code readability, so now it
allows developers to express concepts in fewer lines of code. This is
the main reason why Python was created in the first place.

You can find comparisons that Python is like a chameleon of the

programming world. Well, it’s not a lie. If you ever wonder where
Python is used for, it is everywhere, and you may not realize how
widespread it is. The most common fields where Python is
indispensable are web development, Machine Learning (ML) and
Artificial Intelligence (AI) development, data science, video game
development, and, of course, web scraping.

Python advantages for web scraping

The best part is that Python, compared to other programming
languages, is easy to learn, clear to read, and simple to write in.

Diverse libraries. P
ython has a fantastic collection of libraries such as
BeautifulSoup, Selenium, lxml, and much more. These libraries are a
perfect fit for web scraping and, also, for further work with extracted
data. You'll find more information about these libraries below.

Easy to use. T
o put it simply, Python is easy to code. Of course, it’s
wrong to believe that you would easily write a code for web scraping
without any programming knowledge. But, compared to other
languages, it’s much easier to use as you do not have to add
semicolons like “;” or curly-brackets “{}” everywhere. Many developers
agree that this is the reason why Python is less messy. Furthermore,
Python syntax is clear and easy to read. Developers can simply
navigate between different blocks in the code.

4
Saves time. A
s you probably know, web scraping was created to
simplify time-consuming tasks like collecting vast amounts of data
manually. Using Python for web scraping is similar because you are
able to write a little bit of code that completes a large task. Python
saves a bunch of developers’ time.

Community. A
s Python is one of the most popular programming
languages, it also has a very active community. Developers are
sharing their knowledge on various questions, so if you are
struggling while writing the code, you can always search for help.

Python libraries used for web scraping

Powerful frameworks and libraries, explicitly built for web scraping,
are the main reason why Python is a popular choice for data
extraction. We’ll take a closer look at all the essential libraries that
makes every developer’s web scraping tasks much easier.

Selenium. T
he primary purpose of Selenium is to test web
applications. However, it’s not limited to do just that as you can use
Selenium for web scraping. It automates script processes because,
for web scraping, the script needs to interact with a browser to
perform repetitive tasks like clicking, scrolling, etc.

BeautifulSoup. BeautifulSoup is widely used for parsing the HTML

files. According to their documentation, BeautifulSoup library is
precisely built for pulling data out of HTML and XML files. It saves
developers hours or even days of work.

Pandas. According to their official site, Pandas in web scraping is

used for data manipulation and analysis. Pandas features include
flexible reshaping and pivoting of data sets, reading and writing data
between in-memory data structures and different formats,
aggregating or transforming data, etc.

Requests (HTTP for Humans). This library is used for making various
types of HTTP requests like GET, POST. Python Requests library

5
retrieves only static content of the page. This library doesn’t parse the
HTML data extracted from web sites. However, r equests library can
be used for basic web scraping tasks.

lxml. This library is similar to BeautifulSoup because developers use

lxml for processing XML and HTML files in the Python language.

Now that we know what Python is good for, it should be easier to
understand its appeal, especially for web scraping.

6
Python Web Scraping Tutorial:
Step-By-Step
Python is one of the easiest ways to get started as it is an
object-oriented language. Python’s classes and objects are
significantly easier to use than in any other language. Additionally,
many libraries exist that make building a tool for web scraping in
Python an absolute breeze.

In this web scraping Python tutorial, we’ll outline everything needed

to get started with a simple application. It’ll acquire text-based data
from page sources, store it into a file and sort the output according
to set parameters. Options for more advanced features when using
Python for web scraping will be outlined at the very end with
suggestions for implementation. By following the steps outlined
below you will be able to understand how to do web scraping.

This web scraping tutorial will work for all operating systems. There
will be slight differences when installing either Python or
development environments but not in anything else.

Building a web scraper: Python prepwork

Throughout this entire web scraping tutorial, P
ython 3.4+ version will
be used. Specifically, we used 3.8.3 but any 3.4+ version should work
just fine.

For Windows installations, when installing Python make sure to

check “PATH installation”. PATH installation adds executables to the
default Windows Command Prompt executable search. Windows
will then recognize commands like “pip” or “python” without
requiring users to point it to the directory of the executable (e.g.
C:/tools/python/…/python.exe). If you have already installed Python
but did not mark the checkbox, just rerun the installation and select
modify. On the second screen select “Add to environment variables”.

7
Getting to the libraries
A barebones installation isn’t enough for web scraping. We’ll be
using three important libraries – BeautifulSoup v4, Pandas, and
Selenium.
To install these libraries, start the terminal of your OS. Type in:

pip install BeautifulSoup4 pandas selenium

Each of these installations take anywhere from a few seconds to a

few minutes to install. If your terminal freezes, gets stuck when
downloading or extracting the package or any other issue outside of
a total meltdown arises, use CTRL+C to abort any running
installation.

Further steps in this web scraping with Python tutorial assume a

successful installation of the previously listed libraries. If you receive a
“NameError: name * is not defined” it is likely that one of these
installations has failed.

WebDrivers and browsers

Every web scraper uses a browser as it needs to connect to the
destination URL. For testing purposes we highly recommend using a
regular browser (or not a headless one), especially for newcomers.
Seeing how written code interacts with the application allows simple
troubleshooting and debugging, and grants a better understanding
of the entire process.

Headless browsers can be used later on as they are more efficient for
complex tasks. Throughout this web scraping tutorial we will be
using the Chrome web browser although the entire process is
almost identical with Firefox.

To get started, use your preferred search engine to find the

“webdriver for Chrome” (or Firefox). Take note of your browser’s

8
current version. Download the webdriver that matches your
browser’s version.

If applicable, select the requisite package, download and unzip it.

Copy the driver’s executable file to any easily accessible directory.
Whether everything was done correctly, we will only be able to find
out later on.

Finding a cozy place for our Python web scraper

One final step needs to be taken before we can get to the
programming part of this web scraping tutorial: using a good coding
environment. There are many options, from a simple text editor, with
which simply creating a *.py file and writing the code down directly is
enough, to a fully-featured IDE (Integrated Development
Environment).

If you already have Visual Studio Code installed, picking this IDE
would be the simplest option. Otherwise, I’d highly recommend
PyCharm for any newcomer as it has very little barrier to entry and
an intuitive UI. We will assume that PyCharm is used for the rest of
the web scraping tutorial.

In PyCharm, right click on the project area and “New -> Python File”.
Give it a nice name!

Importing and using libraries

Time to put all those pips we installed previously to use:

import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver

PyCharm might display these imports in grey as it automatically
marks unused libraries. Don’t accept its suggestion to remove
unused libs (at least yet).

9
We should begin by defining our browser. Depending on the
webdriver we picked back in “WebDriver and browsers” we should
type in:

driver =
webdriver.Chrome(executable_path='c:\path\to\windows\webdri
ver\executable.exe')

driver =
webdriver.Firefox(executable_path='/nix/path/to/webdriver/e
xecutable')

Picking a URL
Before performing our first test run, choose a URL. As this web
scraping tutorial is intended to create an elementary application, we
highly recommended picking a simple target URL:

● Avoid data hidden in Javascript elements. These sometimes

need to be triggered by performing specific actions in order to
display required data. Scraping data from Javascript elements
requires more sophisticated use of Python and its logic.

● Avoid image scraping. Images can be downloaded directly with

Selenium.

● Before conducting any scraping activities ensure that you are

scraping public data, and are in no way breaching third party
rights. Also, don’t forget to check robots.txt file for guidance.

Select the landing page you want to visit and input the URL into the
driver.get(‘URL’) parameter. Selenium requires that the connection
protocol is provided. As such, it is always necessary to attach “http://”
or “https://” to the URL.

10
driver.get('https://your.url/here?yes=brilliant')

Try doing a test run by clicking the green arrow at the bottom left or

by right clicking the coding environment and selecting ‘Run’.

Follow the red pointer

If you receive an error message stating that a file is missing then turn

double check if the path provided in the driver “webdriver.*” matches
the location of the webdriver executable. If you receive a message
that there is a version mismatch redownload the correct webdriver
executable.

Defining objects and building lists

Python allows coders to design objects without assigning an exact
type. An object can be created by simply typing its title and
assigning a value.

# Object is “results”, brackets make the object an empty

list.
# We will be storing our data here.
results = []

11
Lists in Python are ordered, mutable and allow duplicate members.
Other collections, such as sets or dictionaries, can be used but lists
are the easiest to use. Time to make more objects!

# Add the page source to the variable `content`.

content = driver.page_source

# Load the contents of the page, its source, into

BeautifulSoup

# class, which analyzes the HTML as a nested data structure

and allows to select

# its elements by using various selectors.

soup = BeautifulSoup(content)

Before we go on with, let’s recap on how our code should look so far:

import pandas as pd

from bs4 import BeautifulSoup

from selenium import webdriver

driver =
webdriver.Chrome(executable_path='/nix/path/to/webdriver/ex
ecutable')

driver.get('https://your.url/here?yes=brilliant')

results = []

content = driver.page_source
soup = BeautifulSoup(content)

Try rerunning the application again. There should be no errors

displayed. If any arise, a few possible troubleshooting options were
outlined in earlier chapters.

12
Extracting data with our Python web scraper
We have finally arrived at the fun and difficult part – extracting data
out of the HTML file. Since in almost all cases we are taking small
sections out of many different parts of the page and we want to store
it into a list, we should process every smaller section and then add it
to the list:

# Loop over all elements returned by the `findAll` call. It

has the filter `attrs` given

# to it in order to limit the data returned to those

elements with a given class only.

for element in soup.findAll(attrs={'class': 'list-item'}):

…

“soup.findAll” accepts a wide array of arguments. For the purposes of

this tutorial we only use “attrs” (attributes). It allows us to narrow
down the search by setting up a statement “if attribute is equal to X
is true then…”. Classes are easy to find and use therefore we shall use
those.

Let’s visit the chosen URL in a real browser before continuing. Open
the page source by using CTRL+U (Chrome) or right click and select
“View Page Source”. Find the “closest” class where the data is nested.
Another option is to press F12 to open DevTools to select Element
Picker. For example, it could be nested as:

<a href="...">This is a Title</a>

</h4>

13
Our attribute, “class”, would then be “title”. If you picked a simple
target, in most cases data will be nested in a similar way to the
example above. Complex targets might require more effort to get
the data out. Let’s get back to coding and add the class we found in
the source:

# Change ‘list-item’ to ‘title’.

for element in soup.findAll(attrs={'class': 'title'}):

...

Our loop will now go through all objects with the class “title” in the
page source. We will process each of them:

name = element.find('a')

Let’s take a look at how our loop goes through the HTML:

<a href="...">This is a Title</a>

</h4>

Our first statement (in the loop itself) finds all elements that match
tags, whose “class” attribute contains “title”. We then execute
another search within that class. Our next search finds all the <a>
tags in the document (<a> is included while partial matches like
<span> are not). Finally, the object is assigned to the variable “name”.

We could then assign the object name to our previously created list
array “results” but doing this would bring the entire <a href…> tag
with the text inside it into one element. In most cases, we would only
need the text itself without any additional tags.

# Add the object of “name” to the list “results”.

14
# `<element>.text` extracts the text in the element,
omitting the HTML tags.

results.append(name.text)

Our loop will go through the entire page source, find all the
occurrences of the classes listed above, then append the nested data
to our list:

import pandas as pd

from bs4 import BeautifulSoup

from selenium import webdriver

driver =
webdriver.Chrome(executable_path='/nix/path/to/webdriver/ex
ecutable')

driver.get('https://your.url/here?yes=brilliant')

results = []

content = driver.page_source

soup = BeautifulSoup(content)

for element in soup.findAll(attrs={'class': 'title'}):

name = element.find('a')
results.append(name.text)

Note that the two statements after the loop are indented. Loops
require indentation to denote nesting. Any consistent indentation
will be considered legal. Loops without indentation will output an
“IndentationError” with the offending statement pointed out with
the “arrow”.

Exporting the data

Even if no syntax or runtime errors appear when running our
program, there still might be semantic errors. You should check

15
whether we actually get the data assigned to the right object and
move to the array correctly.

One of the simplest ways to check if the data you acquired during
the previous steps is being collected correctly is to use “print”. Since
arrays have many different values, a simple loop is often used to
separate each entry to a separate line in the output:

for x in results:

print(x)

Both “print” and “for” should be self-explanatory at this point. We are

only initiating this loop for quick testing and debugging purposes. It
is completely viable to print the results directly:

print(results)

So far our code should look like this:

driver =
webdriver.Chrome(executable_path='/nix/path/to/webdriver/ex
ecutable')

driver.get('https://your.url/here?yes=brilliant')

results = []

content = driver.page_source

soup = BeautifulSoup(content)

for a in soup.findAll(attrs={'class': 'class'}):

name = a.find('a')

if name not in results:

results.append(name.text)

for x in results:
print(x)

16
Running our program now should display no errors and display
acquired data in the debugger window. While “print” is great for
testing purposes, it isn’t all that great for parsing and analyzing data.

You might have noticed that “import pandas” is still greyed out so
far. We will finally get to put the library to good use. I recommend
removing the “print” loop for now as we will be doing something
similar but moving our data to a csv file.

df = pd.DataFrame({'Names': results})

df.to_csv('names.csv', index=False, encoding='utf-8')

Our two new statements rely on the pandas library. Our first
statement creates a variable “df” and turns its object into a
two-dimensional data table. “Names” is the name of our column
while “results” is our list to be printed out. Note that pandas can
create multiple columns, we just don’t have enough lists to utilize
those parameters (yet).

Our second statement moves the data of variable “df” to a specific

file type (in this case “csv”). Our first parameter assigns a name to our
soon-to-be file and an extension. Adding an extension is necessary as
“pandas” will otherwise output a file without one and it will have to
be changed manually. “index” can be used to assign specific starting
numbers to columns. “encoding” is used to save data in a specific
format. UTF-8 will be enough in almost all cases.

import pandas as pd

from bs4 import BeautifulSoup

from selenium import webdriver

driver =
webdriver.Chrome(executable_path='/nix/path/to/webdriver/ex
ecutable')

17
driver.get('https://your.url/here?yes=brilliant')

results = []

content = driver.page_source

soup = BeautifulSoup(content)

for a in soup.findAll(attrs={'class': 'class'}):

name = a.find('a')

if name not in results:

results.append(name.text)

df = pd.DataFrame({'Names': results})

df.to_csv('names.csv', index=False, encoding='utf-8')

No imports should now be greyed out and running our application

should output a “names.csv” into our project directory. Note that a
“Guessed At Parser” warning remains. We could remove it by
installing a third party parser but for the purposes of this Python web
scraping tutorial the default HTML option will do just fine.

More lists. More!

Many web scraping operations will need to acquire several sets of
data. For example, extracting just the titles of items listed on an
e-commerce website will rarely be useful. In order to gather
meaningful information and to draw conclusions from it at least two
data points are needed.

For the purposes of this tutorial, we will try something slightly

different. Since acquiring data from the same class would just mean
appending to an additional list, we should attempt to extract data
from a different class but, at the same time, maintain the structure of
our table.

Obviously, we will need another list to store our data in.

18
import pandas as pd

from bs4 import BeautifulSoup

from selenium import webdriver

driver =
webdriver.Chrome(executable_path='/nix/path/to/webdriver/ex
ecutable')

driver.get('https://your.url/here?yes=brilliant')

results = []

other_results = []

for b in soup.findAll(attrs={'class': 'otherclass'}):

# Assume that data is nested in ‘span’.

name2 = b.find('span')

other_results.append(name.text)

Since we will be extracting an additional data point from a different

part of the HTML, we will need an additional loop. If needed we can
also add another “if” conditional to control for duplicate entries:

Finally, we need to change how our data table is formed:

df = pd.DataFrame({'Names': results, 'Categories':

other_results})

So far the newest iteration of our code should look something like
this:

import pandas as pd

from bs4 import BeautifulSoup

from selenium import webdriver

19
driver =
webdriver.Chrome(executable_path='/nix/path/to/webdriver/ex
ecutable')

driver.get('https://your.url/here?yes=brilliant')

results = []

other_results = []

content = driver.page_source

for a in soup.findAll(attrs={'class': 'class'}):

name = a.find('a')

if name not in results:

results.append(name.text)

for b in soup.findAll(attrs={'class': 'otherclass'}):

name2 = b.find('span')

other_results.append(name.text)

df = pd.DataFrame({'Names': results, 'Categories':

other_results})
df.to_csv('names.csv', index=False, encoding='utf-8')

If you are lucky, running this code will output no error. In some cases
“pandas” will output an “ValueError: arrays must all be the same
length” message. Simply put, the length of the lists “results” and
“other_results” is unequal, therefore pandas cannot create a
two-dimensional table.

There are dozens of ways to resolve that error message. From

padding the shortest list with “empty” values, to creating
dictionaries, to creating two series and listing them out. We shall do
the third option:

series1 = pd.Series(results, name = 'Names')

series2 = pd.Series(other_results, name = 'Categories')

20
df = pd.DataFrame({'Names': series1, 'Categories':
series2})

df.to_csv('names.csv', index=False, encoding='utf-8')

Note that data will not be matched as the lists are of uneven length
but creating two series is the easiest fix if two data points are
needed. Our final code should look something like this:

import pandas as pd

from bs4 import BeautifulSoup

from selenium import webdriver

driver =
webdriver.Chrome(executable_path='/nix/path/to/webdriver/ex
ecutable')

driver.get('https://your.url/here?yes=brilliant')

results = []

other_results = []

content = driver.page_source

soup = BeautifulSoup(content)

for a in soup.findAll(attrs={'class': 'class'}):

name = a.find('a')

if name not in results:

results.append(name.text)

for b in soup.findAll(attrs={'class': 'otherclass'}):

name2 = b.find('span')

other_results.append(name.text)

series1 = pd.Series(results, name = 'Names')

series2 = pd.Series(other_results, name = 'Categories')

df = pd.DataFrame({'Names': series1, 'Categories':

series2})

21
df.to_csv('names.csv', index=False, encoding='utf-8')

Running it should create a csv file named “names” with two columns
of data.

Web scraping with Python best practices

Our first web scraper should now be fully functional. Of course it is so
basic and simplistic that performing any serious data acquisition
would require significant upgrades. Before moving on to greener
pastures, I highly recommend experimenting with some additional
features:

● Create matched data extraction by creating a loop that would

make lists of an even length.

● Scrape several URLs in one go. There are many ways to
implement such a feature. One of the simplest options is to
simply repeat the code above and change URLs each time.
That would be quite boring. Build a loop and an array of URLs
to visit.

● Another option is to create several arrays to store different sets

of data and output it into one file with different rows. Scraping
several different types of information at once is an important
part of e-commerce data acquisition.

● Once a satisfactory web scraper is running, you no longer need

to watch the browser perform its actions. Get headless versions
of either Chrome or Firefox browsers and use those to reduce
load times.

● Create a scraping pattern. Think of how a regular user would

browse the internet and try to automate their actions. New
libraries will definitely be needed. Use “import time” and “from
random import randint” to create wait times between pages.
Add “scrollto()” or use specific key inputs to move around the

22
browser. It’s nearly impossible to list all of the possible options
when it comes to creating a scraping pattern.

● Create a monitoring process. Data on certain websites might

be time (or even user) sensitive. Try creating a long-lasting loop
that rechecks certain URLs and scrapes data at set intervals.
Ensure that your acquired data is always fresh.

● Make use of the Python Requests library. Requests is a powerful

asset in any web scraping toolkit as it allows to optimize HTTP
methods sent to servers.

● Finally, integrate proxies into your web scraper. Using location

specific request sources allows you to acquire data that might
otherwise be inaccessible.

23
Scrape Images From a Website with
Python
Previously we outlined how to scrape text-based data with Python.
Throughout the tutorial we went through the entire process: all the
way from installing Python, getting the required libraries, setting
everything up to coding a basic web scraper and outputting the
acquired data into a .csv file. In the second installment, we will learn
how to scrape images from a website and store them in a set
location.

Before conducting image scraping please consult with legal

professionals to be sure that you are not breaching third party
rights, including but not limited to, intellectual property rights.

Libraries: new and old

We will need quite a few libraries in order to extract images from a
website. In the basic web scraper tutorial we used BeautifulSoup,
Selenium and pandas to gather and output data into a .csv file. We
will do all these previous steps to export scraped data (i.e. image
URLs).

Of course, gathering image URLs into a list is not enough. We will

use several other libraries to store the c
ontent of the URL into a
variable, convert it into an image object and then save it to a
specified location. Our newly acquired libraries are Pillow and
Requests.

If you missed the previous installment:

pip install beautifulsoup4 selenium pandas

Install these libraries as well:

24
#install the Pillow library (used for image processing)

pip install Pillow

#install the requests library (used to send HTTP requests)

pip install requests

Additionally, we will use built-in libraries to d

ownload images from a
website, mostly to store our acquired files in a specified folder.

Back to square one

import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
driver =
webdriver.Chrome(executable_path='/nix/path/to/webdriver/ex
ecutable')
driver.get('https://your.url/here?yes=brilliant')
results = []
content = driver.page_source
soup = BeautifulSoup(content)

Our data extraction process begins almost exactly the same (we will
import libraries as needed). We assign our preferred webdriver,
select the URL from which we’ll scrape image links and create a list
to store them in. As our Chrome driver arrives at the URL, we use the
variable ‘content’ to point to the page source and then “soupify” it
with BeautifulSoup.

In the previous tutorial, we performed all actions by using built-in

and library defined functions. While we could do another tutorial
without defining any functions, it is an extremely u
seful tool for just
about any project:

25
# Example on how to define a function and select custom
arguments for the

# code that goes into it.

def function_name(arguments):

# Function body goes here.

We’ll move our URL scraper into a defined function. Additionally,

we’ll reuse the same code as we used in the previous tutorial and
repurpose it to scrape full URLs.

Before

for a in soup.findAll(attrs={'class': 'class'}):

name = a.find('a')

if name not in results:

results.append(name.text)

After

#picking a name that represents the functions will be

useful later on.

def parse_image_urls(classes, location, source):

for a in soup.findAll(attrs={'class': classes}):

name = a.find(location)

if name not in results:

results.append(name.get(source))

Note that we now append in a different manner. Instead of

appending the text, we use another function ‘get()’ and add a new

26
parameter ‘source’ to it. We use ‘source’ to indicate the field in the
website where image links are stored . They will be nested in a ‘src’,
‘data-src’ or other similar HTML tags.

Moving forward with defined functions

Let’s assume that our target URL has image links nested in the
classes ‘blog-card__link’, ‘img’ and that the URL itself is in the ‘src’
attribute of the element. We would call our newly defined function
as such:

parse_image_urls("blog-card__link", "img", "src")

Our code should now look something like this:

import pandas as pd

from bs4 import BeautifulSoup

from selenium import webdriver

driver =
webdriver.Chrome(executable_path='/nix/path/to/webdriver/ex
ecutable')

driver.get('https://your.url/here?yes=brilliant')

results = []

content = driver.page_source

soup = BeautifulSoup(content)

def parse_image_urls(classes, location, source):

for a in soup.findAll(attrs={'class': classes}):

name = a.find(location)

if name not in results:

results.append(name.get(source))

27
parse_image_urls("blog-card__link", "img", "src")

Since we sometimes want to export scraped data and we had

already used pandas before, we can check by outputting everything
into a “.csv” file. If needed, we can always check for any possible
semantic errors this way.

df = pd.DataFrame("links": results})
df.to_csv('links.csv', index=False, encoding='utf-8')

If we run our code right now, we should get a “links.csv” file

outputted right into the running directory.

Time to extract images from the website

Assuming that we didn’t run into any issues at the end of the
previous section, we can continue to download images from
websites.

#import library requests to send HTTP requests

import requests
for b in results:
#add the content of the url to a variable
image_content = requests.get(b).content

We will use the requests library to acquire the content stored in the
image URL. Our “for” loop above will iterate over our ‘results’ list.

#io manages file-related in/out operations

import io

#creates a byte object out of image_content and point the

variable image_file to it
image_file = io.BytesIO(image_content)

We are not done yet. So far the “image” we have above is just a
Python object.

28
#we use Pillow to convert our object to an RGB image

from PIL import Image

image = Image.open(image_file).convert('RGB')

We are still not done as we need to find a place to save our images.
Creating a folder “Test” for the purposes of this tutorial would be the
easiest option.

#pathlib let's us point to specific locations. Will be used

to save our images.

import pathlib

#hashlib allows us to get hashes. We will be using sha1 to

name our images.

import hashlib

#sets a file_path variable which is pointed to

#our directory and creates a file based on #the sha1 hash

of 'image_content'

#and uses .hexdigest to convert it into a string.

file_path = pathlib.Path('nix/path/to/test',
hashlib.sha1(image_content).hexdigest()[:10] + '.png')

image.save(file_path, "PNG", quality=80)

Putting it all together

Let’s combine all of the previous steps without any comments and
see how it works out. Note that pandas are greyed out as we are not
extracting data into any tables. We kept it in for the sake of
convenience. Use it if you need to see or double-check the outputs.

import hashlib
import io
from pathlib import Path
import pandas as pd
import requests

29
from bs4 import BeautifulSoup
from PIL import Image
from selenium import webdriver

driver =
webdriver.Chrome(executable_path='/nix/path/to/webdriver/ex
ecutable')
driver.get('https://your.url/here?yes=brilliant')
driver.execute_script("window.scrollTo(0,
document.body.scrollHeight);")
results = []
content = driver.page_source
soup = BeautifulSoup(content)

def gets_url(classes, location, source):

results = []
for a in soup.findAll(attrs={'class': classes}):
name = a.find(location)
if name not in results:
results.append(name.get(source))
return results

driver.quit()

if __name__ == "__main__":
returned_results = gets_url("blog-card__link", "img",
"src")
for b in returned_results::
image_content = requests.get(b).content
image_file = io.BytesIO(image_content)
image = Image.open(image_file).convert('RGB')
file_path = pathlib.Path('nix/path/to/test',
hashlib.sha1(image_content).hexdigest()[:10] + '.png')
image.save(file_path, "PNG", quality=80)

For efficiency, we quit our webdriver by using “driver.quit()” after

retrieving the URL list we need. We no longer need that browser as
everything is stored locally.

Running our application will output one of two results:

● Images are outputted into the folder we selected by defining

the ‘file_path’ variable.

30
● Python outputs a 403 Forbidden HTTP error.

Obviously, getting the first result means we are finished. We would

receive the second outcome if we were to scrape our /blog/ page.
Fixing the second outcome will take a little bit of time in most cases,
although, at times, there can be more difficult scenarios.

Whenever we use the requests library to send a request to the

destination server, a default user-agent
“Python-urllib/version.number” is assigned. Some web services
might block these user-agents specifically as they are guaranteed to
be bots. Fortunately, the requests library allows us to assign any
user-agent (or an entire header) we want:

image_content = requests.get(b, headers={'User-agent':

'Mozilla/5.0'}).content

Adding a user-agent will be enough for most cases. There are more
complex cases where servers might try to check other parts of the
HTTP header in order to confirm that it is a genuine user.

Cleaning up
Our task is finished but the code is still messy. We can make our
application more readable and reusable by putting everything under
defined functions:

import io
import pathlib
import hashlib
import pandas as pd
import requests
from bs4 import BeautifulSoup
from PIL import Image
from selenium import webdriver

def get_content_from_url(url):
driver = webdriver.Chrome() # add "executable_path=" if
driver not in running directory

31
driver.get(url)
driver.execute_script("window.scrollTo(0,
document.body.scrollHeight);")
page_content = driver.page_source
driver.quit() # We do not need the browser instance for
further steps.
return page_content

def parse_image_urls(content, classes, location, source):

soup = BeautifulSoup(content)
results = []
for a in soup.findAll(attrs={"class": classes}):
name = a.find(location)
if name not in results:
results.append(name.get(source))
return results

def save_urls_to_csv(image_urls):
df = pd.DataFrame({"links": image_urls})
df.to_csv("links.csv", index=False, encoding="utf-8")

def get_and_save_image_to_file(image_url, output_dir):

response = requests.get(image_url,
headers={"User-agent": "Mozilla/5.0"})
image_content = response.content
image_file = io.BytesIO(image_content)
image = Image.open(image_file).convert("RGB")
filename = hashlib.sha1(image_content).hexdigest()[:10]
+ ".png"
file_path = output_dir / filename
image.save(file_path, "PNG", quality=80)

def main():
url = "https://your.url/here?yes=brilliant"
content = get_content_from_url(url)
image_urls = parse_image_urls(
content=content, classes="blog-card__link",
location="img", source="src",
)
save_urls_to_csv(image_urls)

for image_url in image_urls:

get_and_save_image_to_file(

32
image_url,
output_dir=pathlib.Path("nix/path/to/test"),
)

if name == "main": #only executes if imported as

main file
main()

Everything is now nested under clearly defined functions and can be
called when imported. Otherwise it will run as it had previously.

By using the code outlined above, you should now be able to
complete basic image scraping tasks such as to download all images
from a website in one go.

33
Conclusion
Python is a perfect fit for building web scrapers and extracting data
as it has a large selection of libraries, and an active community to
search for help if you have issues with coding. One of the most
important parts why use Python for web scraping is that Python is
easy to learn, clear to read, and simple to write in.

Building web scrapers, acquiring data, and drawing conclusions from

large amounts of information is inherently an interesting and
complicated process. We have provided several tutorials on how to
start web scraping in Python. From here onwards, you are on your
own.

By applying our advices on how to begin web scraping in Python,

you'll be able to implement web scraping in your company’s daily
tasks to get the required data for making data-driven decisions.
Don’t forget that the web scraping industry is constantly evolving, so
continuous learning is one of the successful web scraping elements.

If the whole web scraping process seems like a time-consuming task

and you would rather spend time analyzing data rather than
gathering it, you should contact us! Oxylabs are always ready to help
businesses with their data-gathering processes.

Learning The Pandas Library Python Tools For Data Munging Analysis and Visual PDF
100% (18)
Learning The Pandas Library Python Tools For Data Munging Analysis and Visual PDF
208 pages
Dash User Guide and Documentation
100% (2)
Dash User Guide and Documentation
376 pages
Practical Projects
100% (30)
Practical Projects
478 pages
Python Web Scraping Tutorial
92% (12)
Python Web Scraping Tutorial
65 pages
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
100% (3)
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
26 pages
How To Scrape Websites With Python and BeautifulSoup PDF
100% (2)
How To Scrape Websites With Python and BeautifulSoup PDF
10 pages
Python Cookbook-O'Reilly (2005)
No ratings yet
Python Cookbook-O'Reilly (2005)
1,078 pages
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
100% (3)
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
9 pages
Python Web Frameworks
100% (3)
Python Web Frameworks
83 pages
How To Scrap Any Website's Content Using Scrapy
0% (1)
How To Scrap Any Website's Content Using Scrapy
20 pages
Data Structure and Algorithms With Python
100% (14)
Data Structure and Algorithms With Python
369 pages
Getting Started With Beautiful Soup Build Your Own Web Scraper and Learn All About Web Scraping With Beautiful Soup (PDFDrive)
100% (2)
Getting Started With Beautiful Soup Build Your Own Web Scraper and Learn All About Web Scraping With Beautiful Soup (PDFDrive)
130 pages
Web Scraping Handbook
No ratings yet
Web Scraping Handbook
115 pages
Become A Web Scraping Pro: With These 5 Tips
No ratings yet
Become A Web Scraping Pro: With These 5 Tips
6 pages
Scrapy Tutorial PDF
100% (3)
Scrapy Tutorial PDF
114 pages
Web Scraping With Python - Sample Chapter
100% (3)
Web Scraping With Python - Sample Chapter
26 pages
Python Flask For Web Development. Build Web Applications... 2022
100% (2)
Python Flask For Web Development. Build Web Applications... 2022
114 pages
20+ Real-World Java and Python Projects To Expand Your Dev Portfolio
100% (1)
20+ Real-World Java and Python Projects To Expand Your Dev Portfolio
25 pages
Python Flask Framework A Step by Step Gu
No ratings yet
Python Flask Framework A Step by Step Gu
167 pages
Professional Python (2024)
100% (1)
Professional Python (2024)
256 pages
Flask Docs
100% (1)
Flask Docs
300 pages
Muhammad Yasoob Ullah Khalid - Practical Python Projects-Muhammad Yasoob Ullah Khalid (2021)
100% (3)
Muhammad Yasoob Ullah Khalid - Practical Python Projects-Muhammad Yasoob Ullah Khalid (2021)
329 pages
Advanced Python Tips
No ratings yet
Advanced Python Tips
50 pages
Practical Introduction To Web Scraping in Python
100% (1)
Practical Introduction To Web Scraping in Python
14 pages
Flask Tutorial
100% (5)
Flask Tutorial
71 pages
Python Learn Python in One Day and Learn It Well. Python For Beginners With Hands-On Project. (Learn Coding Fast With Hands-On Project Book 1) by LCF Publishing Jamie Chan (Publishing, LCF) (Z-Li
100% (2)
Python Learn Python in One Day and Learn It Well. Python For Beginners With Hands-On Project. (Learn Coding Fast With Hands-On Project Book 1) by LCF Publishing Jamie Chan (Publishing, LCF) (Z-Li
114 pages
Python 101 - Michael Driscoll
No ratings yet
Python 101 - Michael Driscoll
296 pages
The Python Master
100% (6)
The Python Master
192 pages
Coding With Replit Export
No ratings yet
Coding With Replit Export
421 pages
Super Advanced Python PDF
No ratings yet
Super Advanced Python PDF
120 pages
Python Master The Art of Design Patterns by Dusty Phillips, Chetan Giridhar, Sakis Kasampalis 1787125181 2016 SAMPLE
33% (3)
Python Master The Art of Design Patterns by Dusty Phillips, Chetan Giridhar, Sakis Kasampalis 1787125181 2016 SAMPLE
21 pages
Python Web Development Libraries
100% (1)
Python Web Development Libraries
67 pages
(David Phillips) Web Scraping With Excel How To U (B-Ok - CC)
100% (3)
(David Phillips) Web Scraping With Excel How To U (B-Ok - CC)
59 pages
Setting Up A Simple OCR Server: by Real Python 37 Comments
No ratings yet
Setting Up A Simple OCR Server: by Real Python 37 Comments
8 pages
Create Graphical User Interfaces With Python
100% (12)
Create Graphical User Interfaces With Python
156 pages
Serious Python - 2019 PDF
100% (8)
Serious Python - 2019 PDF
242 pages
Learn Python Visually
100% (9)
Learn Python Visually
134 pages
100 Skills To Better Python
100% (10)
100 Skills To Better Python
80 pages
30 Python Best Practices, Tips, and Tricks by Erik Van Baaren Python Land Medium
No ratings yet
30 Python Best Practices, Tips, and Tricks by Erik Van Baaren Python Land Medium
23 pages
Python For Data Analysis
100% (4)
Python For Data Analysis
227 pages
60 Python Projects With Source Code by Aman Kharwal Coders Camp Medium
No ratings yet
60 Python Projects With Source Code by Aman Kharwal Coders Camp Medium
13 pages
Building Web Applications With Flask - Sample Chapter
0% (1)
Building Web Applications With Flask - Sample Chapter
10 pages
Python Flask 2
No ratings yet
Python Flask 2
66 pages
Mastering Python For Data Science
86% (7)
Mastering Python For Data Science
572 pages
Scikit Learn Docs PDF
100% (3)
Scikit Learn Docs PDF
2,204 pages
Getting Started With TensorFlow - Js - TensorFlow - Medium
No ratings yet
Getting Started With TensorFlow - Js - TensorFlow - Medium
6 pages
Python Web Flask
No ratings yet
Python Web Flask
118 pages
Asyncio Event in Python
No ratings yet
Asyncio Event in Python
7 pages
Asyncio Documentation Documentation: Release 0.0
No ratings yet
Asyncio Documentation Documentation: Release 0.0
39 pages
Pandas Python For Data Science
100% (1)
Pandas Python For Data Science
1 page
Api-Demo: Platform-As-A-Service (Paas) Based Solution
No ratings yet
Api-Demo: Platform-As-A-Service (Paas) Based Solution
6 pages
Module 4
No ratings yet
Module 4
14 pages
Web Scraping Presentation With Images
No ratings yet
Web Scraping Presentation With Images
4 pages
Web Scraping With Python_ a Complete Step-By-Step Guide + Code _ by Anthony Heath _ Geek Culture _ Medium
No ratings yet
Web Scraping With Python_ a Complete Step-By-Step Guide + Code _ by Anthony Heath _ Geek Culture _ Medium
42 pages
Web Scraping
No ratings yet
Web Scraping
4 pages
Web Crawling - Python
No ratings yet
Web Crawling - Python
34 pages
Web Scraping Using Python
No ratings yet
Web Scraping Using Python
18 pages
Data Analysis by Web Scraping Using Python
No ratings yet
Data Analysis by Web Scraping Using Python
6 pages
Sample From A Python Guide For Web Scraping
No ratings yet
Sample From A Python Guide For Web Scraping
9 pages
Introduction To C Programming
No ratings yet
Introduction To C Programming
27 pages
Function Unit Test
No ratings yet
Function Unit Test
1 page
Programming Concepts and Embedded Programming in C and C++
No ratings yet
Programming Concepts and Embedded Programming in C and C++
55 pages
OOPS Sample Programs
No ratings yet
OOPS Sample Programs
14 pages
Ict Introduction 103
No ratings yet
Ict Introduction 103
4 pages
Phyton
No ratings yet
Phyton
12 pages
CS101 Solved MCQs Alot of Solved MCQs in One File
No ratings yet
CS101 Solved MCQs Alot of Solved MCQs in One File
86 pages
Chapter 7,8
No ratings yet
Chapter 7,8
12 pages
Lecture 2 - Introduction To Programming (Problem Solving)
No ratings yet
Lecture 2 - Introduction To Programming (Problem Solving)
19 pages
Dcit23a Midterm Reviewer
No ratings yet
Dcit23a Midterm Reviewer
7 pages
Omnithreadlibrary Sample
No ratings yet
Omnithreadlibrary Sample
85 pages
CH 7 MATLAB
No ratings yet
CH 7 MATLAB
41 pages
Control Statements in Python
No ratings yet
Control Statements in Python
5 pages
Python Programming
No ratings yet
Python Programming
151 pages
Heba DSBook 2022
No ratings yet
Heba DSBook 2022
337 pages
Chapter 1
No ratings yet
Chapter 1
42 pages
ICSE IX Iterative
No ratings yet
ICSE IX Iterative
5 pages
ISPSoft UM EN 20170614 PDF
No ratings yet
ISPSoft UM EN 20170614 PDF
840 pages
Tutorial Programacion USER RPL (HP50g)
No ratings yet
Tutorial Programacion USER RPL (HP50g)
46 pages
Problem-Solving Notes
No ratings yet
Problem-Solving Notes
2 pages
BSIT New 2019 PDF
No ratings yet
BSIT New 2019 PDF
92 pages
Question Paper
No ratings yet
Question Paper
4 pages
Labtask 3
No ratings yet
Labtask 3
14 pages
Computer-Aided Engineering PDF
No ratings yet
Computer-Aided Engineering PDF
104 pages
VB Note For Students
No ratings yet
VB Note For Students
41 pages
OCR GCSE Computer Science Solutions and Marking Scheme
No ratings yet
OCR GCSE Computer Science Solutions and Marking Scheme
3 pages
PDI Mod 2 Printable PDF
No ratings yet
PDI Mod 2 Printable PDF
29 pages
Module 2 - Repetition - Iteration Flowchart
No ratings yet
Module 2 - Repetition - Iteration Flowchart
18 pages
Decision Making & Looping
No ratings yet
Decision Making & Looping
32 pages
Notes For Fintech Assesment, Cheatsheet
No ratings yet
Notes For Fintech Assesment, Cheatsheet
19 pages