Python Web Scraping Tutorial
Python Web Scraping Tutorial
This tutorial will teach you various concepts of web scraping and makes you comfortable
with scraping various types of websites and their data.
Audience
This tutorial will be useful for graduates, post graduates, and research students who either
have an interest in this subject or have this subject as a part of their curriculum. The
tutorial suits the learning needs of both a beginner or an advanced learner.
Prerequisites
The reader must have basic knowledge about HTML, CSS, and Java Script. He/she should
also be aware about basic terminologies used in Web Technology along with Python
programming concepts. If you do not have knowledge on these concepts, we suggest you
to go through tutorials on these concepts first.
All the content and graphics published in this e-book are the property of Tutorials Point (I)
Pvt. Ltd. The user of this e-book is prohibited to reuse, retain, copy, distribute or republish
any contents or a part of contents of this e-book in any manner without written consent
of the publisher.
We strive to update the contents of our website and tutorials as timely and as precisely as
possible, however, the contents may contain inaccuracies or errors. Tutorials Point (I) Pvt.
Ltd. provides no guarantee regarding the accuracy, timeliness or completeness of our
website or its contents including this tutorial. If you discover any errors on our website or
in this tutorial, please notify us at contact@tutorialspoint.com
i
Python Web Scraping
Table of Contents
About the Tutorial .................................................................................................................................... i
Audience .................................................................................................................................................. i
Prerequisites ............................................................................................................................................ i
Requests ............................................................................................................................................... 11
Urllib3 ................................................................................................................................................... 12
Selenium ............................................................................................................................................... 13
Scrapy ................................................................................................................................................... 14
Introduction .......................................................................................................................................... 15
Lxml ...................................................................................................................................................... 24
Introduction .......................................................................................................................................... 26
Introduction .......................................................................................................................................... 31
Introduction .......................................................................................................................................... 37
iii
Python Web Scraping
Tokenization ......................................................................................................................................... 38
Stemming .............................................................................................................................................. 39
Lemmatization ...................................................................................................................................... 39
Chunking ............................................................................................................................................... 40
Bag of Word (BoW) Model Extracting and converting the Text into Numeric Form ............................... 41
Introduction .......................................................................................................................................... 44
Introduction .......................................................................................................................................... 48
Introduction .......................................................................................................................................... 55
iv
Python Web Scraping
v
Python Web Scraping
1. Python Web Scraping – Introduction
Web scraping is an automatic process of extracting information from web. This chapter
will give you an in-depth idea of web scraping, its comparison with web crawling, and why
you should opt for web scraping. You will also learn about the components and working of
a web scraper.
The answer to the first question is ‘data’. Data is indispensable for any programmer and
the basic requirement of every programming project is the large amount of useful data.
The answer to the second question is a bit tricky, because there are lots of ways to get
data. In general, we may get data from a database or data file and other sources. But
what if we need large amount of data that is available online? One way to get such kind
of data is to manually search (clicking away in a web browser) and save (copy-pasting into
a spreadsheet or file) the required data. This method is quite tedious and time consuming.
Another way to get such data is using web scraping.
Web scraping, also called web data mining or web harvesting, is the process of
constructing an agent which can extract, parse, download and organize useful information
from the web automatically. In other words, we can say that instead of manually saving
the data from websites, the web scraping software will automatically load and extract data
from multiple websites as per our requirement.
Web crawling is basically used to index the information on the page using bots aka
crawlers. It is also called indexing. On the hand, web scraping is an automated way of
extracting the information using bots aka scrapers. It is also called data extraction.
1
Python Web Scraping
To understand the difference between these two terms, let us look into the comparison
table given hereunder:
E-commerce Websites: Web scrapers can collect the data specially related to the
price of a specific product from various e-commerce websites for their comparison.
Marketing and Sales Campaigns: Web scrapers can be used to get the data like
emails, phone number etc. for sales and marketing campaigns.
Search Engine Optimization (SEO): Web scraping is widely used by SEO tools
like SEMRush, Majestic etc. to tell business how they rank for search keywords that
matter to them.
Data for Machine Learning Projects: Retrieval of data for machine learning
projects depends upon web scraping.
Data for Research: Researchers can collect useful data for the purpose of their research
work by saving their time by this automated process.
2
Python Web Scraping
Extractor
The extractor processes the fetched HTML content and extracts the data into semi-
structured format. This is also called as a parser module and uses different parsing
techniques like Regular expression, HTML Parsing, DOM parsing or Artificial Intelligence
for its functioning.
Storage Module
After extracting the data, we need to store it as per our requirement. The storage module
will output the data in a standard format that can be stored in a database or JSON or CSV
format.
3
Python Web Scraping
We can understand the working of a web scraper in simple steps as shown in the diagram
given above.
In this step, a web scraper will download the requested contents from multiple web pages.
The data on websites is HTML and mostly unstructured. Hence, in this step, web scraper
will parse and extract structured data from the downloaded contents.
Here, a web scraper will store and save the extracted data in any of the format like CSV,
JSON or in database.
After all these steps are successfully done, the web scraper will analyze the data thus
obtained.
4
Python Web Scraping
2. Python Web Scraping – Getting Started with
Python
In the first chapter, we have learnt what web scraping is all about. In this chapter, let us
see how to implement web scraping using Python.
Python programming language is gaining huge popularity and the reasons that make
Python a good fit for web scraping projects are as below:
Syntax Simplicity
Python has the simplest structure when compared to other programming languages. This
feature of Python makes the testing easier and a developer can focus more on
programming.
Inbuilt Modules
Another reason for using Python for web scraping is the inbuilt as well as external useful
libraries it possesses. We can perform many implementations related to web scraping by
using Python as the base for programming.
Python has huge support from the community because it is an open source programming
language.
Python can be used for various programming tasks ranging from small shell scripts to
enterprise web applications.
Installation of Python
Python distribution is available for platforms like Windows, MAC and Unix/Linux. We need
to download only the binary code applicable for our platform to install Python. But in case
if the binary code for our platform is not available, we must have a C compiler so that
source code can be compiled manually.
5
Python Web Scraping
Step2: Download the zipped source code available for Unix/Linux on above link.
You can find installed Python at the standard location /usr/local/bin and its libraries
at /usr/local/lib/pythonXX, where XX is the version of Python.
Step2: Download the Windows installer python-XYZ.msi file, where XYZ is the version
we need to install.
Step3: Now, save the installer file to your local machine and run the MSI file.
Step4: At last, run the downloaded file to bring up the Python install wizard.
For updating the package manager, we can use the following command:
$ brew update
With the help of the following command, we can install Python3 on our MAC machine:
6
Python Web Scraping
ATH="$PATH:/usr/local/bin/python".
PATH="$PATH:/usr/local/bin/python".
Running Python
We can start Python using any of the following three ways:
Interactive Interpreter
An operating system such as UNIX and DOS that is providing a command-line interpreter
or shell can be used for starting Python.
Step 2: Then, we can start coding right away in the interactive interpreter.
$python # Unix/Linux
or
python% # Unix/Linux
or
C:> python # Windows/DOS
7
Python Web Scraping
IDE for Windows: Windows has PythonWin IDE which has GUI too.
IDE for Macintosh: Macintosh has IDLE IDE which is downloadable as either MacBinary
or BinHex'd files from the main website.
8
Python Web Scraping
3. Python Web Scraping – Python Modules for Web
Scraping
In this chapter, let us learn various Python modules that we can use for web scraping.
Now, we need to create a directory which will represent the project with the help of
following command:
Now, enter into that directory with the help of this following command:
9
Python Web Scraping
Now, activate the virtual environment with the command given below. Once successfully
activated, you will see the name of it on the left hand side in brackets.
(base) D:\ProgramData\webscrap>websc\scripts\activate
10
Python Web Scraping
For deactivating the virtual environment, we can use the following command:
In this section, we are going to discuss about useful Python libraries for web scraping.
Requests
It is a simple python web scraping library. It is an efficient HTTP library used for accessing
web pages. With the help of Requests, we can get the raw HTML of web pages which can
then be parsed for retrieving the data. Before using requests, let us understand its
installation.
Installing Requests
We can install it in either on our virtual environment or on the global installation. With the
help of pip command, we can easily install it as follows:
11
Python Web Scraping
Example
In this example, we are making a GET HTTP request for a web page. For this we need to
first import requests library as follows:
In this following line of code, we use requests to make a GET HTTP requests for the url:
https://authoraditiagarwal.com/ by making a GET request.
In [2]: r = requests.get('https://authoraditiagarwal.com/')
In [5]: r.text[:200]
Observe that in the following output, we got the first 200 characters.
Urllib3
It is another Python library that can be used for retrieving data from URLs similar to the
requests library. You can read more on this at its technical documentation at
https://urllib3.readthedocs.io/en/latest/.
Installing Urllib3
Using the pip command, we can install urllib3 either in our virtual environment or in
global installation.
12
Python Web Scraping
import urllib3
from bs4 import BeautifulSoup
http = urllib3.PoolManager()
r = http.request('GET', 'https://authoraditiagarwal.com')
soup = BeautifulSoup(r.data, 'lxml')
print (soup.title)
print (soup.title.text)
This is the output you will observe when you run this code:
Selenium
It is an open source automated testing suite for web applications across different browsers
and platforms. It is not a single tool but a suite of software. We have selenium bindings
for Python, Java, C#, Ruby and JavaScript. Here we are going to perform web scraping by
using selenium and its Python bindings. You can learn more about Selenium with Java on
the link https://www.tutorialspoint.com/selenium.
Selenium Python bindings provide a convenient API to access Selenium WebDrivers like
Firefox, IE, Chrome, Remote etc. The current supported Python versions are 2.7, 3.5 and
above.
Installing Selenium
Using the pip command, we can install urllib3 either in our virtual environment or in
global installation.
As selenium requires a driver to interface with the chosen browser, we need to download
it. The following table shows different browsers and their links for downloading the same.
Chrome https://sites.google.com/a/chromium.org/chromedriver/downloads
Edge https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/
Firefox https://github.com/mozilla/geckodriver/releases
Safari https://webkit.org/blog/6900/webdriver-support-in-safari-10/
13
Python Web Scraping
Example
This example shows web scraping using selenium. It can also be used for testing which is
called selenium testing.
After downloading the particular driver for the specified version of browser, we need to do
programming in Python.
Now, provide the path of web driver which we have downloaded as per our requirement:
path = r'C:\\Users\\gaurav\\Desktop\\Chromedriver'
browser = webdriver.Chrome(executable_path = path)
Now, provide the url which we want to open in that web browser now controlled by our
Python script.
browser.get('https://authoraditiagarwal.com/leadershipmanagement')
We can also scrape a particular element by providing the xpath as provided in lxml.
browser.find_element_by_xpath('/html/body').click()
You can check the browser, controlled by Python script, for output.
Scrapy
Scrapy is a fast, open-source web crawling framework written in Python, used to extract
the data from the web page with the help of selectors based on XPath. Scrapy was first
released on June 26, 2008 licensed under BSD, with a milestone 1.0 releasing in June
2015. It provides us all the tools we need to extract, process and structure the data from
websites.
Installing Scrapy
Using the pip command, we can install urllib3 either in our virtual environment or in
global installation.
14
Python Web Scraping
4. Python Web Scraping — Legality of Web
Scraping
With Python, we can scrape any website or particular elements of a web page but do you
have any idea whether it is legal or not? Before scraping any website we must have to
know about the legality of web scraping. This chapter will explain the concepts related to
legality of web scraping.
Introduction
Generally, if you are going to use the scraped data for personal use, then there may not
be any problem. But if you are going to republish that data, then before doing the same
you should make download request to the owner or do some background research about
policies as well about the data you are going to scrape.
Analyzing robots.txt
Actually most of the publishers allow programmers to crawl their websites at some extent.
In other sense, publishers want specific portions of the websites to be crawled. To define
this, websites must put some rules for stating which portions can be crawled and which
cannot be. Such rules are defined in a file called robots.txt.
robots.txt is human readable file used to identify the portions of the website that crawlers
are allowed as well as not allowed to scrape. There is no standard format of robots.txt file
and the publishers of website can do modifications as per their needs. We can check the
robots.txt file for a particular website by providing a slash and robots.txt after url of that
website. For example, if we want to check it for Google.com, then we need to type
https://www.google.com/robots.txt and we will get something as follows:
User-agent: *
Disallow: /search
Allow: /search/about
Allow: /search/static
Allow: /search/howsearchworks
Disallow: /sdch
Disallow: /groups
Disallow: /index.html?
Disallow: /?
Allow: /?hl=
15
Python Web Scraping
Disallow: /?hl=*&
Allow: /?hl=*&gws_rd=ssl$
and so on……..
Some of the most common rules that are defined in a website’s robots.txt file are as
follows:
User-agent: BadCrawler
Disallow: /
The above rule means the robots.txt file asks a crawler with BadCrawler user agent not
to crawl their website.
User-agent: *
Crawl-delay: 5
Disallow: /trap
The above rule means the robots.txt file delays a crawler for 5 seconds between download
requests for all user-agents for avoiding overloading server. The /trap link will try to block
malicious crawlers who follow disallowed links. There are many more rules that can be
defined by the publisher of the website as per their requirements. Some of them are
discussed here:
Sitemap: https://www.microsoft.com/en-us/explore/msft_sitemap_index.xml
Sitemap: https://www.microsoft.com/learning/sitemap.xml
Sitemap: https://www.microsoft.com/en-us/licensing/sitemap.xml
Sitemap: https://www.microsoft.com/en-us/legal/sitemap.xml
Sitemap: https://www.microsoft.com/filedata/sitemaps/RW5xN8
Sitemap: https://www.microsoft.com/store/collections.xml
Sitemap: https://www.microsoft.com/store/productdetailpages.index.xml
Sitemap: https://www.microsoft.com/en-us/store/locations/store-locations-
sitemap.xml
16
Python Web Scraping
The above content shows that the sitemap lists the URLs on website and further allows a
webmaster to specify some additional information like last updated date, change of
contents, importance of URL with relation to others etc. about each URL.
You can see there are around 60 results which mean it is not a big website and crawling
would not lead the efficiency issue.
17
Python Web Scraping
Example
In this example we are going to check the technology used by the website
https://authoraditiagarwal.com with the help of Python library builtwith. But before using
this library, we need to install it as follows:
Now, with the help of following simple line of codes we can check the technology used by
a particular website:
18
Python Web Scraping
Example
In this example we are going to check the owner of the website say microsoft.com with
the help of Whois. But before using this library, we need to install it as follows:
Now, with the help of following simple line of codes we can check the technology used by
a particular website:
],
"emails": [
"abusecomplaints@markmonitor.com",
"domains@microsoft.com",
"msnhst@microsoft.com",
"whoisrelay@markmonitor.com"
],
20
Python Web Scraping
5. Python Web Scraping – Data Extraction
Analyzing a web page means understanding its sructure . Now, the question arises why it
is important for web scraping? In this chapter, let us understand this in detail.
Regular Expression
They are highly specialized programming language embedded in Python. We can use it
through re module of Python. It is also called RE or regexes or regex patterns. With the
help of regular expressions, we can specify some rules for the possible set of strings we
want to match from the data.
If you want to learn more about regular expression in general, go to the link
https://www.tutorialspoint.com/automata_theory/regular_expressions.htm and if you
want to know more about re module or regular expression in Python, you can follow the
link https://www.tutorialspoint.com/python/python_reg_expressions.htm.
21
Python Web Scraping
Example
In the following example, we are going to scrape data about India from
http://example.webscraping.com after matching the contents of <td> with the help of
regular expression.
import re
import urllib.request
response =
urllib.request.urlopen('http://example.webscraping.com/places/default/view/Indi
a-102')
html = response.read()
text = html.decode()
re.findall('<td class="w2p_fw">(.*?)</td>',text)
Output
The corresponding output will be as shown here:
Observe that in the above output you can see the details about country India by using
regular expression.
22
Python Web Scraping
Beautiful Soup
Suppose we want to collect all the hyperlinks from a web page, then we can use a parser
called BeautifulSoup which can be known in more detail at
https://www.crummy.com/software/BeautifulSoup/bs4/doc/. In simple words,
BeautifulSoup is a Python library for pulling data out of HTML and XML files. It can be used
with requests, because it needs an input (document or url) to create a soup object asit
cannot fetch a web page by itself. You can use the following Python script to gather the
title of web page and hyperlinks.
Example
Note that in this example, we are extending the above example implemented with requests
python module. we are using r.text for creating a soup object which will further be used
to fetch details like title of the webpage.
import requests
from bs4 import BeautifulSoup
In this following line of code we use requests to make a GET HTTP requests for the url:
https://authoraditiagarwal.com/ by making a GET request.
r = requests.get('https://authoraditiagarwal.com/')
23
Python Web Scraping
Output
The corresponding output will be as shown here:
Lxml
Another Python library we are going to discuss for web scraping is lxml. It is a high-
performance HTML and XML parsing library. It is comparatively fast and straightforward.
You can read about it more on https://lxml.de/.
Installing lxml
Using the pip command, we can install lxml either in our virtual environment or in global
installation.
First, we need to import the requests and html from lxml library as follows:
import requests
from lxml import html
24
Python Web Scraping
url = 'https://authoraditiagarwal.com/leadershipmanagement/'
Now we need to provide the path (Xpath) to particular element of that web page:
path = '//*[@id="panel-836-0-0-1"]/div/div/p[1]'
response = requests.get(url)
byte_string = response.content
source_code = html.fromstring(byte_string)
tree = source_code.xpath(path)
print(tree[0].text_content())
Output
The corresponding output will be as shown here:
25
Python Web Scraping
6. Python Web Scraping – Data Processing
In earlier chapters, we learned about extracting the data from web pages or web scraping
by various Python modules. In this chapter, let us look into various techniques to process
the data that has been scraped.
Introduction
To process the data that has been scraped, we must store the data on our local machine
in a particular format like spreadsheet (CSV), JSON or sometimes in databases like MySQL.
import requests
from bs4 import BeautifulSoup
import csv
In this following line of code, we use requests to make a GET HTTP requests for the url:
https://authoraditiagarwal.com/ by making a GET request.
r = requests.get('https://authoraditiagarwal.com/')
Now, with the help of next lines of code, we will write the grabbed data into a CSV file
named dataprocessing.csv.
After running this script, the textual information or the title of the webpage will be saved
in the above mentioned CSV file on your local machine.
26
Python Web Scraping
Similarly, we can save the collected information in a JSON file. The following is an easy to
understand Python script for doing the same in which we are grabbing the same
information as we did in last Python script, but this time the grabbed information is saved
in JSONfile.txt by using JSON Python module.
import requests
from bs4 import BeautifulSoup
import csv
import json
r = requests.get('https://authoraditiagarwal.com/')
soup = BeautifulSoup(r.text, 'lxml')
y = json.dumps(soup.title.text)
with open('JSONFile.txt', 'wt') as outfile:
json.dump(y, outfile)
After running this script, the grabbed information i.e. title of the webpage will be saved in
the above mentioned text file on your local machine.
We can follow the following steps for storing data in AWS S3:
Step1: First we need an AWS account which will provide us the secret keys for using in
our Python script while storing the data. It will create a S3 bucket in which we can store
our data.
Step2: Next, we need to install boto3 Python library for accessing S3 bucket. It can be
installed with the help of the following command:
Step3: Next, we can use the following Python script for scraping data from web page and
saving it to AWS S3 bucket.
First, we need to import Python libraries for scraping, here we are working with requests,
and boto3 saving data to S3 bucket.
import requests
import boto3
27
Python Web Scraping
s3 = boto3.client('s3')
bucket_name = "our-content"
s3.create_bucket(Bucket=bucket_name, ACL='public-read')
s3.put_object(Bucket=bucket_name, Key='', Body=data, ACL="public-read")
Now you can check the bucket with name our-content from your AWS account.
With the help of following steps, we can scrape and process data into MySQL table:
Step1: First, by using MySQL we need to create a database and table in which we want
to save our scraped data. For example, we are creating the table with following query:
Step2: Next, we need to deal with Unicode. Note that MySQL does not handle Unicode by
default. We need to turn on this feature with the help of following commands which will
change the default character set for the database, for the table and for both of the
columns:
Step3: Now, integrate MySQL with Python. For this, we will need PyMySQL which can be
installed with the help of the following command
Step4: Now, our database named Scrap, created earlier, is ready to save the data, after
scraped from web, into table named Scrap_pages. Here in our example we are going to
scrape data from Wikipedia and it will be saved into our database.
28
Python Web Scraping
def getLinks(articleUrl):
html = urlopen('http://en.wikipedia.org'+articleUrl)
bs = BeautifulSoup(html, 'html.parser')
title = bs.find('h1').get_text()
content = bs.find('div', {'id':'mw-content-text'}).find('p').get_text()
store(title, content)
return bs.find('div',
{'id':'bodyContent'}).findAll('a',href=re.compile('^(/wiki/)((?!:).)*$'))
links = getLinks('/wiki/Kevin_Bacon')
try:
while len(links) > 0:
newArticle = links[random.randint(0, len(links)-1)].attrs['href']
print(newArticle)
links = getLinks(newArticle)
29
Python Web Scraping
finally:
cur.close()
conn.close()
This will save the data gather from Wikipedia into table named scrap_pages. If you are
familiar with MySQL and web scraping, then the above code would not be tough to
understand.
If you are not familiar with PostgreSQL then you can learn it at
https://www.tutorialspoint.com/postgresql/. And with the help of following command we
can install psycopg2 Python library:
30
Python Web Scraping
7. Python Web Scraping – Processing Images and
Videos
Web scraping usually involves downloading, storing and processing the web media
content. In this chapter, let us understand how to process the content downloaded from
the web.
Introduction
The web media content that we obtain during scraping can be images, audio and video
files, in the form of non-web pages as well as data files. But, can we trust the downloaded
data especially on the extension of data we are going to download and store in our
computer memory? This makes it essential to know about the type of data we are going
to store locally.
import requests
Now, provide the URL of the media content we want to download and store locally.
url = "https://authoraditiagarwal.com/wp-
content/uploads/2018/05/MetaSlider_ThinkBig-1080x180.jpg"
r = requests.get(url)
With the help of following line of code, we can save the received content as .png file.
with open("ThinkBig.png",'wb') as f:
f.write(r.content)
After running the above Python script, we will get a file named ThinkBig.png, which would
have the downloaded image.
31
Python Web Scraping
With the help of following Python script, using urlparse, we can extract the filename from
URL:
import urllib3
import os
url = "https://authoraditiagarwal.com/wp-
content/uploads/2018/05/MetaSlider_ThinkBig-1080x180.jpg"
a = urlparse(url)
a.path
'/wp-content/uploads/2018/05/MetaSlider_ThinkBig-1080x180.jpg'
os.path.basename(a.path)
'MetaSlider_ThinkBig-1080x180.jpg'
Once you run the above script, we will get the filename from URL.
import requests
Now, we need to provide the URL of the media content we want to download and store
locally.
url = "https://authoraditiagarwal.com/wp-
content/uploads/2018/05/MetaSlider_ThinkBig-1080x180.jpg"
r = requests.get(url, allow_redirects=True)
Now, we can get what type of information about content can be provided by web server.
32
Python Web Scraping
Date
Server
Upgrade
Connection
Last-Modified
Accept-Ranges
Content-Length
Keep-Alive
Content-Type
With the help of following line of code we can get the particular information about content
type, say content-type:
print (r.headers.get('content-type'))
image/jpeg
With the help of following line of code, we can get the particular information about content
type, say EType:
print (r.headers.get('ETag'))
None
print (r.headers.get('content-length'))
12636
With the help of following line of code we can get the particular information about content
type, say Server:
print (r.headers.get('Server'))
Apache
33
Python Web Scraping
For this Python script, we need to install Python library named Pillow, a fork of the Python
Image library having useful functions for manipulating images. It can be installed with the
help of following command:
The following Python script will create a thumbnail of the image and will save it to the
current directory by prefixing thumbnail file with Th_
import glob
from PIL import Image
for infile in glob.glob("ThinkBig.png"):
img = Image.open(infile)
img.thumbnail((128, 128), Image.ANTIALIAS)
if infile[0:2] != "Th_":
img.save("Th_" + infile, "png")
The above code is very easy to understand and you can check for the thumbnail file in the
current directory.
34
Python Web Scraping
After running the script, you can check your current directory for screenshot.png file.
The following Python script will generate thumbnail of the video and will save it to our local
directory:
import subprocess
video_MP4_file = “C:\Users\gaurav\desktop\solar.mp4
thumbnail_image_file = 'thumbnail_solar_video.jpg'
subprocess.call(['ffmpeg', '-i', video_MP4_file, '-ss', '00:00:20.000', '-
vframes', '1', thumbnail_image_file, "-y"])
After running the above script, we will get the thumbnail named
thumbnail_solar_video.jpg saved in our local directory.
35
Python Web Scraping
Now, after successfully installing moviepy with the help of following script we can convert
and MP4 to MP3.
import moviepy.editor as mp
clip = mp.VideoFileClip(r"C:\Users\gaurav\Desktop\1234.mp4")
clip.audio.write_audiofile("movie_audio.mp3")
The above script will save the audio MP3 file in the local directory.
36
Python Web Scraping
8. Python Web Scraping – Dealing with Text
In the previous chapter, we have seen how to deal with videos and images that we obtain
as a part of web scraping content. In this chapter we are going to deal with text analysis
by using Python library and will learn about this in detail.
Introduction
You can perform text analysis in by using Python library called Natural Language Tool Kit
(NLTK). Before proceeding into the concepts of NLTK, let us understand the relation
between text analysis and web scraping.
Analyzing the words in the text can lead us to know about which words are important,
which words are unusual, how words are grouped. This analysis eases the task of web
scraping.
Installing NLTK
You can use the following command to install NLTK in Python:
If you are using Anaconda, then a conda package for NLTK can be built by using the
following command:
import nltk
Now, with the help of following command NLTK data can be downloaded:
nltk.download()
Installation of all available packages of NLTK will take some time, but it is always
recommended to install all the packages.
37
Python Web Scraping
gensim: A robust semantic modeling library which is useful for many applications. It
can be installed by the following command:
pattern: Used to make gensim package work properly. It can be installed by the
following command:
Tokenization
The Process of breaking the given text, into the smaller units called tokens, is called
tokenization. These tokens can be the words, numbers or punctuation marks. It is also
called word segmentation.
Example
Input: Ram, Mohan and Sohan are my friends.
NLTK module provides different packages for tokenization. We can use these packages as
per our requirement. Some of the packages are described here:
sent_tokenize package: This package will divide the input text into sentences. You can
use the following command to import this package:
word_tokenize package: This package will divide the input text into words. You can use
the following command to import this package:
WordPunctTokenizer package: This package will divide the input text as well as the
punctuation marks into words. You can use the following command to import this package:
38
Python Web Scraping
Stemming
In any language, there are different forms of a words. A language includes lots of variations
due to the grammatical reasons. For example, consider the words democracy,
democratic, and democratization. For machine learning as well as for web scraping
projects, it is important for machines to understand that these different words have the
same base form. Hence we can say that it can be useful to extract the base forms of the
words while analyzing the text.
This can be achieved by stemming which may be defined as the heuristic process of
extracting the base forms of the words by chopping off the ends of words.
NLTK module provides different packages for stemming. We can use these packages as
per our requirement. Some of these packages are described here:
For example, after giving the word ‘writing’ as the input to this stemmer, the output
would be the word ‘write’ after stemming.
For example, after giving the word ‘writing’ as the input to this stemmer then the output
would be the word ‘writ’ after stemming.
For example, after giving the word ‘writing’ as the input to this stemmer then the output
would be the word ‘write’ after stemming.
Lemmatization
An other way to extract the base form of words is by lemmatization, normally aiming to
remove inflectional endings by using vocabulary and morphological analysis. The base
form of any word after lemmatization is called lemma.
WordNetLemmatizer package: It will extract the base form of the word depending upon
whether it is used as noun as a verb. You can use the following command to import this
package:
39
Python Web Scraping
Chunking
Chunking, which means dividing the data into small chunks, is one of the important
processes in natural language processing to identify the parts of speech and short phrases
like noun phrases. Chunking is to do the labeling of tokens. We can get the structure of
the sentence with the help of chunking process.
Example
In this example, we are going to implement Noun-Phrase chunking by using NLTK Python
module. NP chunking is a category of chunking which will find the noun phrases chunks in
the sentence.
Steps for implementing noun phrase chunking
We need to follow the steps given below for implementing noun-phrase chunking:
Step-1:Chunk grammar definition
In the first step we will define the grammar for chunking. It would consist of the rules
which we need to follow.
Step-2:Chunk parser creation
Now, we will create a chunk parser. It would parse the grammar and give the output.
Step-3:The Output
In this last step, the output would be produced in a tree format.
import nltk
Next, we need to define the sentence. Here DT: the determinant, VBP: the verb, JJ: the
adjective, IN: the preposition and NN: the noun.
grammar = "NP:{<DT>?<JJ>*<NN>}"
Now, next line of code will define a parser for parsing the grammar.
parser_chunking=nltk.RegexpParser(grammar)
parser_chunking.parse(sentence)
Output=parser_chunking.parse(sentence)
40
Python Web Scraping
With the help of following code, we can draw our output in the form of a tree as shown
below.
output.draw()
Bag of Word (BoW) Model Extracting and converting the Text into
Numeric Form
Bag of Word (BoW), a useful model in natural language processing, is basically used to
extract the features from text. After extracting the features from the text, it can be used
in modeling in machine learning algorithms because raw data cannot be used in ML
applications.
Example
Suppose we have the following two sentences:
41
Python Web Scraping
Now, by considering these two sentences, we have the following 14 distinct words:
1. This
2. is
3. an
4. example
5. bag
6. of
7. words
8. model
9. we
10. can
11. extract
12. features
13. by
14. using
Output
It shows that we have 14 distinct words in the above two sentences:
42
Python Web Scraping
Text Classification
Classification can be improved by topic modeling because it groups similar words together
rather than using each word separately as a feature.
Recommender Systems
We can build recommender systems by using similarity measures.
Latent Dirichlet Allocation(LDA): It is one of the most popular algorithm that uses the
probabilistic graphical models for implementing topic modeling.
Non-Negative Matrix Factorization (NMF): It is also based upon Linear Algebra as like
LDA.
43
Python Web Scraping
9. Python Web Scraping – Scraping Dynamic
Websites
In this chapter, let us learn how to perform web scraping on dynamic websites and the
concepts involved in detail.
Introduction
Web scraping is a complex task and the complexity multiplies if the website is dynamic.
According to United Nations Global Audit of Web Accessibility more than 70% of the
websites are dynamic in nature and they rely on JavaScript for their functionalities.
import re
import urllib.request
response =
urllib.request.urlopen('http://example.webscraping.com/places/default/search')
html = response.read()
text = html.decode()
re.findall('<td class="w2p_fw">(.*?)</td>',text)
Output
[ ]
The above output shows that the example scraper failed to extract information because
the <div> element we are trying to find is empty.
44
Python Web Scraping
For doing this, we need to click the inspect element tab for a specified URL. Next, we
will click NETWORK tab to find all the requests made for that web page including
search.json with a path of /ajax. Instead of accessing AJAX data from browser or via
NETWORK tab, we can do it with the help of following Python script too:
import requests
url=requests.get('http://example.webscraping.com/ajax/search.json?page=0&page_s
ize=10&search_term=a')
url.json()
Example
The above script allows us to access JSON response by using Python json method. Similarly
we can download the raw string response and by using python’s json.loads method, we
can load it too. We are doing this with the help of following Python script. It will basically
scrape all of the countries by searching the letter of the alphabet ‘a’ and then iterating the
resulting pages of the JSON responses.
import requests
import string
PAGE_SIZE = 15
url = 'http://example.webscraping.com/ajax/' +
'search.json?page={}&page_size={}&search_term=a'
countries = set()
for letter in string.ascii_lowercase:
print('Searching with %s' % letter)
page = 0
while True:
response = requests.get(url.format(page, PAGE_SIZE, letter))
data = response.json()
print('adding %d records from the page %d' %(len(data.get('records')),
page))
for record in data.get('records'):
countries.add(record['country'])
page += 1
if page >= data['num_pages']:
break
with open('countries.txt', 'w') as countries_file:
countries_file.write('n'.join(sorted(countries)))
45
Python Web Scraping
After running the above script, we will get the following output and the records would be
saved in the file named countries.txt.
Output:
Searching with a
adding 15 records from the page 0
adding 15 records from the page 1
...
Rendering JavaScript
In the previous section, we did reverse engineering on web page that how API worked and
how we can use it to retrieve the results in single request. However, we can face following
difficulties while doing reverse engineering:
Sometimes websites can be very difficult. For example, if the website is made with
advanced browser tool such as Google Web Toolkit (GWT), then the resulting JS
code would be machine-generated and difficult to understand and reverse engineer.
Some higher level frameworks like React.js can make reverse engineering difficult
by abstracting already complex JavaScript logic.
The solution to the above difficulties is to use a browser rendering engine that parses
HTML, applies the CSS formatting and executes JavaScript to display a web page.
Example
In this example, for rendering Java Script we are going to use a familiar Python module
Selenium. The following Python code will render a web page with the help of Selenium:
Now, provide the path of web driver which we have downloaded as per our requirement:
path = r'C:\\Users\\gaurav\\Desktop\\Chromedriver'
driver = webdriver.Chrome(executable_path = path)
Now, provide the url which we want to open in that web browser now controlled by our
Python script.
driver.get('http://example.webscraping.com/search')
Now, we can use ID of the search toolbox for setting the element to select.
driver.find_element_by_id('search_term').send_keys('.')
46
Python Web Scraping
Next, we can use java script to set the select box content as follows:
js = "document.getElementById('page_size').options[1].text = '100';"
driver.execute_script(js)
The following line of code shows that search is ready to be clicked on the web page:
driver.find_element_by_id('search').click()
Next line of code shows that it will wait for 45 seconds for completing the AJAX request.
driver.implicitly_wait(45)
Now, for selecting country links, we can use the CSS selector as follows:
Now the text of each link can be extracted for creating the list of countries:
47
Python Web Scraping
10. Python Web Scraping — Scraping Form based
Websites
In the previous chapter, we have seen scraping dynamic websites. In this chapter, let us
understand scraping of websites that work on user based inputs, that is form based
websites.
Introduction
These days WWW (World Wide Web) is moving towards social media as well as user-
generated contents. So the question arises how we can access such kind of information
that is beyond login screen? For this we need to deal with forms and logins.
In previous chapters, we worked with HTTP GET method to request information but in this
chapter we will work with HTTP POST method that pushes information to a web server for
storage and analysis.
In this section, we are going to deal with a simple submit form with the help of Python
requests library.
import requests
Now, we need to provide the information for the fields of login form.
In next line of code, we need to provide the URL on which action of the form would happen.
After running the script, it will return the content of the page where action has happened.
48
Python Web Scraping
Suppose if you want to submit any image with the form, then it is very easy with
requests.post(). You can understand it with the help of following Python script:
import requests
file = {‘Uploadfile’: open(’C:\Usres\desktop\123.png’,‘rb’)}
r = requests.post(“enter the URL”, files = file)
print(r.text)
In the context of dealings with login forms, cookies can be of two types. One, we dealt in
the previous section, that allows us to submit information to a website and second which
lets us to remain in a permanent “logged-in” state throughout our visit to the website. For
the second kind of forms, websites use cookies to keep track of who is logged in and who
is not.
Step1: First, the site will authenticate our login credentials and stores it in our browser’s
cookie. This cookie generally contains a server-generated toke, time-out and tracking
information.
Step2: Next, the website will use the cookie as a proof of authentication. This
authentication is always shown whenever we visit the website.
Cookies are very problematic for web scrapers because if web scrapers do not keep track
of the cookies, the submitted form is sent back and at the next page it seems that they
never logged in. It is very easy to track the cookies with the help of Python requests
library, as shown below:
import requests
parameters = {‘Name’:’Enter your name’, ‘Email-id’:’Your Email-
id’,’Message’:’Type your message here’}
r = requests.post(“enter the URL”, data = parameters)
In the above line of code, the URL would be the page which will act as the processor for
the login form.
49
Python Web Scraping
After running the above script, we will retrieve the cookies from the result of last request.
There is another issue with cookies that sometimes websites frequently modify cookies
without warning. Such kind of situation can be dealt with requests.Session() as follows:
import requests
session = requests.Session()
parameters = {‘Name’:’Enter your name’, ‘Email-id’:’Your Email-
id’,’Message’:’Type your message here’}
r = session.post(“enter the URL”, data = parameters)
In the above line of code, the URL would be the page which will act as the processor for
the login form.
Observe that you can easily understand the difference between script with session and
without session.
Mechanize module
Mechanize module provides us a high-level interface to interact with forms. Before starting
using it we need to install it with the following command:
Example
In this example, we are going to automate the process of filling a login form having two
fields namely email and password:
import mechanize
brwsr = mechanize.Browser()
brwsr.open(Enter the URL of login)
brwsr.select_form(nr=0)
brwsr['email'] = ‘Enter email’
brwsr['password'] = ‘Enter password’
response = brwsr.submit()
50
Python Web Scraping
brwsr.submit()
The above code is very easy to understand. First, we imported mechanize module. Then
a Mechanize browser object has been created. Then, we navigated to the login URL and
selected the form. After that, names and values are passed directly to the browser object.
51
Python Web Scraping
11. Python Web Scraping — Processing CAPTCHA
In this chapter, let us understand how to perform web scraping and processing CAPTCHA
that is used for testing a user for human or robot.
What is CAPTCHA?
The full form of CAPTCHA is Completely Automated Public Turing test to tell
Computers and Humans Apart, which clearly suggests that it is a test to determine
whether the user is human or not.
A CAPTCHA is a distorted image which is usually not easy to detect by computer program
but a human can somehow manage to understand it. Most of the websites use CAPTCHA
to prevent bots from interacting.
import lxml.html
import urllib.request as urllib2
import pprint
import http.cookiejar as cookielib
def form_parsing(html):
tree = lxml.html.fromstring(html)
data = {}
for e in tree.cssselect('form input'):
if e.get('name'):
data[e.get('name')] = e.get('value')
return data
REGISTER_URL = 'http://example.webscraping.com/user/register'
ckj = cookielib.CookieJar()
browser = urllib2.build_opener(urllib2.HTTPCookieProcessor(ckj))
html =
browser.open('http://example.webscraping.com/places/default/user/register?_next
=/places/default/index').read()
form = form_parsing(html)
pprint.pprint(form)
52
Python Web Scraping
In the above Python script, first we defined a function that will parse the form by using
lxml python module and then it will print the form requirements as follows:
{'_formkey': '5e306d73-5774-4146-a94e-3541f22c95ab',
'_formname': 'register',
'_next': '/places/default/index',
'email': '',
'first_name': '',
'last_name': '',
'password': '',
'password_two': '',
'recaptcha_response_field': None}
You can check from the above output that all the information except
recpatcha_response_field are understandable and straightforward. Now the question
arises that how we can handle this complex information and download CAPTCHA. It can
be done with the help of pillow Python library as follows;
The above python script is using pillow python package and defining a function for loading
CAPTCHA image. It must be used with the function named form_parser() that is defined
in the previous script for getting information about the registration form. This script will
save the CAPTCHA image in a useful format which further can be extracted as string.
53
Python Web Scraping
Example
Here we will extend the above Python script, which loaded the CAPTCHA by using Pillow
Python Package, as follows:
import pytesseract
img = get_captcha(html)
img.save('captcha_original.png')
gray = img.convert('L')
gray.save('captcha_gray.png')
bw = gray.point(lambda x: 0 if x < 1 else 255, '1')
bw.save('captcha_thresholded.png')
The above Python script will read the CAPTCHA in black and white mode which would be
clear and easy to pass to tesseract as follows:
pytesseract.image_to_string(bw)
After running the above script we will get the CAPTCHA of registration form as the output.
54
Python Web Scraping
12. Python Web Scraping — Testing with Scrapers
This chapter explains how to perform testing using web scrapers in Python.
Introduction
In large web projects, automated testing of website’s backend is performed regularly but
the frontend testing is skipped often. The main reason behind this is that the programming
of websites is just like a net of various markup and programming languages. We can write
unit test for one language but it becomes challenging if the interaction is being done in
another language. That is why we must have suite of tests to make sure that our code is
performing as per our expectation.
At-least one aspect of the functionality of a component would be tested in each unit
test.
Unit test does not interfere with success or failure of any other test.
Unit tests can run in any order and must contain at least one assertion.
Example
In this example we are going to combine web scraping with unittest. We will test
Wikipedia page for searching string ‘Python’. It will basically do two tests, first weather
the title page is same as the search string i.e.‘Python’ or not and second test makes sure
that the page has a content div.
55
Python Web Scraping
First, we will import the required Python modules. We are using BeautifulSoup for web
scraping and of course unittest for testing.
Now we need to define a class which will extend unittest.TestCase. Global object bs
would be shared between all tests. A unittest specified function setUpClass will
accomplish it. Here we will define two functions, one for testing the title page and other
for testing the page content.
class Test(unittest.TestCase):
bs = None
def setUpClass():
url = 'https://en.wikipedia.org/wiki/Python'
Test.bs = BeautifulSoup(urlopen(url), 'html.parser')
def test_titleText(self):
pageTitle = Test.bs.find('h1').get_text()
self.assertEqual('Python', pageTitle);
def test_contentExists(self):
content = Test.bs.find('div',{'id':'mw-content-text'})
self.assertIsNotNone(content)
if __name__ == '__main__':
unittest.main()
After running the above script we will get the following output:
----------------------------------------------------------------------
Ran 2 tests in 2.773s
OK
An exception has occurred, use %tb to see the full traceback.
SystemExit: False
D:\ProgramData\lib\site-packages\IPython\core\interactiveshell.py:2870:
UserWarning: To exit: use 'exit', 'quit', or Ctrl-D.
warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)
56
Python Web Scraping
Example
With the help of next Python script, we are creating test script for the automation of
Facebook Login page. You can modify the example for automating other forms and logins
of your choice, however the concept would be same.
First for connecting to web browser, we will import webdriver from selenium module:
Next we need to provide username and password for login into our facebook account.
user = "gauravleekha@gmail.com"
pwd = ""
path = r'C:\\Users\\gaurav\\Desktop\\Chromedriver'
driver = webdriver.Chrome(executable_path=path)
driver.get("http://www.facebook.com")
With the help of following line of code we are sending values to the email section. Here we
are searching it by its id but we can do it by searching it by name as
driver.find_element_by_name("email").
element = driver.find_element_by_id("email")
element.send_keys(user)
57
Python Web Scraping
With the help of following line of code we are sending values to the password section. Here
we are searching it by its id but we can do it by searching it by name as
driver.find_element_by_name("pass").
element = driver.find_element_by_id("pass")
element.send_keys(pwd)
Next line of code is used to press enter/login after inserting the values in email and
password field.
element.send_keys(Keys.RETURN)
driver.close()
After running the above script, Chrome web browser will be opened and you can see email
and password is being inserted and clicked on login button.
58
Python Web Scraping
For example, we are rewriting the above Python script for automation of Facebook login
by combining both of them as follows:
import unittest
from selenium import webdriver
class InputFormsCheck(unittest.TestCase):
def setUp(self):
self.driver =
webdriver.Chrome(r'C:\Users\gaurav\Desktop\chromedriver')
def test_singleInputField(self):
user = "gauravleekha@gmail.com"
pwd = ""
pageUrl = "http://www.facebook.com"
driver=self.driver
driver.maximize_window()
driver.get(pageUrl)
assert "Facebook" in driver.title
elem = driver.find_element_by_id("email")
elem.send_keys(user)
elem = driver.find_element_by_id("pass")
elem.send_keys(pwd)
elem.send_keys(Keys.RETURN)
def tearDown(self):
self.driver.close()
if __name__ == "__main__":
unittest.main()
59