Python Web Scraping Data Extraction
Python Web Scraping Data Extraction
Advertisements
Analyzing a web page means understanding its sructure . Now, the question arises why it is important for web
scraping? In this chapter, let us understand this in detail.
This is a way to understand how a web page is structured by examining its source code. To implement this, we need
to right click the page and then must select the View page source option. Then, we will get the data of our
interest from that web page in the form of HTML. But the main concern is about whitespaces and formatting
which is difficult for us to format.
Regular Expression
They are highly specialized programming language embedded in Python. We can use it through re module of
Python. It is also called RE or regexes or regex patterns. With the help of regular expressions, we can specify some
rules for the possible set of strings we want to match from the data.
If you want to learn more about regular expression in general, go to the link
https://www.tutorialspoint.com/automata_theory/regular_expressions.htm and if you want to know more about
re module or regular expression in Python, you can follow the link
https://www.tutorialspoint.com/python/python_reg_expressions.htm.
Example
In the following example, we are going to scrape data about India from http://example.webscraping.com after
matching the contents of <td> with the help of regular expression.
import re
import urllib.request
response =
urllib.request.urlopen('http://example.webscraping.com/places/default/view/India‐102')
html = response.read()
text = html.decode()
re.findall('<td class="w2p_fw">(.*?)</td>',text)
Output
[
'<img src="/places/static/images/flags/in.png" />',
'3,287,590 square kilometres',
'1,173,108,018',
'IN',
'India',
'New Delhi',
'<a href="/places/default/continent/AS">AS</a>',
'.in',
'INR',
'Rupee',
'91',
'######',
'^(\\d{6})$',
'enIN,hi,bn,te,mr,ta,ur,gu,kn,ml,or,pa,as,bh,sat,ks,ne,sd,kok,doi,mni,sit,sa,fr,lus,inc',
'<div>
<a href="/places/default/iso/CN">CN </a>
<a href="/places/default/iso/NP">NP </a>
<a href="/places/default/iso/MM">MM </a>
<a href="/places/default/iso/BT">BT </a>
<a href="/places/default/iso/PK">PK </a>
<a href="/places/default/iso/BD">BD </a>
</div>'
]
Observe that in the above output you can see the details about country India by using regular expression.
Beautiful Soup
Suppose we want to collect all the hyperlinks from a web page, then we can use a parser called BeautifulSoup
which can be known in more detail at https://www.crummy.com/software/BeautifulSoup/bs4/doc/. In simple
words, BeautifulSoup is a Python library for pulling data out of HTML and XML files. It can be used with requests,
because it needs an input documentorurl to create a soup object asit cannot fetch a web page by itself. You can use
the following Python script to gather the title of web page and hyperlinks.
Using the pip command, we can install beautifulsoup either in our virtual environment or in global installation.
Example
Note that in this example, we are extending the above example implemented with requests python module. we are
using r.text for creating a soup object which will further be used to fetch details like title of the webpage.
import requests
from bs4 import BeautifulSoup
In this following line of code we use requests to make a GET HTTP requests for the url:
https://authoraditiagarwal.com/ by making a GET request.
r = requests.get('https://authoraditiagarwal.com/')
Output
Lxml
Another Python library we are going to discuss for web scraping is lxml. It is a highperformance HTML and XML
parsing library. It is comparatively fast and straightforward. You can read about it more on https://lxml.de/.
Installing lxml
Using the pip command, we can install lxml either in our virtual environment or in global installation.
In the following example, we are scraping a particular element of the web page from authoraditiagarwal.com
by using lxml and requests −
First, we need to import the requests and html from lxml library as follows −
import requests
from lxml import html
url = 'https://authoraditiagarwal.com/leadershipmanagement/'
Now we need to provide the path Xpath to particular element of that web page −
path = '//*[@id="panel‐836‐0‐0‐1"]/div/div/p[1]'
response = requests.get(url)
byte_string = response.content
source_code = html.fromstring(byte_string)
tree = source_code.xpath(path)
print(tree[0].text_content())
Output
The Sprint Burndown or the Iteration Burndown chart is a powerful tool to communicate
daily progress to the stakeholders. It tracks the completion of work for a given sprint
or an iteration. The horizontal axis represents the days within a Sprint. The vertical
axis represents the hours remaining to complete the committed work.