0% found this document useful (0 votes)

77 views4 pages

Python Web Scraping Data Extraction

Uploaded by

David Osei

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

77 views4 pages

Python Web Scraping Data Extraction

Uploaded by

David Osei

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

PYTHON WEB SCRAPING DATA EXTRACTION

https://www.tutorialspoint.com/python_web_scraping/python_web_scraping_data_extraction.htm Copyright © tutorialspoint.com

Web page Analysis

Web page analysis is important because without analyzing we are not able to know in which form we are going to
receive the data from structuredorunstructured that web page after extraction. We can do web page analysis in
the following ways −

Viewing Page Source

This is a way to understand how a web page is structured by examining its source code. To implement this, we need
to right click the page and then must select the View page source option. Then, we will get the data of our
interest from that web page in the form of HTML. But the main concern is about whitespaces and formatting
which is difficult for us to format.

Inspecting Page Source by Clicking Inspect Element Option

This is another way of analyzing web page. But the difference is that it will resolve the issue of formatting and
whitespaces in the source code of web page. You can implement this by right clicking and then selecting the
Inspect or Inspect element option from menu. It will provide the information about particular area or element
of that web page.

Different Ways to Extract Data from Web Page

The following methods are mostly used for extracting data from a web page −

Regular Expression

They are highly specialized programming language embedded in Python. We can use it through re module of
Python. It is also called RE or regexes or regex patterns. With the help of regular expressions, we can specify some
rules for the possible set of strings we want to match from the data.

If you want to learn more about regular expression in general, go to the link
https://www.tutorialspoint.com/automata_theory/regular_expressions.htm and if you want to know more about
re module or regular expression in Python, you can follow the link
https://www.tutorialspoint.com/python/python_reg_expressions.htm.

Example

In the following example, we are going to scrape data about India from http://example.webscraping.com after
matching the contents of <td> with the help of regular expression.

import re
import urllib.request
response =
urllib.request.urlopen('http://example.webscraping.com/places/default/view/India‐102')
html = response.read()
text = html.decode()
re.findall('<td class="w2p_fw">(.*?)</td>',text)

Output

The corresponding output will be as shown here −

[
'<img src="/places/static/images/flags/in.png" />',
'3,287,590 square kilometres',
'1,173,108,018',
'IN',
'India',
'New Delhi',
'<a href="/places/default/continent/AS">AS</a>',
'.in',
'INR',
'Rupee',
'91',
'######',
'^(\\d{6})$',
'enIN,hi,bn,te,mr,ta,ur,gu,kn,ml,or,pa,as,bh,sat,ks,ne,sd,kok,doi,mni,sit,sa,fr,lus,inc',
'<div>
<a href="/places/default/iso/CN">CN </a>
<a href="/places/default/iso/NP">NP </a>
<a href="/places/default/iso/MM">MM </a>
<a href="/places/default/iso/BT">BT </a>
<a href="/places/default/iso/PK">PK </a>
<a href="/places/default/iso/BD">BD </a>
</div>'
]

Observe that in the above output you can see the details about country India by using regular expression.

Beautiful Soup
Suppose we want to collect all the hyperlinks from a web page, then we can use a parser called BeautifulSoup
which can be known in more detail at https://www.crummy.com/software/BeautifulSoup/bs4/doc/. In simple
words, BeautifulSoup is a Python library for pulling data out of HTML and XML files. It can be used with requests,
because it needs an input documentorurl to create a soup object asit cannot fetch a web page by itself. You can use
the following Python script to gather the title of web page and hyperlinks.

Installing Beautiful Soup

Using the pip command, we can install beautifulsoup either in our virtual environment or in global installation.

(base) D:\ProgramData>pip install bs4

Collecting bs4
Downloading
https://files.pythonhosted.org/packages/10/ed/7e8b97591f6f456174139ec089c769f89
a94a1a4025fe967691de971f314/bs4‐0.0.1.tar.gz
Requirement already satisfied: beautifulsoup4 in d:\programdata\lib\sitepackages
(from bs4) (4.6.0)
Building wheels for collected packages: bs4
Running setup.py bdist_wheel for bs4 ... done
Stored in directory:
C:\Users\gaurav\AppData\Local\pip\Cache\wheels\a0\b0\b2\4f80b9456b87abedbc0bf2d
52235414c3467d8889be38dd472
Successfully built bs4
Installing collected packages: bs4
Successfully installed bs4‐0.0.1

Example

Note that in this example, we are extending the above example implemented with requests python module. we are
using r.text for creating a soup object which will further be used to fetch details like title of the webpage.

First, we need to import necessary Python modules −

import requests
from bs4 import BeautifulSoup

In this following line of code we use requests to make a GET HTTP requests for the url:
https://authoraditiagarwal.com/ by making a GET request.

r = requests.get('https://authoraditiagarwal.com/')

Now we need to create a Soup object as follows −

soup = BeautifulSoup(r.text, 'lxml')

print (soup.title)
print (soup.title.text)

Output

The corresponding output will be as shown here −

<title>Learn and Grow with Aditi Agarwal</title>

Learn and Grow with Aditi Agarwal

Lxml
Another Python library we are going to discuss for web scraping is lxml. It is a highperformance HTML and XML
parsing library. It is comparatively fast and straightforward. You can read about it more on https://lxml.de/.

Installing lxml

Using the pip command, we can install lxml either in our virtual environment or in global installation.

(base) D:\ProgramData>pip install lxml

Collecting lxml
Downloading
https://files.pythonhosted.org/packages/b9/55/bcc78c70e8ba30f51b5495eb0e
3e949aa06e4a2de55b3de53dc9fa9653fa/lxml‐4.2.5‐cp36‐cp36m‐win_amd64.whl
(3.
6MB)
100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 3.6MB 64kB/s
Installing collected packages: lxml
Successfully installed lxml‐4.2.5

Example: Data extraction using lxml and requests

In the following example, we are scraping a particular element of the web page from authoraditiagarwal.com
by using lxml and requests −

First, we need to import the requests and html from lxml library as follows −

import requests
from lxml import html

Now we need to provide the url of web page to scrap

url = 'https://authoraditiagarwal.com/leadershipmanagement/'

Now we need to provide the path Xpath to particular element of that web page −

path = '//*[@id="panel‐836‐0‐0‐1"]/div/div/p[1]'
response = requests.get(url)
byte_string = response.content
source_code = html.fromstring(byte_string)
tree = source_code.xpath(path)
print(tree[0].text_content())

Output

The corresponding output will be as shown here −

The Sprint Burndown or the Iteration Burndown chart is a powerful tool to communicate
daily progress to the stakeholders. It tracks the completion of work for a given sprint
or an iteration. The horizontal axis represents the days within a Sprint. The vertical
axis represents the hours remaining to complete the committed work.

Python Web Scraping Guide
100% (2)
Python Web Scraping Guide
35 pages
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
100% (3)
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
26 pages
Python Web Scraping Tutorial
92% (12)
Python Web Scraping Tutorial
65 pages
Python Selenium Web Scraping Guide
No ratings yet
Python Selenium Web Scraping Guide
14 pages
Beautiful Soup Tutorial
100% (2)
Beautiful Soup Tutorial
56 pages
Web Scraping Using Python
No ratings yet
Web Scraping Using Python
18 pages
Web Scraping Using Python - Notes
No ratings yet
Web Scraping Using Python - Notes
6 pages
Scraping Book Python PDF
No ratings yet
Scraping Book Python PDF
50 pages
Salient Features of IT Act 2000
No ratings yet
Salient Features of IT Act 2000
10 pages
Process Gas Compressors
100% (1)
Process Gas Compressors
24 pages
Scraping
100% (1)
Scraping
25 pages
EI8751-Industrial Data Networks
No ratings yet
EI8751-Industrial Data Networks
10 pages
1.8 Data Scrapping PDF
No ratings yet
1.8 Data Scrapping PDF
42 pages
2013 SNUG SV Synthesizable SystemVerilog Paper
No ratings yet
2013 SNUG SV Synthesizable SystemVerilog Paper
45 pages
Web Scraping Ganesh
0% (1)
Web Scraping Ganesh
20 pages
Scraping Book
No ratings yet
Scraping Book
50 pages
Analysis of Different Web Data Extraction Techniques
No ratings yet
Analysis of Different Web Data Extraction Techniques
7 pages
Data Analysis by Web Scraping Using Python
No ratings yet
Data Analysis by Web Scraping Using Python
6 pages
Lecture03 Data II
No ratings yet
Lecture03 Data II
42 pages
Web Scraping in R: A Beginner's Guide
No ratings yet
Web Scraping in R: A Beginner's Guide
20 pages
Module 6 - Spring Boot Java (MCA)
No ratings yet
Module 6 - Spring Boot Java (MCA)
113 pages
Harnessing The Reasoning Economy A Survey of Efficient Reasoning For Large Language Models
No ratings yet
Harnessing The Reasoning Economy A Survey of Efficient Reasoning For Large Language Models
24 pages
AOPA - GPS Technology
100% (1)
AOPA - GPS Technology
16 pages
Web Scraping & API Guide
No ratings yet
Web Scraping & API Guide
24 pages
Introduction To Web Scraping in RPA With Python
No ratings yet
Introduction To Web Scraping in RPA With Python
10 pages
BeautifulSoup Notes
No ratings yet
BeautifulSoup Notes
22 pages
Python Data Wrangling
No ratings yet
Python Data Wrangling
12 pages
Annihilator Method
100% (1)
Annihilator Method
7 pages
WEBSCRAping Buildwithpython
No ratings yet
WEBSCRAping Buildwithpython
78 pages
Parent Card Stage 5
100% (1)
Parent Card Stage 5
2 pages
Web Scraping Tools
No ratings yet
Web Scraping Tools
5 pages
Extract and Transform Data With Web Scraping - Learn Python Basics - OpenClassrooms
No ratings yet
Extract and Transform Data With Web Scraping - Learn Python Basics - OpenClassrooms
16 pages
Learning Theories 2016-1 (Notes)
No ratings yet
Learning Theories 2016-1 (Notes)
69 pages
Web Scraping: What Why How
No ratings yet
Web Scraping: What Why How
3 pages
Beginner Guide To Web Scraping of Data
No ratings yet
Beginner Guide To Web Scraping of Data
14 pages
Web Scraping
No ratings yet
Web Scraping
28 pages
Programming 2 Lectures
No ratings yet
Programming 2 Lectures
52 pages
!!!!!!!!!AC SINGLE PHASE INDUCTION MOTOR SPEED CONTROL U2008b PDF
No ratings yet
!!!!!!!!!AC SINGLE PHASE INDUCTION MOTOR SPEED CONTROL U2008b PDF
6 pages
Human Relations in Organizations Applications and Skill Building 10th Edition Lussier Test Bank 1
100% (71)
Human Relations in Organizations Applications and Skill Building 10th Edition Lussier Test Bank 1
26 pages
Web Scraping 2
No ratings yet
Web Scraping 2
14 pages
Web Crawling - Python
No ratings yet
Web Crawling - Python
34 pages
Hazardous in Underground Mines
No ratings yet
Hazardous in Underground Mines
26 pages
Encrypted Text Analysis
No ratings yet
Encrypted Text Analysis
77 pages
Upload PDF
No ratings yet
Upload PDF
11 pages
DPS5020 Operating Manual
No ratings yet
DPS5020 Operating Manual
9 pages
L21 L22 Varying CTReconstruction Parameters
No ratings yet
L21 L22 Varying CTReconstruction Parameters
24 pages
3python Web Scraping Getting Started With Python
No ratings yet
3python Web Scraping Getting Started With Python
4 pages
The Ultimate Web Scraping With Python Bootcamp 2023 - Coderprog
No ratings yet
The Ultimate Web Scraping With Python Bootcamp 2023 - Coderprog
3 pages
Download
No ratings yet
Download
4 pages
Datadgeling
No ratings yet
Datadgeling
22 pages
MATLAB Scripts & Functions Guide
No ratings yet
MATLAB Scripts & Functions Guide
38 pages
S-1206 Series: Ultra Low Current Consumption and Low Dropout Cmos Voltage Regulator
No ratings yet
S-1206 Series: Ultra Low Current Consumption and Low Dropout Cmos Voltage Regulator
35 pages
MOSFET Basics for Engineering Students
No ratings yet
MOSFET Basics for Engineering Students
46 pages
DLL - Math6 - Week 1
No ratings yet
DLL - Math6 - Week 1
12 pages
Here Is The Placeholder For Three Lines Title Create Social Media Accounts For Your Business
No ratings yet
Here Is The Placeholder For Three Lines Title Create Social Media Accounts For Your Business
21 pages
Perbandingan Biaya Jaringan Dan Kelayakan Teknologi LTE Pada Frekuensi 900 MHZ, 1800 MHZ, 2100 MHZ, Dan 2300 MHZ Untuk Mendukung Rencana Pita Lebar Di Indonesia
No ratings yet
Perbandingan Biaya Jaringan Dan Kelayakan Teknologi LTE Pada Frekuensi 900 MHZ, 1800 MHZ, 2100 MHZ, Dan 2300 MHZ Untuk Mendukung Rencana Pita Lebar Di Indonesia
16 pages
Python Web Scraping Basics
No ratings yet
Python Web Scraping Basics
4 pages
Practical Web Scraping For Economists 1744341390
No ratings yet
Practical Web Scraping For Economists 1744341390
33 pages
Design A Cloud-Enabled Humanoid Robot Application System To Assess The ABA Learning For Autistic Children
No ratings yet
Design A Cloud-Enabled Humanoid Robot Application System To Assess The ABA Learning For Autistic Children
8 pages
Web Scraping With Python - A Complete Step-By-Step Guide + Code - by Anthony Heath - Geek Culture - Medium
No ratings yet
Web Scraping With Python - A Complete Step-By-Step Guide + Code - by Anthony Heath - Geek Culture - Medium
42 pages
99 Ta 516149
No ratings yet
99 Ta 516149
2 pages
Lesson 4 Unstructured Data
No ratings yet
Lesson 4 Unstructured Data
20 pages
4F IntroToWebScraping
No ratings yet
4F IntroToWebScraping
6 pages
DAP 4 Module
No ratings yet
DAP 4 Module
45 pages
First Quarter Examination in Epas G12
100% (1)
First Quarter Examination in Epas G12
3 pages
Web Scarpping
No ratings yet
Web Scarpping
4 pages
8python Web Scraping Dealing With Text
No ratings yet
8python Web Scraping Dealing With Text
7 pages
Bits ZG553 Ec-2r First Sem 2019-2020
No ratings yet
Bits ZG553 Ec-2r First Sem 2019-2020
2 pages
11python Web Scraping Testing With Scrapers
No ratings yet
11python Web Scraping Testing With Scrapers
5 pages
Other Skills in Resume
100% (1)
Other Skills in Resume
8 pages
Web Scraping With Python
No ratings yet
Web Scraping With Python
16 pages
Log
No ratings yet
Log
4 pages
10python Web Scraping Form Based Websites
No ratings yet
10python Web Scraping Form Based Websites
3 pages
Term Paper On Management Information System
100% (1)
Term Paper On Management Information System
4 pages
Web Scraping With: 1 High-Level Overview: The Process of Webscraping
No ratings yet
Web Scraping With: 1 High-Level Overview: The Process of Webscraping
11 pages
Reference Consent Form
No ratings yet
Reference Consent Form
1 page
2python Web Scraping Introduction
No ratings yet
2python Web Scraping Introduction
4 pages
Webscraping
No ratings yet
Webscraping
12 pages
Web Scraping Python Tutorial - How To Scrape Data From A Website
No ratings yet
Web Scraping Python Tutorial - How To Scrape Data From A Website
19 pages
4python Heat Maps
No ratings yet
4python Heat Maps
1 page
UML Class Diagram
No ratings yet
UML Class Diagram
4 pages
WebScraping Lessons 1
100% (1)
WebScraping Lessons 1
3 pages
Scraping
No ratings yet
Scraping
6 pages
Web Scraping and HTML Basics
No ratings yet
Web Scraping and HTML Basics
4 pages
ML Week 6
No ratings yet
ML Week 6
11 pages
Data Science
No ratings yet
Data Science
9 pages
055-En
No ratings yet
055-En
2 pages
WebScraping Lessons 2
No ratings yet
WebScraping Lessons 2
3 pages
Dap Mod 4-5
No ratings yet
Dap Mod 4-5
19 pages
Unit I
No ratings yet
Unit I
12 pages
Study Plan 2 Months
No ratings yet
Study Plan 2 Months
2 pages
Quick Guide Web Scraping With Python
No ratings yet
Quick Guide Web Scraping With Python
3 pages