0% found this document useful (0 votes)

103 views31 pages

Scraping HTML Chapter2

The document discusses various XPath navigation techniques in Python for web scraping, including: - Using slashes and brackets to look forward in the HTML structure and narrow selections. - Examples of selecting elements by tag name, attribute values, and relative position. - The wildcard character "*" to select all child elements, and contains() to match partial text in attributes. - Creating Selector objects in Scrapy to parse HTML content and extract data using XPath queries.

Uploaded by

Mikistli Yowaltekutli

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

103 views31 pages

Scraping HTML Chapter2

Uploaded by

Mikistli Yowaltekutli

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

XPath Navigation

W EB S CRAP IN G IN P YTH ON

Thomas Laetsch
Data Scientist, NYU
Slashes and Brackets
Single forward slash / looks forward one generation

Double forward slash // looks forward all future generations

Square brackets [] help narrow in on speci c elements

WEB SCRAPING IN PYTHON

To Bracket or not to Bracket
xpath = '/html/body'

xpath = '/html[1]/body[1]'

Give the same selection

WEB SCRAPING IN PYTHON

A Body of P
xpath = '/html/body/p'

WEB SCRAPING IN PYTHON

The Birds and the Ps
xpath = '/html/body/div/p' xpath = '/html/body/div/p[2]'

WEB SCRAPING IN PYTHON

Double Slashing the Brackets
xpath = '//p' xpath = '//p[1]'

WEB SCRAPING IN PYTHON

The Wildcard
xpath = '/html/body/*'
The asterisks * is the "wildcard"

WEB SCRAPING IN PYTHON

Xposé
W EB S CRAP IN G IN P YTH ON
Off the Beaten XPath
W EB S CRAP IN G IN P YTH ON

Thomas Laetsch
Data Scientist, NYU
(At)tribute
@ represents "attribute"
@class

@id

@href

WEB SCRAPING IN PYTHON

Brackets and Attributes

WEB SCRAPING IN PYTHON

Brackets and Attributes
xpath = '//p[@class="class-1"]'

WEB SCRAPING IN PYTHON

Brackets and Attributes
xpath = '//*[@id="uid"]'

WEB SCRAPING IN PYTHON

Brackets and Attributes
xpath = '//div[@id="uid"]/p[2]'

WEB SCRAPING IN PYTHON

Content with Contains
Xpath Contains Notation:

contains( @attri-name, "string-expr" )

WEB SCRAPING IN PYTHON

Contain This
xpath = '//*[contains(@class,"class-1")]'

WEB SCRAPING IN PYTHON

Contain This
xpath = '//*[@class="class-1"]'

WEB SCRAPING IN PYTHON

Get Classy
xpath = '/html/body/div/p[2]'

WEB SCRAPING IN PYTHON

Get Classy
xpath = '/html/body/div/p[2]/@class'

WEB SCRAPING IN PYTHON

End of the Path
W EB S CRAP IN G IN P YTH ON
Introduction to the
scrapy Selector
W EB S CRAP IN G IN P YTH ON

Thomas Laetsch
Data Scientist, NYU
Setting up a Selector
from scrapy import Selector

html = '''
<html>
<body>
<div class="hello datacamp">
Hello World!
</div>
Enjoy DataCamp!
</body>
</html>
'''

sel = Selector( text = html )

Created a scrapy Selector object using a string with the html code

The selector sel has selected the entire html document

WEB SCRAPING IN PYTHON

Selecting Selectors
We can use the xpath call within a Selector to create new Selector s of speci c pieces of
the html code

The return is a SelectorList of Selector objects

sel.xpath("//p")

# outputs the SelectorList:

[<Selector xpath='//p' data='Hello World!'>,
<Selector xpath='//p' data='Enjoy DataCamp!'>]

WEB SCRAPING IN PYTHON

Extracting Data from a SelectorList
Use the extract() method

>>> sel.xpath("//p")

out: [<Selector xpath='//p' data='Hello World!'>,

<Selector xpath='//p' data='Enjoy DataCamp!'>]

>>> sel.xpath("//p").extract()

out: [ 'Hello World!',

'Enjoy DataCamp!' ]

We can use extract_first() to get the rst element of the list

>>> sel.xpath("//p").extract_first()

out: 'Hello World!'

WEB SCRAPING IN PYTHON

Extracting Data from a Selector
ps = sel.xpath('//p')

second_p = ps[1]

second_p.extract()

out: 'Enjoy DataCamp!'

WEB SCRAPING IN PYTHON

Select This Course!
W EB S CRAP IN G IN P YTH ON
"Inspecting the
HTML"
W EB S CRAP IN G IN P YTH ON

Thomas Laetsch, PhD

Data Scientist, NYU
"Source" = HTML Code

WEB SCRAPING IN PYTHON

Inspecting Elements

WEB SCRAPING IN PYTHON

HTML text to Selector
from scrapy import Selector

import requests

url = 'https://www.datacamp.com/courses/all'

html = requests.get( url ).content

sel = Selector( text = html )

WEB SCRAPING IN PYTHON

You Know Our
Secrets
W EB S CRAP IN G IN P YTH ON

Chapter3-CSS Locators, Chaining, and Responses
No ratings yet
Chapter3-CSS Locators, Chaining, and Responses
30 pages
Python Web Scraping Guide
No ratings yet
Python Web Scraping Guide
29 pages
Web+Scraping+Cheat+Sheet+2 0
No ratings yet
Web+Scraping+Cheat+Sheet+2 0
3 pages
Web Scraping Techniques Cheat Sheet
No ratings yet
Web Scraping Techniques Cheat Sheet
3 pages
Understanding HTML Attributes and XPath
No ratings yet
Understanding HTML Attributes and XPath
5 pages
Web Scrapping: From NP-10
No ratings yet
Web Scrapping: From NP-10
11 pages
XPath Basics for Web Scrapers
No ratings yet
XPath Basics for Web Scrapers
11 pages
Css Selector & Xpath Expla
No ratings yet
Css Selector & Xpath Expla
10 pages
Web Crawling and Social Media Mining: Module No. 5
No ratings yet
Web Crawling and Social Media Mining: Module No. 5
77 pages
How To Scrap Any Website's Content Using Scrapy
0% (1)
How To Scrap Any Website's Content Using Scrapy
20 pages
DAP Module4 1
No ratings yet
DAP Module4 1
110 pages
WEBSCRAping Buildwithpython
No ratings yet
WEBSCRAping Buildwithpython
78 pages
Scrapy Guide for Python Developers
No ratings yet
Scrapy Guide for Python Developers
4 pages
Web Scraping Python Tutorial - How To Scrape Data From A Website
No ratings yet
Web Scraping Python Tutorial - How To Scrape Data From A Website
19 pages
Web Scraping in Python Using Scrapy
No ratings yet
Web Scraping in Python Using Scrapy
30 pages
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
100% (3)
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
26 pages
Web Scraping Using Python
No ratings yet
Web Scraping Using Python
18 pages
Automating Web Scraping with Scrapy
No ratings yet
Automating Web Scraping with Scrapy
5 pages
Practical Web Scraping For Economists 1744341390
No ratings yet
Practical Web Scraping For Economists 1744341390
33 pages
Python Web Scraping Basics
No ratings yet
Python Web Scraping Basics
4 pages
Web Crawling and Scraping with Python
No ratings yet
Web Crawling and Scraping with Python
34 pages
Beginner Guide To Web Scraping of Data
No ratings yet
Beginner Guide To Web Scraping of Data
14 pages
DAP 4 Module
No ratings yet
DAP 4 Module
45 pages
Unit I
No ratings yet
Unit I
12 pages
Scraping
100% (1)
Scraping
25 pages
Web Scraping for Developers
No ratings yet
Web Scraping for Developers
8 pages
Web Scraping and API Fundamentals
No ratings yet
Web Scraping and API Fundamentals
10 pages
Webscraping
No ratings yet
Webscraping
12 pages
Quick Guide Web Scraping With Python
No ratings yet
Quick Guide Web Scraping With Python
3 pages
CSS Selectors Guide for Web Scrapers
No ratings yet
CSS Selectors Guide for Web Scrapers
10 pages
Web Scraping Report
No ratings yet
Web Scraping Report
14 pages
Introduction To Web Crawling Chapter - 13
No ratings yet
Introduction To Web Crawling Chapter - 13
3 pages
4F IntroToWebScraping
No ratings yet
4F IntroToWebScraping
6 pages
Python Module-4
No ratings yet
Python Module-4
109 pages
Web Scraping Using Python - Notes
No ratings yet
Web Scraping Using Python - Notes
6 pages
Web Scraping
No ratings yet
Web Scraping
53 pages
Lesson 4 Unstructured Data
No ratings yet
Lesson 4 Unstructured Data
20 pages
Data - Collection Python
No ratings yet
Data - Collection Python
40 pages
Scrapy Tutorial for PyCharm Users
100% (1)
Scrapy Tutorial for PyCharm Users
8 pages
DAP - Module 4
No ratings yet
DAP - Module 4
57 pages
Web Scraping With Python
No ratings yet
Web Scraping With Python
16 pages
DAP Module4
No ratings yet
DAP Module4
109 pages
Learning Scrapy - Sample Chapter
0% (1)
Learning Scrapy - Sample Chapter
16 pages
Lecture 12 - Web Scrapping
No ratings yet
Lecture 12 - Web Scrapping
11 pages
Python Web Scraping Guide
100% (2)
Python Web Scraping Guide
35 pages
S12 Web Scraping
No ratings yet
S12 Web Scraping
13 pages
Scraping Book
No ratings yet
Scraping Book
50 pages
Download
No ratings yet
Download
4 pages
Scraping Book Python PDF
No ratings yet
Scraping Book Python PDF
50 pages
The Ultimate Web Scraping With Python Bootcamp 2023 - Coderprog
No ratings yet
The Ultimate Web Scraping With Python Bootcamp 2023 - Coderprog
3 pages
Django Scraper Steps 1 To 5 Guide
No ratings yet
Django Scraper Steps 1 To 5 Guide
4 pages
Lab 8
No ratings yet
Lab 8
6 pages
Lecture03 Data II
No ratings yet
Lecture03 Data II
42 pages
Web Scraping Basics with Python
No ratings yet
Web Scraping Basics with Python
4 pages
Key Concepts in Scrapy
No ratings yet
Key Concepts in Scrapy
3 pages
Q-1 Web Scraping: Definition and Significance
No ratings yet
Q-1 Web Scraping: Definition and Significance
4 pages
Python Web Scraping Cheat Sheet
No ratings yet
Python Web Scraping Cheat Sheet
6 pages
BeautifulSoup Notes
No ratings yet
BeautifulSoup Notes
22 pages
Scrapy Beginners Series Part 1 - First Scrapy Spider - ScrapeOps
No ratings yet
Scrapy Beginners Series Part 1 - First Scrapy Spider - ScrapeOps
17 pages
The Art and Science of Prompting
No ratings yet
The Art and Science of Prompting
2 pages
Introduction to Bagged Trees in R
No ratings yet
Introduction to Bagged Trees in R
21 pages
Chapter1 AI Embeddings
No ratings yet
Chapter1 AI Embeddings
32 pages
Kubernetes Certification Resources
No ratings yet
Kubernetes Certification Resources
1 page
Regression Trees Chapter2
No ratings yet
Regression Trees Chapter2
21 pages
Nahuatl Spelling and Syllable Structure Guide
No ratings yet
Nahuatl Spelling and Syllable Structure Guide
59 pages
Deep Learning CH3
No ratings yet
Deep Learning CH3
22 pages
Alchemical Keys Book Review
50% (2)
Alchemical Keys Book Review
1 page
Tree-Based Models in R by De Queiroz
No ratings yet
Tree-Based Models in R by De Queiroz
36 pages
Product Catalog
No ratings yet
Product Catalog
30 pages
Tattoo Machine Pricing Overview
No ratings yet
Tattoo Machine Pricing Overview
11 pages
Jabra Sport Pulse Techsheet Standard
No ratings yet
Jabra Sport Pulse Techsheet Standard
1 page
Wacom STU SDK Guide for Developers
No ratings yet
Wacom STU SDK Guide for Developers
13 pages
Javascript Notes
No ratings yet
Javascript Notes
55 pages
TARAY-Activity 1
No ratings yet
TARAY-Activity 1
2 pages
Fake Scribd
No ratings yet
Fake Scribd
5 pages
2008-2009 Huawei Access Network Product Cases
No ratings yet
2008-2009 Huawei Access Network Product Cases
105 pages
Email Spoofing Detection Using Volatile Memory
No ratings yet
Email Spoofing Detection Using Volatile Memory
7 pages
The Beginners (Ultimate) Guide To APN Settings
No ratings yet
The Beginners (Ultimate) Guide To APN Settings
4 pages
Updated Deep Web Links 2016
0% (2)
Updated Deep Web Links 2016
13 pages
SMART Goals Workbook Aug 2016 - Goal Setting - Goal
No ratings yet
SMART Goals Workbook Aug 2016 - Goal Setting - Goal
12 pages
Awareness On Online Safety and Its Necessity in Today's Generation
No ratings yet
Awareness On Online Safety and Its Necessity in Today's Generation
2 pages
Google Adwords PPC
No ratings yet
Google Adwords PPC
157 pages
SAP BTP Admin - Training Course
No ratings yet
SAP BTP Admin - Training Course
15 pages
Setup Streamyx on TPLink TD-W8961n
No ratings yet
Setup Streamyx on TPLink TD-W8961n
3 pages
Python Crash Course: INSTALLED - APPS, Which Is Stored in The Project's Settings - Py File
No ratings yet
Python Crash Course: INSTALLED - APPS, Which Is Stored in The Project's Settings - Py File
4 pages
Common Encryption Methods Explained
No ratings yet
Common Encryption Methods Explained
3 pages
Efficient Browser Extension To Detect Phishing Attacks On Web Pages
No ratings yet
Efficient Browser Extension To Detect Phishing Attacks On Web Pages
19 pages
Italy's Strategy on Tech Security Risks
No ratings yet
Italy's Strategy on Tech Security Risks
2 pages
Introduction to CSS Basics
No ratings yet
Introduction to CSS Basics
42 pages
Ayushman Card User Manual Guide
No ratings yet
Ayushman Card User Manual Guide
21 pages
Search Engines Overview
No ratings yet
Search Engines Overview
14 pages
Optical Voice Service Setup Guide
No ratings yet
Optical Voice Service Setup Guide
10 pages
15364248-How To Integrate Unity Connection With Microsoft Office 365
No ratings yet
15364248-How To Integrate Unity Connection With Microsoft Office 365
2 pages
Portfolio
No ratings yet
Portfolio
3 pages
The Bad Effects of Social Media On Students
No ratings yet
The Bad Effects of Social Media On Students
2 pages
(Apr-2018) New PassLeader 70-741 Exam Dumps PDF
No ratings yet
(Apr-2018) New PassLeader 70-741 Exam Dumps PDF
4 pages
Happay Employee Guide
No ratings yet
Happay Employee Guide
27 pages
Dokky Provate Groups Features
No ratings yet
Dokky Provate Groups Features
12 pages
Information Age Reviewer
No ratings yet
Information Age Reviewer
3 pages
Project SECURITY - Website Security, Anti-Spam &
No ratings yet
Project SECURITY - Website Security, Anti-Spam &
1 page
Google IPv6 Implementation Overview
No ratings yet
Google IPv6 Implementation Overview
23 pages
5G Time Services 1702028472
No ratings yet
5G Time Services 1702028472
14 pages

Scraping HTML Chapter2

Uploaded by

Scraping HTML Chapter2

Uploaded by

XPath Navigation

Double forward slash // looks forward all future generations

Square brackets [] help narrow in on speci c elements

WEB SCRAPING IN PYTHON

Give the same selection

WEB SCRAPING IN PYTHON

WEB SCRAPING IN PYTHON

WEB SCRAPING IN PYTHON

WEB SCRAPING IN PYTHON

WEB SCRAPING IN PYTHON

WEB SCRAPING IN PYTHON

WEB SCRAPING IN PYTHON

WEB SCRAPING IN PYTHON

WEB SCRAPING IN PYTHON

WEB SCRAPING IN PYTHON

contains( @attri-name, "string-expr" )

WEB SCRAPING IN PYTHON

WEB SCRAPING IN PYTHON

WEB SCRAPING IN PYTHON

WEB SCRAPING IN PYTHON

WEB SCRAPING IN PYTHON

sel = Selector( text = html )

The selector sel has selected the entire html document

WEB SCRAPING IN PYTHON

The return is a SelectorList of Selector objects

# outputs the SelectorList:

WEB SCRAPING IN PYTHON

out: [<Selector xpath='//p' data='<p>Hello World!</p>'>,

out: [ '<p>Hello World!</p>',

We can use extract_first() to get the rst element of the list

out: '<p>Hello World!</p>'

WEB SCRAPING IN PYTHON

out: '<p>Enjoy DataCamp!</p>'

WEB SCRAPING IN PYTHON

Thomas Laetsch, PhD

WEB SCRAPING IN PYTHON

WEB SCRAPING IN PYTHON

html = requests.get( url ).content

sel = Selector( text = html )

WEB SCRAPING IN PYTHON

You might also like