0% found this document useful (0 votes)
86 views31 pages

Scraping HTML Chapter2

The document discusses various XPath navigation techniques in Python for web scraping, including: - Using slashes and brackets to look forward in the HTML structure and narrow selections. - Examples of selecting elements by tag name, attribute values, and relative position. - The wildcard character "*" to select all child elements, and contains() to match partial text in attributes. - Creating Selector objects in Scrapy to parse HTML content and extract data using XPath queries.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views31 pages

Scraping HTML Chapter2

The document discusses various XPath navigation techniques in Python for web scraping, including: - Using slashes and brackets to look forward in the HTML structure and narrow selections. - Examples of selecting elements by tag name, attribute values, and relative position. - The wildcard character "*" to select all child elements, and contains() to match partial text in attributes. - Creating Selector objects in Scrapy to parse HTML content and extract data using XPath queries.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

XPath Navigation

W EB S CRAP IN G IN P YTH ON

Thomas Laetsch
Data Scientist, NYU
Slashes and Brackets
Single forward slash / looks forward one generation

Double forward slash // looks forward all future generations

Square brackets [] help narrow in on speci c elements

WEB SCRAPING IN PYTHON


To Bracket or not to Bracket
xpath = '/html/body'

xpath = '/html[1]/body[1]'

Give the same selection

WEB SCRAPING IN PYTHON


A Body of P
xpath = '/html/body/p'

WEB SCRAPING IN PYTHON


The Birds and the Ps
xpath = '/html/body/div/p' xpath = '/html/body/div/p[2]'

WEB SCRAPING IN PYTHON


Double Slashing the Brackets
xpath = '//p' xpath = '//p[1]'

WEB SCRAPING IN PYTHON


The Wildcard
xpath = '/html/body/*'
The asterisks * is the "wildcard"

WEB SCRAPING IN PYTHON


Xposé
W EB S CRAP IN G IN P YTH ON
Off the Beaten XPath
W EB S CRAP IN G IN P YTH ON

Thomas Laetsch
Data Scientist, NYU
(At)tribute
@ represents "attribute"
@class

@id

@href

WEB SCRAPING IN PYTHON


Brackets and Attributes

WEB SCRAPING IN PYTHON


Brackets and Attributes
xpath = '//p[@class="class-1"]'

WEB SCRAPING IN PYTHON


Brackets and Attributes
xpath = '//*[@id="uid"]'

WEB SCRAPING IN PYTHON


Brackets and Attributes
xpath = '//div[@id="uid"]/p[2]'

WEB SCRAPING IN PYTHON


Content with Contains
Xpath Contains Notation:

contains( @attri-name, "string-expr" )

WEB SCRAPING IN PYTHON


Contain This
xpath = '//*[contains(@class,"class-1")]'

WEB SCRAPING IN PYTHON


Contain This
xpath = '//*[@class="class-1"]'

WEB SCRAPING IN PYTHON


Get Classy
xpath = '/html/body/div/p[2]'

WEB SCRAPING IN PYTHON


Get Classy
xpath = '/html/body/div/p[2]/@class'

WEB SCRAPING IN PYTHON


End of the Path
W EB S CRAP IN G IN P YTH ON
Introduction to the
scrapy Selector
W EB S CRAP IN G IN P YTH ON

Thomas Laetsch
Data Scientist, NYU
Setting up a Selector
from scrapy import Selector

html = '''
<html>
<body>
<div class="hello datacamp">
<p>Hello World!</p>
</div>
<p>Enjoy DataCamp!</p>
</body>
</html>
'''

sel = Selector( text = html )

Created a scrapy Selector object using a string with the html code

The selector sel has selected the entire html document

WEB SCRAPING IN PYTHON


Selecting Selectors
We can use the xpath call within a Selector to create new Selector s of speci c pieces of
the html code

The return is a SelectorList of Selector objects

sel.xpath("//p")

# outputs the SelectorList:


[<Selector xpath='//p' data='<p>Hello World!</p>'>,
<Selector xpath='//p' data='<p>Enjoy DataCamp!</p>'>]

WEB SCRAPING IN PYTHON


Extracting Data from a SelectorList
Use the extract() method

>>> sel.xpath("//p")

out: [<Selector xpath='//p' data='<p>Hello World!</p>'>,


<Selector xpath='//p' data='<p>Enjoy DataCamp!</p>'>]

>>> sel.xpath("//p").extract()

out: [ '<p>Hello World!</p>',


'<p>Enjoy DataCamp!</p>' ]

We can use extract_first() to get the rst element of the list

>>> sel.xpath("//p").extract_first()

out: '<p>Hello World!</p>'

WEB SCRAPING IN PYTHON


Extracting Data from a Selector
ps = sel.xpath('//p')

second_p = ps[1]

second_p.extract()

out: '<p>Enjoy DataCamp!</p>'

WEB SCRAPING IN PYTHON


Select This Course!
W EB S CRAP IN G IN P YTH ON
"Inspecting the
HTML"
W EB S CRAP IN G IN P YTH ON

Thomas Laetsch, PhD


Data Scientist, NYU
"Source" = HTML Code

WEB SCRAPING IN PYTHON


Inspecting Elements

WEB SCRAPING IN PYTHON


HTML text to Selector
from scrapy import Selector

import requests

url = 'https://www.datacamp.com/courses/all'

html = requests.get( url ).content

sel = Selector( text = html )

WEB SCRAPING IN PYTHON


You Know Our
Secrets
W EB S CRAP IN G IN P YTH ON

You might also like