Web Scraping With Python
W EB S CRAP IN G IN P YTH ON
Thomas Laetsch
Data Scientist, NYU
Business Savvy
What are businesses looking for?
Comparing prices
Satisfaction of customers
Generating potential leads
...and much more!
WEB SCRAPING IN PYTHON
It's Personal
What could you do?
Search for your favorite memes on your favorite sites.
Automatically look through classi ed ads for your favorite gadgets.
Scrape social site content looking for hot topics.
Scrape cooking blogs looking for particular recipes, or recipe reviews.
...and much more!
WEB SCRAPING IN PYTHON
About My Work
WEB SCRAPING IN PYTHON
Pipe Dream
WEB SCRAPING IN PYTHON
Pipe Dream: Setup
Setup
Understand what we want to do.
Find sources to help us do it.
WEB SCRAPING IN PYTHON
Pipe Dream: Acquisition
Acquisition
Read in the raw data from online.
Format these data to be usable.
WEB SCRAPING IN PYTHON
Pipe Dream: Processing
Processing
Many options!
WEB SCRAPING IN PYTHON
How do you do?
Our Focus
Acquisition!
(Using scrapy via python )
WEB SCRAPING IN PYTHON
Are you in?
W EB S CRAP IN G IN P YTH ON
HyperText Markup Language
W EB S CRAP IN G IN P YTH ON
Thomas Laetsch
Data Scientist, NYU
The main example
WEB SCRAPING IN PYTHON
HTML tags
<html> ... </html>
<body> ... </body>
<div> ... </div>
<p> ... </p>
WEB SCRAPING IN PYTHON
The HTML tree
WEB SCRAPING IN PYTHON
The HTML tree: Example 1
WEB SCRAPING IN PYTHON
The HTML tree: Example 2
WEB SCRAPING IN PYTHON
Introduction to HTML Outro
W EB S CRAP IN G IN P YTH ON
HTML Tags and Attributes
W EB S CRAP IN G IN P YTH ON
Thomas Laetsch
Data Scientist, NYU
Do we have to?
Information within HTML tags can be valuable
Extract link URLs
Easier way to select elements
WEB SCRAPING IN PYTHON
Tag, you're it!
We've seen tag names such as html, div, and p.
The attribute name is followed by = followed by information assigned to that attribute, usually
quoted text.
WEB SCRAPING IN PYTHON
Let's "div"vy up the tag
id attribute should be unique
class attribute doesn't need to be unique
WEB SCRAPING IN PYTHON
"a" be linkin'
a tags are for hyperlinks
href attribute tells what link to go to
WEB SCRAPING IN PYTHON
Tag Traction
WEB SCRAPING IN PYTHON
Et Tu, Attributes?
W EB S CRAP IN G IN P YTH ON
Crash Course X
W EB S CRAP IN G IN P YTH ON
Thomas Laetsch
Data Scientist, NYU
Another Slasher Video?
xpath = '/html/body/div[2]'
Simple XPath:
Single forward-slash / used to move forward one generation.
tag-names between slashes give direction to which element(s).
Brackets [] after a tag name tell us which of the selected siblings to choose.
WEB SCRAPING IN PYTHON
Another Slasher Video?
xpath = '/html/body/div[2]'
WEB SCRAPING IN PYTHON
Slasher Double Feature?
Direct to all table elements within the entire HTML code:
xpath = '//table'
Direct to all table elements which are descendants of the 2nd div child of the body
element:
xpath = '/html/body/div[2]//table`
WEB SCRAPING IN PYTHON
Ex(path)celent
W EB S CRAP IN G IN P YTH ON