0% found this document useful (0 votes)
5 views13 pages

08 Web Scraping

The document provides an overview of web scraping, explaining its purpose and the importance of understanding HTML and robots.txt files. It discusses how to access HTML code and introduces libraries like Requests and Beautiful Soup for effective web scraping. The content emphasizes the need to adhere to website scraping rules to avoid being blocked.

Uploaded by

Quang Vinh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views13 pages

08 Web Scraping

The document provides an overview of web scraping, explaining its purpose and the importance of understanding HTML and robots.txt files. It discusses how to access HTML code and introduces libraries like Requests and Beautiful Soup for effective web scraping. The content emphasizes the need to adhere to website scraping rules to avoid being blocked.

Uploaded by

Quang Vinh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

DATA SCIENCE

WEB SCRAPING
AGENDA 2

I. WHAT IS WEB SCRAPING?


II. ROBOTS.TXT
III. HTML & HTML TAGS
IV. HOW TO LOOK AT HTML CODE
V. BEAUTIFUL SOUP
DATA SCIENCE

I. WHAT IS WEB
SCRAPING?
WHAT IS WEB SCRAPING? 4

•A way of systematically pulling information from a website


•Allows you to simulate a human viewing the page and copying
information
•Hacking OK Cupid
•Pull data based upon finding patterns in the structured data
•This is one way of “getting the data”.
•But be careful… you can get blocked
DATA SCIENCE

II. ROBOTS.TXT
ROBOTS.TXT 6

•The robots exclusion standard allows website owners to specify


whether they allow web “robots” or not.
•This tells you whether you can scrape a website or not.
•Located in the root directory of a website and called “robots.txt”.
•“www.google.com/robots.txt” or “http://www.dataschool.io/
robots.txt”
•Read more: http://www.robotstxt.org/robotstxt.html
ROBOTS.TXT 7

•Things to look for


•User-agent: what type of robots do the following rules apply
to
•Disallow: what parts of the website are you not allowed to
scrape
•Notice that you may be able to scrape parts of the website but
not others
DATA SCIENCE

III. HTML & HTML TAGS


HTML & HTML TAGS 9

•HTML is the structured data underneath webpages


•Your web browser takes this code and interprets its meaning
•Basic format of an HTML tag
•<tag class=“class_name” id=“id_name”> … </tag>
•Open and close tags: <tag> and </tag>
•Attributes of the tag: class=“class_name”, id=“id_name”
DATA SCIENCE

IV. HOW TO LOOK AT


HTML CODE
HOW TO LOOK AT HTML CODE 11

•View source code


•This shows you the entire HTML code that makes up the
webpage.
•Good to have it all, but hard to find specific things
•Inspect Element
•Allows you to bring up highlighted HTML for that specific item
on the page
•This is the preferred method
DATA SCIENCE

V. BEAUTIFUL SOUP
BEAUTIFUL SOUP 13

•We’ll be using two libraries to help us create web scraping


robots.
•Requests
•Beautiful Soup
•Requests “gets” the webpage’s HTML from the web.
•Beautiful Soup changes the HTML into a searchable, structured
object.

You might also like