0% found this document useful (0 votes)
11 views10 pages

Web Scaping - YL

Uploaded by

rui91seu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views10 pages

Web Scaping - YL

Uploaded by

rui91seu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Web Data Crawling

Agenda
● What is HTML
● URL and Page Structure: indeed.com as an example
● Hands-On
What is HTTP?
● HTTP: HyperText Transfer Protocol
○ client/server model
○ client (browser, program, curl…) opens a connection and sends a message to an server (Nginx,
Apache,...)
○ server answers with a response and closes the connection
● Example HTTP Request Header
What is HTTP?
● Example HTTP Response Header
HTTP codes:
● 2XX for successful requests
● 3XX for redirects
● 4XX for bad requests (the most famous being 404 Not
found)
● 5XX for server errors

In case you are sending this HTTP request with your web
browser, the browser will parse the HTML code, fetch all the
eventual assets (Javascript files, CSS files, images…) and it
will render the result into the main window.
What is HTML?

● HTML: HyperText Markup Language


○ a computer language that is used to create documents on the World Wide Web
○ simple and logical
○ a mark-up language that uses <Tags> instead of programming language
● All websites over the internet are plain text files that consist of HTML
Tags.
What is HTML?
Tags
● Tags are instructions to markup the text shown on your Web browser.
● All tags are in the format <Tags>
● Each tag must be accompanied by a closing tag <\Tags>
● Elements are made up of two tags (start one and end one) and the
element content.

<title>Business Analytics</title>
Toy Example

● Browser use HTML tags to decide how to display the document.


○ <html> root element of an HTML page
○ <head> contains elements that are about the document which are not displayed in
the page itself. <title> is one of such element
○ <body> is the web page itself
○ <h1> defines a large heading and <p> defines a paragraph
Beautiful Soup

● Beautiful Soup is a Python library for parsing HTML documents


(including having malformed markup), whose name is derived more
from the unrelated “tag soup”.
● Help you pull data out of HTML and XML files.
Scrapy
● Scrapy is a Python web scraping framework. It handles the most common use cases when doing web
scraping at scale:
a. Multithreading
b. Crawling (going from link to link)
c. Extracting the data
d. Validating
e. Saving to different format / databases

You might also like