Web Scaping - YL

Uploaded by

rui91seu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views10 pages

Web Scaping - YL

Uploaded by

rui91seu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Web Data Crawling

Agenda
● What is HTML
● URL and Page Structure: indeed.com as an example
● Hands-On
What is HTTP?
● HTTP: HyperText Transfer Protocol
○ client/server model
○ client (browser, program, curl…) opens a connection and sends a message to an server (Nginx,
Apache,...)
○ server answers with a response and closes the connection
● Example HTTP Request Header
What is HTTP?
● Example HTTP Response Header
HTTP codes:
● 2XX for successful requests
● 3XX for redirects
● 4XX for bad requests (the most famous being 404 Not
found)
● 5XX for server errors

In case you are sending this HTTP request with your web
browser, the browser will parse the HTML code, fetch all the
eventual assets (Javascript files, CSS files, images…) and it
will render the result into the main window.
What is HTML?

● HTML: HyperText Markup Language

○ a computer language that is used to create documents on the World Wide Web
○ simple and logical
○ a mark-up language that uses <Tags> instead of programming language
● All websites over the internet are plain text files that consist of HTML
Tags.
What is HTML?
Tags
● Tags are instructions to markup the text shown on your Web browser.
● All tags are in the format <Tags>
● Each tag must be accompanied by a closing tag <\Tags>
● Elements are made up of two tags (start one and end one) and the
element content.

<title>Business Analytics</title>
Toy Example

● Browser use HTML tags to decide how to display the document.

○ <html> root element of an HTML page
○ <head> contains elements that are about the document which are not displayed in
the page itself. <title> is one of such element
○ <body> is the web page itself
○ <h1> defines a large heading and <p> defines a paragraph
Beautiful Soup

● Beautiful Soup is a Python library for parsing HTML documents

(including having malformed markup), whose name is derived more
from the unrelated “tag soup”.
● Help you pull data out of HTML and XML files.
Scrapy
● Scrapy is a Python web scraping framework. It handles the most common use cases when doing web
scraping at scale:
a. Multithreading
b. Crawling (going from link to link)
c. Extracting the data
d. Validating
e. Saving to different format / databases

Web Tech - 1-26-115
No ratings yet
Web Tech - 1-26-115
90 pages
Landpower 125-185 TDI
80% (10)
Landpower 125-185 TDI
204 pages
HTML
No ratings yet
HTML
144 pages
WEB TECHNOLOGY-1 CHP 1
No ratings yet
WEB TECHNOLOGY-1 CHP 1
115 pages
2 O-Level-Project-Html-Programe
No ratings yet
2 O-Level-Project-Html-Programe
35 pages
HTML Notes
No ratings yet
HTML Notes
16 pages
HTML
No ratings yet
HTML
76 pages
Intro To HTML
No ratings yet
Intro To HTML
31 pages
Module 1 - Introduction To Computer Networks
No ratings yet
Module 1 - Introduction To Computer Networks
9 pages
Endorsement of Higher Qualification-New
0% (1)
Endorsement of Higher Qualification-New
2 pages
HTML
100% (1)
HTML
19 pages
World Wide Web: (WWW", "Web" or "W3")
No ratings yet
World Wide Web: (WWW", "Web" or "W3")
46 pages
Web Tech
No ratings yet
Web Tech
44 pages
Technically and Economically-Developed Refractory Concrete Concepts For The Cement Industry
No ratings yet
Technically and Economically-Developed Refractory Concrete Concepts For The Cement Industry
66 pages
ITT 05103 - 2023mimi Internet Programming-1
No ratings yet
ITT 05103 - 2023mimi Internet Programming-1
142 pages
DSA Patterns and Problems
No ratings yet
DSA Patterns and Problems
10 pages
HTML
No ratings yet
HTML
52 pages
ECD Unit 4
No ratings yet
ECD Unit 4
85 pages
Unit-1 Notes Web Tech
No ratings yet
Unit-1 Notes Web Tech
20 pages
HTML Report .
No ratings yet
HTML Report .
18 pages
PHD Thesis Media Communication
100% (3)
PHD Thesis Media Communication
4 pages
Introduction To HTML PDF
No ratings yet
Introduction To HTML PDF
28 pages
HTML Fourth Year
No ratings yet
HTML Fourth Year
22 pages
hw4 Sol PDF
100% (2)
hw4 Sol PDF
23 pages
HTML Handbook
No ratings yet
HTML Handbook
9 pages
What Is HTML
No ratings yet
What Is HTML
2 pages
Web Scarpping
No ratings yet
Web Scarpping
4 pages
Download
No ratings yet
Download
4 pages
Introduction To Web Technology HTML Day1
No ratings yet
Introduction To Web Technology HTML Day1
70 pages
HTML
No ratings yet
HTML
30 pages
Web Design and Management
No ratings yet
Web Design and Management
34 pages
Wad Lab 1 Complete Lab Manual
No ratings yet
Wad Lab 1 Complete Lab Manual
111 pages
Web Programmimg Mod1
No ratings yet
Web Programmimg Mod1
19 pages
Web Lab
No ratings yet
Web Lab
17 pages
Notes For Web Scraping - BeautifulSoup-3903
No ratings yet
Notes For Web Scraping - BeautifulSoup-3903
6 pages
Unit 01 Introduction To HTML
No ratings yet
Unit 01 Introduction To HTML
21 pages
Unit 1
No ratings yet
Unit 1
81 pages
Abhinav
No ratings yet
Abhinav
12 pages
Ecommerce Technologies File
No ratings yet
Ecommerce Technologies File
48 pages
S12 Web Scraping
No ratings yet
S12 Web Scraping
13 pages
HTML &NBSP Body &NBSP Style: Sandeep Sir
No ratings yet
HTML &NBSP Body &NBSP Style: Sandeep Sir
30 pages
IT - SKILLS - UNIT - 2 (1) .PPTX (Read-Only)
No ratings yet
IT - SKILLS - UNIT - 2 (1) .PPTX (Read-Only)
40 pages
6 - HTML
No ratings yet
6 - HTML
45 pages
Web Tech BCAFY Unit I
No ratings yet
Web Tech BCAFY Unit I
15 pages
Scraping
No ratings yet
Scraping
6 pages
1-HTML 0y9skgo
No ratings yet
1-HTML 0y9skgo
70 pages
Unit-1 HTML and CSS
No ratings yet
Unit-1 HTML and CSS
189 pages
Chapter 1
No ratings yet
Chapter 1
27 pages
1.1 Web Scraping
No ratings yet
1.1 Web Scraping
34 pages
Web Crawling - Python
No ratings yet
Web Crawling - Python
34 pages
Web Scraping and HTML Basics
No ratings yet
Web Scraping and HTML Basics
4 pages
Intro To HTML
No ratings yet
Intro To HTML
25 pages
HTML Movie Xy La
No ratings yet
HTML Movie Xy La
3 pages
Unit 2-Web Technologies - HTML
No ratings yet
Unit 2-Web Technologies - HTML
45 pages
Unit - 2 SDP Student
No ratings yet
Unit - 2 SDP Student
103 pages
HTML Introduction
No ratings yet
HTML Introduction
7 pages
Complete HTML Handwritten Notes Oyj9ih
No ratings yet
Complete HTML Handwritten Notes Oyj9ih
57 pages
HTML CSS Notes Tronix
No ratings yet
HTML CSS Notes Tronix
114 pages
HTML History: - : Mark-Up Language), Which Is A Higher-Level Mark-Up Language That Has Long Been A
No ratings yet
HTML History: - : Mark-Up Language), Which Is A Higher-Level Mark-Up Language That Has Long Been A
4 pages
Amogha INTERNSHIP Main
No ratings yet
Amogha INTERNSHIP Main
48 pages
YEAR 11 ICT WEEK 6 Student - Note - Website - Authoring - Note
No ratings yet
YEAR 11 ICT WEEK 6 Student - Note - Website - Authoring - Note
4 pages
WT Unit-1 (Extra)
No ratings yet
WT Unit-1 (Extra)
64 pages
Grade 9 Final - Google Forms
100% (1)
Grade 9 Final - Google Forms
15 pages
Sop Vigilance
No ratings yet
Sop Vigilance
7 pages
Structure of Programming Languages
No ratings yet
Structure of Programming Languages
19 pages
Chapter 1 and 2-Chapter 1&2 dt-2025-04-01 13-38-50
No ratings yet
Chapter 1 and 2-Chapter 1&2 dt-2025-04-01 13-38-50
251 pages
DGSKLDJSD
No ratings yet
DGSKLDJSD
16 pages
OUM MARKETING MANAGEMENT BBPM2103 Topic 2
No ratings yet
OUM MARKETING MANAGEMENT BBPM2103 Topic 2
45 pages
Lesson 2 Current Trends and Emerging Technologies - JENCY JOY MALASIG
No ratings yet
Lesson 2 Current Trends and Emerging Technologies - JENCY JOY MALASIG
15 pages
IPCC Inventory Software Manual
No ratings yet
IPCC Inventory Software Manual
66 pages
Ks2 Mathematics 2001 Marking Scheme
No ratings yet
Ks2 Mathematics 2001 Marking Scheme
30 pages
TDS Tam 395 Coaltar Epoxy Black
No ratings yet
TDS Tam 395 Coaltar Epoxy Black
2 pages
Extended Essay BM IB
No ratings yet
Extended Essay BM IB
51 pages
Modul8 Manual
No ratings yet
Modul8 Manual
113 pages
Forklift Inspection
No ratings yet
Forklift Inspection
4 pages
Chs 07 08answers PDF
No ratings yet
Chs 07 08answers PDF
18 pages
Acpk Brochure 1
No ratings yet
Acpk Brochure 1
20 pages
2 Abstract (Black and White)
No ratings yet
2 Abstract (Black and White)
5 pages
Safetica Datasheet EN 2024-04-11
No ratings yet
Safetica Datasheet EN 2024-04-11
8 pages
Icmlp 1501
No ratings yet
Icmlp 1501
2 pages
Networking
No ratings yet
Networking
4 pages
Management Policy PDF
No ratings yet
Management Policy PDF
50 pages
Yoga Pavan Resume
No ratings yet
Yoga Pavan Resume
2 pages
Search Gps
No ratings yet
Search Gps
27 pages
Supports Production DRW Rev B
No ratings yet
Supports Production DRW Rev B
9 pages
BCN Campus Recruitment Process - FAQ
No ratings yet
BCN Campus Recruitment Process - FAQ
1 page
Mickael Musindo
No ratings yet
Mickael Musindo
2 pages
Spys Mykola Resume
No ratings yet
Spys Mykola Resume
1 page
Web Devlopment
From Everand
Web Devlopment
Netra
No ratings yet
Comprehensive Hypertext Markup Language (HTML).: A Tutorial Guide to Editing and Developing a Responsive and Dynamic Website for Beginners.
From Everand
Comprehensive Hypertext Markup Language (HTML).: A Tutorial Guide to Editing and Developing a Responsive and Dynamic Website for Beginners.
Ibrahim Nugwa Abdulrazak
No ratings yet