100% found this document useful (1 vote)
196 views

Web Scraping - Unit 1

This document provides a syllabus for a course on web scraping. The course covers topics like introduction to web scraping and its uses, web scraping with Beautiful Soup and Selenium, and APIs. It includes 3 units - introduction to scraping tools and techniques, scraping with Beautiful Soup, and scraping with Selenium. Case studies and software requirements are also mentioned. Key terms related to web scraping and crawling are defined. Differences between server-side and client-side rendering are explained. Direct HTTP requests for scraping client-side rendered websites are also discussed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
196 views

Web Scraping - Unit 1

This document provides a syllabus for a course on web scraping. The course covers topics like introduction to web scraping and its uses, web scraping with Beautiful Soup and Selenium, and APIs. It includes 3 units - introduction to scraping tools and techniques, scraping with Beautiful Soup, and scraping with Selenium. Case studies and software requirements are also mentioned. Key terms related to web scraping and crawling are defined. Differences between server-side and client-side rendering are explained. Direct HTTP requests for scraping client-side rendered websites are also discussed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

WEB SCRAPING

DR.R.GUNAVATHI
ASSOCIATE PROFESSOR
CHRIST UNIVERSITY
MOBILE NO. : 9486354525
EMAIL-ID: [email protected]
Syllabus
Unit 1: Introduction to web scraping Hours: 6
Introduction to Web Scraping, Need and usage of web scraping. Web Scraping and web Crawling. Introduction to Client-Server
architecture, APIs.
Introduction to HTML, XML, XPath, CSS, Lxml, Understanding URLs decipher, Basic HTML tags to inspect web pages.
UNIT 2 Hours: 12
Web Scraping with Beautiful Soup
Regular expressions with python, Introduction and Installation of Beautiful Soup, Scraping HTML contents, Usage of Request package and
Parse Tree. Navigating through Tags. Scraping hidden websites Generating CSV of scrapped data. Scrapy – Introduction, Visualizing
scraped data with plots.
 Case Study (Any one) –
Scraping a job portal and see which company has more jobs.
Scrape a stock exchange website and provide insights.
Scraping Customer Reports website and provide insights.
 
 UNIT 3 Hours: 12
Web Scraping with Selenium
Introduction and Installation of Selenium. Scraping the data from dynamic website. Searching data as per the input. Navigating in
internal URLs of Webpages. Scraping images from website and maintain in folders.
 
Case Study (Any one) –
Scraping data from Instagram account and perform basic analysis.
Scrapping data from traveling website and perform basic analysis.
Software required
• Python/Anaconda

Useful links for download Python


https://www.python.org/downloads/

Anaconda(Individual Edition)
https://www.anaconda.com/products/individual
Web Scraping(Definition)
• Web scraping refers to the extraction of data from a website.
• This technique mostly focuses on the transformation of unstructured
data (HTML format) on the web into structured data (database or
spreadsheet).
Need and usage of web scraping
• Innovation(create new products)
• Better access to company data
• Marketing automation without limits
• Brand monitoring for everyone
• Data(base) enrichment on demand
• Machine learning and large datasets
• SEO, etc
Applications
• Scraping stock prices into an app API
• Scraping data from YellowPages to generate leads
• Scraping data from a store locator to create a list of business locations
• Scraping product data from sites like Amazon or eBay for competitor analysis
• Scraping sports stats for betting or fantasy leagues
• Scraping site data before a website migration
• Scraping product details for comparison shopping
• Scraping financial data for market research and insights
• The list of things you can do with web scraping is almost endless. After all, it is all
about what you can do with t
Web scraping vs Web Scrawling
Web scraping Web crawling
It is basically extracting data from websites in an It’s basically an internet bot that systematically
automated manner.  browses (read crawls) the World Wide Web, usually
for the purpose of web indexing. 
It is automated because it uses bots to scrape the It is used for indexing the information on the page
information or content from websites.  using bots also known as crawlers.
It’s a programmatic analysis of a web page to It involves looking at a page in its entirety and indexing
download information from it. it, including its last letter and dot on the page, in the
quest for information. 
Data scraping involves locating data and then Web crawlers or bots navigate through heaps of data
extracting it. It does not copy and paste but directly and information and procure whatever is relevant for
fetches the data in a precise and accurate manner. your project.
S.NO. Web Scraping Web Crawling

1. The tool used is Web Scraper. The tool used Web Crawler or Spiders.

2. It is used for downloading information It is used for indexing of Web pages

It need not visit all the pages of website for It visits each and every page, until the last line for
3. information. information.

A Web Scraper doesn’t obey robots.txt in most of the


4. cases. It always obeys robots.txt.

5. It is done on both small and large scale. It is mostly employed in large scale.

Application areas include Retail Marketing, Equity


6. search and Machine learning. Used in search engines to give search results to the user.

Data de-duplication is not necessarily a part of Web


7. Scraping. Data de-duplication is and integral part of Web Scraping.

This needs crawl agent and a parser for parsing the


8. response. This only needs only crawl agent.

9. ProWebScraper, Web Scraper.io are the examples Google, Yahoo or Bing do Web Crawling
CLIENT SERVER ARCHITECTURE-TERMINOLOGIES
• Scraping is become a common method of data collection
• scraping by traversing through HTML DOM using automated functions such as Chrome
extensions 

1.Server Side Rendering — is the ability of an application to contribute by displaying the web-page on
the server instead of rendering it in the browser.
2.Client Side Rendering — a technique for rendering content in the browser using JavaScript.
3.Single Page Application — is a web application or website that interacts with the web browser by
dynamically rewriting the current web page with new data from the web server, instead of the default
method of the browser loading entire new pages.
4.HTTP endpoints — endpoint is a connection point where data, HTML files, or active server pages are
exposed.
5.HTML DOM (Document Object Model) — The Document Object Model (DOM) is a programming API for
HTML and XML documents. It defines the logical structure of documents and the way a document is
accessed and manipulated.
SERVER SIDE VS CLIENT SIDE RENDERING
Server side rendering: When you open a URL link,
your browser creates a request to the server
requesting the website. The server responds by giving
back the full HTML markup of the website, which later
appears on the browser.

Client side rendering: Single Page Applications (SPA)


refers to websites that utilize client side rendering to
serve dynamic content. 

Hydrating And Populating:


✔ Through SPA and client side rendering, when our
browsers access a URL link, what the server returns
and we initially receive are an empty shell HTML file
with no content.
✔ However, a Javascript file is included in the HTML
that will hydrate and populate the empty HTML with
content.
✔ This process of hydrating and populating is done by
Direct HTTP Request

1.Scraping becomes faster since we make a direct HTTP


request and never need to load the full web page.
2.Data that we receive is richer since servers often provide
more data than what the website shows.
3.It is as simple as making an HTTP request. In fact, after
identifying and obtaining the link, we can open that link in a
new tab and see the results (often in JSON form) in its bare
form.
4.Suitable for client side rendered websites, SPA or
websites that incorporate infinite loading, especially those
that heavily rely on REST API or GraphQL
5.We obtain structured data from the HTTP response that
we can instantly download to file. This saves a lot of time
compared to matching and mapping selectors to each HTML
DOM (I personally hated this process).
WARNINGS
1.We must make sure that the websites we are trying to scrape in fact use client side
rendering method of populating data.
2.Identifying the HTTP endpoint requires some trial and error and often it’s not a universal
solution since each website will have different implementations.
3.Require some workaround in making an HTTP request to go over protection measures
such as Cross-Origin Resource Sharing, Same-Origin policies, Cookies, and
Authentication Tokens.
4.When the website updates its endpoint, we will also need to update our scraper algorithm
and accessible endpoints.
API -What is an API?
• An API, or Application Programming Interface, is a server that you can use to retrieve
and send data to using code.
• APIs are most commonly used to retrieve data
import requests

There are many different types of requests. The most


commonly used one, a GET request, is used to retrieve
data.
we’ll use the requests.get() function, which requires
one argument — the URL we want to make the request
to
API – Does not exists
response = requests.get("http://api.open-notify.org/this-api-doesnt-exist")

import requests
response = requests.get("http://api.open-notify.org/this-api- Output:
doesnt-exist") response is: 404
print('response is:', response.status_code)
API Status Codes
200: Everything went okay, and the result has been returned (if any).
301: The server is redirecting you to a different endpoint. This can happen when a company switches
domain
names, or an endpoint name is changed.
400: The server thinks you made a bad request. This can happen when you don’t send along the right
data,
among other things.
401: The server thinks you’re not authenticated. Many APIs require login credentials, so this happens
when
you don’t send the right credentials to access an API.
403: The resource you’re trying to access is forbidden: you don’t have the right permissions to see it.
404: The resource you tried to access wasn’t found on the server.
503: The server is not ready to handle the request.
Meta Info

Basic HTML Tag Description

<head> Defines information about the document


Tag Description
<meta> Defines metadata about an HTML
<!DOCTYP Defines the document type
E> document
  <base> Specifies the base URL/target for all
<html> Defines an HTML document
relative URLs in a document
<head> Contains metadata/information for the
<basefont> Not supported in HTML5. Use CSS instead.
document
Specifies a default color, size, and font for
<title> Defines a title for the document all text in a document
<body> Defines the document's body
<h1> to <h6 Defines HTML headings
>
<p> Defines a paragraph
Reference:
<br> Inserts a single line break
<hr> Defines a thematic change in the https://www.w3schools.com/tags/ref_byfunc.asp
content
<!--...--> Defines a comment
Components of a Webpage
(Basic syntax of any web page)

<!DOCTYPE html>
<html markdown="1">
<head>
<meta charset="utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
</head>
<body>
<h1 class = "heading"> My first Web Scraping with Beautiful soup </h1>
<p>Let's scrap the website using python. </p>
<body>
</html>

Understanding and Inspecting the Data(Right click and select inspect in a website)
Steps for Scraping Any Website

1. Find the URL that you want to


scrape
2. Inspecting the Page
3. Find the data you want to extract
4. Write the code
5. Run the code and extract the data
6. Store the data in the required
format 
Libraries required for web scraping

Urllib/Urllib2/ Urllib3: It is a Python module which can be used for fetching


URLs. 
• It defines functions and classes to help with URL actions (basic and
digest authentication, redirections, cookies, etc).
BeautifulSoup: It is a tool for pulling out information from a webpage.
• it is used to extract tables, lists, paragraph and you can also put filters to
extract information from web pages
Other Libraries
• mechanize
• scrapemark
• scrapy
Importing necessary libraries
• import pandas as pd
• import numpy as np
• import matplotlib.pyplot as plt
• import seaborn as sns %matplotlib inline
• import re
• import time from datetime
• import datetime
• import matplotlib.dates as mdates
• import matplotlib.ticker as ticker from urllib.request
• import urlopen from bs4
• import BeautifulSoup
• import requests

Reference: https://data-flair.training/blogs/python-libraries/
PYTHON CODING
# Basic scraping in web pages(HTML) # Extract first <h1>(...)</h1> text
import requests first_h1 = soup.select('h1')[0].text
from bs4 import BeautifulSoup first_h1
# Make a request # Create all_h1_tags as empty list
page = requests.get(
all_h1_tags = []
"https://www.w3resource.com/python-exercises/web-
scraping/web-scraping-exercise-1.php")
soup = BeautifulSoup(page.content, 'html.parser’) # Set all_h1_tags to all h1 tags of the soup
# Extract title of page for element in soup.select('h1'):
page_title = soup.title.text all_h1_tags.append(element.text)
page_title print(all_h1_tags)
# Extract Title
title=soup.title
title
Image file
# Extracting all images from the page # Extract and store in top_items according to instructions on
the left
page = requests.get(
images = soup.select('img')
"https://codedamn-classrooms.github.io/webscraper-
python-codedamn-classroom-website/") for image in images:
soup = BeautifulSoup(page.content, 'html.parser') src = image.get('src')
# Create top_items as empty list alt = image.get('alt')
image_data = [] image_data.append({"src": src, "alt": alt})
print(image_data)
Without Coding How to scrap the data?
• https://www.octoparse.com/tutorial/web-scraping-case-study-scrapin
g-product-information-from-jabongcom

• https://nocodewebscraping.com/web-scraping-for-dummies-tutorial-
with-import-io-without-coding/#:~:text=The%20first%20step%20is%2
0to,APIs%20or%20crawl%20entire%20websites
.

Useful Links:
https://www.octoparse.com/blog/9-free-web-scrapers-that-you-cannot-miss
XML
• XML stands for eXtensible Markup Language
• XML is a markup language much like HTML
• XML was designed to store and transport data
• XML was designed to be self-descriptive
• XML is a W3C Recommendation

To install XML notepad


https://xml-notepad.en.softonic.com/?ex=THD-324.0
XML Does Not DO Anything

Code Output
<note> Note
  <to>Tove</to> To: Tove
  <from>Jani</from>
  <heading>Reminder</ From: Jani
heading> Reminder
  <body>Don't forget me Don't forget me this
this weekend!</body> weekend!
</note>
What is XPath?

• XPath is a major element in the XSLT standard.


• XPath can be used to navigate through elements and
attributes in an XML document.
XPath Example
XPath Example
What is lxml in Python?

• lxml is one of the fastest and feature-rich libraries for processing


XML and HTML in Python.
• Using Python lxml library, XML and HTML documents can be
created, parsed(transformed into executable machine code),
and queried. 
URLs decipher – Splitting the URL
http://www.youtube.com/watch?v=DBJ7mBxi8LM&list=FLjtfXY-PoLnBtJTrAVIHR3A&index=5
Protocol – http://
Subdomain – www.
Domain – youtube.com/
Path to file – watch
Query string – ?v=DBJ7mBxi8LM&list=FLjtfXY-PoLnBtJTrAVIHR3A&index=5

So we look for patterns. In the query string you’ll notice a question mark (?) at the beginning and the regular
occurrence of ampersands (&) and equal signs (=) within. Let’s split it up with that in mind.

?v=DBJ7mBxi8LM&list=FLjtfXY-PoLnBtJTrAVIHR3A&index=5

? – Start
Says “hey, from here on is a query string”

Name/Value Pairs
v=DBJ7mBxi8LM
list=FLjtfXY-PoLnBtJTrAVIHR3A
index=5
lxml
• lxml is a Python library which allows for easy handling of XML
and HTML files, and can also be used for web scraping.
• lxml provides a very simple and powerful API for parsing XML
and HTML.
Basic HTML tags to inspect web pages.
 

Go to URL right click and select inspect option

You might also like