Web Scraping - Unit 1
Web Scraping - Unit 1
DR.R.GUNAVATHI
ASSOCIATE PROFESSOR
CHRIST UNIVERSITY
MOBILE NO. : 9486354525
EMAIL-ID: [email protected]
Syllabus
Unit 1: Introduction to web scraping Hours: 6
Introduction to Web Scraping, Need and usage of web scraping. Web Scraping and web Crawling. Introduction to Client-Server
architecture, APIs.
Introduction to HTML, XML, XPath, CSS, Lxml, Understanding URLs decipher, Basic HTML tags to inspect web pages.
UNIT 2 Hours: 12
Web Scraping with Beautiful Soup
Regular expressions with python, Introduction and Installation of Beautiful Soup, Scraping HTML contents, Usage of Request package and
Parse Tree. Navigating through Tags. Scraping hidden websites Generating CSV of scrapped data. Scrapy – Introduction, Visualizing
scraped data with plots.
Case Study (Any one) –
Scraping a job portal and see which company has more jobs.
Scrape a stock exchange website and provide insights.
Scraping Customer Reports website and provide insights.
UNIT 3 Hours: 12
Web Scraping with Selenium
Introduction and Installation of Selenium. Scraping the data from dynamic website. Searching data as per the input. Navigating in
internal URLs of Webpages. Scraping images from website and maintain in folders.
Case Study (Any one) –
Scraping data from Instagram account and perform basic analysis.
Scrapping data from traveling website and perform basic analysis.
Software required
• Python/Anaconda
Anaconda(Individual Edition)
https://www.anaconda.com/products/individual
Web Scraping(Definition)
• Web scraping refers to the extraction of data from a website.
• This technique mostly focuses on the transformation of unstructured
data (HTML format) on the web into structured data (database or
spreadsheet).
Need and usage of web scraping
• Innovation(create new products)
• Better access to company data
• Marketing automation without limits
• Brand monitoring for everyone
• Data(base) enrichment on demand
• Machine learning and large datasets
• SEO, etc
Applications
• Scraping stock prices into an app API
• Scraping data from YellowPages to generate leads
• Scraping data from a store locator to create a list of business locations
• Scraping product data from sites like Amazon or eBay for competitor analysis
• Scraping sports stats for betting or fantasy leagues
• Scraping site data before a website migration
• Scraping product details for comparison shopping
• Scraping financial data for market research and insights
• The list of things you can do with web scraping is almost endless. After all, it is all
about what you can do with t
Web scraping vs Web Scrawling
Web scraping Web crawling
It is basically extracting data from websites in an It’s basically an internet bot that systematically
automated manner. browses (read crawls) the World Wide Web, usually
for the purpose of web indexing.
It is automated because it uses bots to scrape the It is used for indexing the information on the page
information or content from websites. using bots also known as crawlers.
It’s a programmatic analysis of a web page to It involves looking at a page in its entirety and indexing
download information from it. it, including its last letter and dot on the page, in the
quest for information.
Data scraping involves locating data and then Web crawlers or bots navigate through heaps of data
extracting it. It does not copy and paste but directly and information and procure whatever is relevant for
fetches the data in a precise and accurate manner. your project.
S.NO. Web Scraping Web Crawling
It need not visit all the pages of website for It visits each and every page, until the last line for
3. information. information.
9. ProWebScraper, Web Scraper.io are the examples Google, Yahoo or Bing do Web Crawling
CLIENT SERVER ARCHITECTURE-TERMINOLOGIES
• Scraping is become a common method of data collection
• scraping by traversing through HTML DOM using automated functions such as Chrome
extensions
1.Server Side Rendering — is the ability of an application to contribute by displaying the web-page on
the server instead of rendering it in the browser.
2.Client Side Rendering — a technique for rendering content in the browser using JavaScript.
3.Single Page Application — is a web application or website that interacts with the web browser by
dynamically rewriting the current web page with new data from the web server, instead of the default
method of the browser loading entire new pages.
4.HTTP endpoints — endpoint is a connection point where data, HTML files, or active server pages are
exposed.
5.HTML DOM (Document Object Model) — The Document Object Model (DOM) is a programming API for
HTML and XML documents. It defines the logical structure of documents and the way a document is
accessed and manipulated.
SERVER SIDE VS CLIENT SIDE RENDERING
Server side rendering: When you open a URL link,
your browser creates a request to the server
requesting the website. The server responds by giving
back the full HTML markup of the website, which later
appears on the browser.
import requests
response = requests.get("http://api.open-notify.org/this-api- Output:
doesnt-exist") response is: 404
print('response is:', response.status_code)
API Status Codes
200: Everything went okay, and the result has been returned (if any).
301: The server is redirecting you to a different endpoint. This can happen when a company switches
domain
names, or an endpoint name is changed.
400: The server thinks you made a bad request. This can happen when you don’t send along the right
data,
among other things.
401: The server thinks you’re not authenticated. Many APIs require login credentials, so this happens
when
you don’t send the right credentials to access an API.
403: The resource you’re trying to access is forbidden: you don’t have the right permissions to see it.
404: The resource you tried to access wasn’t found on the server.
503: The server is not ready to handle the request.
Meta Info
<!DOCTYPE html>
<html markdown="1">
<head>
<meta charset="utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
</head>
<body>
<h1 class = "heading"> My first Web Scraping with Beautiful soup </h1>
<p>Let's scrap the website using python. </p>
<body>
</html>
Understanding and Inspecting the Data(Right click and select inspect in a website)
Steps for Scraping Any Website
Reference: https://data-flair.training/blogs/python-libraries/
PYTHON CODING
# Basic scraping in web pages(HTML) # Extract first <h1>(...)</h1> text
import requests first_h1 = soup.select('h1')[0].text
from bs4 import BeautifulSoup first_h1
# Make a request # Create all_h1_tags as empty list
page = requests.get(
all_h1_tags = []
"https://www.w3resource.com/python-exercises/web-
scraping/web-scraping-exercise-1.php")
soup = BeautifulSoup(page.content, 'html.parser’) # Set all_h1_tags to all h1 tags of the soup
# Extract title of page for element in soup.select('h1'):
page_title = soup.title.text all_h1_tags.append(element.text)
page_title print(all_h1_tags)
# Extract Title
title=soup.title
title
Image file
# Extracting all images from the page # Extract and store in top_items according to instructions on
the left
page = requests.get(
images = soup.select('img')
"https://codedamn-classrooms.github.io/webscraper-
python-codedamn-classroom-website/") for image in images:
soup = BeautifulSoup(page.content, 'html.parser') src = image.get('src')
# Create top_items as empty list alt = image.get('alt')
image_data = [] image_data.append({"src": src, "alt": alt})
print(image_data)
Without Coding How to scrap the data?
• https://www.octoparse.com/tutorial/web-scraping-case-study-scrapin
g-product-information-from-jabongcom
• https://nocodewebscraping.com/web-scraping-for-dummies-tutorial-
with-import-io-without-coding/#:~:text=The%20first%20step%20is%2
0to,APIs%20or%20crawl%20entire%20websites
.
Useful Links:
https://www.octoparse.com/blog/9-free-web-scrapers-that-you-cannot-miss
XML
• XML stands for eXtensible Markup Language
• XML is a markup language much like HTML
• XML was designed to store and transport data
• XML was designed to be self-descriptive
• XML is a W3C Recommendation
Code Output
<note> Note
<to>Tove</to> To: Tove
<from>Jani</from>
<heading>Reminder</ From: Jani
heading> Reminder
<body>Don't forget me Don't forget me this
this weekend!</body> weekend!
</note>
What is XPath?
So we look for patterns. In the query string you’ll notice a question mark (?) at the beginning and the regular
occurrence of ampersands (&) and equal signs (=) within. Let’s split it up with that in mind.
?v=DBJ7mBxi8LM&list=FLjtfXY-PoLnBtJTrAVIHR3A&index=5
? – Start
Says “hey, from here on is a query string”
Name/Value Pairs
v=DBJ7mBxi8LM
list=FLjtfXY-PoLnBtJTrAVIHR3A
index=5
lxml
• lxml is a Python library which allows for easy handling of XML
and HTML files, and can also be used for web scraping.
• lxml provides a very simple and powerful API for parsing XML
and HTML.
Basic HTML tags to inspect web pages.