0% found this document useful (0 votes)

5 views3 pages

Practical7 IR

The document contains a Python script for a web crawler that fetches HTML content, saves the robots.txt file, and extracts links from web pages. It includes functions for handling HTTP requests, parsing HTML, and respecting robots.txt rules. The script is designed to crawl a specified URL up to a maximum depth while implementing a delay between requests.

Uploaded by

siddhivishwakarma391

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views3 pages

Practical7 IR

Uploaded by

siddhivishwakarma391

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

Prac cal 7

import requests

from bs4 import Beau fulSoup

import me

from urllib.parse import urljoin,urlparse

from urllib.robotparser import RobotFileParser

def get_html(url):

headers={'User-Agent':'Mozilla/5.0(Windows NT 10.0;Win64;x64)AppleWebKit/537.36(KHTML,like
Gecko)Chrome/58.0.3029.110 Safari/537.3'}

try:

response=requests.get(url,headers=headers)

response.raise_for_status()

return response.text

except requests.excep ons.HTTPError as errh:

print(f"HTTP Error:{errh}")

except requests.excep ons.RequestExcep on as err:

print(f"Request Error: {err}")

return None

def save_robots_txt(url):

try:

robots_url=urljoin(url,'/robots.txt')

robots_content=get_html(robots_url)

if robots_content:

with open('robots.txt','wb') as ﬁle:

ﬁle.write(robots_content.encode('u -8-sig'))

except Excep on as e:

print("Error saving robots.txt: {e}")

def load_robots_txt():
try:

with open('robots.txt','rb') as ﬁle:

return ﬁle.read().decode('u -8-sig')

except FileNotFoundError:

return None

def extract_links(html,base_url):

soup=Beau fulSoup(html,'html.parser')

links=[]

for link in soup.ﬁnd_all('a',href=True):

absolute_url=urljoin(base_url,link.get('href'))

links.append(absolute_url)

return links

def is_allowed_by_robots(url,robots_content):

parser=RobotFileParser()

parser.parse(robots_content.split('\n'))

return parser.can_fetch('*',url)

def crawl(start_url,max_depth=3,delay=1):

visited_urls=set()

def recursive_crawl(url,depth,robots_content):

if depth > max_depth or url in visited_urls or not is_allowed_by_robots(url,robots_content):

return

visited_urls.add(url)

me.sleep(delay)

html=get_html(url)

if html:

print(f"Crawling {url}")
links=extract_links(html,url)

for link in links:

recursive_crawl(link,depth+1,robots_content)

save_robots_txt(start_url)

robots_content=load_robots_txt()

if not robots_content:

print("Unable to retrieve robots.txt. Crawling without restric ons.")

recursive_crawl(start_url,1,robots_content)

print("Performed by Raj")

crawl("h ps://wikipedia.com",max_depth=2,delay=2)

Output:

robots.txt is generated a er running the program.

Advanced Web Scraping - Bypassing - 403 Forbidden, - Captchas, and More - Sangaline
No ratings yet
Advanced Web Scraping - Bypassing - 403 Forbidden, - Captchas, and More - Sangaline
12 pages
Practical Introduction To Web Scraping in Python
100% (1)
Practical Introduction To Web Scraping in Python
14 pages
XI Appreciation
100% (1)
XI Appreciation
3 pages
Implementing Web Scraping in Python With Beautifulsoup
No ratings yet
Implementing Web Scraping in Python With Beautifulsoup
6 pages
Programming Assignment Unit 07 - CS 3308 - Information Retrieval - University of The People
No ratings yet
Programming Assignment Unit 07 - CS 3308 - Information Retrieval - University of The People
4 pages
Python Cheat Set
No ratings yet
Python Cheat Set
1 page
50 Python Projects Scripts
No ratings yet
50 Python Projects Scripts
65 pages
Beautiful Soup Documentation: Getting Help
100% (1)
Beautiful Soup Documentation: Getting Help
56 pages
Tool
No ratings yet
Tool
3 pages
Beautiful Soup Documentation - Beautiful Soup 4.13.0 Documentation
No ratings yet
Beautiful Soup Documentation - Beautiful Soup 4.13.0 Documentation
54 pages
Chapter 11. Web Scraping
100% (1)
Chapter 11. Web Scraping
57 pages
Message 12 3
No ratings yet
Message 12 3
10 pages
Web Scrapping
100% (1)
Web Scrapping
20 pages
React Js Interview Questions and Answers
No ratings yet
React Js Interview Questions and Answers
9 pages
Wa0016.
No ratings yet
Wa0016.
76 pages
Web Crawling and Social Media Mining: Module No. 5
No ratings yet
Web Crawling and Social Media Mining: Module No. 5
77 pages
A Simple Python Web Crawler...
100% (1)
A Simple Python Web Crawler...
5 pages
Scrapping The Web
100% (1)
Scrapping The Web
13 pages
Beautiful Soup
No ratings yet
Beautiful Soup
61 pages
Context
No ratings yet
Context
8 pages
Web Scraping Using Python: A Step by Step Guide: September 2019
0% (1)
Web Scraping Using Python: A Step by Step Guide: September 2019
7 pages
Web Scraping Using Python (Step by Step Tutorial) - Pythonista Planet
No ratings yet
Web Scraping Using Python (Step by Step Tutorial) - Pythonista Planet
11 pages
Siddhi BB
No ratings yet
Siddhi BB
15 pages
RajSingh WIexp4
No ratings yet
RajSingh WIexp4
7 pages
Beautifulsoap4 Experiments
No ratings yet
Beautifulsoap4 Experiments
7 pages
CSS Exp 08
No ratings yet
CSS Exp 08
4 pages
03 Web Scraping
No ratings yet
03 Web Scraping
41 pages
Ir QB
No ratings yet
Ir QB
8 pages
Beautiful Soup Documentation
No ratings yet
Beautiful Soup Documentation
61 pages
App
No ratings yet
App
10 pages
Python v3 URL and Page
No ratings yet
Python v3 URL and Page
4 pages
Shauaisbgsiajabzjzmaisuw
No ratings yet
Shauaisbgsiajabzjzmaisuw
21 pages
Trip Planner Example
No ratings yet
Trip Planner Example
7 pages
25 Awesome Python Scripts
No ratings yet
25 Awesome Python Scripts
26 pages
Python Unit 4
No ratings yet
Python Unit 4
6 pages
Beautiful Soup Documentation
No ratings yet
Beautiful Soup Documentation
53 pages
Basic Scraping Techniques
No ratings yet
Basic Scraping Techniques
7 pages
Parser
No ratings yet
Parser
6 pages
Create - Folder - If - Not - Exists: STR None
No ratings yet
Create - Folder - If - Not - Exists: STR None
5 pages
3252 Ids 10
No ratings yet
3252 Ids 10
5 pages
Data Engineering Concepts #2 - Sending Data Using An API - by Bar Dadon - Dev Genius
No ratings yet
Data Engineering Concepts #2 - Sending Data Using An API - by Bar Dadon - Dev Genius
14 pages
Sans Titre
No ratings yet
Sans Titre
11 pages
Full Web Scraping With Python Collecting Data From The Modern Web 1st Edition Ryan Mitchell Ebook All Chapters
No ratings yet
Full Web Scraping With Python Collecting Data From The Modern Web 1st Edition Ryan Mitchell Ebook All Chapters
67 pages
Googlemap
No ratings yet
Googlemap
2 pages
Introduction To Web Crawling Chapter - 13
No ratings yet
Introduction To Web Crawling Chapter - 13
3 pages
TikTok Checker
No ratings yet
TikTok Checker
1 page
F6Us9XWGTfenbcduDaGuFQ - Openai Workingcourse Dall e Intro To Dall e
No ratings yet
F6Us9XWGTfenbcduDaGuFQ - Openai Workingcourse Dall e Intro To Dall e
10 pages
1st Ns
No ratings yet
1st Ns
2 pages
Duckduckgo Download
No ratings yet
Duckduckgo Download
3 pages
Beautiful Soup
No ratings yet
Beautiful Soup
40 pages
PF 85 S 5 Cu XTV ERJF6 IWXD
No ratings yet
PF 85 S 5 Cu XTV ERJF6 IWXD
36 pages
Web Scraping and Data Collection CheatSheet 1731972399
No ratings yet
Web Scraping and Data Collection CheatSheet 1731972399
10 pages
Another Hack Test3
No ratings yet
Another Hack Test3
4 pages
How Can I Get Href Links From HTML Using Python?: 6 Answers
No ratings yet
How Can I Get Href Links From HTML Using Python?: 6 Answers
3 pages
Python Web Crawler
No ratings yet
Python Web Crawler
15 pages
Test 2
No ratings yet
Test 2
2 pages
Sahil Malhotra 16 BCE 0113 Web Mining L51+L52: 1. Universal Crawling 1.1. CODE
No ratings yet
Sahil Malhotra 16 BCE 0113 Web Mining L51+L52: 1. Universal Crawling 1.1. CODE
11 pages
Import Import Import Import
No ratings yet
Import Import Import Import
2 pages
Web Scraping Using Python: A Step by Step Guide: September 2019
No ratings yet
Web Scraping Using Python: A Step by Step Guide: September 2019
7 pages
Mind Mapping
No ratings yet
Mind Mapping
4 pages
WebScraping Lessons 4
No ratings yet
WebScraping Lessons 4
5 pages
Python For Web Scraping - Week 3: 1 Installing A Module
No ratings yet
Python For Web Scraping - Week 3: 1 Installing A Module
4 pages
Web Scraping Using Python: A Step by Step Guide: September 2019
No ratings yet
Web Scraping Using Python: A Step by Step Guide: September 2019
7 pages
Angular Generative AI: Building an intelligent CV enhancer with Google Gemini
From Everand
Angular Generative AI: Building an intelligent CV enhancer with Google Gemini
Abdelfattah Ragab
No ratings yet
Angular HTTP: Connecting to the REST API
From Everand
Angular HTTP: Connecting to the REST API
Abdelfattah Ragab
No ratings yet
50 Recipes for Programming Angular
From Everand
50 Recipes for Programming Angular
Jamie Munro
4/5 (1)
Introduction to PHP, Part 5, Second Edition
From Everand
Introduction to PHP, Part 5, Second Edition
Adam Majczak
No ratings yet

Practical7 IR

Uploaded by

Practical7 IR

Uploaded by

Prac cal 7

from bs4 import Beau fulSoup

from urllib.parse import urljoin,urlparse

from urllib.robotparser import RobotFileParser

except requests.excep ons.HTTPError as errh:

except requests.excep ons.RequestExcep on as err:

print(f"Request Error: {err}")

with open('robots.txt','wb') as ﬁle:

print("Error saving robots.txt: {e}")

with open('robots.txt','rb') as ﬁle:

return ﬁle.read().decode('u -8-sig')

for link in soup.ﬁnd_all('a',href=True):

if depth > max_depth or url in visited_urls or not is_allowed_by_robots(url,robots_content):

for link in links:

print("Unable to retrieve robots.txt. Crawling without restric ons.")

robots.txt is generated a er running the program.

You might also like