0% found this document useful (0 votes)
4 views3 pages

Practical7 IR

The document contains a Python script for a web crawler that fetches HTML content, saves the robots.txt file, and extracts links from web pages. It includes functions for handling HTTP requests, parsing HTML, and respecting robots.txt rules. The script is designed to crawl a specified URL up to a maximum depth while implementing a delay between requests.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views3 pages

Practical7 IR

The document contains a Python script for a web crawler that fetches HTML content, saves the robots.txt file, and extracts links from web pages. It includes functions for handling HTTP requests, parsing HTML, and respecting robots.txt rules. The script is designed to crawl a specified URL up to a maximum depth while implementing a delay between requests.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Prac cal 7

import requests

from bs4 import Beau fulSoup

import me

from urllib.parse import urljoin,urlparse

from urllib.robotparser import RobotFileParser

def get_html(url):

headers={'User-Agent':'Mozilla/5.0(Windows NT 10.0;Win64;x64)AppleWebKit/537.36(KHTML,like
Gecko)Chrome/58.0.3029.110 Safari/537.3'}

try:

response=requests.get(url,headers=headers)

response.raise_for_status()

return response.text

except requests.excep ons.HTTPError as errh:

print(f"HTTP Error:{errh}")

except requests.excep ons.RequestExcep on as err:

print(f"Request Error: {err}")

return None

def save_robots_txt(url):

try:

robots_url=urljoin(url,'/robots.txt')

robots_content=get_html(robots_url)

if robots_content:

with open('robots.txt','wb') as file:

file.write(robots_content.encode('u -8-sig'))

except Excep on as e:

print("Error saving robots.txt: {e}")

def load_robots_txt():
try:

with open('robots.txt','rb') as file:

return file.read().decode('u -8-sig')

except FileNotFoundError:

return None

def extract_links(html,base_url):

soup=Beau fulSoup(html,'html.parser')

links=[]

for link in soup.find_all('a',href=True):

absolute_url=urljoin(base_url,link.get('href'))

links.append(absolute_url)

return links

def is_allowed_by_robots(url,robots_content):

parser=RobotFileParser()

parser.parse(robots_content.split('\n'))

return parser.can_fetch('*',url)

def crawl(start_url,max_depth=3,delay=1):

visited_urls=set()

def recursive_crawl(url,depth,robots_content):

if depth > max_depth or url in visited_urls or not is_allowed_by_robots(url,robots_content):

return

visited_urls.add(url)

me.sleep(delay)

html=get_html(url)

if html:

print(f"Crawling {url}")
links=extract_links(html,url)

for link in links:

recursive_crawl(link,depth+1,robots_content)

save_robots_txt(start_url)

robots_content=load_robots_txt()

if not robots_content:

print("Unable to retrieve robots.txt. Crawling without restric ons.")

recursive_crawl(start_url,1,robots_content)

print("Performed by Raj")

crawl("h ps://wikipedia.com",max_depth=2,delay=2)

Output:

robots.txt is generated a er running the program.

You might also like