0% found this document useful (0 votes)

16 views7 pages

Citl Exp 8

Uploaded by

varsha bojja

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views7 pages

Citl Exp 8

Uploaded by

varsha bojja

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

NAME: LAXMI BODKE - 2022300011

VARSHA BOJJA - 2022300012

RONIT CHINTRATE - 2023301004
ADWAIT SHESH- 2023301016

Problem Demonstrate the behavior of Web Crawlers/ spiders (use XPATH,CSS

Statement: PATH),extract information and store it in the database.
Theory: 1. Introduction to Web Crawling

Web Crawlers/Spiders:
• Definition: Web crawlers, also known as spiders or bots, are
automated programs that browse the internet systematically to
index content and gather information.
• Functionality: They navigate through web pages by following
hyperlinks and retrieving content, which can be stored and
processed for various applications, such as search engines, data
mining, and analytics.
• Use Cases: Common uses include indexing for search engines
(like Google), gathering data for research, monitoring changes in
websites, and scraping data for analysis.
2. Web Scraping Techniques Data Extraction:
• Web scraping involves extracting data from web pages. This can
be
achieved using various techniques, with two common methods being
XPATH
and CSS selectors.
XPATH:
• Definition: XPATH is a query language used to select nodes from
an XML document. It can also be used to navigate HTML
documents.
• Syntax: XPATH uses a path-like syntax to specify the location of
elements in a document. For example, //div[@class='example']
selects all <div> elements with a class of "example".
• Advantages: XPATH is powerful for complex queries and allows
for precise element selection, including attributes and text content.
CSS Selectors:
• Definition: CSS selectors are used to select elements in HTML
based on their attributes, types, classes, and IDs.
• Syntax: For example, .example selects all elements with the class
"example", and #uniqueID selects the element with the ID
"uniqueID".

Advantages: CSS selectors are generally easier to use and understand,

making them suitable for straightforward data extraction tasks.

Code: • euler.py:
import requests
from bs4 import BeautifulSoup
import sqlite3
import matplotlib.pyplot as plt
import os

print("Current working directory:", os.getcwd())

# Set up the database

conn = sqlite3.connect('newProjectEuler.db')
c = conn.cursor()
c.execute('CREATE TABLE IF NOT EXISTS problems (id INTEGER
PRIMARY KEY, title TEXT, solved_count INTEGER)')

# Iterate through all pages

all_problems = []
for page in range(1, 20):
url = f'https://projecteuler.net/archives;page={page}'
print(f"Fetching data from: {url}")
response = requests.get(url)

# Check for a successful response

if response.status_code != 200:
print(f"Failed to retrieve data from {url}, status code:
{response.status_code}")
continue
soup = BeautifulSoup(response.content, 'html.parser')

# Extract information from the current page

page_problems = []
for row in soup.select('tr'):
id_column = row.select_one('td.id_column')
title_column = row.select_one('td:nth-of-type(2) a')
solved_count_column = row.select_one('td:nth-of-type(3) div.center')

if id_column and title_column and solved_count_column:

problem_id = int(id_column.text.strip())
title = title_column.text.strip()
solved_count = int(solved_count_column.text.strip().replace(',', ''))
page_problems.append((problem_id, title, solved_count))

# Append the current page's problems to the total list

all_problems.extend(page_problems)

# Insert the extracted data into the database

c.executemany('INSERT OR IGNORE INTO problems (id, title,
solved_count) VALUES (?, ?, ?)', page_problems)
conn.commit()

# Print the total number of problems extracted

print(f"Total problems extracted: {len(all_problems)}")
print(all_problems)

# Query the data for plotting

c.execute('SELECT id, solved_count FROM problems')
data = c.fetchall()

# Prepare data for plotting

if data:
ids, solved_counts = zip(*data)

# Plotting the data

plt.scatter(ids, solved_counts)
plt.xscale('linear')
plt.yscale('log') # Use a log scale for the y-axis
plt.xlabel('Problem ID')
plt.ylabel('Number of Solved Users (Log Scale)')
plt.title('Number of Users Solved Problems on Project Euler')
plt.grid(False)
plt.show()
else:
print("No data available for plotting.")

# Find the problems solved the most and least

if data:
most_solved = max(data, key=lambda x: x[1])
least_solved = min(data, key=lambda x: x[1])
print(f"Problem with ID {most_solved[0]} has been solved the most with
{most_solved[1]} solutions.")
print(f"Problem with ID {least_solved[0]} has been solved the least with
{least_solved[1]} solutions.")

# Close the database connection

conn.close()
print("Data extraction and storage completed.")

• newProjectEuler.py:
import sqlite3

# Connect to the database

conn = sqlite3.connect('newProjectEuler.db')
c = conn.cursor()

# Fetch all rows from the 'problems' table

c.execute('SELECT * FROM problems')
rows = c.fetchall()

# Check if there is any data and print it

if rows:
print("Data Stored in Database:")
for row in rows:
print(row)
else:
print("No data found in Database.")
# Close the database connection
conn.close()
Output: Webpage: https://projecteuler.net/archives

CSS Selector:
<tr><td class="id_column">1</td><td><a href="problem=1"
title="Published on Friday, 5th October 2001, 06:00 pm">Multiples of 3
or 5</a></td><td><div class="center">1010203</div></td></tr>

Database Used - SQLite3

Visualization – Python-Matplotlib
• Scatter Plot
Conclusion: Hence by completing this experiment we got to know how to Demonstrate
the behavior of Web Crawlers/ spiders (use XPATH,CSS PATH),extract
information and store it in the database.

Python Module-4
No ratings yet
Python Module-4
109 pages
A Guide To The National Initiative For Cybersecurity Education (NICE) Cybersecurity Workforce Framework (2.0) (PDFDrive)
No ratings yet
A Guide To The National Initiative For Cybersecurity Education (NICE) Cybersecurity Workforce Framework (2.0) (PDFDrive)
554 pages
Data Aggregation by Web Scraping Using Python
No ratings yet
Data Aggregation by Web Scraping Using Python
48 pages
Ifmis Training
50% (2)
Ifmis Training
60 pages
DAP Module4
No ratings yet
DAP Module4
109 pages
Lecture03 Data II
No ratings yet
Lecture03 Data II
42 pages
Implementing Web Scraping in Python With Beautifulsoup
No ratings yet
Implementing Web Scraping in Python With Beautifulsoup
6 pages
Data Science and Machine Learning
No ratings yet
Data Science and Machine Learning
30 pages
The Ultimate Guide To Python Programming With Python 3.10
No ratings yet
The Ultimate Guide To Python Programming With Python 3.10
2 pages
Web Scraping Python
No ratings yet
Web Scraping Python
13 pages
DAP - Module 4
No ratings yet
DAP - Module 4
57 pages
AD82088
No ratings yet
AD82088
7 pages
Ds Final
No ratings yet
Ds Final
45 pages
DAP 4 Module
No ratings yet
DAP 4 Module
45 pages
Efficient Python Tricks and Tools For Data Scientists
100% (1)
Efficient Python Tricks and Tools For Data Scientists
23 pages
Practical Web Scraping For Economists 1744341390
No ratings yet
Practical Web Scraping For Economists 1744341390
33 pages
ML Week 6
No ratings yet
ML Week 6
11 pages
Restful Cheat Sheet
100% (1)
Restful Cheat Sheet
8 pages
Practical Introduction To Web Scraping in Python
100% (1)
Practical Introduction To Web Scraping in Python
14 pages
User Manual - Vet
100% (1)
User Manual - Vet
56 pages
Programming 2 Lectures
No ratings yet
Programming 2 Lectures
52 pages
Web Scraping Report
No ratings yet
Web Scraping Report
14 pages
RajSingh WIexp4
No ratings yet
RajSingh WIexp4
7 pages
Chapter One To Five For Alex Pal's
No ratings yet
Chapter One To Five For Alex Pal's
61 pages
CBAD2103 System Analysis and Design Capr14 (RS) (M) PDF
100% (1)
CBAD2103 System Analysis and Design Capr14 (RS) (M) PDF
248 pages
Deloitte Data Engineer Interview Experience (0-3 Yoe)
No ratings yet
Deloitte Data Engineer Interview Experience (0-3 Yoe)
22 pages
Python Record Manual
No ratings yet
Python Record Manual
18 pages
Web Devlopment
From Everand
Web Devlopment
Netra
No ratings yet
Python Using AI
No ratings yet
Python Using AI
9 pages
Experiment2 Web Scraping and Data Analysis
No ratings yet
Experiment2 Web Scraping and Data Analysis
5 pages
Strategy Change Cycle PDF
No ratings yet
Strategy Change Cycle PDF
1 page
SK Tf80Sc: Warning!
No ratings yet
SK Tf80Sc: Warning!
8 pages
Data Science
No ratings yet
Data Science
5 pages
06 WebScrapingData
No ratings yet
06 WebScrapingData
39 pages
ETI Microproject (57,59,62,63)
No ratings yet
ETI Microproject (57,59,62,63)
20 pages
Exam Practice Questions
No ratings yet
Exam Practice Questions
3 pages
Vanessaa Wim
No ratings yet
Vanessaa Wim
9 pages
SkyfendHunter Datasheet 20230619
No ratings yet
SkyfendHunter Datasheet 20230619
3 pages
Must-On Board Diagnostics II PCED PDF
No ratings yet
Must-On Board Diagnostics II PCED PDF
19 pages
Web Scraping Weather Data Using Python - by Abhishek Khatri - Medium
No ratings yet
Web Scraping Weather Data Using Python - by Abhishek Khatri - Medium
8 pages
WebCrawler Report
No ratings yet
WebCrawler Report
6 pages
10 Lessons in Front-end
From Everand
10 Lessons in Front-end
Krasimir Tsonev
2/5 (1)
Algorithm
No ratings yet
Algorithm
5 pages
Web Scraping With Python
No ratings yet
Web Scraping With Python
21 pages
Sma 2
No ratings yet
Sma 2
9 pages
Computer Business Centre: Project Report of
No ratings yet
Computer Business Centre: Project Report of
10 pages
Mini Project
No ratings yet
Mini Project
13 pages
Utilizing Python For Web Scraping and Incremental Data Extraction
No ratings yet
Utilizing Python For Web Scraping and Incremental Data Extraction
6 pages
3252 Ids 10
No ratings yet
3252 Ids 10
5 pages
Upload PDF
No ratings yet
Upload PDF
11 pages
JavaScript Interview Questions You'll Most Likely Be Asked
No ratings yet
JavaScript Interview Questions You'll Most Likely Be Asked
20 pages
Chapter1 PDF
No ratings yet
Chapter1 PDF
22 pages
Pramodh CV
No ratings yet
Pramodh CV
1 page
Introduction To Web Crawling Chapter - 13
No ratings yet
Introduction To Web Crawling Chapter - 13
3 pages
Web Scraping
No ratings yet
Web Scraping
11 pages
Exercises 5
No ratings yet
Exercises 5
7 pages
Visiferm Do Visiferm Do Arc: Modbus Rtu Programmer'S Manual
No ratings yet
Visiferm Do Visiferm Do Arc: Modbus Rtu Programmer'S Manual
97 pages
PCB & Electronics Workshop Lab
No ratings yet
PCB & Electronics Workshop Lab
35 pages
Raum 3 5kW Turbine Data Sheet-2010
No ratings yet
Raum 3 5kW Turbine Data Sheet-2010
2 pages
Learninng Plan
No ratings yet
Learninng Plan
6 pages
Mastering JavaScript: The Complete Guide to JavaScript Mastery
From Everand
Mastering JavaScript: The Complete Guide to JavaScript Mastery
Tim Robards
5/5 (1)
CS Practical - 2
No ratings yet
CS Practical - 2
19 pages
Conversation Guide - IBM Power Virtual Server
No ratings yet
Conversation Guide - IBM Power Virtual Server
23 pages
Raster Graphics
No ratings yet
Raster Graphics
14 pages
Python: The Ultimate
No ratings yet
Python: The Ultimate
7 pages
On Python Project VI Semester: Academic Year: 2018-2019
No ratings yet
On Python Project VI Semester: Academic Year: 2018-2019
7 pages
4 Design and Development
No ratings yet
4 Design and Development
3 pages
Sahil Malhotra 16 BCE 0113 Web Mining L51+L52: 1. Universal Crawling 1.1. CODE
No ratings yet
Sahil Malhotra 16 BCE 0113 Web Mining L51+L52: 1. Universal Crawling 1.1. CODE
11 pages
Esc Enter M Y A B D + D Z F Shift + Up/Down Space Shift + Space
No ratings yet
Esc Enter M Y A B D + D Z F Shift + Up/Down Space Shift + Space
12 pages
Administering Microsoft Azure SQL Solutions DP 300
From Everand
Administering Microsoft Azure SQL Solutions DP 300
Manish Soni
No ratings yet
NoSQL Injection for Elasticsearch
From Everand
NoSQL Injection for Elasticsearch
Gary Drocella
No ratings yet
Python Cheat Sheet - The Basics Edx
No ratings yet
Python Cheat Sheet - The Basics Edx
2 pages
Web Data Scraping
No ratings yet
Web Data Scraping
5 pages
Python Cheat Sheet - The Basics CC
No ratings yet
Python Cheat Sheet - The Basics CC
2 pages
Div - Exam - TLE 3rd PT CHS Gr.9
No ratings yet
Div - Exam - TLE 3rd PT CHS Gr.9
6 pages
C# Interview Questions, Answers, and Explanations: C Sharp Certification Review
From Everand
C# Interview Questions, Answers, and Explanations: C Sharp Certification Review
equitypress
4.5/5 (3)
Fuel Cell Lab Report
No ratings yet
Fuel Cell Lab Report
6 pages
Notes On Intro To Data Science Udacity
No ratings yet
Notes On Intro To Data Science Udacity
8 pages
Omnia Enterprise 9s Brochure
No ratings yet
Omnia Enterprise 9s Brochure
6 pages
St. Mary'S School of Novaliches, Inc. Third Quarterly Exam Empowerment Technologies Grade 11 S.Y. 2022 - 2023
No ratings yet
St. Mary'S School of Novaliches, Inc. Third Quarterly Exam Empowerment Technologies Grade 11 S.Y. 2022 - 2023
4 pages
Plate Heat Exchanger: T20-BFG
No ratings yet
Plate Heat Exchanger: T20-BFG
2 pages
Visual Basic 2010 Coding Briefs Data Access
From Everand
Visual Basic 2010 Coding Briefs Data Access
Kevin Hough
5/5 (1)
Courier Tracking System Project Showcase
No ratings yet
Courier Tracking System Project Showcase
2 pages
Vertex VX-4500 Series SpecSheet FINAL 012011
No ratings yet
Vertex VX-4500 Series SpecSheet FINAL 012011
2 pages
Microsoft Visual Basic Interview Questions: Microsoft VB Certification Review
From Everand
Microsoft Visual Basic Interview Questions: Microsoft VB Certification Review
Equity Press
No ratings yet
Brain - Inspired Computing: Wozniak Et Al. Yin Et Al. Masquelier
No ratings yet
Brain - Inspired Computing: Wozniak Et Al. Yin Et Al. Masquelier
5 pages
Resume of Kayelinn
No ratings yet
Resume of Kayelinn
2 pages
Inspiring Powershell Articles
From Everand
Inspiring Powershell Articles
Murat Yildirimoglu
No ratings yet
ACTPC6UBCM10BU
No ratings yet
ACTPC6UBCM10BU
2 pages
Kbutty Resume 2022 Sask
No ratings yet
Kbutty Resume 2022 Sask
2 pages