0% found this document useful (0 votes)

457 views11 pages

Synopsis WS

This document provides a synopsis report on a project to develop a web scraper using Python that can extract both text and images from websites. It was submitted by three students to fulfill requirements for their B.Tech degree in Information Technology. The report introduces web scraping and outlines the proposed work to build scrapers to collect product reviews from e-commerce sites and images for specified keywords from the internet. It describes the necessary prerequisites, application architecture, and provides screenshots of the final results of scraping reviews and images. The conclusion discusses potential future applications and references several sources on web scraping and related topics.

Uploaded by

Nishit Chaudhary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

457 views11 pages

Synopsis WS

Uploaded by

Nishit Chaudhary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

A

Synopsis Report
On

Web Scrapping(Text+Image)
For
partial fulfillment of award of the
B. Tech Degree in Information Technology

Under the Supervision of

Dr. Arun Kumar Singh

Submitted by:

NISHIT CHAUDHARY (1901920130119)

PANKAJ SHARMA (1901920130120)
PIYUSH SHARMA (1901920130121)

Session:

G. L. Bajaj Institute of Technology and Management,

Greater Noida
TABLE OF CONTENTS

1. Introduction
2. Relevant Work
3. Pre-reqisites
4. Application Architecture
5. Conclusion and Future Scope
6. Final Result
7. References
INTRODUCTION

Web scraping is a technique using which the webpages from the internet are fetched and parsed
to understand and extract specific information similar to a human being. Web scrapping consists
of two parts:

• Web Crawling→ Accessing the webpages over the internet and pulling data from them.

• HTML Parsing→ Parsing the HTML content of the webpages obtained through web crawling
and then extracting specific information from it.

Hence, web scrappers are applications/bots, which automatically send requests to websites and
then extract the desired information from the website output. Let’s take an example: how do we
buy a phone online? 1. We first look for a phone with good reviews 2. We see on which website
it’s available at the lowest price 3. We check whether it’s delivered in our area or not 4. If
everything looks good, then we buy the phone. What if there is a computer program that can do
all of these for us? That’s what web scrappers necessarily do. They try to understand the webpage
content as a human would do. Other examples of the applications of web scrapping are:

• Competitive pricing.

• Manufacturers monitor the market, whether the retailer is maintaining a minimum price or not.

• Sentiment analysis of the consumers, whether they are happy with the services and products or
not.

• To aggregate news articles.

• To aggregate Marketing data.

• To gain financial insights from the market.

• To gather data for research.

• To generate marketing leads.

• To collect trending topics by media houses. And, the list goes on.
Figure 1: Web scraping process

PROPOSED WORK

a) For text :
In this document, we’ll take the example of buying a phone online further and try to scrap the
reviews from the website about the phone that we are planning to buy. For example, if we open
filpkart.com and search for ‘iPhone’, the search result will be as follows:
Then if we click on a product link, it will take us to to the following page:

Now, if we scroll down, we will get to see following comments posted by customers:

Our end goal is to build a web scraper that collects the reviews of a product from
the internet.

b) For image :
Our end goal is to build a web scraper that collects the images for a keyword from
the internet.

PREREQUISITES

The things needed before we start building a python based web scraper are:
• Python installed.
• A Python IDE (Integrated Development Environment): like PyCharm, Spyder, or any other
IDE of choice.
• Flask Installed. (A simple command: pip install flask)
• MongoDB installed (Explained Later).
• Basic understanding of Python and HTML.
• Basic understanding of Git (download Git CLI from https://gitforwindows.org/ ).

APPLICATION ARCHITECTURE

The architecture of the application is:

a) For text :
b) For image :
CONCLUSION AND FUTURE SCOPE

In this project, we built a web scraper from scratch that collects the reviews of products from the
internet and also collects the images for a keyword fom the internet collects the images for a
keyword from the internet and then deploying it to the heroku cloud platform.
It is a step by step guide for creating a web scraper, in this case, a review scrapper right from
scratch and then deploying it to the heroku cloud platform.

Text scrappers are extensively used in the industry today for competitive pricing, market studies,
customer sentiment analysis, etc…

Image scrappers are extensively used in the industry today for collecting a huge number of
images that are used as inputs for training the object detection, classification and identification
models.

In the near future, Web scraping will be one of the important tools in the lead generation
process. The web scraping tool can make market research of the particular
product/services and enormous benefits to offer in the marketing field.

FINAL RESULT

a) For text :
b) For image :
REFERENCES

[1]. ”Renita Crystal Pereira, Vanitha T. “Web Scraping of Social Networks.” International
Journal of Innovative Research in Computer and Communication Engineering, vol. 3, pp.237-
239, Oct. 7, 2018”

[2].”Ghazvinian, Holbert, Viswanathan.

“SimpleWebScraping.”Internet:https://seanholbert.wordpress.co m/2011/07/15/scrappy-simple-
webscraping/, Jun. 2015”

[3].”Bellarosey.“Crowdsourcing-Definition.”
Internet:http://crowdsourcing.typepad.com/cs/2006/06/crowdsour cing_a.html, Jun. 02, 2006”

[4].”Kolari, Pand Joshi A. ,“Web mining : research and practice , Computing in Science
&Engineering”, IEEE Transactions on Knowledgeand Data Engineering, vol. 6, no. 2,Vol. 6 ,
No. 4, 2004”

[5].”Kengtel,W:Wagner,M.Proteins1999,37,334-345.”

[6]. “Datahen."3 Advantages of web scraping

foryourenterprise"Internet:https://www.datahen.com/3- advantages-web-
scrapingenterprise/,May.17,2017””

[7].”http://resources.distilnetworks.com/h/i/53822104-iswebscraping- illegal-depends-on-
whatthe-meaning-of-thewordis-is/181642”

GE3361 Professional Development
No ratings yet
GE3361 Professional Development
45 pages
BDA Lab Manual - BAD601-Final One - 7-11
No ratings yet
BDA Lab Manual - BAD601-Final One - 7-11
25 pages
IT Presentation-HOD IT For 2025 NBA
No ratings yet
IT Presentation-HOD IT For 2025 NBA
62 pages
Python Full Stack Development Summer Internship Report
No ratings yet
Python Full Stack Development Summer Internship Report
44 pages
Busibud Testing Assignment Report - Sheet1
67% (3)
Busibud Testing Assignment Report - Sheet1
3 pages
GE8072 - Foundation Skills in Integrated Product Development (Ripped From Amazon Kindle Ebooks by Sai Seena)
No ratings yet
GE8072 - Foundation Skills in Integrated Product Development (Ripped From Amazon Kindle Ebooks by Sai Seena)
140 pages
Manual BlueBeam Revu
100% (2)
Manual BlueBeam Revu
24 pages
AI in Healthcare Syllabus
No ratings yet
AI in Healthcare Syllabus
7 pages
IBM - PBL Program 2025
No ratings yet
IBM - PBL Program 2025
2 pages
Bcs515b Notes Dr. Sbl-1
No ratings yet
Bcs515b Notes Dr. Sbl-1
69 pages
Conformation Meeting PPT Presentation
No ratings yet
Conformation Meeting PPT Presentation
20 pages
KEE403 Network Analysis and Synthesis
No ratings yet
KEE403 Network Analysis and Synthesis
60 pages
Hotel Recommendation Systen Final
No ratings yet
Hotel Recommendation Systen Final
16 pages
Experiment 3 Module 1
No ratings yet
Experiment 3 Module 1
6 pages
Sepm Notes Module 2
No ratings yet
Sepm Notes Module 2
31 pages
CCS336 Cloud Services Management Apr May 2024 Question Paper Download
No ratings yet
CCS336 Cloud Services Management Apr May 2024 Question Paper Download
3 pages
SDE Himanshu Gupta Resume
No ratings yet
SDE Himanshu Gupta Resume
1 page
Social Network Analysis
No ratings yet
Social Network Analysis
2 pages
BCS515B
No ratings yet
BCS515B
2 pages
Ccs349 Iva Record - Final
No ratings yet
Ccs349 Iva Record - Final
49 pages
React Cheat Sheet
No ratings yet
React Cheat Sheet
14 pages
Module 4
No ratings yet
Module 4
34 pages
Keylogger
No ratings yet
Keylogger
11 pages
CG Decode
100% (1)
CG Decode
93 pages
Artificial Intelligence and Machine Learning - CS3491 2021 Regulation - Question Paper 2023 Nov Dec
No ratings yet
Artificial Intelligence and Machine Learning - CS3491 2021 Regulation - Question Paper 2023 Nov Dec
11 pages
CCS356 OOSE - NOTES-Final
No ratings yet
CCS356 OOSE - NOTES-Final
114 pages
Web-Technology Lab Course File
No ratings yet
Web-Technology Lab Course File
63 pages
Data Mining Techniques and Applications
No ratings yet
Data Mining Techniques and Applications
16 pages
Cs8080 Unit3 Text Classification and Clustering
No ratings yet
Cs8080 Unit3 Text Classification and Clustering
171 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
112 pages
21CS63 - CG&FIP Course Material
No ratings yet
21CS63 - CG&FIP Course Material
151 pages
Introduction To AI Notes
No ratings yet
Introduction To AI Notes
4 pages
ToR 2074 06 23 Final
100% (1)
ToR 2074 06 23 Final
11 pages
6CS4-02 ML PPT Unit-3
No ratings yet
6CS4-02 ML PPT Unit-3
52 pages
Intern Report
No ratings yet
Intern Report
27 pages
Information User Lecture Note
No ratings yet
Information User Lecture Note
23 pages
MLT Unit 3 Notes
No ratings yet
MLT Unit 3 Notes
19 pages
Daa Lab Manual
No ratings yet
Daa Lab Manual
60 pages
Mtech 1 Sem Foundation of Computer Science Mtcs 101 2018 19
100% (1)
Mtech 1 Sem Foundation of Computer Science Mtcs 101 2018 19
1 page
AI Lab MAnual Final
No ratings yet
AI Lab MAnual Final
44 pages
Sample Technical Seminar Vtu
No ratings yet
Sample Technical Seminar Vtu
14 pages
Decision Tree - A Step-by-Step Guide
No ratings yet
Decision Tree - A Step-by-Step Guide
36 pages
Internship Report
No ratings yet
Internship Report
13 pages
CS8792 CNS Unit 1 - R1
No ratings yet
CS8792 CNS Unit 1 - R1
89 pages
Chapter 9
No ratings yet
Chapter 9
73 pages
DAN Lab ManuaL
No ratings yet
DAN Lab ManuaL
53 pages
Unit 3 CC
No ratings yet
Unit 3 CC
8 pages
r20 4-1 Open Elective III Syllabus Final Ws
No ratings yet
r20 4-1 Open Elective III Syllabus Final Ws
29 pages
Emptech DLL Q1 Week 2
No ratings yet
Emptech DLL Q1 Week 2
10 pages
Introduction To Data Engineering
No ratings yet
Introduction To Data Engineering
8 pages
III Sem Syllabus RNSIT New
No ratings yet
III Sem Syllabus RNSIT New
19 pages
CS8494 Software Engineering
No ratings yet
CS8494 Software Engineering
21 pages
East West Institute of Technology: Sadp Notes
No ratings yet
East West Institute of Technology: Sadp Notes
30 pages
Sample Report 22-23 1
No ratings yet
Sample Report 22-23 1
30 pages
IML-IITKGP - Assignment 7 Solution
No ratings yet
IML-IITKGP - Assignment 7 Solution
8 pages
15csl47 Daa Lab Manual-1
No ratings yet
15csl47 Daa Lab Manual-1
53 pages
Question Bank: T.E. (Computer Engineering) Data Science and Big Data Analytics (2019 Pattern)
No ratings yet
Question Bank: T.E. (Computer Engineering) Data Science and Big Data Analytics (2019 Pattern)
4 pages
01 - Introduction To Bing Ads
No ratings yet
01 - Introduction To Bing Ads
23 pages
Haze Removal
No ratings yet
Haze Removal
34 pages
Static Hashing in DBMS
No ratings yet
Static Hashing in DBMS
75 pages
Unit 1 Client Side Scripting Final
No ratings yet
Unit 1 Client Side Scripting Final
254 pages
Pro Angular JS (Apress)
No ratings yet
Pro Angular JS (Apress)
1 page
Viza Travel Sponsorship Proposal 1
No ratings yet
Viza Travel Sponsorship Proposal 1
13 pages
TE040 Iprocurement Test Script On Oracle Iprocurement
100% (1)
TE040 Iprocurement Test Script On Oracle Iprocurement
17 pages
Aircraft-Structures-For-Engi
No ratings yet
Aircraft-Structures-For-Engi
379 pages
Computer Operator and Programming Assistant P
No ratings yet
Computer Operator and Programming Assistant P
9 pages
Ieee Paper
No ratings yet
Ieee Paper
5 pages
Introduction To CSS3
No ratings yet
Introduction To CSS3
159 pages
L5 Slides - Developing For The Web - Y8
No ratings yet
L5 Slides - Developing For The Web - Y8
27 pages
Resume Shivanand
No ratings yet
Resume Shivanand
8 pages
NetMon-4 0 3-SupportedApplications - Reva
No ratings yet
NetMon-4 0 3-SupportedApplications - Reva
113 pages
Abhinav Jain
No ratings yet
Abhinav Jain
2 pages
Violin Mastery PDF
No ratings yet
Violin Mastery PDF
337 pages
Coating Thickness Gages: Instruction Manual V. 7.1
No ratings yet
Coating Thickness Gages: Instruction Manual V. 7.1
30 pages
AZ 204 Demo
No ratings yet
AZ 204 Demo
19 pages
Mission 4 (GC-B)
No ratings yet
Mission 4 (GC-B)
18 pages
Purcom - Reviewer Lesson 4
No ratings yet
Purcom - Reviewer Lesson 4
2 pages
Exploiting Web Scraping in A Collaborati
No ratings yet
Exploiting Web Scraping in A Collaborati
11 pages
Basis P3
No ratings yet
Basis P3
6 pages
Welcome To Powerschool Support For Schoology Learning: June 2022
No ratings yet
Welcome To Powerschool Support For Schoology Learning: June 2022
16 pages
TGK - Easy Start Manual
No ratings yet
TGK - Easy Start Manual
8 pages
Online Leave Management Systemfinal
No ratings yet
Online Leave Management Systemfinal
3 pages
CallCenter API
No ratings yet
CallCenter API
6 pages
Guide For Content Preparation and Development (Content Marketing)
No ratings yet
Guide For Content Preparation and Development (Content Marketing)
2 pages
Hadi Ariawan - Curriculum Vitae
100% (4)
Hadi Ariawan - Curriculum Vitae
4 pages
Screenshot 2023-01-22 at 23.43.25
No ratings yet
Screenshot 2023-01-22 at 23.43.25
1 page
The Today and Future of WSN, AI, and IoT: A Compass and Torchbearer for the Technocrats
From Everand
The Today and Future of WSN, AI, and IoT: A Compass and Torchbearer for the Technocrats
Dr.Chandrakant
No ratings yet
Equity of Cybersecurity in the Education System: High Schools, Undergraduate, Graduate and Post-Graduate Studies.
From Everand
Equity of Cybersecurity in the Education System: High Schools, Undergraduate, Graduate and Post-Graduate Studies.
Joseph O. Esin
No ratings yet
Touchpad Plus Ver. 1.1 Class 7
From Everand
Touchpad Plus Ver. 1.1 Class 7
Nisha Batra
No ratings yet