Final Publish Paper

This document discusses using machine learning for web scraping. It begins with definitions of web scraping, text recognition, and text classification. Web scraping extracts structured data from unstructured web pages. Text recognition converts text in images to machine-readable text. Text classification assigns categories to text based on its content. The document then reviews literature on different web scraping techniques, including using machine learning approaches like statistical models, adaptive search, and support vector machines. It focuses on using machine learning for tasks like web crawling, data extraction, text recognition from social media images, and text classification.

Uploaded by

shruti.narkhede.it.2019

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views4 pages

Final Publish Paper

Uploaded by

shruti.narkhede.it.2019

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Web Scraping Using Machine Learning

WEB SCRAPING USING MACHINE

LEARNING
Sakshi Nakate
Student, Department of Information Technology,
Pune University, Baramati, Pune, Maharashtra., India,
[email protected]
Shruti Narkhede
Student, Department of Information Technology,
Pune University, Baramati, Pune, Maharashtra., India,
[email protected]
Sonali Gawade
Student, Department of Information Technology,
Pune University, Baramati, Pune, Maharashtra., India,
[email protected]

Abstract - Web scraping is the process of analyzed or stored for later use. It’s
collecting or extracting information from a important to note that while web scraping
particular website. It is a technique to convert any
unstructured data into structured data and then
can be a powerful tool for data collection,
analyze the obtained data based and is the stored in you should always respect the website’s
required format file type. Web scraping is terms of service and follow ethical
becoming well known due to large amount of data guide lines. Some websites may have
available on internet and want to collect the data restrictions or prohibit scraping their
without wasting time. Web scarping can be applied
to obtain a huge amount of data for better decision
content, so it’s crucial to ensure you’re
making. We can achieve this using BeautifulSoup acting within legal and ethical boundaries.
tool and other algorithms. The obtained data after TEXT RECOGNITION: Text recognition,
web scraping will be processed for Text also known as Optical Character
Recognition and Text Classification using NLP and Recognition (OCR), is the technology that
Classification.
enables computers to recognize and extract
Keywords: Web scraping, unstructured data, data- text from images or scanned documents. It
format, text classification, text recognition, NLP, involves converting the text present in an
classification image or document into
1. Introduction machine readable and editable text. Text
recognition finds applications in a wide
WEB SCRAPING: Web scraping is the
range of fields, such as digitizing printed
process of extracting data from websites
documents, extracting information from
automatically using a software program or
invoices or receipts, converting scanned
script. It involves retrieving data from the
books into editable text, automatic license
HTML source code of a webpage and
plate recognition, and more. Many
transforming it into a structured format
programming languages provide OCR
that can be analyzed or stored for later use
libraries or APIs that simplify the
Web scraping is the process of extracting
implementation of text recognition in your
data from websites automatically using a
applications, such as Tesseract OCR for
software program or script. It involves
Python or Google Cloud Vision API.
retrieving data from the HTML source
TEXT CLASSIFICATION: Text
code of a web-page and transforming it
classification is a natural language
into a structured format that can be
processing (NLP) task that involves

Web Scraping Using Machine Learning

categorizing or assigning predefined labels Of Different Web Data Extraction

or categories to textual data. It aims to Techniques,” 2018 International
automatically classify text documents or Conference on Smart City and
snippets into different classes or categories Emerging Technology (ICSCET),
based on their content. Text classification Mumbai, 2018.
has various applications, including The work in this paper is focusing on -
sentiment analysis, spam filtering, topic Web scraping, Data Extraction, Web
classification, news categorization, intent crawler, Machin Learning Approach
detection, and many more. The choice of systems -WIEN, WHISK, Rapier. Web
model and techniques depends on the data Scraping process includes Web
specific task and the characteristics of the Crawler and Data Extractor. Data
text data you are working with. Extraction Techniques involved are – a)
2. Literature Survey Human Copy and Paste, b) HTML Parser-
JAVA library – Jsoup and Python library –
1. D. M. Thomas and S. Mathur, ”Data BeautifulSoup. c) HTTP Programming, d)
Analysis by Web Scraping using Tree based technique i.e DOM (Document
Python,” 2019 3rd International Object Model).The techniques that use
conference on Electronics, DOM are i) Addressing element in the
Communication and Aerospace document tree(XPath), ii) Tree edit
Technology (ICECA), Coimbatore, distance matching algorithms., e)Web
India, 2019 Scrap per includes 3 approaches -
The work in this paper is focusing on – Regular expression based approach, Logic
Data analysis, Web Scraping, based approach, ML approaches. Different
Implementing of Web Scrape which is a ML approach specifies are Statistical ML
web scraping software to scrape e- approach, Adaptive Search, WIEN,
commerce sites such as Flipkart, Amazon RAPIER, WHISK, SRV. WHISK system
and analyze product details which aren’t is the most advantageous as compared to
available, analyze variation, comments, others according to the survey.
ratings, etc. The point of the paper is to
remove the information from different 3. M.S. Akopyan, O.V. Belyaeva, T.P.
sources with the assistance of Plechov, D.Y. Turdakov (2019). Text
programming known as the web crawler Recognition on Images from Social
Scrapy utilizing the programming Media.
language Python adaptation 3.6. The The work in this paper is text recognition,
Database is created which collects all the social networks, image processing, deep
unstructured data from various sources and neural networks. Text recognition pipeline
then analyses them by the analytic process is provided to address text extraction from
of its specifications, assembling, various quality images collected from
organizing, cleaning, reanalyzing, social media. Input images are categorized
applying models and algorithms and into different classes and then class
finally providing the desired results. In this specific preprocessing is applied to them
paper Reddit by XPath method was used to for illumination improvement, text
find details of each element of the frequent localization. Then OCR engine is used to
searches. Main outcome was user friendly recognize text. The results are experiments
search interface, indexing, query of dataset collected from social media. For
processing and effective data extraction Image preprocessing. Image Resolution
based on web structure. Enhancement (IRE) is used before
applying OCR engine. They are based on
2. M. S. Parvez, K. S. A. Tasneem, S. S. Deep Neural Networks and use general
Rajendra, and K. R. Bodke, ”Analysis Adversial Networks with Sub-Pixel

Web Scraping Using Machine Learning

Convolutional Images. The dataset used 5. W. Zhuo and C. Lili, ”The algorithm
contains 64554 symbols in 67 images. of text classification based on rough set
Images are divided into 4 categories – and support vector machine,” 2010 2nd
Demotivators, Certificates, Scanned, International Conference on Fu ture
Smartphones. To solve classification Computer and Communication,
problems in this paper Neural Network Wuhan, 2010.
(ResNet50, MobileNet) and gradient The work demonstrates - rough set,
boosting (GB) approaches are used. The support vector machine, classification. It
rounded stamp on documents are rep resents a new algorithm of text
determined using Hough Circle Search classification based on Rough Set and
Algorithm. Support Vector Machine. As SVM is a
tool for solving the problem of ML based
4. Y. Su, H. Peng, K. Huang and C. on optimization method. It has a simple
Yang, ”Image processing technology for structure and good classification ability but
text recognition,” 2019 International its processing speed is slow when we deal
Conference on Technologies and with large amount of data. To overcome
Applications of Artificial Intelligence this bottleneck problem of SVM, Theory
(TAAI), Kaohsiung, Taiwan, 2019. of Rough Set was introduced. Theory of
The work in this paper is - Image Rough Set is a math tool of quantitative
processing, optical character recognition, analysis which could analyze correlations
object detection component. This paper between the information, which needn’t
demonstrates how image processing any prior knowledge and it has a powerful
technology can be used in combination foundation to process information of high
with OCR to improve recognition accuracy capacity and dimensions. The experiment
and to improve the efficiency of extracting used Rossta software to process the initial
text from images. Two software systems training set data. The SVM aims to
are developed and tested – i) A character construct objective function which could
recognition system applied to cosmetic- make a distinction between two patterns of
related advertising images which include modes as far as possible and give
the process as a) Image processing, b) consideration to maximize of interval of
Establishing contours and lassosing the classification and minimize the error. The
region of interest(ROI), c) Character key of RS-SVM algorithm is how to delete
Recognition. The techniques in attributes which uncorrelated and
preprocessing were edge detection, unimportant by the algorithm of attribute
binarization and erosion and dilation. Edge reduction and decrease the dimensions of
detection algorithm used was Sobel Edge SVM training.
detection with Tesseract - OCR tool used
in combination with python. ii) A text 3. Proposed Work
detection and recognition system for 3.1 System Architecture
natural scenes which include the process
as a) Operating the Raspberry Pi camera
and detecting the target object containing
text, b) Image processing. In this the
Cascade classifier was used which was
given the image trained set downloaded
from ImageNET. Advertising images and
images from ICDAR Robust Reading
Competition were used as test images for
this result study.

Web Scraping Using Machine Learning

References

[1] D. M. Thomas and S. Mathur, ”Data

Analysis by Web Scraping using Python,”
2019 3rd International conference on
Electronics, Communication and
Aerospace Technology (ICECA),
Coimbatore, India, 2019.

[2] M. S. Parvez, K. S. A. Tasneem, S. S.

Rajendra, and K. R. Bodke, ”Analysis Of
Dif ferent Web Data Extraction
Techniques,” 2018 International
Conference on Smart City and Emerging
Technology (ICSCET), Mumbai, 2018.

[3] M.S. Akopyan, O.V. Belyaeva, T.P.

Plechov, D.Y. Turdakov (2019). Text
Fig – 1: Proposed System Architecture Recogni tion on Images from Social
Media.

[4] Y. Su, H. Peng, K. Huang and C. Yang,

4. Conclusion
”Image processing technology for text
recognition,” 2019 International Conference on
The proposed extraction model is capable Technologies and Applications of Artificial
Intelligence (TAAI), Kaohsiung, Taiwan, 2019.
of extracting the data and storing data in
required format. The stored or saved data [5] W. Zhuo and C. Lili, ”The algorithm of text
can be used for obtaining confidential data classification based on rough set and support
It contains modules as web scraping, vector machine,” 2010 2nd International
storing, text recognition and text Conference on Future Computer and
classification. In this first, data is extracted Communication, Wuhan, 2010.
and stored in required format. Further it is
given for text recognition and text
classification and finally the overall saved
data format is provided to the server using
chatbot as platform.

Web Scraping Using Machine Learning

of The Impact of AI in Social Media Edia
No ratings yet
of The Impact of AI in Social Media Edia
16 pages
Sih Report
No ratings yet
Sih Report
33 pages
3.4. A Comprehensive Guide To Convolutional Neural Networks - The ELI5 Way - by Sumit Saha - Towards Data Science
No ratings yet
3.4. A Comprehensive Guide To Convolutional Neural Networks - The ELI5 Way - by Sumit Saha - Towards Data Science
17 pages
Lect8 DNN
No ratings yet
Lect8 DNN
33 pages
ML Fraud Detection at A Dutch Healtcare Insurer
No ratings yet
ML Fraud Detection at A Dutch Healtcare Insurer
140 pages
Business Intelligence Unit 5
No ratings yet
Business Intelligence Unit 5
12 pages
Case Study Panasonic Small Molecule Development
No ratings yet
Case Study Panasonic Small Molecule Development
2 pages
AI ML Developer
No ratings yet
AI ML Developer
3 pages
AI Engineer Interview Prep Guide
No ratings yet
AI Engineer Interview Prep Guide
16 pages
Sma U-2
No ratings yet
Sma U-2
19 pages
The Path To AI Maturity 2024
No ratings yet
The Path To AI Maturity 2024
37 pages
ML Lesson Plan
No ratings yet
ML Lesson Plan
2 pages
ML Week 6
No ratings yet
ML Week 6
11 pages
Novel Smart Water Metering and Management System F
No ratings yet
Novel Smart Water Metering and Management System F
8 pages
Machine Learning Infographics by Slidesgo
No ratings yet
Machine Learning Infographics by Slidesgo
38 pages
ICT583 Assignment 1
No ratings yet
ICT583 Assignment 1
4 pages
LLM Paper 4
No ratings yet
LLM Paper 4
24 pages
Eswa D 23 08549
No ratings yet
Eswa D 23 08549
29 pages
Team14 Mini Report FINAL
No ratings yet
Team14 Mini Report FINAL
61 pages
Decision Tree
No ratings yet
Decision Tree
68 pages
A Survey On Web Scraping and Its Applications - IJCRT
No ratings yet
A Survey On Web Scraping and Its Applications - IJCRT
4 pages
Web Scraping
No ratings yet
Web Scraping
5 pages
Medical Text Classifier GabrieldeOlaguibel
No ratings yet
Medical Text Classifier GabrieldeOlaguibel
12 pages
PPPP
No ratings yet
PPPP
23 pages
Wa 2
No ratings yet
Wa 2
6 pages
Document 2
No ratings yet
Document 2
6 pages
Rohan Report
No ratings yet
Rohan Report
25 pages
Ijcrt 183909
No ratings yet
Ijcrt 183909
5 pages
Web Scraping 2
No ratings yet
Web Scraping 2
14 pages
ML Quiz-1
No ratings yet
ML Quiz-1
4 pages
Data Collection
No ratings yet
Data Collection
10 pages
Final Report
No ratings yet
Final Report
39 pages
Request For EoI Data 4 Development Fellowships 1
No ratings yet
Request For EoI Data 4 Development Fellowships 1
6 pages
Utilizing Python For Web Scraping and Incremental Data Extraction
No ratings yet
Utilizing Python For Web Scraping and Incremental Data Extraction
6 pages
AI-Powered Web Scraping in 2024: Best Practices & Use Cases
No ratings yet
AI-Powered Web Scraping in 2024: Best Practices & Use Cases
5 pages
Statistics Science Vs Data Science
No ratings yet
Statistics Science Vs Data Science
11 pages
Python Web Scraping Tutorial
92% (12)
Python Web Scraping Tutorial
65 pages
WEB Scrap Report
No ratings yet
WEB Scrap Report
77 pages
Web Scraping of Social Networks: Nternational Ournal of Nnovative Esearch in Omputer and Ommunication Ngineering
No ratings yet
Web Scraping of Social Networks: Nternational Ournal of Nnovative Esearch in Omputer and Ommunication Ngineering
4 pages
Digital Skills Sbi
No ratings yet
Digital Skills Sbi
34 pages
Introduction To Web Scraping in RPA With Python
No ratings yet
Introduction To Web Scraping in RPA With Python
10 pages
Web Scraping Presentation With Images
No ratings yet
Web Scraping Presentation With Images
4 pages
Trends and Trajectories For Explainable, Accountable and Intelligible Systems: An HCI Research Agenda
No ratings yet
Trends and Trajectories For Explainable, Accountable and Intelligible Systems: An HCI Research Agenda
18 pages
Web Scraping Ganesh
0% (1)
Web Scraping Ganesh
20 pages
Introduction To Soft Computing: Practice Sheet: NN-1
No ratings yet
Introduction To Soft Computing: Practice Sheet: NN-1
2 pages
MEERA Reservoir Simulation Software Introduction
No ratings yet
MEERA Reservoir Simulation Software Introduction
19 pages
Upadhyay (2017) - Articulating The Construction of A Web Scraper For
No ratings yet
Upadhyay (2017) - Articulating The Construction of A Web Scraper For
4 pages
Data Aggregation by Web Scraping Using Python
No ratings yet
Data Aggregation by Web Scraping Using Python
48 pages
Synopsis WS
No ratings yet
Synopsis WS
11 pages
Text Processing For NLP Web Scrapping
No ratings yet
Text Processing For NLP Web Scrapping
18 pages
Web Scraping - Notes - 321
No ratings yet
Web Scraping - Notes - 321
3 pages
Super VIP Cheat Sheet: Arti Cial Intelligence
No ratings yet
Super VIP Cheat Sheet: Arti Cial Intelligence
18 pages
Learn&Fuzz: Machine Learning For Input Fuzzing: Patrice Godefroid Hila Peleg Rishabh Singh
No ratings yet
Learn&Fuzz: Machine Learning For Input Fuzzing: Patrice Godefroid Hila Peleg Rishabh Singh
10 pages
Web Scraping
No ratings yet
Web Scraping
4 pages
A Web Scraper For Extracting Alumni Information From Social
No ratings yet
A Web Scraper For Extracting Alumni Information From Social
4 pages
Data Analysis by Web Scraping Using Python
No ratings yet
Data Analysis by Web Scraping Using Python
6 pages
EJMCM Volume7 Issue3 Pages433-442
No ratings yet
EJMCM Volume7 Issue3 Pages433-442
11 pages
Upload PDF
No ratings yet
Upload PDF
11 pages
AReviewon Web Scrappingandits Applications
No ratings yet
AReviewon Web Scrappingandits Applications
7 pages
Review
No ratings yet
Review
21 pages
Summary Paper 1 2 3
No ratings yet
Summary Paper 1 2 3
2 pages
Mini Project
No ratings yet
Mini Project
13 pages
Seminar Completed
No ratings yet
Seminar Completed
22 pages
Internship
No ratings yet
Internship
10 pages
Web Scraping Report
No ratings yet
Web Scraping Report
14 pages
Arindam Manna, Financial Analytics
No ratings yet
Arindam Manna, Financial Analytics
9 pages
Image Scrapper
No ratings yet
Image Scrapper
14 pages
BE IT Project Synopsis Format 2022 23 V1
No ratings yet
BE IT Project Synopsis Format 2022 23 V1
11 pages
Seminar Report
No ratings yet
Seminar Report
6 pages
Web Scraping or Web Crawling: State of Art, Techniques, Approaches and Application
No ratings yet
Web Scraping or Web Crawling: State of Art, Techniques, Approaches and Application
25 pages
20 - 3 - A Study
No ratings yet
20 - 3 - A Study
5 pages
Knowledge Representation Systems: Neural Networks
No ratings yet
Knowledge Representation Systems: Neural Networks
14 pages
Web Scraping With Python and Selenium: Sarah Fatima, Shaik Luqmaan Nuha Abdul Rasheed
No ratings yet
Web Scraping With Python and Selenium: Sarah Fatima, Shaik Luqmaan Nuha Abdul Rasheed
5 pages
Implementation of Web Application For Disease Prediction Using AI
No ratings yet
Implementation of Web Application For Disease Prediction Using AI
5 pages
Com 059
No ratings yet
Com 059
6 pages
Web Crawling State of ArtTechniques ApproachesandApplication
No ratings yet
Web Crawling State of ArtTechniques ApproachesandApplication
26 pages
E-Commerce Review Scrapper: Python Mini Project On
No ratings yet
E-Commerce Review Scrapper: Python Mini Project On
15 pages
Web Data Scraping
No ratings yet
Web Data Scraping
5 pages
Diouf 2019
No ratings yet
Diouf 2019
3 pages
Introduction To Web Scraping
100% (1)
Introduction To Web Scraping
3 pages
1.8 Data Scrapping PDF
No ratings yet
1.8 Data Scrapping PDF
42 pages
Sing Rodia 2019
No ratings yet
Sing Rodia 2019
6 pages
Mastering the Art of Web Scraping: Unraveling the Secrets of Expert-Level Programming
From Everand
Mastering the Art of Web Scraping: Unraveling the Secrets of Expert-Level Programming
Steve Jones
No ratings yet
Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1
No ratings yet
Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1
8 pages
DWM Assignment 1: 1. Write Detailed Notes On The Following: - A. Web Content Mining
No ratings yet
DWM Assignment 1: 1. Write Detailed Notes On The Following: - A. Web Content Mining
10 pages
Web Scraping with Python Step by Step: A Practical Guide with Examples
From Everand
Web Scraping with Python Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
Engineering-A Review Web Data Scrapping
No ratings yet
Engineering-A Review Web Data Scrapping
4 pages
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
Image Retrieval: Fundamentals and Applications
From Everand
Image Retrieval: Fundamentals and Applications
Fouad Sabry
No ratings yet
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet