0% found this document useful (0 votes)
61 views4 pages

Final Publish Paper

This document discusses using machine learning for web scraping. It begins with definitions of web scraping, text recognition, and text classification. Web scraping extracts structured data from unstructured web pages. Text recognition converts text in images to machine-readable text. Text classification assigns categories to text based on its content. The document then reviews literature on different web scraping techniques, including using machine learning approaches like statistical models, adaptive search, and support vector machines. It focuses on using machine learning for tasks like web crawling, data extraction, text recognition from social media images, and text classification.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views4 pages

Final Publish Paper

This document discusses using machine learning for web scraping. It begins with definitions of web scraping, text recognition, and text classification. Web scraping extracts structured data from unstructured web pages. Text recognition converts text in images to machine-readable text. Text classification assigns categories to text based on its content. The document then reviews literature on different web scraping techniques, including using machine learning approaches like statistical models, adaptive search, and support vector machines. It focuses on using machine learning for tasks like web crawling, data extraction, text recognition from social media images, and text classification.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Web Scraping Using Machine Learning

WEB SCRAPING USING MACHINE


LEARNING
Sakshi Nakate
Student, Department of Information Technology,
Pune University, Baramati, Pune, Maharashtra., India,
[email protected]
Shruti Narkhede
Student, Department of Information Technology,
Pune University, Baramati, Pune, Maharashtra., India,
[email protected]
Sonali Gawade
Student, Department of Information Technology,
Pune University, Baramati, Pune, Maharashtra., India,
[email protected]

Abstract - Web scraping is the process of analyzed or stored for later use. It’s
collecting or extracting information from a important to note that while web scraping
particular website. It is a technique to convert any
unstructured data into structured data and then
can be a powerful tool for data collection,
analyze the obtained data based and is the stored in you should always respect the website’s
required format file type. Web scraping is terms of service and follow ethical
becoming well known due to large amount of data guide lines. Some websites may have
available on internet and want to collect the data restrictions or prohibit scraping their
without wasting time. Web scarping can be applied
to obtain a huge amount of data for better decision
content, so it’s crucial to ensure you’re
making. We can achieve this using BeautifulSoup acting within legal and ethical boundaries.
tool and other algorithms. The obtained data after TEXT RECOGNITION: Text recognition,
web scraping will be processed for Text also known as Optical Character
Recognition and Text Classification using NLP and Recognition (OCR), is the technology that
Classification.
enables computers to recognize and extract
Keywords: Web scraping, unstructured data, data- text from images or scanned documents. It
format, text classification, text recognition, NLP, involves converting the text present in an
classification image or document into
1. Introduction machine readable and editable text. Text
recognition finds applications in a wide
WEB SCRAPING: Web scraping is the
range of fields, such as digitizing printed
process of extracting data from websites
documents, extracting information from
automatically using a software program or
invoices or receipts, converting scanned
script. It involves retrieving data from the
books into editable text, automatic license
HTML source code of a webpage and
plate recognition, and more. Many
transforming it into a structured format
programming languages provide OCR
that can be analyzed or stored for later use
libraries or APIs that simplify the
Web scraping is the process of extracting
implementation of text recognition in your
data from websites automatically using a
applications, such as Tesseract OCR for
software program or script. It involves
Python or Google Cloud Vision API.
retrieving data from the HTML source
TEXT CLASSIFICATION: Text
code of a web-page and transforming it
classification is a natural language
into a structured format that can be
processing (NLP) task that involves

Web Scraping Using Machine Learning


Web Scraping Using Machine Learning

categorizing or assigning predefined labels Of Different Web Data Extraction


or categories to textual data. It aims to Techniques,” 2018 International
automatically classify text documents or Conference on Smart City and
snippets into different classes or categories Emerging Technology (ICSCET),
based on their content. Text classification Mumbai, 2018.
has various applications, including The work in this paper is focusing on -
sentiment analysis, spam filtering, topic Web scraping, Data Extraction, Web
classification, news categorization, intent crawler, Machin Learning Approach
detection, and many more. The choice of systems -WIEN, WHISK, Rapier. Web
model and techniques depends on the data Scraping process includes Web
specific task and the characteristics of the Crawler and Data Extractor. Data
text data you are working with. Extraction Techniques involved are – a)
2. Literature Survey Human Copy and Paste, b) HTML Parser-
JAVA library – Jsoup and Python library –
1. D. M. Thomas and S. Mathur, ”Data BeautifulSoup. c) HTTP Programming, d)
Analysis by Web Scraping using Tree based technique i.e DOM (Document
Python,” 2019 3rd International Object Model).The techniques that use
conference on Electronics, DOM are i) Addressing element in the
Communication and Aerospace document tree(XPath), ii) Tree edit
Technology (ICECA), Coimbatore, distance matching algorithms., e)Web
India, 2019 Scrap per includes 3 approaches -
The work in this paper is focusing on – Regular expression based approach, Logic
Data analysis, Web Scraping, based approach, ML approaches. Different
Implementing of Web Scrape which is a ML approach specifies are Statistical ML
web scraping software to scrape e- approach, Adaptive Search, WIEN,
commerce sites such as Flipkart, Amazon RAPIER, WHISK, SRV. WHISK system
and analyze product details which aren’t is the most advantageous as compared to
available, analyze variation, comments, others according to the survey.
ratings, etc. The point of the paper is to
remove the information from different 3. M.S. Akopyan, O.V. Belyaeva, T.P.
sources with the assistance of Plechov, D.Y. Turdakov (2019). Text
programming known as the web crawler Recognition on Images from Social
Scrapy utilizing the programming Media.
language Python adaptation 3.6. The The work in this paper is text recognition,
Database is created which collects all the social networks, image processing, deep
unstructured data from various sources and neural networks. Text recognition pipeline
then analyses them by the analytic process is provided to address text extraction from
of its specifications, assembling, various quality images collected from
organizing, cleaning, reanalyzing, social media. Input images are categorized
applying models and algorithms and into different classes and then class
finally providing the desired results. In this specific preprocessing is applied to them
paper Reddit by XPath method was used to for illumination improvement, text
find details of each element of the frequent localization. Then OCR engine is used to
searches. Main outcome was user friendly recognize text. The results are experiments
search interface, indexing, query of dataset collected from social media. For
processing and effective data extraction Image preprocessing. Image Resolution
based on web structure. Enhancement (IRE) is used before
applying OCR engine. They are based on
2. M. S. Parvez, K. S. A. Tasneem, S. S. Deep Neural Networks and use general
Rajendra, and K. R. Bodke, ”Analysis Adversial Networks with Sub-Pixel

Web Scraping Using Machine Learning


Web Scraping Using Machine Learning

Convolutional Images. The dataset used 5. W. Zhuo and C. Lili, ”The algorithm
contains 64554 symbols in 67 images. of text classification based on rough set
Images are divided into 4 categories – and support vector machine,” 2010 2nd
Demotivators, Certificates, Scanned, International Conference on Fu ture
Smartphones. To solve classification Computer and Communication,
problems in this paper Neural Network Wuhan, 2010.
(ResNet50, MobileNet) and gradient The work demonstrates - rough set,
boosting (GB) approaches are used. The support vector machine, classification. It
rounded stamp on documents are rep resents a new algorithm of text
determined using Hough Circle Search classification based on Rough Set and
Algorithm. Support Vector Machine. As SVM is a
tool for solving the problem of ML based
4. Y. Su, H. Peng, K. Huang and C. on optimization method. It has a simple
Yang, ”Image processing technology for structure and good classification ability but
text recognition,” 2019 International its processing speed is slow when we deal
Conference on Technologies and with large amount of data. To overcome
Applications of Artificial Intelligence this bottleneck problem of SVM, Theory
(TAAI), Kaohsiung, Taiwan, 2019. of Rough Set was introduced. Theory of
The work in this paper is - Image Rough Set is a math tool of quantitative
processing, optical character recognition, analysis which could analyze correlations
object detection component. This paper between the information, which needn’t
demonstrates how image processing any prior knowledge and it has a powerful
technology can be used in combination foundation to process information of high
with OCR to improve recognition accuracy capacity and dimensions. The experiment
and to improve the efficiency of extracting used Rossta software to process the initial
text from images. Two software systems training set data. The SVM aims to
are developed and tested – i) A character construct objective function which could
recognition system applied to cosmetic- make a distinction between two patterns of
related advertising images which include modes as far as possible and give
the process as a) Image processing, b) consideration to maximize of interval of
Establishing contours and lassosing the classification and minimize the error. The
region of interest(ROI), c) Character key of RS-SVM algorithm is how to delete
Recognition. The techniques in attributes which uncorrelated and
preprocessing were edge detection, unimportant by the algorithm of attribute
binarization and erosion and dilation. Edge reduction and decrease the dimensions of
detection algorithm used was Sobel Edge SVM training.
detection with Tesseract - OCR tool used
in combination with python. ii) A text 3. Proposed Work
detection and recognition system for 3.1 System Architecture
natural scenes which include the process
as a) Operating the Raspberry Pi camera
and detecting the target object containing
text, b) Image processing. In this the
Cascade classifier was used which was
given the image trained set downloaded
from ImageNET. Advertising images and
images from ICDAR Robust Reading
Competition were used as test images for
this result study.

Web Scraping Using Machine Learning


Web Scraping Using Machine Learning

References

[1] D. M. Thomas and S. Mathur, ”Data


Analysis by Web Scraping using Python,”
2019 3rd International conference on
Electronics, Communication and
Aerospace Technology (ICECA),
Coimbatore, India, 2019.

[2] M. S. Parvez, K. S. A. Tasneem, S. S.


Rajendra, and K. R. Bodke, ”Analysis Of
Dif ferent Web Data Extraction
Techniques,” 2018 International
Conference on Smart City and Emerging
Technology (ICSCET), Mumbai, 2018.

[3] M.S. Akopyan, O.V. Belyaeva, T.P.


Plechov, D.Y. Turdakov (2019). Text
Fig – 1: Proposed System Architecture Recogni tion on Images from Social
Media.

[4] Y. Su, H. Peng, K. Huang and C. Yang,


4. Conclusion
”Image processing technology for text
recognition,” 2019 International Conference on
The proposed extraction model is capable Technologies and Applications of Artificial
Intelligence (TAAI), Kaohsiung, Taiwan, 2019.
of extracting the data and storing data in
required format. The stored or saved data [5] W. Zhuo and C. Lili, ”The algorithm of text
can be used for obtaining confidential data classification based on rough set and support
It contains modules as web scraping, vector machine,” 2010 2nd International
storing, text recognition and text Conference on Future Computer and
classification. In this first, data is extracted Communication, Wuhan, 2010.
and stored in required format. Further it is
given for text recognition and text
classification and finally the overall saved
data format is provided to the server using
chatbot as platform.

Web Scraping Using Machine Learning

You might also like