Final Publish Paper
Final Publish Paper
Abstract - Web scraping is the process of analyzed or stored for later use. It’s
collecting or extracting information from a important to note that while web scraping
particular website. It is a technique to convert any
unstructured data into structured data and then
can be a powerful tool for data collection,
analyze the obtained data based and is the stored in you should always respect the website’s
required format file type. Web scraping is terms of service and follow ethical
becoming well known due to large amount of data guide lines. Some websites may have
available on internet and want to collect the data restrictions or prohibit scraping their
without wasting time. Web scarping can be applied
to obtain a huge amount of data for better decision
content, so it’s crucial to ensure you’re
making. We can achieve this using BeautifulSoup acting within legal and ethical boundaries.
tool and other algorithms. The obtained data after TEXT RECOGNITION: Text recognition,
web scraping will be processed for Text also known as Optical Character
Recognition and Text Classification using NLP and Recognition (OCR), is the technology that
Classification.
enables computers to recognize and extract
Keywords: Web scraping, unstructured data, data- text from images or scanned documents. It
format, text classification, text recognition, NLP, involves converting the text present in an
classification image or document into
1. Introduction machine readable and editable text. Text
recognition finds applications in a wide
WEB SCRAPING: Web scraping is the
range of fields, such as digitizing printed
process of extracting data from websites
documents, extracting information from
automatically using a software program or
invoices or receipts, converting scanned
script. It involves retrieving data from the
books into editable text, automatic license
HTML source code of a webpage and
plate recognition, and more. Many
transforming it into a structured format
programming languages provide OCR
that can be analyzed or stored for later use
libraries or APIs that simplify the
Web scraping is the process of extracting
implementation of text recognition in your
data from websites automatically using a
applications, such as Tesseract OCR for
software program or script. It involves
Python or Google Cloud Vision API.
retrieving data from the HTML source
TEXT CLASSIFICATION: Text
code of a web-page and transforming it
classification is a natural language
into a structured format that can be
processing (NLP) task that involves
Convolutional Images. The dataset used 5. W. Zhuo and C. Lili, ”The algorithm
contains 64554 symbols in 67 images. of text classification based on rough set
Images are divided into 4 categories – and support vector machine,” 2010 2nd
Demotivators, Certificates, Scanned, International Conference on Fu ture
Smartphones. To solve classification Computer and Communication,
problems in this paper Neural Network Wuhan, 2010.
(ResNet50, MobileNet) and gradient The work demonstrates - rough set,
boosting (GB) approaches are used. The support vector machine, classification. It
rounded stamp on documents are rep resents a new algorithm of text
determined using Hough Circle Search classification based on Rough Set and
Algorithm. Support Vector Machine. As SVM is a
tool for solving the problem of ML based
4. Y. Su, H. Peng, K. Huang and C. on optimization method. It has a simple
Yang, ”Image processing technology for structure and good classification ability but
text recognition,” 2019 International its processing speed is slow when we deal
Conference on Technologies and with large amount of data. To overcome
Applications of Artificial Intelligence this bottleneck problem of SVM, Theory
(TAAI), Kaohsiung, Taiwan, 2019. of Rough Set was introduced. Theory of
The work in this paper is - Image Rough Set is a math tool of quantitative
processing, optical character recognition, analysis which could analyze correlations
object detection component. This paper between the information, which needn’t
demonstrates how image processing any prior knowledge and it has a powerful
technology can be used in combination foundation to process information of high
with OCR to improve recognition accuracy capacity and dimensions. The experiment
and to improve the efficiency of extracting used Rossta software to process the initial
text from images. Two software systems training set data. The SVM aims to
are developed and tested – i) A character construct objective function which could
recognition system applied to cosmetic- make a distinction between two patterns of
related advertising images which include modes as far as possible and give
the process as a) Image processing, b) consideration to maximize of interval of
Establishing contours and lassosing the classification and minimize the error. The
region of interest(ROI), c) Character key of RS-SVM algorithm is how to delete
Recognition. The techniques in attributes which uncorrelated and
preprocessing were edge detection, unimportant by the algorithm of attribute
binarization and erosion and dilation. Edge reduction and decrease the dimensions of
detection algorithm used was Sobel Edge SVM training.
detection with Tesseract - OCR tool used
in combination with python. ii) A text 3. Proposed Work
detection and recognition system for 3.1 System Architecture
natural scenes which include the process
as a) Operating the Raspberry Pi camera
and detecting the target object containing
text, b) Image processing. In this the
Cascade classifier was used which was
given the image trained set downloaded
from ImageNET. Advertising images and
images from ICDAR Robust Reading
Competition were used as test images for
this result study.
References