0% found this document useful (0 votes)
20 views

Web Scraping for Data Analytics a BeatifulSoup Implementation

The paper presents a web scraper implemented using Python's BeautifulSoup library, designed to automate data gathering from websites, specifically targeting Amazon for product information analysis. It highlights the scraper's efficiency in extracting product names, prices, reviews, and ratings, and visualizing the data using Matplotlib and a graphical interface. The implementation aims to assist novices in web scraping and data analytics by providing a straightforward tool for small-scale projects.

Uploaded by

Vladimiro HJ
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Web Scraping for Data Analytics a BeatifulSoup Implementation

The paper presents a web scraper implemented using Python's BeautifulSoup library, designed to automate data gathering from websites, specifically targeting Amazon for product information analysis. It highlights the scraper's efficiency in extracting product names, prices, reviews, and ratings, and visualizing the data using Matplotlib and a graphical interface. The implementation aims to assist novices in web scraping and data analytics by providing a straightforward tool for small-scale projects.

Uploaded by

Vladimiro HJ
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/371467193

Web Scraping for Data Analytics: A BeautifulSoup Implementation

Conference Paper · March 2023


DOI: 10.1109/WiDS-PSU57071.2023.00025

CITATIONS READS

16 2,222

5 authors, including:

Rabia Latif
Prince Sultan University
12 PUBLICATIONS 118 CITATIONS

SEE PROFILE

All content following this page was uploaded by Rabia Latif on 12 January 2024.

The user has requested enhancement of the downloaded file.


Web Scraping for Data Analytics: A BeautifulSoup
Implementation
Ayat Abodayeh Reem Hejazi Ward Najjar
College of Computer and Information College of Computer and Information College of Computer and Information
Sciences Sciences Sciences
Prince Sultan University Prince Sultan University Prince Sultan University
Riyadh, Saudi Arabia Riyadh, Saudi Arabia Riyadh, Saudi Arabia
[email protected] [email protected] [email protected]

Leena Shihadeh Dr. Rabia Latif


College of Computer and Information College of Computer and Information
Sciences Sciences
Prince Sultan University Prince Sultan University
Riyadh, Saudi Arabia Riyadh, Saudi Arabia
219410254 @psu.edu.sa [email protected]

Abstract— Web scraping is an essential tool for automating useful when there are websites that contain information that
the data-gathering process for big data applications. There are cannot be copied and pasted. With web scraping, data can be
many implementations for web scraping, but barely any of them retrieved in any form according to the context they are needed
is based on Python’s BeautifulSoup library. Therefore, this in. After retrieving this data, they can be transformed into the
paper aims at creating a web scraper that gathers data from any
website and then analyzes the data accordingly. For results and
preferable format depending on the incentive of the
analysis, the web scraper has been implemented on the Amazon application.
website which collects a product’s name, price, number of
reviews, rate, and link. We further highlighted the web There exists a lack of papers that implement a web
scraper’s capabilities by assimilating the results into an scraper using Python’s BeautifulSoup4 (the 2022 version)
interface that integrates data visualization techniques to analyze while depicting its capabilities through an interface, which
the results gathered. The web scraper proved to be efficient can be troublesome for people who are new to the field. Thus,
upon execution, in which it scraped five pages and analyzed in this paper, we have developed a web scraper that is based
them, and visualized the information in approximately ten on Python’s existing libraries, mainly the BeautifulSoup
seconds. The limitations as a result of this implementation
mainly revolved around applying it to specific product names
library while performing data analysis and visualization on
rather than generic ones and extracting specific information the results of the scraper. To implement the web scraper, we
that we wanted from the resulting products. Moreover, targeted the Amazon website in which we extracted certain
BeautifulSoup cannot extract all the data available to not information that is of interest to the customer (name, price,
compromise on speed. Further studies can be done by rate, link, and number of reviews) about an arbitrary product
researchers who wish to reuse this implementation and modify that the user specifies. The proposed method then analyzed
it according to the data they want to extract, the analysis they the data gathered from the web scraper by using a graphical
wish to perform, and the website they wish to scrape. The interface and data visualization techniques based on
implementation can be helpful, thus, to developers who are PySimpleGUI and Matplotlib, respectively, that displayed
novices in the web scraping field or to researchers that wish to
reuse the code for small data analytics projects.
the top ten products with the cheapest price and highest rate
and then graphed price frequencies as well as the number of
Keywords—web scraping, BeautifulSoup, data gathering, data reviews per rate. The output has indeed proven the efficiency
analytics, data visualization of the proposed web scraping implementation, which can be
used in small data analytics projects that do not require a
I. INTRODUCTION massive amount of extracted data.
There is a potent need to access and analyze a large
amount of data. Data harvesting and analytics are always at II. LITERATURE REVIEW
the heart of any research effort. Most people would directly Many existing papers have delved into the field of web
copy and paste the available information, but this does not scraping through a myriad of methods. One such paper was
apply to immensely large websites or projects that require big made by Färholt, who conducted a controlled experiment to
data. It can also be a waste of human labor and time. develop a web scraping technique using JavaScript’s library,
Puppeteer, while remaining undetectable to the websites
To facilitate this issue, web scraping has been being targeted [2]. Fruitfully, one of the algorithms the study
introduced. Web scraping automates data extraction from experimented with did achieve this aim on websites adopting
web pages, which proves to be fast and effective. Data can be semi-security mechanisms (honeypots and activity logging)
extracted using software services or self-built programs that by evading them. Many factors have distinguished the
run on websites with numerous levels of navigation. Web implementation of Puppeteer in this study. First, it was
scraping is typically used by individuals and businesses that revealed that it is feasible to control computer performance
want to take advantage of the vast amount of freely available using its built-in methods. Other than its built-in functions,
web data to make better decisions [1]. This is especially Puppeteer allows visiting websites within the library itself,

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE


which is useful in the context of the study. Despite Python parse HTML responses in Scrapy callbacks. The paper crawls
being an obvious option, the researcher contended that it data stored from the social networking site, Reddit, utilizing
would be difficult to implement and that future studies can the XPath method. The results showed effective data
adopt it as the root of their web scraping endeavor. The results extraction that's based on the structure of the website, form
could be of importance to those who want to develop more submission analysis, and new submission plan. Another study
secure websites against malicious users as well as researchers by Bradley and James [7] has provided a tutorial for scraping
who wish to gather data without facing security breaches. online data using the R statistical language, which can
perform scraping, statistical analysis, and visualization. The
Another study conducted by Chaulagain, Pandey et al., software has many packages with functions made by R's
[3] which focused on developing a web scraping tool that can open-source community. The tutorial had four steps:
handle massive data requests dynamically, which they then downloading the web page, extracting information from the
tested against Amazon’s web server. Therefore, the web page through a code that specifies the location and type
researchers have integrated the use of Selenium and of the information collected and storing the extracted
WebDriver API to automate web pages per desired state; information. The information was stored in vectors. Overall,
accordingly, they parsed text files and extracted data from the method can help in scraping websites, but websites that
HTML (hypertext markup language) content using the display information in unusual formats might not be easily
HTMLParser and Requests libraries in Python. As for the scraped. Finally, Gunawan, Rahmatulloh, Darmawan, and
architecture, an elastic computer cloud of web services has Firdaus [8] compared the various methods of web scraping
been chosen for increased flexibility and virtual scalability. using Java programming language. They compared
The uniqueness of using a cloud-based distributed Selenium performance from the perspective of regular expression
scraper lies in the fact that it can run infinite parallel instances (regex), HTML DOM, and XPath using process time,
in the cloud. Selenium, despite its process delay, stands out memory usage, and data consumption as parameters.
among other tools due to its assimilation into major online According to the results, regex consumes the least amount of
enterprises, support of multiple browsers and programming memory. Meanwhile, HTML DOM requires the least amount
languages, emulation of human behavior over the web, and of time and the smallest data consumption as opposed to the
compatibility with dynamic web pages. This makes the rest.
proposed solution a viable option for big data applications.
Overall, it can be concluded that in terms of efficiency
Similarly, a study by Yannikos, Heeger, and and performance, Ruby is the most appropriate programming
Brockmeyer explored illegal products supplied by three of the language for web scraping as the study by [5] has proven.
largest marketplaces over the dark web [4]. The study Unfortunately, the language does not bolster machine
expounded on the benefits of web scraping in the field of learning for big data applications, hence making the usage of
cybersecurity. The researchers have used Selenium as [3] did Selenium along with an HTML parser a better option as
to cater to dynamic content. They also used the Requests demonstrated by [3] and [4]. In terms of extraction, using
library in Python to dispatch HTML requests. By contrast, HTML DOM would be a better option if there’s a need for
they used the SOCKS proxy support to connect with the TOR optimal time and data consumption [8]. If minimum memory
browser and RabbitMQ to handle URLs (uniform resource utility is desired, using regex would be the best option [8].
locators) of different marketplaces. Finally, if someone wants to perform web scraping
dynamically while remaining undetected or to test the
To solve the supply chain management problem in security of web applications, the best option would be
Japan and optimize retail distribution, Le and Pishva JavaScript’s open-source library Puppeteer [2].
employed web scraping along with Google’s API
(application programming interface) service [5]. They
scraped pieces of electronic data by using the programming III. PROPOSED METHODOLOGY
language Ruby, specifically its library Nokogiri. Ruby’s There’s a limitation within the existing literature in
benefits can be reaped in web scraping for its high which the web scrapers proposed did not use BeautifulSoup4;
performance and HTTP (hypertext transfer protocol) parser this is especially of significant importance since the
libraries, and Nokogiri is especially helpful for supporting implementation of this library can help in small projects
many document encodings and rapid web page analysis. It based on data analytics. Furthermore, a BeautifulSoup4 web
can search for documents through XPath or CSS3 selectors. scraper can help researchers who are new to the field by
This, as a result, allows the web scraping tool to cover many deepening their understanding of the web scraping intrinsic
online sources. The website targeted in this research was the functionality. Thus, we have proposed a methodology that
Navitime website since it aligns with the paper’s aim and is explores web scraping using BeautifulSoup4 in Python along
inclusive of a list of convenience stores and gas stations in with the Requests library for sending HTML/ DOM
Japan. (document object model) requests; after that, we extended the
capabilities of BeautifulSoup by transforming the results
Many other papers have used web scraping in different gathered into information that can be of significance to the
contexts. Thomas and Mathur [6] utilized a Python-based user by adopting data analytics and visualization via
approach, combining both BeautifulSoup and Scrapy. Scrapy Matplotlib graphs and an interface.
is a web-crawling framework that uses an application
programming interface (API) to extract data and allow The algorithm for our proposed web scraper is shown in
developers to write crawlers. Here, BeautifulSoup is used to Figure 1. As illustrated, the input for the scraper and the
interface generator is the user’s desired input. Consequently, Mozilla Firefox web browser. If the HTTP status code is equal
the output will be an analysis of the results from the scraper to 200, it means that the product has been found. Otherwise, it
shown in an interface. To carry out this algorithm, we have will terminate and give an error. The scraper then finds all the
proposed the following: an extract function that gathers the resulting divs (division documents) and searches for the
preferred elements from HTML results and stores them in a information we specified in the results. Consequently, the
list, a scrape function that handles HTTP requests and uses results will be displayed via the graphical interface and
BeautifulSoup4 to search through division documents and Matplotlib graphs.
return a Pandas data frame of the results, a visualization There are two program files in the proposed
function that creates graphs for the data needed in the implementation. We have made them accessible on GitHub1.
preferred method which is based on Matplotlib, and an The AmazonScraper.py file includes the main class for
interface function that displays the overall analysis as well as scraping as well as the functions for visualizing the data
the graphs created. throughout graphs. The Gui.py file includes the main class
for the graphical user interface which is contingent on
AmazonScraper. Through AmazonScraper’s scrape function,
we send a request to the Amazon server with the product’s
name and the page number, then we start parsing using
Python’s BeautifulSoup4 and html5lib libraries. The
resulting divs have the data component type as s-search-
result 2 , so we filtered the elements based on that. Then the
program loops over the results. In the loop body, we call the
extract function which finds the product’s name, link, rate,
number of reviews, and price and stores them in a Pandas
data frame. For this function, the proposed method referred
to Amazon’s DOM and gathered HTML element types and
class names for the information we need. The last function is
the visualize function which plots two graphs: one for the
price frequencies and another for the number of reviews per
each rate.
In the gui.py file, the proposed method used the
PySimpleGUI library to create a simple graphical user
interface that calls the AmazonScraper class and outlines the
Figure 1: Our Web Scraper Algorithm layout for the interface. It also contains a while loop to
control the flow of the program, keep track of events, and call
their proper functions.
IV. IMPLEMENTATION
The proposed web scraper can be implemented on any
website. For this research, we implemented the web scraper Start
Soup find all
on the Amazon website. As aforementioned, the results divs
PySimpleGUI’s library has been used for the interface and
Matplotlib’s library to visualize the data gathered using a bar
Enter
chart and a histogram. Finally, the interface will display the product
results of the ten lowest-priced products as well as the ten name
Search for
products that have received the highest ratings which could information in
res
typically accommodate the user’s needs. Each row in the table
Send request to
must contain a hyperlink to the product they want to check Amazon
out.
The implementation of the proposed method is slightly Display
product
similar to that made by [3] and [4] in which we have used the yes
information
Status
Requests library in Python to dispatch HTML requests. code
==200
However, our web scraping project is solely reliant on the
BeautifulSoup library while those studies rather incorporated No
Selenium for web browser automation. Finally, the proposed Exit
implementation in terms of parsing made use of the finding
Exit
made by [8] since extracting data using HTML DOM is more
efficient in terms of processing time and data consumption.
The flowchart of the proposed Amazon web scraper Figure 2: Implementation Flowchart
implementation is shown in Figure 2. First, the user has to
enter the specific name of the product. This will induce the
scraper to send a request to the Amazon website via the

1
https://github.com/rem2718/AmazonScraper 2
Based on Amazon’s current html
V. RESULTS relieves the programmer from the burden of encoding since it
In order to test the proposed web scraper, the code was handles it, a similar trait to that of the Nokogiri gem in Ruby
run twice. The first time was to check the results for the New [5].
Apple iPhone 14 pro max as shown in Figure 3. The second
BeautifulSoup contains backends that provide HTML
was to check the results for the Raspberry Pi 4 as shown in
parsing algorithms which are the following: HTML.parser,
Figure 4. The implementation successfully displayed the data
lxml, and html5lib. Html.parser is a built-in Python algorithm
we wanted to analyze on the interface, which is shown in the
slower than its counterparts while lxml is C-based and
histogram and bar charts in figure 3 and 4. The total run time
difficult to install [10]. Thus, for our project, we have used
for scraping five pages, analyzing the data, visualizing the
the html5lib parser, which is efficient and written in Python.
information via Matplotlib, and displaying the results was
approximately 10.9 seconds. The web scraper rummaged The results of the proposed method have successfully
through five different pages of results on Amazon for both depicted the strengths of BeautifulSoup, which efficiently
products. traversed the HTML parse tree based on the information we
specified in the proposed algorithm. To further highlight
them to interested users, we displayed the results in an
interface, outlining the flexibility of incorporating such
results in different ways for different projects. Our research
aimed at analyzing the data extracted and presenting valuable
information regarding the price and rate frequencies that can
be of interest to the customer; this has proven to be successful
due to Matplotlib library’s numerous functionalities.
VII. LIMITATIONS
The proposed implementation has some limitations
associated with it. First, we have made it such that it only
Figure 3: Graphical Interface Output for Apple iPhone 14 Pro Max
works for writing the specific name of the product as listed
on Amazon. Also, it scrapes only some information about the
products; no details or seller names are extracted.
Furthermore, the library that is used for the interface is a
simple demonstration, so we encourage implementing the
proposed methodology using a different library for the
interface. Lastly, if Amazon changes anything in their DOM,
this code might not work. It needs to be maintained.
Therefore, we recommend reusing this code for small
projects in which the researchers can make some alterations
to the functions to suit their incentive.
We recommend that further studies reuse this code on
other websites to test its efficiency or to extract information
Figure 4: Graphical Interface Output for Raspberry Pi 4 based on generic names. To reuse the code, the programmer
must make the changes according to the chosen website’s
VI. DISCUSSION HTML. Furthermore, they must alter the elements based on
what they wish to extract from the website and maintain those
There are multiple reasons behind the selection of Python
changes in the visualization and interface functions. Graphs
as a tool to implement the proposed web scraper and, its
different from the ones we have incorporated can be also used
BeautifulSoup4 library. The proposed static web scraper’s
since Matplotlib supports many visualization techniques.
incentive is to gather a small scale of data from a web page,
Seaborn can alternatively be used if the programmer does not
eliminating the need for Selenium. Ruby can prove to be a
wish to use Matplotlib. We also recommend integrating other
practical option for implementation, but Python’s variety of
ways information to highlight the strengths of BeautifulSoup.
libraries, by contrast, would enable us to enhance the
Finally, a future study that compares the implementation to
visualization of the results according to our preference.
analyze the data scraped and to display the resulting of a
Python is also a faster, more popular, and more efficient
BeautifulSoup4 web scraper with a JavaScript
option to program with as opposed to JavaScript, which [2]
implementation would be an interesting contribution.
has recommended as well. It is crucial to note that the web
scraping method to use is heavily dependent on the type of VIII. CONCLUSION AND FUTURE WORK
data we wish to extract from an arbitrary web page as well as
the scale we are targeting. In conclusion, a web scraper is a tool used for crawling
databases and extracting data. In this project, we developed a
The BeautifulSoup library has many strengths associated web scraper that uses the BeautifulSoup4 library in Python
with its usage. It constructs a parse tree from the HTML and because it is fast and efficient in gathering the required data.
has straightforward methods for flexibly “traversing, This is essential in the field of data analytics since there exists
searching, and modifying a parse tree” [9]. Furthermore, it a lack of papers that explore the strengths of this library
through an application of data analytics and visualization. [2] F. Färholt, “Less Detectable Web Scraping Techniques,” Bachelor
Thesis, Linnaeus University, Faculty of Technology, Department of
Our proposed web scraper can work on any website by computer science and media technology (CM), 2021.
incorporating the changes accordingly. For our [3] R. S. Chaulagain, S. Pandey, S. R. Basnet, and S. Shakya, “Cloud
implementation, we tested our web scraper on the Amazon Based Web Scraping for Big Data Applications,” 2017 IEEE
website. It sends a request to Amazon servers to search for a International Conference on Smart Cloud (SmartCloud), pp. 138–143,
product based on its name. Then it displays the ten cheapest Nov. 2017, doi: 10.1109/smartcloud.2017.28.
products and the ten products with the highest ratings and [4] Y. Yannikos, J. Heeger, and M. Brockmeyer, “An Analysis Framework
for Product Prices and Supplies in Darknet
graphs the price frequencies as well as the number of reviews Marketplaces,” Proceedings of the 14th International Conference on
per rate. The results have proven to be successful as Availability, Reliability and Security, Aug. 2019, doi:
demonstrated by the interface, and the scraper managed to go 10.1145/3339252.3341485.
over five pages and analyze the data according to our [5] Q. T. Le and D. Pishva, “Application of web scraping and Google API
service to optimize convenience stores’ distribution,” 2015 17th
preference in approximately ten seconds. This can be of use International Conference on Advanced Communication Technology
in projects that want to gather data from websites in an (ICACT), pp. 478–482, Aug. 2015, doi:
efficient manner. We have some recommendations based on https://doi.org/10.1109/ICACT.2015.7224841.
the limitations. The first recommendation is to make the tool [6] D. M. Thomas and S. Mathur, "Data Analysis by Web Scraping using
display products using generic names if the scraper is reused Python," 2019 3rd International conference on Electronics,
Communication and Aerospace Technology (ICECA), Coimbatore,
on Amazon. We also recommend reusing the scraper on other India, 2019, pp. 450-454, doi: 10.1109/ICECA.2019.8822022.
websites and analyzing the data in methods different from the [7] A. Bradley and R. J. James, “Web scraping using R,” Advances in
ones we have proposed in this paper. Finally, we recommend Methods and Practices in Psychological Science, vol. 2, no. 3, pp. 264–
conducting research that compares the efficiency of a 270, 2019.
BeautifulSoup4 web scraper against a JavaScript [8] R. Gunawan, A. Rahmatulloh, I. Darmawan, and F. Firdaus,
implementation since it would be useful in the context of “Comparison of web scraping techniques : Regular expression, HTML
dom and xpath,” Proceedings of the 2018 International Conference on
data-gathering techniques. Industrial Enterprise and System Engineering (IcoIESE 2018), 2019.
[9] L. Richardson, “Beautiful Soup,” Crummy, 2020.
REFERENCES https://www.crummy.com/software/BeautifulSoup/
[10] Scrapfly, “Web Scraping with Python and BeautifulSoup,” ScrapFly,
[1] M. Perez, “What is Web Scraping and What is it Used Jan. 03, 2022. https://scrapfly.io/blog/web-scraping-with-python-
For?,” ParseHub, Aug. 06, 2019. https://www.parsehub.com beautifulsoup/

View publication stats

You might also like