Web Scraping for Data Analytics a BeatifulSoup Implementation
Web Scraping for Data Analytics a BeatifulSoup Implementation
net/publication/371467193
CITATIONS READS
16 2,222
5 authors, including:
Rabia Latif
Prince Sultan University
12 PUBLICATIONS 118 CITATIONS
SEE PROFILE
All content following this page was uploaded by Rabia Latif on 12 January 2024.
Abstract— Web scraping is an essential tool for automating useful when there are websites that contain information that
the data-gathering process for big data applications. There are cannot be copied and pasted. With web scraping, data can be
many implementations for web scraping, but barely any of them retrieved in any form according to the context they are needed
is based on Python’s BeautifulSoup library. Therefore, this in. After retrieving this data, they can be transformed into the
paper aims at creating a web scraper that gathers data from any
website and then analyzes the data accordingly. For results and
preferable format depending on the incentive of the
analysis, the web scraper has been implemented on the Amazon application.
website which collects a product’s name, price, number of
reviews, rate, and link. We further highlighted the web There exists a lack of papers that implement a web
scraper’s capabilities by assimilating the results into an scraper using Python’s BeautifulSoup4 (the 2022 version)
interface that integrates data visualization techniques to analyze while depicting its capabilities through an interface, which
the results gathered. The web scraper proved to be efficient can be troublesome for people who are new to the field. Thus,
upon execution, in which it scraped five pages and analyzed in this paper, we have developed a web scraper that is based
them, and visualized the information in approximately ten on Python’s existing libraries, mainly the BeautifulSoup
seconds. The limitations as a result of this implementation
mainly revolved around applying it to specific product names
library while performing data analysis and visualization on
rather than generic ones and extracting specific information the results of the scraper. To implement the web scraper, we
that we wanted from the resulting products. Moreover, targeted the Amazon website in which we extracted certain
BeautifulSoup cannot extract all the data available to not information that is of interest to the customer (name, price,
compromise on speed. Further studies can be done by rate, link, and number of reviews) about an arbitrary product
researchers who wish to reuse this implementation and modify that the user specifies. The proposed method then analyzed
it according to the data they want to extract, the analysis they the data gathered from the web scraper by using a graphical
wish to perform, and the website they wish to scrape. The interface and data visualization techniques based on
implementation can be helpful, thus, to developers who are PySimpleGUI and Matplotlib, respectively, that displayed
novices in the web scraping field or to researchers that wish to
reuse the code for small data analytics projects.
the top ten products with the cheapest price and highest rate
and then graphed price frequencies as well as the number of
Keywords—web scraping, BeautifulSoup, data gathering, data reviews per rate. The output has indeed proven the efficiency
analytics, data visualization of the proposed web scraping implementation, which can be
used in small data analytics projects that do not require a
I. INTRODUCTION massive amount of extracted data.
There is a potent need to access and analyze a large
amount of data. Data harvesting and analytics are always at II. LITERATURE REVIEW
the heart of any research effort. Most people would directly Many existing papers have delved into the field of web
copy and paste the available information, but this does not scraping through a myriad of methods. One such paper was
apply to immensely large websites or projects that require big made by Färholt, who conducted a controlled experiment to
data. It can also be a waste of human labor and time. develop a web scraping technique using JavaScript’s library,
Puppeteer, while remaining undetectable to the websites
To facilitate this issue, web scraping has been being targeted [2]. Fruitfully, one of the algorithms the study
introduced. Web scraping automates data extraction from experimented with did achieve this aim on websites adopting
web pages, which proves to be fast and effective. Data can be semi-security mechanisms (honeypots and activity logging)
extracted using software services or self-built programs that by evading them. Many factors have distinguished the
run on websites with numerous levels of navigation. Web implementation of Puppeteer in this study. First, it was
scraping is typically used by individuals and businesses that revealed that it is feasible to control computer performance
want to take advantage of the vast amount of freely available using its built-in methods. Other than its built-in functions,
web data to make better decisions [1]. This is especially Puppeteer allows visiting websites within the library itself,
1
https://github.com/rem2718/AmazonScraper 2
Based on Amazon’s current html
V. RESULTS relieves the programmer from the burden of encoding since it
In order to test the proposed web scraper, the code was handles it, a similar trait to that of the Nokogiri gem in Ruby
run twice. The first time was to check the results for the New [5].
Apple iPhone 14 pro max as shown in Figure 3. The second
BeautifulSoup contains backends that provide HTML
was to check the results for the Raspberry Pi 4 as shown in
parsing algorithms which are the following: HTML.parser,
Figure 4. The implementation successfully displayed the data
lxml, and html5lib. Html.parser is a built-in Python algorithm
we wanted to analyze on the interface, which is shown in the
slower than its counterparts while lxml is C-based and
histogram and bar charts in figure 3 and 4. The total run time
difficult to install [10]. Thus, for our project, we have used
for scraping five pages, analyzing the data, visualizing the
the html5lib parser, which is efficient and written in Python.
information via Matplotlib, and displaying the results was
approximately 10.9 seconds. The web scraper rummaged The results of the proposed method have successfully
through five different pages of results on Amazon for both depicted the strengths of BeautifulSoup, which efficiently
products. traversed the HTML parse tree based on the information we
specified in the proposed algorithm. To further highlight
them to interested users, we displayed the results in an
interface, outlining the flexibility of incorporating such
results in different ways for different projects. Our research
aimed at analyzing the data extracted and presenting valuable
information regarding the price and rate frequencies that can
be of interest to the customer; this has proven to be successful
due to Matplotlib library’s numerous functionalities.
VII. LIMITATIONS
The proposed implementation has some limitations
associated with it. First, we have made it such that it only
Figure 3: Graphical Interface Output for Apple iPhone 14 Pro Max
works for writing the specific name of the product as listed
on Amazon. Also, it scrapes only some information about the
products; no details or seller names are extracted.
Furthermore, the library that is used for the interface is a
simple demonstration, so we encourage implementing the
proposed methodology using a different library for the
interface. Lastly, if Amazon changes anything in their DOM,
this code might not work. It needs to be maintained.
Therefore, we recommend reusing this code for small
projects in which the researchers can make some alterations
to the functions to suit their incentive.
We recommend that further studies reuse this code on
other websites to test its efficiency or to extract information
Figure 4: Graphical Interface Output for Raspberry Pi 4 based on generic names. To reuse the code, the programmer
must make the changes according to the chosen website’s
VI. DISCUSSION HTML. Furthermore, they must alter the elements based on
what they wish to extract from the website and maintain those
There are multiple reasons behind the selection of Python
changes in the visualization and interface functions. Graphs
as a tool to implement the proposed web scraper and, its
different from the ones we have incorporated can be also used
BeautifulSoup4 library. The proposed static web scraper’s
since Matplotlib supports many visualization techniques.
incentive is to gather a small scale of data from a web page,
Seaborn can alternatively be used if the programmer does not
eliminating the need for Selenium. Ruby can prove to be a
wish to use Matplotlib. We also recommend integrating other
practical option for implementation, but Python’s variety of
ways information to highlight the strengths of BeautifulSoup.
libraries, by contrast, would enable us to enhance the
Finally, a future study that compares the implementation to
visualization of the results according to our preference.
analyze the data scraped and to display the resulting of a
Python is also a faster, more popular, and more efficient
BeautifulSoup4 web scraper with a JavaScript
option to program with as opposed to JavaScript, which [2]
implementation would be an interesting contribution.
has recommended as well. It is crucial to note that the web
scraping method to use is heavily dependent on the type of VIII. CONCLUSION AND FUTURE WORK
data we wish to extract from an arbitrary web page as well as
the scale we are targeting. In conclusion, a web scraper is a tool used for crawling
databases and extracting data. In this project, we developed a
The BeautifulSoup library has many strengths associated web scraper that uses the BeautifulSoup4 library in Python
with its usage. It constructs a parse tree from the HTML and because it is fast and efficient in gathering the required data.
has straightforward methods for flexibly “traversing, This is essential in the field of data analytics since there exists
searching, and modifying a parse tree” [9]. Furthermore, it a lack of papers that explore the strengths of this library
through an application of data analytics and visualization. [2] F. Färholt, “Less Detectable Web Scraping Techniques,” Bachelor
Thesis, Linnaeus University, Faculty of Technology, Department of
Our proposed web scraper can work on any website by computer science and media technology (CM), 2021.
incorporating the changes accordingly. For our [3] R. S. Chaulagain, S. Pandey, S. R. Basnet, and S. Shakya, “Cloud
implementation, we tested our web scraper on the Amazon Based Web Scraping for Big Data Applications,” 2017 IEEE
website. It sends a request to Amazon servers to search for a International Conference on Smart Cloud (SmartCloud), pp. 138–143,
product based on its name. Then it displays the ten cheapest Nov. 2017, doi: 10.1109/smartcloud.2017.28.
products and the ten products with the highest ratings and [4] Y. Yannikos, J. Heeger, and M. Brockmeyer, “An Analysis Framework
for Product Prices and Supplies in Darknet
graphs the price frequencies as well as the number of reviews Marketplaces,” Proceedings of the 14th International Conference on
per rate. The results have proven to be successful as Availability, Reliability and Security, Aug. 2019, doi:
demonstrated by the interface, and the scraper managed to go 10.1145/3339252.3341485.
over five pages and analyze the data according to our [5] Q. T. Le and D. Pishva, “Application of web scraping and Google API
service to optimize convenience stores’ distribution,” 2015 17th
preference in approximately ten seconds. This can be of use International Conference on Advanced Communication Technology
in projects that want to gather data from websites in an (ICACT), pp. 478–482, Aug. 2015, doi:
efficient manner. We have some recommendations based on https://doi.org/10.1109/ICACT.2015.7224841.
the limitations. The first recommendation is to make the tool [6] D. M. Thomas and S. Mathur, "Data Analysis by Web Scraping using
display products using generic names if the scraper is reused Python," 2019 3rd International conference on Electronics,
Communication and Aerospace Technology (ICECA), Coimbatore,
on Amazon. We also recommend reusing the scraper on other India, 2019, pp. 450-454, doi: 10.1109/ICECA.2019.8822022.
websites and analyzing the data in methods different from the [7] A. Bradley and R. J. James, “Web scraping using R,” Advances in
ones we have proposed in this paper. Finally, we recommend Methods and Practices in Psychological Science, vol. 2, no. 3, pp. 264–
conducting research that compares the efficiency of a 270, 2019.
BeautifulSoup4 web scraper against a JavaScript [8] R. Gunawan, A. Rahmatulloh, I. Darmawan, and F. Firdaus,
implementation since it would be useful in the context of “Comparison of web scraping techniques : Regular expression, HTML
dom and xpath,” Proceedings of the 2018 International Conference on
data-gathering techniques. Industrial Enterprise and System Engineering (IcoIESE 2018), 2019.
[9] L. Richardson, “Beautiful Soup,” Crummy, 2020.
REFERENCES https://www.crummy.com/software/BeautifulSoup/
[10] Scrapfly, “Web Scraping with Python and BeautifulSoup,” ScrapFly,
[1] M. Perez, “What is Web Scraping and What is it Used Jan. 03, 2022. https://scrapfly.io/blog/web-scraping-with-python-
For?,” ParseHub, Aug. 06, 2019. https://www.parsehub.com beautifulsoup/