Web Scraping With Python and Selenium: Sarah Fatima, Shaik Luqmaan Nuha Abdul Rasheed
Web Scraping With Python and Selenium: Sarah Fatima, Shaik Luqmaan Nuha Abdul Rasheed
Web Scraping With Python and Selenium: Sarah Fatima, Shaik Luqmaan Nuha Abdul Rasheed
e-ISSN: 2278-0661,p-ISSN: 2278-8727, Volume 23, Issue 3, Ser. II (May – June 2021), PP 01-05
www.iosrjournals.org
Abstract:
In this paper, we have designed a method for retrieving web information using selenium and python script.
Selenium is used to automate web browser interaction. The block-based structure is obtained by using a python
script. Web information is mostly unstructured, the proposed work helps to organize the unstructured data and
make it useful for various data analysis techniques. It also focuses on ways in which data can be persisted and
used from various websites for which APIs are not available.
Key Word: Web scraping; Selenium; Information Extraction; HTML Parsing; Web data retrieval.
---------------------------------------------------------------------------------------------------------------------------------------
Date of Submission: 02-06-2021 Date of Acceptance: 15-06-2021
---------------------------------------------------------------------------------------------------------------------------------------
I. Introduction
Web scraping is a process of information extraction from the world wide web (www), accomplished by
writing automated script routines that request data by querying the desired web server and retrieving the data by
using different parsing techniques [1]. Scraping helps in transforming unstructured HTML data into various
structured data formats like CSV, spreadsheets. As it is known, the nature of web data is changing frequently,
using an easy-to-use language like python which accepts dynamic inputs can be highly productive, as code
changes are easily done to keep up with the speed of web updates. Using the wide collection of python libraries,
such as requests, pandas, csv, webdriver can ease the process of fetching URLs and pulling out information
from web pages, building scrapers that can hop from one domain to another, gather information, and store that
information for later use. To automate web browser interaction, the single interface open-source tool Selenium
is used that can mimic human browsing behaviors. Besides, numpy and pandas are used to process the data [2].
By using this implementation, web data is transformed into structured blocks. The block-based structure is
obtained by using a python script with Selenium. The proposed experimental work shows, parsing the HTML
code, installation of python and selenium, python scripting and interpretation, and structural extraction of web
information. The evolving needs of internet and social media services require various techniques for the
extraction of web data. Web information is mostly unstructured, the proposed work helps to organize the
unstructured data and make it useful for various data analysis techniques.
● Observing customer sentiment by scraping customer feedback and reviews of different businesses as
visiting various websites can be cumbersome.
human behavior and ease the extraction of large data sets and images, we have created one script to perform
required scraping.
Description of work
The research work is developed in Python using HTML parsing running on Anaconda Platform. Script is
supported by Selenium library [5]. The site used for scraping instances of unstructured data with and without
pagination. Simulation of experimented work:
A. Installation of Python
B. Importing selenium web drivers, requests and csv library
C. Execution of script using Python
D. Persisting the generated structured data in the database.
When the script is run, an instance of chrome web driver is initiated, and the required output CSV files are
initialized. The scraper then parses the data from the aforementioned URL using element id or XPath and starts
collecting the data, which then will be written in output files using writer headers. The script is designed to
throw errors in case of time out or if the element id or XPath is missing.
Results from the data files obtained from web scraping can further be used for machine learning and data
analytics techniques.
IV. Discussion
When we start web scraping, we will notice how much we value the simple things that browsers
perform for us. The web browser is a very handy tool for generating and sending the information packets from
our computer, and interpreting the data we receive as text, images, etc. back from the server. The internet
includes many types of interesting data sources that serve as a goldmine of interesting stuff. Unfortunately, the
current unstructured nature of the web makes it extremely difficult to easily access or export this data. Modern
browsers are brilliant at showcasing visuals, displaying motions, and arranging out websites in a pleasing style,
but they do not offer a capacitance to export their data, at least not in most situations. So, web scraping, instead
of seeing the website page through the interface of your browser, gathers the data from the browser. Now-a-
days many websites provide a service called an API, which gives access to the data, but most of the website
doesn't provide an API to interact with or doesn't expose an API required for our functionality. In such cases
building a web scraper to gather the data can be handy.
V. Conclusion
Web scraping can be useful to gather different types of data from websites either for business
or personal purposes and there are many ways to do it. But it is also necessary to be mindful of the burden that
your web scraper is putting on the website and there can be consequences of irresponsible web scraping.
Consider running a script throughout the first 100 pages; this would be an aggressive scraper, and we would be
placing an unreasonably enormous strain on the website servers, perhaps disrupting their operation and web
scraping is a violation of certain websites' terms and conditions in such cases the website is likely to take action
against you.
References
[1]. Anand V. Saurkar, Kedar G. Pathare, Shweta A. Gode, “An overview of web scraping techniques and tools” International Journal
on Future Revolution in Computer Science and Communication Engineering, April 2018.
[2]. Jiahao Wu, “Web Scraping using Python: Step by step guide” ResearchGate publications (2019).
[3]. Matthew Russell, “Using python for web scraping,” No Starch Press, 2012.
[4]. Ryan Mitchell, “Web scraping with Python,” O’Reilly Media, 2015.
[5]. Selenium Library : https://pypi.org/project/selenium/.
Sarah Fatima, et. al. “Web Scraping with Python and Selenium.” IOSR Journal of Computer Engineering
(IOSR-JCE), 23(3), 2021, pp. 01-05.