2Python Web Scraping Introduction
2Python Web Scraping Introduction
Advertisements
Web scraping is an automatic process of extracting information from web. This chapter will give you an indepth
idea of web scraping, its comparison with web crawling, and why you should opt for web scraping. You will also
learn about the components and working of a web scraper.
The answer to the first question is ‘data’. Data is indispensable for any programmer and the basic requirement of
every programming project is the large amount of useful data.
The answer to the second question is a bit tricky, because there are lots of ways to get data. In general, we may get
data from a database or data file and other sources. But what if we need large amount of data that is available
online? One way to get such kind of data is to manually search clickingawayinawebbrowser and save
copy − pastingintoaspreadsheetorf ile the required data. This method is quite tedious and time consuming.
Web scraping, also called web data mining or web harvesting, is the process of constructing an agent which
can extract, parse, download and organize useful information from the web automatically. In other words, we can
say that instead of manually saving the data from websites, the web scraping software will automatically load and
extract data from multiple websites as per our requirement.
Web crawling is basically used to index the information on the page using bots aka crawlers. It is also called
indexing. On the hand, web scraping is an automated way of extracting the information using bots aka scrapers.
It is also called data extraction.
To understand the difference between these two terms, let us look into the comparison table given hereunder −
Refers to downloading and storing the contents of a Refers to extracting individual data elements from the
large number of websites. website by using a sitespecific structure.
Mostly done on large scale. Can be implemented at any scale.
Ecommerce Websites − Web scrapers can collect the data specially related to the price of a specific
product from various ecommerce websites for their comparison.
Content Aggregators − Web scraping is used widely by content aggregators like news aggregators and
job aggregators for providing updated data to their users.
Marketing and Sales Campaigns − Web scrapers can be used to get the data like emails, phone number
etc. for sales and marketing campaigns.
Search Engine Optimization S EO − Web scraping is widely used by SEO tools like SEMRush, Majestic
etc. to tell business how they rank for search keywords that matter to them.
Data for Machine Learning Projects − Retrieval of data for machine learning projects depends upon
web scraping.
Data for Research − Researchers can collect useful data for the purpose of their research work by saving their
time by this automated process.
A very necessary component of web scraper, web crawler module, is used to navigate the target website by making
HTTP or HTTPS request to the URLs. The crawler downloads the unstructured data H T M Lcontents and passes
it to extractor, the next module.
Extractor
The extractor processes the fetched HTML content and extracts the data into semistructured format. This is also
called as a parser module and uses different parsing techniques like Regular expression, HTML Parsing, DOM
parsing or Artificial Intelligence for its functioning.
Storage Module
After extracting the data, we need to store it as per our requirement. The storage module will output the data in a
standard format that can be stored in a database or JSON or CSV format.
We can understand the working of a web scraper in simple steps as shown in the diagram given above.
In this step, a web scraper will download the requested contents from multiple web pages.
The data on websites is HTML and mostly unstructured. Hence, in this step, web scraper will parse and extract
structured data from the downloaded contents.
After all these steps are successfully done, the web scraper will analyze the data thus obtained.