5legality of Web Scraping
5legality of Web Scraping
Advertisements
With Python, we can scrape any website or particular elements of a web page but do you have any idea whether it
is legal or not? Before scraping any website we must have to know about the legality of web scraping. This chapter
will explain the concepts related to legality of web scraping.
Introduction
Generally, if you are going to use the scraped data for personal use, then there may not be any problem. But if you
are going to republish that data, then before doing the same you should make download request to the owner or
do some background research about policies as well about the data you are going to scrape.
Analyzing robots.txt
Actually most of the publishers allow programmers to crawl their websites at some extent. In other sense,
publishers want specific portions of the websites to be crawled. To define this, websites must put some rules for
stating which portions can be crawled and which cannot be. Such rules are defined in a file called robots.txt.
robots.txt is human readable file used to identify the portions of the website that crawlers are allowed as well as
not allowed to scrape. There is no standard format of robots.txt file and the publishers of website can do
modifications as per their needs. We can check the robots.txt file for a particular website by providing a slash and
robots.txt after url of that website. For example, if we want to check it for Google.com, then we need to type
https://www.google.com/robots.txt and we will get something as follows −
User‐agent: *
Disallow: /search
Allow: /search/about
Allow: /search/static
Allow: /search/howsearchworks
Disallow: /sdch
Disallow: /groups
Disallow: /index.html?
Disallow: /?
Allow: /?hl=
Disallow: /?hl=*&
Allow: /?hl=*&gws_rd=ssl$
and so on……..
Some of the most common rules that are defined in a website’s robots.txt file are as follows −
User‐agent: BadCrawler
Disallow: /
The above rule means the robots.txt file asks a crawler with BadCrawler user agent not to crawl their website.
User‐agent: *
Crawl‐delay: 5
Disallow: /trap
The above rule means the robots.txt file delays a crawler for 5 seconds between download requests for all user
agents for avoiding overloading server. The /trap link will try to block malicious crawlers who follow disallowed
links. There are many more rules that can be defined by the publisher of the website as per their requirements.
Some of them are discussed here −
What you supposed to do if you want to crawl a website for updated information? You will crawl every web page
for getting that updated information, but this will increase the server traffic of that particular website. That is why
websites provide sitemap files for helping the crawlers to locate updating content without needing to crawl every
web page. Sitemap standard is defined at http://www.sitemaps.org/protocol.html.
Sitemap: https://www.microsoft.com/en‐us/explore/msft_sitemap_index.xml
Sitemap: https://www.microsoft.com/learning/sitemap.xml
Sitemap: https://www.microsoft.com/en‐us/licensing/sitemap.xml
Sitemap: https://www.microsoft.com/en‐us/legal/sitemap.xml
Sitemap: https://www.microsoft.com/filedata/sitemaps/RW5xN8
Sitemap: https://www.microsoft.com/store/collections.xml
Sitemap: https://www.microsoft.com/store/productdetailpages.index.xml
Sitemap: https://www.microsoft.com/en‐us/store/locations/store‐locationssitemap.xml
The above content shows that the sitemap lists the URLs on website and further allows a webmaster to specify
some additional information like last updated date, change of contents, importance of URL with relation to others
etc. about each URL.
Is the size of a website, i.e. the number of web pages of a website affects the way we crawl? Certainly yes. Because
if we have less number of web pages to crawl, then the efficiency would not be a serious issue, but suppose if our
website has millions of web pages, for example Microsoft.com, then downloading each web page sequentially
would take several months and then efficiency would be a serious concern.
Another important question is whether the technology used by website affects the way we crawl? Yes, it affects.
But how we can check about the technology used by a website? There is a Python library named builtwith with
the help of which we can find out about the technology used by a website.
Example
In this example we are going to check the technology used by the website https://authoraditiagarwal.com with the
help of Python library builtwith. But before using this library, we need to install it as follows −
Now, with the help of following simple line of codes we can check the technology used by a particular website −
In [1]: import builtwith
In [2]: builtwith.parse('http://authoraditiagarwal.com')
Out[2]:
{'blogs': ['PHP', 'WordPress'],
'cms': ['WordPress'],
'ecommerce': ['WooCommerce'],
'font‐scripts': ['Font Awesome'],
'javascript‐frameworks': ['jQuery'],
'programming‐languages': ['PHP'],
'web‐servers': ['Apache']}
The owner of the website also matters because if the owner is known for blocking the crawlers, then the crawlers
must be careful while scraping the data from website. There is a protocol named Whois with the help of which we
can find out about the owner of the website.
Example
In this example we are going to check the owner of the website say microsoft.com with the help of Whois. But
before using this library, we need to install it as follows −
Now, with the help of following simple line of codes we can check the technology used by a particular website −