In computer science Web scraping means extracting data from websites. Using this technique transform the unstructured data on the web into structured data.
Most common web Scraping tools In Python3 are −
- Urllib2
- Requests
- BeautifulSoup
- Lxml
- Selenium
- MechanicalSoup
Urllib2 − This tool is pre-installed with Python. This module is used for extracting the URL's. Using urlopen () function fetching the URL's using different protocols (FTP, HTTPetc.).
Example code
from urllib.request import urlopen my_html = urlopen("https://www.tutorialspoint.com/") print(my_html.read())
Output
b'<!DOCTYPE html<\r\n <!--[if IE 8]< <html class="ie ie8"< <![endif]--< \r\n<!--[if IE 9]< <html class="ie ie9"< <![endif]-->\r\n<!--[if gt IE 9]><!--< \r\n<html lang="en-US"< <!--<![endif]--< \r\n<head>\r\n<!-- Basic --< \r\n<meta charset="utf-8"< \r\n<title>Parallax Scrolling, Java Cryptography, YAML, Python Data Science, Java i18n, GitLab, TestRail, VersionOne, DBUtils, Common CLI, Seaborn, Ansible, LOLCODE, Current Affairs 2018, Apache Commons Collections</title< \r\n<meta name="Description" content="Parallax Scrolling, Java Cryptography, YAML, Python Data Science, Java i18n, GitLab, TestRail, VersionOne, DBUtils, Common CLI, Seaborn, Ansible, LOLCODE, Current Affairs 2018, Intellij Idea, Apache Commons Collections, Java 9, GSON, TestLink, Inter Process Communication (IPC), Logo, PySpark, Google Tag Manager, Free IFSC Code, SAP Workflow"/< \r\n<meta name="Keywords" content="Python Data Science, Java i18n, GitLab, TestRail, VersionOne, DBUtils, Common CLI, Seaborn, Ansible, LOLCODE, Gson, TestLink, Inter Process Communication (IPC), Logo"/<\r\n <meta http-equiv="X-UA-Compatible" content="IE=edge">\r\n<meta name="viewport" content="width=device-width,initial-scale=1.0,user-scalable=yes">\r\n<link href="https://cdn.muicss.com/mui-0.9.39/extra/mui-rem.min.css" rel="stylesheet" type="text/css" /<\r\n <link rel="stylesheet" href="/questions/css/home.css?v=3" /< \r\n <script src="/questions/js/jquery.min.js"< </script< \r\n<script src="/questions/js/fontawesome.js"< </script<\r\n <script src="https://cdn.muicss.com/mui-0.9.39/js/mui.min.js"< </script>\r\n </head>\r\n <body>\r\n <!-- Start of Body Content --> \r\n <div class="mui-appbar-home">\r\n <div class="mui-container">\r\n <div class="tp-primary-header mui-top-home">\r\n <a href="https://www.tutorialspoint.com/index.htm" target="_blank" title="TutorialsPoint - Home"> <i class="fa fa-home"> </i><span>Home</span></a>\r\n </div>\r\n <div class="tp-primary-header mui-top-qa">\r\n <a href="https://www.tutorialspoint.com/questions/index.php" target="_blank" title="Questions & Answers - The Best Technical Questions and Answers - TutorialsPoint"><i class="fa fa-location-arrow"></i> <span> Q/A</span></a>\r\n </div>\r\n <div class="tp-primary-header mui-top-tools">\r\n <a href="https://www.tutorialspoint.com/online_dev_tools.htm" target="_blank" title="Tools - Online Development and Testing Tools"> <i class="fa fa-cogs"></i><span>Tools</span></a>\r\n </div>\r\n <div class="tp-primary-header mui-top-coding-ground">\r\n <a href="https://www.tutorialspoint.com/codingground.htm" target="_blank" title="Coding Ground - Free Online IDE and Terminal"> <i class="fa fa-code"> </i> <span> Coding Ground </span> </a> \r\n </div>\r\n <div class="tp-primary-header mui-top-current-affairs">\r\n <a href="https://www.tutorialspoint.com/current_affairs/index.htm" target="_blank" title="Current Affairs - 2016, 2017 and 2018 | General Knowledge for Competitive Exams"><i class="fa fa-globe"> </i><span>Current Affairs</span> </a>\r\n </div>\r\n <div class="tp-primary-header mui-top-upsc">\r\n <a href="https://www.tutorialspoint.com/upsc_ias_exams.htm" target="_blank" title="UPSC IAS Exams Notes - TutorialsPoint"><i class="fa fa-user-tie"></i><span>UPSC Notes</span></a>\r\n </div>\r\n <div class="tp-primary-header mui-top-tutors">\r\n <a href="https://www.tutorialspoint.com/tutor_connect/index.php" target="_blank" title="Top Online Tutors - Tutor Connect"> <i class="fa fa-user"> </i> <span>Online Tutors</span> </a>\r\n </div>\r\n <div class="tp-primary-header mui-top-examples">\r\n ….
Requests − This module is not preinstalled, we have to write the command line in command prompt.Requests send request to HTTP/1.1.
pip install requests
Example
import requests # get URL my_req = requests.get('https://www.tutorialspoint.com/') print(my_req.encoding) print(my_req.status_code) print(my_req.elapsed) print(my_req.url) print(my_req.history) print(my_req.headers['Content-Type'])
Output
UTF-8 200 0:00:00.205727 https://www.tutorialspoint.com/ [] text/html; charset=UTF-8
BeautifulSoup − This is a parsing library which is used in different parsers. Python’s standard library provides BeautifulSoup’s default parser. It builts a parser tree which is used to extract data from HTML page.
For installing this module, we write command line in command prompt.
pip install beautifulsoup4
Example
from bs4 import BeautifulSoup # importing requests import requests # get URL my_req = requests.get("https://www.tutorialspoint.com/") my_data = my_req.text my_soup = BeautifulSoup(my_data) for my_link in my_soup.find_all('a'): print(my_link.get('href'))
Output
https://www.tutorialspoint.com/index.htm https://www.tutorialspoint.com/questions/index.php https://www.tutorialspoint.com/online_dev_tools.htm https://www.tutorialspoint.com/codingground.htm https://www.tutorialspoint.com/current_affairs/index.htm https://www.tutorialspoint.com/upsc_ias_exams.htm https://www.tutorialspoint.com/tutor_connect/index.php https://www.tutorialspoint.com/programming_examples/ https://www.tutorialspoint.com/whiteboard.htm https://www.tutorialspoint.com/netmeeting.php https://www.tutorialspoint.com/articles/ https://www.tutorialspoint.com/index.htm https://www.tutorialspoint.com/tutorialslibrary.htm https://www.tutorialspoint.com/videotutorials/index.htm https://store.tutorialspoint.com https://www.tutorialspoint.com/html_online_training/index.asp https://www.tutorialspoint.com/css_online_training/index.asp https://www.tutorialspoint.com/3d_animation_online_training/index.asp https://www.tutorialspoint.com/swift_4_online_training/index.asp https://www.tutorialspoint.com/blockchain_online_training/index.asp https://www.tutorialspoint.com/reactjs_online_training/index.asp https://www.tutorialspoint.com/tutorialslibrary.htm https://www.tutorialspoint.com/computer_fundamentals/index.htm https://www.tutorialspoint.com/compiler_design/index.htm https://www.tutorialspoint.com/operating_system/index.htm https://www.tutorialspoint.com/data_structures_algorithms/index.htm https://www.tutorialspoint.com/dbms/index.htm https://www.tutorialspoint.com/data_communication_computer_network/index.htm https://www.tutorialspoint.com/academic_tutorials.htm https://www.tutorialspoint.com/html/index.htm https://www.tutorialspoint.com/css/index.htm https://www.tutorialspoint.com/javascript/index.htm https://www.tutorialspoint.com/php/index.htm https://www.tutorialspoint.com/angular4/index.htm https://www.tutorialspoint.com/mysql/index.htm https://www.tutorialspoint.com/web_development_tutorials.htm https://www.tutorialspoint.com/cprogramming/index.htm https://www.tutorialspoint.com/cplusplus/index.htm https://www.tutorialspoint.com/java8/index.htm https://www.tutorialspoint.com/python/index.htm https://www.tutorialspoint.com/scala/index.htm https://www.tutorialspoint.com/csharp/index.htm https://www.tutorialspoint.com/computer_programming_tutorials.htm https://www.tutorialspoint.com/java8/index.htm https://www.tutorialspoint.com/jdbc/index.htm https://www.tutorialspoint.com/servlets/index.htm https://www.tutorialspoint.com/spring/index.htm https://www.tutorialspoint.com/hibernate/index.htm https://www.tutorialspoint.com/swing/index.htm https://www.tutorialspoint.com/java_technology_tutorials.htm https://www.tutorialspoint.com/android/index.htm https://www.tutorialspoint.com/swift/index.htm https://www.tutorialspoint.com/ios/index.htm https://www.tutorialspoint.com/kotlin/index.htm https://www.tutorialspoint.com/react_native/index.htm https://www.tutorialspoint.com/xamarin/index.htm https://www.tutorialspoint.com/mobile_development_tutorials.htm https://www.tutorialspoint.com/mongodb/index.htm https://www.tutorialspoint.com/plsql/index.htm https://www.tutorialspoint.com/sql/index.htm https://www.tutorialspoint.com/db2/index.htm https://www.tutorialspoint.com/mysql/index.htm https://www.tutorialspoint.com/memcached/index.htm https://www.tutorialspoint.com/database_tutorials.htm https://www.tutorialspoint.com/asp.net/index.htm https://www.tutorialspoint.com/entity_framework/index.htm https://www.tutorialspoint.com/vb.net/index.htm https://www.tutorialspoint.com/ms_project/index.htm https://www.tutorialspoint.com/excel/index.htm https://www.tutorialspoint.com/word/index.htm https://www.tutorialspoint.com/microsoft_technologies_tutorials.htm https://www.tutorialspoint.com/big_data_analytics/index.htm https://www.tutorialspoint.com/hadoop/index.htm https://www.tutorialspoint.com/sas/index.htm https://www.tutorialspoint.com/qlikview/index.htm https://www.tutorialspoint.com/power_bi/index.htm https://www.tutorialspoint.com/tableau/index.htm https://www.tutorialspoint.com/big_data_tutorials.htm https://www.tutorialspoint.com/tutorialslibrary.htm https://www.tutorialspoint.com/codingground.htm https://www.tutorialspoint.com/coding_platform_for_websites.htm https://www.tutorialspoint.com/developers_best_practices/index.htm https://www.tutorialspoint.com/effective_resume_writing.htm https://www.tutorialspoint.com/computer_glossary.htm https://www.tutorialspoint.com/computer_whoiswho.htm https://www.tutorialspoint.com/questions_and_answers.htm https://www.tutorialspoint.com/multi_language_tutorials.htm https://itunes.apple.com/us/app/tutorials-point/id914891263?ls=1&mt=8 https://play.google.com/store/apps/details?id=com.tutorialspoint.onlineviewer https://www.windowsphone.com/s?appid=91249671-7184-4ad6-8a5f-d11847946b09 /about/index.htm /about/about_team.htm /about/about_careers.htm /about/about_privacy.htm /about/about_terms_of_use.htm https://www.tutorialspoint.com/articles/ https://www.tutorialspoint.com/online_dev_tools.htm https://www.tutorialspoint.com/free_web_graphics.htm https://www.tutorialspoint.com/online_file_conversion.htm https://www.tutorialspoint.com/shared-tutorials.php https://www.tutorialspoint.com/netmeeting.php https://www.tutorialspoint.com/free_online_whiteboard.htm https://www.tutorialspoint.com https://www.facebook.com/tutorialspointindia https://plus.google.com/u/0/+tutorialspoint https://www.twitter.com/tutorialspoint https://www.linkedin.com/company/tutorialspoint https://www.youtube.com/channel/UCVLbzhxVTiTLiVKeGV7WEBg https://www.tutorialspoint.com/index.htm /about/about_privacy.htm#cookies /about/faq.htm /about/about_helping.htm /about/contact_us.htm
Lxml − This is a parsing library, high-performance, production-quality HTML and XML parsing library. If we want high-quality, maximum speed, then we have to use this library. It has many module by which we can extract data from web site.
For installing we write in Command prompt
pip install lxml
Example
from lxml import etree my_root_elem = etree.Element('html') etree.SubElement(my_root_elem, 'head') etree.SubElement(my_root_elem, 'title') etree.SubElement(my_root_elem, 'body') print(etree.tostring(my_root_elem, pretty_print = True).decode("utf-8"))
Output
<html> <head/> <title/> <body/> </html>
Selenium − This is an automates browsers tool, it is also known as web-drivers. When we use any website,we observe that sometimes we have to wait for some time, for example when we click any button or scrolling the page, in this moment Selenium is needed.
For installing selenium we use this command
pip install selenium
Example
from selenium import webdriver my_path_to_chromedriver ='/Users/Admin/Desktop/chromedriver' my_browser = webdriver.Chrome(executable_path = my_path_to_chromedriver) my_url = 'https://www.tutorialspoint.com/' my_browser.get(my_url)
Output
MechanicalSoup − This is another Python library for automating interaction with websites. By using this we can automatically store and send cookies, can follow redirects, and can follow links and submit forms. It doesn’t do JavaScript.
For installing we can use following command
pip install MechanicalSoup
Example
import mechanicalsoup my_browser = mechanicalsoup.StatefulBrowser() my_value = my_browser.open("https://www.tutorialspoint.com/") print(my_value) my_val = my_browser.get_url() print(my_val) my_va = my_browser.follow_link("forms") print(my_va) my_value1 = my_browser.get_url() print(my_value1)