Python | Extract URL from HTML using lxml
Last Updated :
19 Jul, 2019
Link extraction is a very common task when dealing with the HTML parsing. For every general web crawler that's the most important function to perform. Out of all the Python libraries present out there,
lxml
is one of the best to work with. As explained in this article, lxml provides a number of helper function in order to extract the links.
lxml installation -
It is a Python binding for C libraries -
libxslt
and
libxml2
. So, maintaining a Python base, it is very fast HTML parsing and XML library. To let it work - C libraries also need to be installed. For installation instruction, follow
this link.
Command to install -
sudo apt-get install python-lxml or
pip install lxml
What is lxml?
It is designed specifically for parsing HTML and therefore comes with an html module. HTML string can be easily parsed with the help of
fromstring()
function. This will return the list of all the links.
The
iterlinks()
method has four parameters of tuple form -
element : Link is extracted from this parsed node of the anchor tag. If interested in the link only, this can be ignored.
attr : attribute of the link from where it has come from, that is simply 'href'
link : The actual URL extracted from the anchor tag.
pos : The anchor tag numeric index of the anchor tag in the document.
Code #1 :
Python3
# importing library
from lxml import html
string_document = html.fromstring('hi <a href ="/world">geeks</a>')
# actual url
link = list(string_document.iterlinks())
# Link length
print ("Length of the link : ", len(link)
Output :
Length of the link : 1
Code #2 : Retrieving the iterlinks()
tuple
Python3
(element, attribute, link, pos) = link[0]
print ("attribute : ", attribute)
print ("\nlink : ", link)
print ("\nposition : ", position)
Output :
attribute : 'href'
link : '/world'
position : 0
Working -
ElementTree is built up when lxml parses the HTML. ElementTree is a tree structure having parent and child nodes. Each node in the tree is representing an HTML tag and it contains all the relative attributes of the tag. A tree after its creation can be iterated on to find elements. These elements can be an anchor or link tag. While the lxml.html module contains only HTML-specific functions for creating and iterating a tree,
lxml.etree module
contains the core tree handling code.
HTML parsing from files -
Instead of using
fromstring()
function to parse an HTML,
parse()
function can be called with the filename or the URL - like
html.parse('http://the/url')
or
html.parse('/path/to/filename')
. Same result will be generated as loaded in the URL or file as in the string and then call
fromstring()
.
Code #3 : ElementTree working
Python3 1==
import requests
import lxml.html
# requesting url
web_response = requests.get('https://www.geeksforgeeks.org/')
# building
element_tree = lxml.html.fromstring(web_response.text)
tree_title_element = element_tree.xpath('//title')[0]
print("Tag title : ", tree_title_element.tag)
print("\nText title :", tree_title_element.text_content())
print("\nhtml title :", lxml.html.tostring(tree_title_element))
print("\ntitle tag:", tree_title_element.tag)
print("\nParent's tag title:", tree_title_element.getparent().tag)
Output :
Tag title : title
Text title : GeeksforGeeks | A computer science portal for geeks
html title : b'GeeksforGeeks | A computer science portal for geeks\r\n'
title tag: title
Parent's tag title: head
Using request to scrap -
request is a Python library, used to scrap the website. It requests the URL of the webserver using
get()
method with URL as a parameter and in return, it gives the Response object. This object will include details about the request and the response. To read the web content,
response.text()
method is used. This content is sent back by the webserver under the request.
Code #4 : Requesting web server
Python3 1==
import requests
web_response = requests.get('https://www.geeksforgeeks.org/')
print("Response from web server : \n", web_response.text)
Output :
It will generate a huge script, of which only a sample is added here.
Response from web server :
<!DOCTYPE html>
<!--[if IE 7]>
<html class="ie ie7" lang="en-US" prefix="og: http://ogp.me/ns#">
<![endif]-->
<<!-->
<html lang="en-US" prefix="og: http://ogp.me/ns#" >
...
...
...
Similar Reads
Extract CSS tag from a given HTML using Python
Prerequisite: Implementing Web Scraping in Python with BeautifulSoup In this article, we are going to see how to extract CSS from an HTML document or URL using python. Â Module Needed: bs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and XML files. This module does not come b
2 min read
Extract title from a webpage using Python
Prerequisite Implementing Web Scraping in Python with BeautifulSoup, Python Urllib Module, Tools for Web Scraping In this article, we are going to write python scripts to extract the title form the webpage from the given webpage URL. Method 1: bs4 Beautiful Soup(bs4) is a Python library for pulling
3 min read
Extract JSON from HTML using BeautifulSoup in Python
In this article, we are going to extract JSON from HTML using BeautifulSoup in Python. Module neededbs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and XML files. This module does not come built-in with Python. To install this type the below command in the terminal.pip inst
3 min read
Parsel: How to Extract Text From HTML in Python
Parsel is a Python library used for extracting data from HTML and XML documents. It provides tools for parsing, navigating, and extracting information using CSS selectors and XPath expressions. Parsel is particularly useful for web scraping tasks where you need to programmatically extract specific d
2 min read
Extract all the URLs from the webpage Using Python
Scraping is a very essential skill for everyone to get data from any website. In this article, we are going to write Python scripts to extract all the URLs from the website or you can save it as a CSV file. Module Needed: bs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and
2 min read
Generate HTML using tinyhtml module in Python
Creating HTML can sometimes be tedious tasks and difficult to debug and error-prone. A way to solve this is using some library which can take care of the opening and closing div, etc, which can reduce chances for mistakes. We will use the tinyhtml module for this purpose. This module provides a set
3 min read
Rule-Based Data Extraction in HTML using Python textminer Module
While working with HTML, there are various requirements to extract data from plain HTML tags in an ordered manner in form of python data containers such as lists, dict, integers, etc. This article deals with a library that helps to achieve this using a rule-based approach. Features of Python - textm
3 min read
Extract hyperlinks from PDF in Python
Prerequisite: PyPDF2, Regex In this article, We are going to extract hyperlinks from PDF in Python. It can be done in different ways: Using PyPDF2Using pdfx Method 1: Using PyPDF2. PyPDF2 is a python library built as a PDF toolkit. It is capable of Extracting document information and many more. Appr
2 min read
How to Use lxml with BeautifulSoup in Python
In this article, we will explore how to use lxml with BeautifulSoup in Python. lxml is a high-performance XML and HTML parsing library for Python, known for its speed and comprehensive feature set. It supports XPath, XSLT, validation, and efficient handling of large documents, making it a preferred
3 min read
Requesting a URL from a local File in Python
Making requests over the internet is a common operation performed by most automated web applications. Whether a web scraper or a visitor tracker, such operations are performed by any program that makes requests over the internet. In this article, you will learn how to request a URL from a local File
4 min read