Parsing and Processing URL using Python - Regex
Last Updated :
04 Nov, 2025
Given a URL, the task is to extract key components such as the protocol, hostname, port number, and path using regular expressions (Regex) in Python. For example:
Input: https://www.geeksforgeeks.org/courses
Output:
Protocol: https
Hostname: geeksforgeeks.org
Let’s explore different methods to parse and process a URL in Python using Regex.
Using re.findall() to Extract Protocol and Hostname
"re.findall()" method returns all non-overlapping matches of the given pattern as a list, it scans the entire string and extracts every substring that matches the given regular expression pattern.
Python
import re
s = 'https://www.geeksforgeeks.org/'
p = re.findall(r'(\w+)://', s)
print(p)
h = re.findall(r'://www.([\w\-.]+)', s)
print(h)
Output['https']
['geeksforgeeks.org']
Explanation:
- (\w+):// captures the protocol part before ://.
- ://www.([\w\-.]+) captures the hostname that may contain letters, digits, dots, or hyphens.
- re.findall() returns all matching parts as a list.
When URLs contain an optional port number, we can extend the regex to capture it using the '?' quantifier. This ensures that the port number is included only if present.
Python
import re
s = 'file://localhost:4040/abc_file'
p = re.findall(r'(\w+)://', s)
print(p)
h = re.findall(r'://([\w\-.]+)', s)
print(h)
hp = re.findall(r'://([\w\-.]+)(:(\d+))?', s)
print(hp)
Output['file']
['localhost']
[('localhost', ':4040', '4040')]
Explanation:
- (\w+):// captures the protocol (file) and ://([\w\-.]+) captures the hostname (localhost).
- (:(\d+))? captures the port number after a colon, if it exists.
- ? makes the port group optional, ensuring it works for URLs with or without ports.
- Here, the tuple represents (hostname, :port_with_colon, port_number).
This approach extracts protocol, domain, path, and file extension together. It’s useful for structured URLs where each part follows a predictable pattern.
Python
import re
s = 'http://www.example.com/index.html'
res = re.findall(r'(\w+)://([\w\-.]+)/(\w+)\.(\w+)', s)
print(res)
Output[('http', 'www.example.com', 'index', 'html')]
Explanation:
- (\w+):// captures the protocol (http) and ([\w\-.]+) captures the domain (www.example.com).
- (\w+) captures the filename (index) and (\w+) after the dot captures the file extension (html).
- re.findall() returns all matching tuples containing these groups.
Explore
Python Fundamentals
Python Data Structures
Advanced Python
Data Science with Python
Web Development with Python
Python Practice