Parsing and Processing URL using Python - Regex

Last Updated : 04 Nov, 2025

Given a URL, the task is to extract key components such as the protocol, hostname, port number, and path using regular expressions (Regex) in Python. For example:

Input: https://www.geeksforgeeks.org/courses
Output:
Protocol: https
Hostname: geeksforgeeks.org

Let’s explore different methods to parse and process a URL in Python using Regex.

Using re.findall() to Extract Protocol and Hostname

"re.findall()" method returns all non-overlapping matches of the given pattern as a list, it scans the entire string and extracts every substring that matches the given regular expression pattern.

Python

import re  
s = 'https://www.geeksforgeeks.org/'

p = re.findall(r'(\w+)://', s)
print(p)

h = re.findall(r'://www.([\w\-.]+)', s)
print(h)

Output

['https']
['geeksforgeeks.org']

Explanation:

(\w+):// captures the protocol part before ://.
://www.([\w\-.]+) captures the hostname that may contain letters, digits, dots, or hyphens.
re.findall() returns all matching parts as a list.

Using re.findall() to Extract Port Number (if Present)

When URLs contain an optional port number, we can extend the regex to capture it using the '?' quantifier. This ensures that the port number is included only if present.

Python

import re  
s = 'file://localhost:4040/abc_file'

p = re.findall(r'(\w+)://', s)
print(p)

h = re.findall(r'://([\w\-.]+)', s)
print(h)

hp = re.findall(r'://([\w\-.]+)(:(\d+))?', s)
print(hp)

Output

['file']
['localhost']
[('localhost', ':4040', '4040')]

Explanation:

(\w+):// captures the protocol (file) and ://([\w\-.]+) captures the hostname (localhost).
(:(\d+))? captures the port number after a colon, if it exists.
? makes the port group optional, ensuring it works for URLs with or without ports.
Here, the tuple represents (hostname, :port_with_colon, port_number).

Using re.findall() to Extract Full URL Components

This approach extracts protocol, domain, path, and file extension together. It’s useful for structured URLs where each part follows a predictable pattern.

Python

import re  
s = 'http://www.example.com/index.html'
res = re.findall(r'(\w+)://([\w\-.]+)/(\w+)\.(\w+)', s)
print(res)

Output

[('http', 'www.example.com', 'index', 'html')]

Explanation:

(\w+):// captures the protocol (http) and ([\w\-.]+) captures the domain (www.example.com).
(\w+) captures the filename (index) and (\w+) after the dot captures the file extension (html).
re.findall() returns all matching tuples containing these groups.

sangy987

Improve

Article Tags :

Parsing and Processing URL using Python - Regex

Using re.findall() to Extract Protocol and Hostname

Using re.findall() to Extract Port Number (if Present)

Using re.findall() to Extract Full URL Components

Explore

Python Fundamentals

Python Data Structures

Advanced Python

Data Science with Python

Web Development with Python

Python Practice

Thank You!

What kind of Experience do you want to share?