Open In App

Parsing and Processing URL using Python - Regex

Last Updated : 04 Nov, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Given a URL, the task is to extract key components such as the protocol, hostname, port number, and path using regular expressions (Regex) in Python. For example:

Input: https://www.geeksforgeeks.org/courses
Output:
Protocol: https
Hostname: geeksforgeeks.org

Let’s explore different methods to parse and process a URL in Python using Regex.

Using re.findall() to Extract Protocol and Hostname

"re.findall()" method returns all non-overlapping matches of the given pattern as a list, it scans the entire string and extracts every substring that matches the given regular expression pattern.

Python
import re  
s = 'https://www.geeksforgeeks.org/'

p = re.findall(r'(\w+)://', s)
print(p)

h = re.findall(r'://www.([\w\-.]+)', s)
print(h)

Output
['https']
['geeksforgeeks.org']

Explanation:

  • (\w+):// captures the protocol part before ://.
  • ://www.([\w\-.]+) captures the hostname that may contain letters, digits, dots, or hyphens.
  • re.findall() returns all matching parts as a list.

Using re.findall() to Extract Port Number (if Present)

When URLs contain an optional port number, we can extend the regex to capture it using the '?' quantifier. This ensures that the port number is included only if present.

Python
import re  
s = 'file://localhost:4040/abc_file'

p = re.findall(r'(\w+)://', s)
print(p)

h = re.findall(r'://([\w\-.]+)', s)
print(h)

hp = re.findall(r'://([\w\-.]+)(:(\d+))?', s)
print(hp)

Output
['file']
['localhost']
[('localhost', ':4040', '4040')]

Explanation:

  • (\w+):// captures the protocol (file) and ://([\w\-.]+) captures the hostname (localhost).
  • (:(\d+))? captures the port number after a colon, if it exists.
  • ? makes the port group optional, ensuring it works for URLs with or without ports.
  • Here, the tuple represents (hostname, :port_with_colon, port_number).

Using re.findall() to Extract Full URL Components

This approach extracts protocol, domain, path, and file extension together. It’s useful for structured URLs where each part follows a predictable pattern.

Python
import re  
s = 'http://www.example.com/index.html'
res = re.findall(r'(\w+)://([\w\-.]+)/(\w+)\.(\w+)', s)
print(res)

Output
[('http', 'www.example.com', 'index', 'html')]

Explanation:

  • (\w+):// captures the protocol (http) and ([\w\-.]+) captures the domain (www.example.com).
  • (\w+) captures the filename (index) and (\w+) after the dot captures the file extension (html).
  • re.findall() returns all matching tuples containing these groups.

Article Tags :

Explore