Python | Grouping similar substrings in list
Last Updated :
11 Apr, 2023
Sometimes we have an application in which we require to group common prefix strings into one such that further processing can be done according to the grouping. This type of grouping is useful in the cases of Machine Learning and Web Development. Let's discuss certain ways in which this can be done.
Method #1 : Using lambda + itertools.groupby() + split()
The combination of above three functions help us achieve the task. The split method is key as it defines the separator by which grouping has to be performed. The groupby function does the grouping of elements.
Steps by step approach:
- Import the groupby function from the itertools module.
- Initialize a list of strings test_list with some elements.
- Sort the test_list in ascending order using the sort() method. This is necessary for grouping later.
- Print the original test_list.
- Use a list comprehension to iterate over the groups of elements in test_list grouped by the first substring before the _ character.
- In the groupby() function, test_list is iterable, and the lambda function lambda a: a.split('_')[0] returns the first substring before the _ character in each element of the list. This is used to group the elements.
- Convert each group into a list and append it to the result list res.
- Print the result list res.
Below is the implementation of the above approach:
Python3
# Python3 code to demonstrate
# group similar substrings
# using lambda + itertools.groupby() + split()
from itertools import groupby
# initializing list
test_list = ['geek_1', 'coder_2', 'geek_4', 'coder_3', 'pro_3']
# sort list
# essential for grouping
test_list.sort()
# printing the original list
print ("The original list is : " + str(test_list))
# using lambda + itertools.groupby() + split()
# group similar substrings
res = [list(i) for j, i in groupby(test_list,
lambda a: a.split('_')[0])]
# printing result
print ("The grouped list is : " + str(res))
OutputThe original list is : ['coder_2', 'coder_3', 'geek_1', 'geek_4', 'pro_3']
The grouped list is : [['coder_2', 'coder_3'], ['geek_1', 'geek_4'], ['pro_3']]
Time complexity: O(nlogn), where n is the length of the input list.
Auxiliary space: O(n), where n is the length of the input list.
Method #2 : Using lambda + itertools.groupby() + partition()
The similar task can also be performed replacing the split function with the partition function. This is more efficient way to perform this task as it uses the iterators and hence internally quicker.
Python3
# Python3 code to demonstrate
# group similar substrings
# using lambda + itertools.groupby() + partition()
from itertools import groupby
# initializing list
test_list = ['geek_1', 'coder_2', 'geek_4', 'coder_3', 'pro_3']
# sort list
# essential for grouping
test_list.sort()
# printing the original list
print ("The original list is : " + str(test_list))
# using lambda + itertools.groupby() + partition()
# group similar substrings
res = [list(i) for j, i in groupby(test_list,
lambda a: a.partition('_')[0])]
# printing result
print ("The grouped list is : " + str(res))
OutputThe original list is : ['coder_2', 'coder_3', 'geek_1', 'geek_4', 'pro_3']
The grouped list is : [['coder_2', 'coder_3'], ['geek_1', 'geek_4'], ['pro_3']]
Time complexity: O(n log n) (due to sorting the list).
Auxiliary space: O(n) (for creating the result list "res").
Method #3 : Using index() and find() methods
Python3
# Python3 code to demonstrate
# group similar substrings
# initializing list
test_list = ['geek_1', 'coder_2', 'geek_4', 'coder_3', 'pro_3']
print("The original List is : "+ str(test_list))
x=[]
for i in test_list:
x.append(i[:i.index("_")])
x=list(set(x))
res=[]
for i in x:
a=[]
for j in test_list:
if(j.find(i)!=-1):
a.append(j)
res.append(a)
# printing result
print ("The grouped list is : " + str(res))
OutputThe grouped list is : [['coder_2', 'coder_3'], ['pro_3'], ['geek_1', 'geek_4']]
Time complexity: O(n^2), where 'n' is the length of the input list 'test_list'.
Auxiliary space: O(n), where 'n' is the length of the input list 'test_list'.
Method #4 : Using startswith()
Python3
# initializing list
test_list = ['geek_1', 'coder_2', 'geek_4', 'coder_3', 'pro_3']
# printing the original list
print("The original list is : " + str(test_list))
# using startswith in a list comprehension
res = [[item for item in test_list if item.startswith(prefix)] for prefix in set([item[:item.index("_")] for item in test_list])]
# printing result
print("The grouped list is : " + str(res))
#This code is contributed by Edula Vinay Kumar Reddy
OutputThe original list is : ['geek_1', 'coder_2', 'geek_4', 'coder_3', 'pro_3']
The grouped list is : [['coder_2', 'coder_3'], ['pro_3'], ['geek_1', 'geek_4']]
Time Complexity: O(n), as it iterates through the list test_list twice, once to create the list of unique prefixes and once to create the grouped list. It also has a space complexity of O(n), as it creates two additional lists, one containing the unique prefixes and one containing the grouped list.
Auxiliary Space: O(n)
Method #5: Using a dictionary to group similar substrings
Use a dictionary to group the substrings that have the same prefix. The keys of the dictionary will be the prefixes, and the values will be lists containing the substrings with that prefix. Here's an example implementation:
Python3
test_list = ['geek_1', 'coder_2', 'geek_4', 'coder_3', 'pro_3']
grouped = {}
for s in test_list:
prefix = s.split('_')[0]
if prefix not in grouped:
grouped[prefix] = []
grouped[prefix].append(s)
res = list(grouped.values())
print(res)
Output[['geek_1', 'geek_4'], ['coder_2', 'coder_3'], ['pro_3']]
Time complexity: O(n*k), where n is the length of the input list and k is the maximum length of the prefix.
Auxiliary space: O(n*k), as the dictionary may contain all n elements of the input list, and the length of each value list may be up to n.
Method #6: Using a loop and a dictionary
Step-by-step approach:
- Initialize the list of strings.
- Create an empty dictionary to store the groups.
- Iterate over each string in the list.
- Extract the substring before the underscore using the split() method.
- Check if the key exists in the dictionary. If it does, append the string to the list under the key. If it doesn't, create a new list with the string under the key.
- Convert the dictionary to a list of lists using the values() method.
- Print the original list and the grouped list.
Python3
# initializing list
test_list = ['geek_1', 'coder_2', 'geek_4', 'coder_3', 'pro_3']
# creating an empty dictionary
d = {}
# iterating over each string in the list
for s in test_list:
# extracting the substring before the underscore
key = s.split('_')[0]
# adding the string to the dictionary under the key
if key in d:
d[key].append(s)
else:
d[key] = [s]
# converting the dictionary to a list of lists
res = list(d.values())
# printing the original list
print("The original list is : " + str(test_list))
# printing the result
print("The grouped list is : " + str(res))
OutputThe original list is : ['geek_1', 'coder_2', 'geek_4', 'coder_3', 'pro_3']
The grouped list is : [['geek_1', 'geek_4'], ['coder_2', 'coder_3'], ['pro_3']]
Time complexity: This approach has a time complexity of O(n), where n is the number of strings in the list. The loop iterates over each string in the list once, and the time complexity of dictionary operations is usually considered to be constant time.
Auxiliary space: This approach uses a dictionary to store the groups, so the auxiliary space complexity is O(k*n), where k is the average size of the groups and n is the number of strings in the list.
Method #7: Using numpy method:
Algorithm :
- Initialize the input list test_list.
- Get the unique prefixes from the input list using np.unique.
- Group the elements in test_list by prefix using a list comprehension.
- Print the grouped list res.
Python3
import numpy as np
test_list = ['geek_1', 'coder_2', 'geek_4', 'coder_3', 'pro_3']
# printing the original list
print("The original list is : " + str(test_list))
# Get unique prefixes
prefixes = np.unique([item.split('_')[0] for item in test_list])
# Group elements by prefix
res = [[item for item in test_list if item.startswith(prefix)] for prefix in prefixes]
# printing result
print("The grouped list is : " + str(res))
#This code is contributed by Jyothi pinjala.
Output:
The original list is : ['geek_1', 'coder_2', 'geek_4', 'coder_3', 'pro_3']
The grouped list is : [['coder_2', 'coder_3'], ['geek_1', 'geek_4'], ['pro_3']]
Time complexity:
The np.unique function has a time complexity of O(n log n) or O(n) depending on the implementation used.
The list comprehension inside the res list has a time complexity of O(n^2), where n is the length of the input list.
Therefore, the overall time complexity of the algorithm is O(n^2).
Auxiliary Space:
The space complexity of the algorithm is O(n) because we store the input list, the prefixes, and the grouped list in memory.
Similar Reads
Python | Identical Strings Grouping
Sometimes, we need to perform the conventional task of grouping some like Strings into a separate list and thus forming a list of list. This can also help in counting and also get the sorted order of elements. Letâs discuss certain ways in which this can be done. Method #1: Using collections.Counter
5 min read
Python | Checking if starting digits are similar in list
Sometimes we may face a problem in which we need to find a list if it contains numbers with the same digits. This particular utility has an application in day-day programming. Let's discuss certain ways in which this task can be achieved. Method #1: Using list comprehension + map() We can approach t
8 min read
Python - Filter Similar Case Strings
Given the Strings list, the task is to write a Python program to filter all the strings which have a similar case, either upper or lower. Examples: Input : test_list = ["GFG", "Geeks", "best", "FOr", "all", "GEEKS"]Â Output : ['GFG', 'best', 'all', 'GEEKS']Â Explanation : GFG is all uppercase, best is
9 min read
Python - Group contiguous strings in List
Given a mixed list, the task is to write a Python program to group all the contiguous strings. Input : test_list = [5, 6, 'g', 'f', 'g', 6, 5, 'i', 's', 8, 'be', 'st', 9] Output : [5, 6, ['g', 'f', 'g'], 6, 5, ['i', 's'], 8, ['be', 'st'], 9] Explanation : Strings are grouped to form result.Input : t
5 min read
Python | Kth index character similar Strings
Sometimes, we require to get the words that have the Kth index with the specific letter. This kind of use case is quiet common in places of common programming projects or competitive programming. Letâs discuss certain shorthand to deal with this problem in Python. Method #1: Using list comprehension
3 min read
Python | Substring removal in String list
While working with strings, one of the most used application is removing the part of string with another. Since string in itself is immutable, the knowledge of this utility in itself is quite useful. Here the removing of a substring in list of string is performed. Letâs discuss certain ways in which
5 min read
Replace Substrings from String List - Python
The task of replacing substrings in a list of strings involves iterating through each string and substituting specific words with their corresponding replacements. For example, given a list a = ['GeeksforGeeks', 'And', 'Computer Science'] and replacements b = [['Geeks', 'Gks'], ['And', '&'], ['C
3 min read
Python - Case Insensitive Strings Grouping
Sometimes, we have a use case in which we need to perform the grouping of strings by various factors, like first letter or any other factor. These type of problems are typical to database queries and hence can occur in web development while programming. This article focuses on one such grouping by c
4 min read
Python | Duplicate substring removal from list
Sometimes we can come to the problem in which we need to deal with certain strings in a list that are separated by some separator and we need to remove the duplicates in each of these kinds of strings. Simple shorthands to solve this kind of problem is always good to have. Let's discuss certain ways
7 min read
Finding Strings with Given Substring in List - Python
The task of finding strings with a given substring in a list in Python involves checking whether a specific substring exists within any of the strings in a list. The goal is to efficiently determine if the desired substring is present in any of the elements of the list. For example, given a list a =
3 min read