Open In App

Similarity Metrics of Strings – Python

Last Updated : 17 Jan, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

In Python, we often need to measure the similarity between two strings. For example, consider the strings “geeks” and “geeky” —we might want to know how closely they match, whether for tasks like comparing user inputs or finding duplicate entries. Let’s explore different methods to compute string similarity.

Using SequenceMatcher() from difflib

SequenceMatcher class in the difflib module provides a simple way to measure string similarity based on the ratio of matching subsequences.

Python
from difflib import SequenceMatcher

s1 = "geeks"
s2 = "geeky"

# Calculating similarity ratio
res = SequenceMatcher(None, s1, s2).ratio()
print(res)  

Output
0.8

Explanation:

  • SequenceMatcher() compares two strings and calculates the ratio of matching characters.
  • The ratio method returns a float between 0 and 1, indicating how similar the strings are.
  • This method is simple to use and works well for general string similarity tasks.

Let’s explore some more methods and see how we can find similarity metrics of strings.

Using Levenshtein distance (edit distance)

Levenshtein distance measures the number of edits (insertions, deletions, or substitutions) needed to convert one string into another.

Python
import Levenshtein

s1 = "geeks"
s2 = "geeky"

# Calculating similarity ratio
res = Levenshtein.ratio(s1, s2)
print(res)  

Explanation:

  • Levenshtein.ratio() method calculates a similarity score based on the edit distance.
  • It is more accurate for cases where string transformations are involved.
  • This method is widely used in text processing and is efficient for moderate string lengths.

Using Jaccard similarity

Jaccard similarity compares the common elements between two sets and calculates their ratio to the union of the sets.

Python
s1 = "geeks"
s2 = "geeky"

# Converting strings to sets of characters
set1 = set(s1)
set2 = set(s2)

# Calculating Jaccard similarity
res = len(set1 & set2) / len(set1 | set2)
print(res) 

Output
0.6

Explanation:

  • The strings are converted into sets of characters.
  • The intersection and union of the sets are used to calculate the similarity ratio.
  • This method is effective for comparing unique characters and is easy to implement.

Using Cosine similarity

Cosine similarity measures the angle between two vectors in a multidimensional space, where each string is represented as a vector of character counts.

Python
from collections import Counter
from math import sqrt

s1 = "geeks"
s2 = "geeky"

# Convert strings to character frequency vectors
vec1 = Counter(s1)
vec2 = Counter(s2)

# Calculating cosine similarity
dot_product = sum(vec1[ch] * vec2[ch] for ch in vec1)
magnitude1 = sqrt(sum(count ** 2 for count in vec1.values()))
magnitude2 = sqrt(sum(count ** 2 for count in vec2.values()))
res = dot_product / (magnitude1 * magnitude2)
print(res)  

Output
0.857142857142857

Explanation:

  • The strings are represented as frequency vectors using the Counter class.
  • The dot product and magnitudes of the vectors are used to compute the similarity.
  • This method is useful for comparing strings with weighted character counts.

Using Hamming distance

Hamming distance measures the number of differing characters at corresponding positions in two strings of equal length.

Python
s1 = "geeks"
s2 = "geeky"

# Calculating Hamming distance
res = sum(c1 != c2 for c1, c2 in zip(s1, s2)) if len(s1) == len(s2) else "Strings must be of equal length"
print(res)  

Output
1

Explanation:

  • zip() function pairs characters from both strings for comparison.
  • A generator expression counts differing characters.
  • This method requires strings of equal length and is efficient for this specific task.


Next Article
Practice Tags :

Similar Reads