Hi! Let’s do a tutorial on Levenshtein Distance in Python. We will be using Python 3.8.10. Let’s go! ⚡🔥✨
The Levenshtein Distance measures the difference between two string sequences. It is the minimum number of edits needed to change or transform one string into the other. It is named after mathematician Vladimir Levenshtein who did a lot of research in field in the 1960s.
As a specific metric, the Levenshtein Distance has a wide range of applications. In computer science it is used in spell checking and spell correction applications, NLP systems and OCR. In linguistics it is used to compare different languages for research purposes.
Our aim here is to write a simple function that will calculate the Levenshtein Distance and thereby allow us to know the minimum number of operations needed to convert one word to another. By “operations” we mean insertion, deletion or replacement of a single character. Here is our code:
def leven(x, y):
n = len(x)
m = len(y)
A = [[i + j for j in range(m + 1)] for i in range(n + 1)]
for i in range(n):
for j in range(m):
A[i + 1][j + 1] = min(A[i][j + 1] + 1, # insert
A[i + 1][j] + 1, # delete
A[i][j] + int(x[i] != y[j])) # replace
return A[n][m]
print(leven("brap","rap"))
print(leven("trial","try"))
print(leven("horse","force"))
print(leven("rose","erode"))
#output
#1
#3
#2
#2
Let’s explain what is happening here:
- First we define the function leven() which takes two parameters (x and y) and calculates the Levenshtein Distance between them i.e. the minimum insertions, deletions and replacements needed to transform x into y.
- We construct the 2D array A[i,j] which gives the distance between the prefix of x of length i and the prefix of y of length j. We start with A[0,j] = j and A[i,0] = i.
- When i or j is greater than or equal to 1 there are three possibilities for the remaining letters of the prefix: either xi is deleted, yj is inserted or, if they aren’t the same, xi is replaced by yj
- When the loop is complete A[n][m] will always contain the minimum amount of operations needed to transform x into y.
When the code executes, we should have the Distance printed to the screen as an integer. We used 4 examples and we can easily see that this algorithm works:
- brap to rap: delete the b (1 operation)
- trial to try: replace the i with y, delete a and delete l (3 operations)
- horse to force: replace the h with f, replace s with c (2 operations)
- rose to erode: add an e at the beginning, replace the s with d (2 operations)
Of course there are other ways to do this but this method is quite short and efficient. You can find a fantastic breakdown of the theory behind the Levenshtein Distance HERE and HERE.
Thanks for reading, hopefully this helped. Good luck! 👌👌👌