Published
- 3 min read
New in PHP 8.5: Levenshtein Comparison for UTF-8 Strings

PHP 8.5 adds a new function for calculating the Levenshtein distance between strings — now with proper UTF-8 support.
PHP has long had a levenshtein() function, but it comes with a significant limitation: it doesn’t support UTF-8.
If you’re not familiar with the Levenshtein distance, it’s a way to measure how different two strings are — by counting the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into another.
For example, the following code returns 2
instead of the correct result, 1
:
var_dump(levenshtein('göthe', 'gothe'));
There are workarounds — such as using a pure PHP implementation or converting strings to a custom single-byte encoding — but they come with downsides, like slower performance or non-standard behavior.
With the new grapheme_levenshtein()
function in PHP 8.5, the code above now correctly returns 1
.
Grapheme-Based Comparison
What makes this new function especially powerful is that it operates on graphemes, not bytes or code points. For instance, the character é (accented 'e') can be represented in two ways: as a single code point (U+00E9
) or as a combination of the letter e (U+0065
) and a combining accent (U+0301
). In PHP, you can write these as:
$string1 = "\u{00e9}";
$string2 = "\u{0065}\u{0301}";
Even though these strings are technically different at the byte level, they represent the same grapheme. The new grapheme_levenshtein()
function correctly recognizes this and returns 0
— meaning no difference.
This is particularly useful when working with complex scripts such as Japanese, Chinese, or Korean, where grapheme clusters play a bigger role than in Latin or Cyrillic alphabets.
Just for fun: what do you think the original levenshtein()
function will return for the example above?
var_dump(levenshtein("\u{0065}\u{0301}", "\u{00e9}"));