![]() | This article is rated C-class on Wikipedia's
content assessment scale. It is of interest to the following WikiProjects: | |||||||||||||||||
|
What is the time complexity for calculating Levenshtein distance? For example, is it O(n*m) where n and m are the lengths of the two strings being compared? I think this information would be very helpful to include for any implementations described in the article. -- Showeropera ( talk) 13:32, 14 January 2016 (UTC)
"if s[i-1] = t[j-1]" should have been "if s[i] = t[j]", since s and t are declared as "char s[1..m], char t[1..n]", and the loops go from 1 to m and 1 to n. i-1 and j-1 are needed in implementations in other languages where string character indexing starts at 0 (e.g C). Fixed. --Anon
I don't get where the following has appeared from?
int cost = (s[i-1] == t[j-1) ? 0 : 1; d[i,j] = min( d[i-1,j]+1, d[i,j-1]+1, d[i-1,j-1]+cost )
...in the pseudo-code, it is suggested that the code is actually...
if( s[i-1] == t[j-1] ) d[i,j] = d[i-1,j-1]; else d[i,j] = min( d[i-1,j]+1, d[i,j-1]+1, d[i-1,j-1]+1 )
I tried the latter and it works, so why do all of the other implementations I've seen use the former code??? -- Lee. — Preceding unsigned comment added by 31.53.116.129 ( talk) 17:34, 25 November 2011 (UTC)
Levenshtein's paper was published in 1966, not 1965. cf. V. Levenshtein, Binary codes capable of correcting deletions, insertions and reversals. Soviet Physice – Doklady 10, pp. 707-710, 1966. ralian 03:15, 11 April 2007 (UTC)
This page should link to the competing utilities such as wdiff since the Levenshtein distance and diff, wdiff, sdiff, etc all address the same set of problems (i.e., what is the difference between two strings/files)
Why in the pseudocode do int arrays start at index 0 and char arrays start at index 1?
Not really. It's confusing and inconsistent. Anyone just converting the pseudocode into real code (like I was doing earlier) will end up with a program that generates array out of bound errors. It was easy enough to fix when I looked at what it was actually doing, but is still terribly annoying. The offsets would be less confusing than having arrays start at different indexes.
I agree that the array indicing is confusing. Adding offsets doesnt change the algorithm, its just another way to show it. How it is at the moment is not especially useful given how most language lack features as nice as: "declare int d[0..lenStr1, 0..lenStr2]", "declare int d[size1,size2]" is much more common. More importantly though you can express code using the first type with the second, but not the other way around, which makes it more practical for psuedocode. Jheriko 05:41, 8 March 2006 (UTC)
I agree it may make sense for the pseudocode to be indexed this way, but could we at least put a comment to indicate it is the ith or jth character? The array syntax implies it is a index, not the character number James Van Boxtel ( talk) 20:12, 28 May 2010 (UTC)
This pseudocode is garbage and needs to be cleaned up — Preceding unsigned comment added by 24.8.146.185 ( talk) 06:53, 26 January 2013 (UTC)
Why is the Haskell code so complicated? Isn't this enough?
editDistance :: Eq a => [a] -> [a] -> Int editDistance s [] = length s editDistance [] t = lenght t editDistance (s:ss) (t:ts) = minimum [ (if s == t then 0 else 1) + editDistance ss ts, 1 + editDistance ss (t:ts), 1 + editDistance (s:ss) ts ]
Okay, I replaced the original code below with the shorter code above. Jirka
min3 :: Int->Int->Int->Int min3 x y z = min x (min y z) cost :: Char->Char->Int cost a b | a == b = 0 | otherwise = 1 sumCosts :: String->Char->Int sumCosts [] a = 0 sumCosts (s:ss) a = cost s a + sumCosts ss a editDistance :: String->String->Int editDistance [] [] = 0 editDistance s [] = sumCosts s '-' editDistance [] t = sumCosts t '-' editDistance (s:ss) (t:ts) = min3 (cost s t + editDistance ss ts) (cost s '-' + editDistance ss (t:ts)) (cost '-' t + editDistance (s:ss) ts)
Also... I really don't think Haskell memoizes like this page claims it does. That would either take tons of space or one crafty LRU cache.... Could we get a reference for that, outside of Wikipedia?
-- Daren Brantley 04:15, 16 July 2005 (UTC)
f x + f x
Haskell will evaluate the function at that value twice. Compare that to the following: let v = f x in v + v
, where v
is created as a reference to a function call. The function is called on the first evaluation of v
, and once it is evaluated, subsequent subsequent references to it will not need to call the function again. So the first example evaluates f x
twice, while the second only evaluates it once.
AquitaneHungerForce (
talk)
12:27, 29 September 2023 (UTC)If you want to verify Haskell's memoization features, visit the Dynamic programming article where the claim is also made. I'm sure somebody could give you a reference to the Haskell standard. -- 132.198.104.164 21:38, 18 July 2005 (UTC)
I think the current Haskell implementation is wrong. The levenshtein algorithm should use no more than O(n²) time, and this is exponential. It's possible to write a memorizing implementation, but Haskell doesn't do it automatically. (The ruby one is wrong, too, and probably some others.)
An O(n²) version:
distance str1 str2 = let (l1, l2) = (length str1, length str2) istr1 = (0, undefined) : zip [1..] str1 istr2 = (0, undefined) : zip [1..] str2 table = array ((0, 0), (l1, l2)) [ ((i, j), value i c1 j c2) | (i, c1) <- istr1, (j, c2) <- istr2 ] value 0 _ j _ = j value i _ 0 _ = i value i c1 j c2 = minimum [ table ! (i-1, j) + 1, table ! (i, j-1) + 1, table ! (i-1, j-1) + cost ] where cost = if c1 == c2 then 0 else 1 in table ! (l1, l2)
The following version isn't O(n²), because // copies arrays and !! is linear in the element number -- using lazy evaluation as above is the key for solving that:
distance str1 str2 = last $ elems $ foldl update table [ (i,j) | i <- [1..length str1] , j <- [1..length str2] ] where table = initial (length str1 , length str2 ) update table (i,j) = table // [((i,j),value)] where value = minimum [ table ! (i-1 , j) + 1 -- deletion , table ! (i , j-1) + 1 -- insertion , table ! (i-1 , j-1) + cost -- substitution ] where cost = if str1!!(i-1) == str2!!(j-1) then 0 else 1 initial (b1,b2) = array ((0,0),(b1,b2)) [ ((i,j), value (i,j)) | i <- [0 .. b1] , j <- [0..b2]] where value (0,j) = j value (i,0) = i value (i,j) = 0
Here is an O(n) version, using iteration over an initial row. This is much faster (with GHC; have not tried others) since
seq
,distance :: String -> String -> Int distance s1 s2 = iter s1 s2 [0..length s2] where iter (c:cs) s2 row@(e:es) = iter cs s2 (e' : rest e' c s2 row) where e' = e + 1 iter [] _ row = last row iter _ _ _ = error "iter (distance): unexpected arguments" rest e c (c2:c2s) (e1:es@(e2:es')) = seq k (k : rest k c c2s es) where k = (min (e1 + if c == c2 then 0 else 1) $ min (e+1) (e2+1)) rest _ _ [] _ = [] rest _ _ _ _ = error "rest (distance): unexpected arguments"
-- Abhay Parvate 05:33, 4 July 2006 (UTC)
I don't see any particular reason to point out his ethnicity/religion in this article. If you want to put it in his article, be my guest -- but please provide documentation of some sort.-- SarekOfVulcan 21:22, 3 November 2005 (UTC)
I can transform "kitten" into "sitting" in two steps:
Can someone explain why this is not acceptable? Are we talking single-character edits here? -- Doradus 17:48, 16 January 2006 (UTC)
I don't believe the implementations are relevant to the article. Some of them are even wrong - and they certainly aren't more illustrative than the pseudo-code. Currently there are 18 implementations: 2 for C++, C#, 2 for Lisp, Haskell, Java, Python, Ruby, 2 for Scheme, Prolog, VB.NET, Visual FoxPro, Actionscript, Perl, Ocaml and Lua. I vote for removing all of them from the article; if anyone find these useful, just create a page (like www.99-bottles-of-beer.net) with a sample implementation for each language under the sun, and add a link to it. DanielKO 00:21, 12 July 2006 (UTC)
Certainly. Get rid of all those implementations and just leave the pseudocode. I truly don't see the point of all this. Zyxoas ( talk to me - I'll listen) 16:49, 12 July 2006 (UTC)
I just removed them all. Maybe the Implementations section should be rewritten. DanielKO 23:25, 12 July 2006 (UTC)
Potential license compatability downstream is no reason to delete material, even if the material is source code. That's like saying there should be no programming books at WikiBooks released under the GFDL.
If there's a wider proposal to remove all source code from WikiPedia (and hopefully come up with a sister project to move it to) then I'll accept deleting the implementations. If the quality of implementations are lacking, this can hardly be relevant to their deletion, because even the pseudocode was incorrect at one point.
The implementations provide useful, working examples of the pseudocode for readers. -- 71.192.61.193 01:54, 13 July 2006 (UTC)
Well even though it's said here that the implementations were removed, someone must have put them back. I went ahead and removed the Ruby implementation because it didn't work.
I agree that the page is long and unprintable. What if we moved these to separate pages? Something like Levenshtein distance (C++)? Is that poor style, since most parentheticals in an encyclopedia are standardized to thinks like "album", "film", "computer" and not flavors of programming language? -- 71.254.12.146 00:55, 17 July 2006 (UTC)
That seems reasonable. We *should* try to standardize and therefore get the input from other articles and their editors. -- 69.54.29.23 15:01, 17 July 2006 (UTC)
Didn't there used to be more history on this page? What's with all the implementations? This article is just silly. ONE is enough. The technicalities of the Haskell code is so irrelevant. 65.100.248.229 01:43, 27 July 2006 (UTC)
The number of implementations are a known problem. What do you think of the proposal?
I think the technicalities of the haskell code are extremely relevant, but the Haskell implementation is disputed.
I don't see any sign of the page's history material removed. -- 72.92.129.85 03:20, 27 July 2006 (UTC)
I'd like to remove the majority of implementations from this page. Visual FoxPro? Is it really necessary? From a good encyclopedia I would expect pseudo code and supporting implementation in a well known language such as C. Can anyone cite me a good standard algorithm text that has implementations in so many languages? If no one has any objection I will remove most of the implementations tomorrow. I am also against an article called Levenshtein distance (implementations), if such an article were to exist in an encyclopedia then it's purpose would be to describe existing implementations not to provide new ones. It is my understanding that in an encyclopedia we should be documenting existing entities, not providing new implementations or producing original research. New299 13:09, 18 September 2006 (UTC)
I've cleaned out many of the major offenders that were either obscure or just hideous (long lines, comment headers). It would be nice to have a C version for sentimentality, but C++ usually subsumes programming examples in most textbooks these days. Should we keep the VisualBasic implementation? -- 71.169.128.40 23:28, 18 September 2006 (UTC)
There was this deletion: "(23:59, 19 September 2006) Simetrical (→Implementations - Wikipedia is not a code repository. It's especially not a code repository for that not-explicitly-GFDLed Python script.)". I think it is OK not to store code here. But as code may be extremely useful for those who want to implement the algorithm the links should be kept. I remember there was a link to code for python and perl. These links should be restored! -- 148.6.178.137 07:54, 20 September 2006 (UTC)
I don't think the article now stands as a "mere collection of ... source code" (quoted from WP:NOT). It probably used to. The python script should be just a link anyway, regardless of copyright issues. I've done that and removed the hideous VisualBasic source code since no one made a motion to oppose it. We'll see if they do now. Currently, the page prints in under 10 pages on my system. -- 69.54.29.23 12:41, 20 September 2006 (UTC)
Seems to me that Visual FoxPro is as valid as the next language: that's why I keep restoring it. If we want to drop the article down to one or two reference implementations (C++, probably), I don't have an issue with it. Since the Haskell implementation looks nothing like the others, I'd vote for keeping that one as well, regardless of how widely-used it is. -- Sar e kOfVulcan 21:09, 9 October 2006 (UTC)
The size of the rendered page is about 44kB. If it get's too large, the VFP implementation should proably be first to go. -- 71.161.222.65 22:21, 23 October 2006 (UTC)
I think this article should include as many implementations as possible. I came here via Google search: 'edit distance' (couldn't recall the spelling of Levenshtein) looking for a Ruby implementation. Wikipedia ideally is a collection of all knowledge, regardless if the concept manifests itself as prose or code. Clearly the different implementations are conceptually different because of the facilities the language provides, and should all be included.
Rather than adding more, can we link to an exisitng page with more solutions? I agree with @Dcoetzee that it was confusing to catch the change of index from 0 to 1. Here's the page I eventually found that helped me solve my problem. wikibooks:Algorithm_Implementation/Strings/Levenshtein_distance - User:SomeAnonymousContributor
I removed the Haskell example, for two reasons:
If anyone knows of a language, not terribly obscure, in which memoization is guaranteed by the standard, please add an implementation in such a language. I think the clarity of such an implementation would be a great addition to the article.
For more information on my recent edits to example-code articles, and proposal to get rid of a lot of relatively esoteric example code along the lines that other editors have suggested in this section, see Wikipedia talk:WikiProject Programming languages#Category:Articles_with_example_code_proposal_and_call_for_volunteers. -- Quuxplusone 01:34, 4 December 2006 (UTC)
Using the Common Lisp implementation described in the article, I get notably wrong results:
[11]> (levenshtein-distance "cow" "cow") 3
Whoever added this should probably fix it. The Levenshtein distance from any string to itself should be zero. -- FOo 01:11, 28 August 2006 (UTC)
Thanks for reporting this. I've deleted until it can be fixed. It wasn't a very elegant implementation anyway, combined with it having been translated from the Python implementation. I'm sure somebody will be able to contribute something more useful in the future. -- 71.161.221.31 03:55, 28 August 2006 (UTC)
Can someone please explain the difference between DTW and Levenshtein distance?
The algorithms look almost identical. Prehaps a translation of both into C would claify this?
I removed the implementations and replaced with a link even though we already had one. It seems they were ignoring it anyway. I didn't even look at the source code to see how good the implementations were... I'm not sure I care. I'm not against people sharing their implementations, even multiple implementations per language, especially if one happens to use o(m) space instead of o(mn), or if there is some advantage to a particular implementation over another. If two implementations in the article were compared and contrasted, with benefits and disadvantages for both, that might be encyclopedic and useful for a more in depth investigation of the calculation of this metric. But all I saw was just code. And I'm not even sure it was good code at that. Root4( one) 03:17, 28 September 2007 (UTC)
In the upper and lower bounds section, point 5 seems to be merely re-stating point 1. I have taken the liberty of removing point 5, as it is the more complicated of the two. If this is wrong, feel free to correct me. -- Gnack ( talk) 05:21, 7 July 2008 (UTC)
Could someone please extend Levenshtein automaton article, as other possible implementation of this string metric? Current stub version is merely nothing. For the particular case when we do not need a transition graph (memory consuming n*m sized matrix) but scalar result only, FSM approach looks more appealing than dynamic one. —Preceding unsigned comment added by 91.78.191.197 ( talk) 13:17, 28 July 2008 (UTC)
The second para of the article currently says this "is often used in applications that need to determine how similar, or different, two strings are, such as spell checkers." Shouldn't that be "spell *correctors*"? A spell checker need only check for exact matches, whereas a spell corrector needs to find the closest match(es). Mcswell ( talk) 13:41, 13 October 2008 (UTC)
The layout of this article is a total failure. Rather then slapping the pseudocode in the reader’s face, go and figure it out, it should provide an introduction and explain the ideas behind the algorithm first. -- Yecril ( talk) 13:01, 25 November 2009 (UTC)
There was an article about Levenshtein distance in Dr. Dobbs Magazine: s. http://drdobbs.com/architecture-and-design/184408746?pgno=2 Maybe someone who knows how to include citations in the article could add this. — Preceding unsigned comment added by 84.128.62.142 ( talk) 16:49, 2 July 2011 (UTC)
The section Relationship with other edit distance metrics states that the length of the longest common subsequence gives an edit distance metric given addition and deletion as the allowable edit operations. This seems incorrect. If two strings have length and with a longest common subsequence of symbols, then the edit distance should be . Am I making some silly mistake? Otherwise this should be fixed. Augurar ( talk) 02:01, 5 January 2012 (UTC)
When the OR is removed from this page, what remains is a rather trivial modification of Levenshtein distance. QVVERTYVS ( hm?) 21:21, 25 November 2013 (UTC)
Does this variation of edit distance really deserve an article of its own? -- Sigmundur ( talk) 13:34, 31 December 2014 (UTC)
The start index of the string is 1 but the start index of the 2D int array is 0. This is an inconsistency that can cause confusion, especially when translated into a programming language. Most programming languages stick with either 1 or 0 but not both. — Preceding unsigned comment added by Blackraven36 ( talk • contribs) 18:19, 15 March 2015 (UTC)
There is a problem in 4.3 pseudocode: line v1[i] = i+1; causes writing outside the array boundary when s.Length > t.Length+1. — Preceding unsigned comment added by 95.80.245.102 ( talk) 12:47, 3 July 2015 (UTC)
I'm not sure why we're presenting a (presumably) fully working C implementation of the naïve recursive algorithm. No serious source presents this algorithm because it's so extremely inefficient. Anyone who wants to try it can derive it from the mathematical definition if they want. QVVERTYVS ( hm?) 12:29, 14 October 2015 (UTC)
The calculation of 'cost' is not
branching. The indicator function is only boolean evaluation expression. If ... else
statement is redundant and confusing. Please see
http://c2.com/cgi/wiki?ReturnBooleanEvaluations in
WikiWikiWeb. So the recursive C implementation source code is the following ---
Cedar101 (
talk)
02:13, 20 October 2015 (UTC)
// len_s and len_t are the number of characters in string s and t respectively
int LevenshteinDistance(string s, int len_s, string t, int len_t) { // 'string' is not in C. It's std::string in C++.
// base case: empty strings
if (len_s == 0) return len_t;
if (len_t == 0) return len_s;
// check mismatch of last characters of the strings
bool mismatch = (slen_s-1 != tlen_t-1]); // The bool type requires C++ or modern C(C99 or above) (#include <stdbool.h>)
int cost = (int)mismatch;
// return minimum of delete char from s, delete char from t, and delete char from both
return minimum(LevenshteinDistance(s, len_s - 1, t, len_t ) + 1,
LevenshteinDistance(s, len_s , t, len_t - 1) + 1,
LevenshteinDistance(s, len_s - 1, t, len_t - 1) + cost);
}
+=
or bizzare typecasts just makes the code more difficult for WP's readers.
Glrx (
talk)
16:17, 20 October 2015 (UTC)// len_s and len_t are the number of characters in string s and t respectively
int LevenshteinDistance(string s, int len_s, string t, int len_t) { // 'string' is not in C. It's std::string in C++.
// base case: empty strings
if (len_s == 0) return len_t;
if (len_t == 0) return len_s;
// return minimum(std::min) of delete char from s, delete char from t, and delete char from both
return min({LevenshteinDistance(s, len_s-1, t, len_t ) + 1,
LevenshteinDistance(s, len_s , t, len_t-1) + 1,
LevenshteinDistance(s, len_s-1, t, len_t-1) + mismatch_cost(s, len_s, t, len_t)});
}
// The indicator function: check (and count) mismatch of last characters of the strings
int mismatch_cost(string s, int len_s, string t, int len_t) {
return slen_s-1 != tlen_t-1];
}
int mismatch_cost = (slen_s-1 != tlen_t-1]) ? 1 : 0
p++; p+=2; (x)?y:z;
) as confusing to the reader, and I will bludgeon dirty typecasts.Please, prior to entering code into this article, take a look at its history. Many code examples have been introduced, gone through changes, and removed, according to the whims of the editors at the time. I have little doubt that the present examples will go through the same process. On top of that, there is a plethora of other articles on this subject such as edit distance, Jaro-Winkler distance, Damerau-Levenshtein distance, Needleman–Wunsch algorithm, string metric, Longest common subsequence problem, Approximate string matching, etcetera, which explain basically the same thing over and over again, often with their own code examples. It would be nice if we could repeat ourselves less often on this topic - but how to achieve that? Rp ( talk) 11:12, 23 October 2015 (UTC)
The article claims that
"A more efficient method would never repeat the same distance calculation. For example, the Levenshtein distance of all possible prefixes might be stored in an array d[][] where d[i][j] is the distance between the first i characters of string s and the first j characters of string t. The table is easy to construct one row at a time starting with row 0. When the entire table has been built, the desired distance is d[len_s][len_t]. While this technique is significantly faster, it will consume len_s * len_t more memory than the straightforward recursive implementation.”
Does anybody have any reference for the last statement, because I think it’s not right. Consider the huge stack space occupied by a recursive implementation…
Xrisk ( talk) 19:53, 29 October 2015 (UTC)
I have implemented this method with unexpected results: the distance obtained is asymmetrical and gives fairly different results than the straightforward implementation. I will read the article and check if the implementation on this article is correct. -- Farru ES ( talk) 18:38, 4 October 2016 (UTC)
The current page shows: function LevenshteinDistance(char s[1..m], char t[1..n]):
// create two work vectors of integer distances declare int v0[n + 1] declare int v1[n + 1]
I believe this should be: function LevenshteinDistance(char s[1..m], char t[1..n]):
// create two work vectors of integer distances declare int v0[n + 1] declare int v1[m + 1]
-- 165.225.81.1 ( talk) 17:06, 5 March 2019 (UTC)
There is an implied loop in the statement
// copy v1 (current row) to v0 (previous row) for next iteration swap v0 with v1
This loop just slows the algorithm down, and for no purpose. If the arrays were declared with an inner index then a 'swap' would consist of switching the inner index from 0 to 1 or 1 to 0. I'm not going to clutter the existing code on the page, but for coders searching for an algorithm this should be noted. — Preceding unsigned comment added by 165.225.81.1 ( talk • contribs) 17:24, 5 March 2019 (UTC)
The existing "naïve" implementation was in my opinion a good deal more naïve than it had to be. I've gone ahead and replaced the code sample with a slight variation on the algorithm that makes it a good deal faster. Of course the point of the naïve implementation was not to be fast, but to be easy to understand. It is my opinion that the new version is easier to understand than the previous version.
The change to the algorithm is to not branch when looking at identical characters. The previous algorithm would at every step branch into three calls, one for deletion, one for insertion, and one for both replacement and no action with a cost calculated based on the characters at the read heads. The new version first checks the values read heads, then if they are the same, move both heads forward (equivalent to no action), only branching if they are different into the three possible moves (insertion, deletion, replacement).
This works properly since , thus when calculating , no branching is needed simply calculate .
This is faster because it branches far less frequently. Every place this algorithm branches the old one would too, but there are some places the old one branched that this one does not. If you would like to verify this simply compare a very long string with itself on both versions. The new version is nearly instant, since it is linear on equal strings while the old version stuggles to handle strings reaching even 100 characters, since it is exponential on equal strings.
This is, in my opinion, easier to understand because it separates the logic for performing steps off from simply moving through the string. The cost
calculation in the old version was a little too abstract, and having both no action and replacement occupy the same call was a little confusing.
Asside from the change of algorithm there are a few other changes. The big change is that the code is now in Haskell instead of C, the biggest reason for this change is that I am more comfortable at writing high quality example Haskell than I am at doing so in C. If there is a real good case that C should be used instead then someone confident in C should replace this code with C code that implements the new algorithm sensibly. The other reason is that Haskell is in its nature a recursive language, making recursive algorithms more clearly expressed in Haskell than C. I've avoided using the more opaque Haskell syntax like pattern guards, instead opting for if then else
, to make this as approachable as possible to persons without Haskell knowledge.
The other change is that the algorithm now moves left to right instead of right to left. This is a product of the differences between Haskell lists and C arrays. I don't think this change impacts clarity.
AquitaneHungerForce ( talk) 16:49, 8 March 2020 (UTC)
The definition presented in the introduction (the smallest number of insertions, deletions, and subsitutions needed to transform one string to another) is perfect as a definition and should be the definition used in this article. (With one improvement: The concept of an insertion should be clarified so that it explicitly includes additions to the beginning or end of a string.)
Then it should be demonstrated that this in fact defines a metric on the space of finite strings from a fixed alphabet: reflexivity, symmetry, and transitivity.
The reason is that this definition is very intuitive, and that's what a definition should be.
Then the "definition" that the article presents in the section titled Definition should instead be presented as a theorem, and that theorem should be proved in the article. 47.44.96.195 ( talk) 02:38, 27 October 2020 (UTC)
I wanted to add a reducer-based Clojure implementation because reducers are a new functional programming concept ("new" as in "not around back in 2007 when the article was started"), and is an additional way of doing it (i.e. not iterative or recursive). It is less efficient computationally, but also terser, so I feel it merits adding. However, I saw the note not to add implementations and to come to the talk page, so I'll pop it here and see what others think:
(defn levenshtein a b
(peek
(reduce
(fn last this
(reduce
(fn row diagonal above other]] (let update (if ( = other this) diagonal (inc (min diagonal above (peek row))))] (conj row update)))
[(inc (first last))]
(map vector last (next last) b)))
(map #(identity %2) (cons nil b) (range))
a)))
-- 139.133.214.105 ( talk) 11:38, 8 August 2022 (UTC)
See https://stackoverflow.com/a/4057585/1709587 and http://www.berghel.net/publications/asm/asm.php, for instance.
"Where the earlier algorithm of Wagner and Fisher [6] was implemented in time O(n²), Ukkonen's algorithm runs in O(s * n) time, for edit distance s and string length n."
Obviously in the worst case s=n (so if you want to characterise the worst-case time complexity purely in terms of n, it's still O(n²)) but the point is that with the right algorithm, the time complexity approaches linear as the strings being compared approach being identical. This way of characterising time complexity of edit distance calculations is fairly normal - see also how the Myers diff algorithm (for calculating diffs between strings, with allowable edits being only insertions and deletions) is characterised (both typically and in the title of Myers' paper) as an O(nd) algorithm, where d is edit distance. ExplodingCabbage ( talk) 09:14, 23 November 2023 (UTC)
It is currently stated that the upper bound is at most the length of the longer string. It is however also stated that if the strings have the same size, the Hamming distance is an upper bound on the Levenshtein distance. This implies that the upper bound for the Levenshtein distance between two strings of different lengths can be calculated by adding the length difference to the Hamming distance between the shorter string and the substring of the longer string that starts at index 0 and has the same length as the shorter string. Should this be explicitly stated instead? Ratsatchel ( talk) 17:11, 21 June 2024 (UTC)
![]() | This article is rated C-class on Wikipedia's
content assessment scale. It is of interest to the following WikiProjects: | |||||||||||||||||
|
What is the time complexity for calculating Levenshtein distance? For example, is it O(n*m) where n and m are the lengths of the two strings being compared? I think this information would be very helpful to include for any implementations described in the article. -- Showeropera ( talk) 13:32, 14 January 2016 (UTC)
"if s[i-1] = t[j-1]" should have been "if s[i] = t[j]", since s and t are declared as "char s[1..m], char t[1..n]", and the loops go from 1 to m and 1 to n. i-1 and j-1 are needed in implementations in other languages where string character indexing starts at 0 (e.g C). Fixed. --Anon
I don't get where the following has appeared from?
int cost = (s[i-1] == t[j-1) ? 0 : 1; d[i,j] = min( d[i-1,j]+1, d[i,j-1]+1, d[i-1,j-1]+cost )
...in the pseudo-code, it is suggested that the code is actually...
if( s[i-1] == t[j-1] ) d[i,j] = d[i-1,j-1]; else d[i,j] = min( d[i-1,j]+1, d[i,j-1]+1, d[i-1,j-1]+1 )
I tried the latter and it works, so why do all of the other implementations I've seen use the former code??? -- Lee. — Preceding unsigned comment added by 31.53.116.129 ( talk) 17:34, 25 November 2011 (UTC)
Levenshtein's paper was published in 1966, not 1965. cf. V. Levenshtein, Binary codes capable of correcting deletions, insertions and reversals. Soviet Physice – Doklady 10, pp. 707-710, 1966. ralian 03:15, 11 April 2007 (UTC)
This page should link to the competing utilities such as wdiff since the Levenshtein distance and diff, wdiff, sdiff, etc all address the same set of problems (i.e., what is the difference between two strings/files)
Why in the pseudocode do int arrays start at index 0 and char arrays start at index 1?
Not really. It's confusing and inconsistent. Anyone just converting the pseudocode into real code (like I was doing earlier) will end up with a program that generates array out of bound errors. It was easy enough to fix when I looked at what it was actually doing, but is still terribly annoying. The offsets would be less confusing than having arrays start at different indexes.
I agree that the array indicing is confusing. Adding offsets doesnt change the algorithm, its just another way to show it. How it is at the moment is not especially useful given how most language lack features as nice as: "declare int d[0..lenStr1, 0..lenStr2]", "declare int d[size1,size2]" is much more common. More importantly though you can express code using the first type with the second, but not the other way around, which makes it more practical for psuedocode. Jheriko 05:41, 8 March 2006 (UTC)
I agree it may make sense for the pseudocode to be indexed this way, but could we at least put a comment to indicate it is the ith or jth character? The array syntax implies it is a index, not the character number James Van Boxtel ( talk) 20:12, 28 May 2010 (UTC)
This pseudocode is garbage and needs to be cleaned up — Preceding unsigned comment added by 24.8.146.185 ( talk) 06:53, 26 January 2013 (UTC)
Why is the Haskell code so complicated? Isn't this enough?
editDistance :: Eq a => [a] -> [a] -> Int editDistance s [] = length s editDistance [] t = lenght t editDistance (s:ss) (t:ts) = minimum [ (if s == t then 0 else 1) + editDistance ss ts, 1 + editDistance ss (t:ts), 1 + editDistance (s:ss) ts ]
Okay, I replaced the original code below with the shorter code above. Jirka
min3 :: Int->Int->Int->Int min3 x y z = min x (min y z) cost :: Char->Char->Int cost a b | a == b = 0 | otherwise = 1 sumCosts :: String->Char->Int sumCosts [] a = 0 sumCosts (s:ss) a = cost s a + sumCosts ss a editDistance :: String->String->Int editDistance [] [] = 0 editDistance s [] = sumCosts s '-' editDistance [] t = sumCosts t '-' editDistance (s:ss) (t:ts) = min3 (cost s t + editDistance ss ts) (cost s '-' + editDistance ss (t:ts)) (cost '-' t + editDistance (s:ss) ts)
Also... I really don't think Haskell memoizes like this page claims it does. That would either take tons of space or one crafty LRU cache.... Could we get a reference for that, outside of Wikipedia?
-- Daren Brantley 04:15, 16 July 2005 (UTC)
f x + f x
Haskell will evaluate the function at that value twice. Compare that to the following: let v = f x in v + v
, where v
is created as a reference to a function call. The function is called on the first evaluation of v
, and once it is evaluated, subsequent subsequent references to it will not need to call the function again. So the first example evaluates f x
twice, while the second only evaluates it once.
AquitaneHungerForce (
talk)
12:27, 29 September 2023 (UTC)If you want to verify Haskell's memoization features, visit the Dynamic programming article where the claim is also made. I'm sure somebody could give you a reference to the Haskell standard. -- 132.198.104.164 21:38, 18 July 2005 (UTC)
I think the current Haskell implementation is wrong. The levenshtein algorithm should use no more than O(n²) time, and this is exponential. It's possible to write a memorizing implementation, but Haskell doesn't do it automatically. (The ruby one is wrong, too, and probably some others.)
An O(n²) version:
distance str1 str2 = let (l1, l2) = (length str1, length str2) istr1 = (0, undefined) : zip [1..] str1 istr2 = (0, undefined) : zip [1..] str2 table = array ((0, 0), (l1, l2)) [ ((i, j), value i c1 j c2) | (i, c1) <- istr1, (j, c2) <- istr2 ] value 0 _ j _ = j value i _ 0 _ = i value i c1 j c2 = minimum [ table ! (i-1, j) + 1, table ! (i, j-1) + 1, table ! (i-1, j-1) + cost ] where cost = if c1 == c2 then 0 else 1 in table ! (l1, l2)
The following version isn't O(n²), because // copies arrays and !! is linear in the element number -- using lazy evaluation as above is the key for solving that:
distance str1 str2 = last $ elems $ foldl update table [ (i,j) | i <- [1..length str1] , j <- [1..length str2] ] where table = initial (length str1 , length str2 ) update table (i,j) = table // [((i,j),value)] where value = minimum [ table ! (i-1 , j) + 1 -- deletion , table ! (i , j-1) + 1 -- insertion , table ! (i-1 , j-1) + cost -- substitution ] where cost = if str1!!(i-1) == str2!!(j-1) then 0 else 1 initial (b1,b2) = array ((0,0),(b1,b2)) [ ((i,j), value (i,j)) | i <- [0 .. b1] , j <- [0..b2]] where value (0,j) = j value (i,0) = i value (i,j) = 0
Here is an O(n) version, using iteration over an initial row. This is much faster (with GHC; have not tried others) since
seq
,distance :: String -> String -> Int distance s1 s2 = iter s1 s2 [0..length s2] where iter (c:cs) s2 row@(e:es) = iter cs s2 (e' : rest e' c s2 row) where e' = e + 1 iter [] _ row = last row iter _ _ _ = error "iter (distance): unexpected arguments" rest e c (c2:c2s) (e1:es@(e2:es')) = seq k (k : rest k c c2s es) where k = (min (e1 + if c == c2 then 0 else 1) $ min (e+1) (e2+1)) rest _ _ [] _ = [] rest _ _ _ _ = error "rest (distance): unexpected arguments"
-- Abhay Parvate 05:33, 4 July 2006 (UTC)
I don't see any particular reason to point out his ethnicity/religion in this article. If you want to put it in his article, be my guest -- but please provide documentation of some sort.-- SarekOfVulcan 21:22, 3 November 2005 (UTC)
I can transform "kitten" into "sitting" in two steps:
Can someone explain why this is not acceptable? Are we talking single-character edits here? -- Doradus 17:48, 16 January 2006 (UTC)
I don't believe the implementations are relevant to the article. Some of them are even wrong - and they certainly aren't more illustrative than the pseudo-code. Currently there are 18 implementations: 2 for C++, C#, 2 for Lisp, Haskell, Java, Python, Ruby, 2 for Scheme, Prolog, VB.NET, Visual FoxPro, Actionscript, Perl, Ocaml and Lua. I vote for removing all of them from the article; if anyone find these useful, just create a page (like www.99-bottles-of-beer.net) with a sample implementation for each language under the sun, and add a link to it. DanielKO 00:21, 12 July 2006 (UTC)
Certainly. Get rid of all those implementations and just leave the pseudocode. I truly don't see the point of all this. Zyxoas ( talk to me - I'll listen) 16:49, 12 July 2006 (UTC)
I just removed them all. Maybe the Implementations section should be rewritten. DanielKO 23:25, 12 July 2006 (UTC)
Potential license compatability downstream is no reason to delete material, even if the material is source code. That's like saying there should be no programming books at WikiBooks released under the GFDL.
If there's a wider proposal to remove all source code from WikiPedia (and hopefully come up with a sister project to move it to) then I'll accept deleting the implementations. If the quality of implementations are lacking, this can hardly be relevant to their deletion, because even the pseudocode was incorrect at one point.
The implementations provide useful, working examples of the pseudocode for readers. -- 71.192.61.193 01:54, 13 July 2006 (UTC)
Well even though it's said here that the implementations were removed, someone must have put them back. I went ahead and removed the Ruby implementation because it didn't work.
I agree that the page is long and unprintable. What if we moved these to separate pages? Something like Levenshtein distance (C++)? Is that poor style, since most parentheticals in an encyclopedia are standardized to thinks like "album", "film", "computer" and not flavors of programming language? -- 71.254.12.146 00:55, 17 July 2006 (UTC)
That seems reasonable. We *should* try to standardize and therefore get the input from other articles and their editors. -- 69.54.29.23 15:01, 17 July 2006 (UTC)
Didn't there used to be more history on this page? What's with all the implementations? This article is just silly. ONE is enough. The technicalities of the Haskell code is so irrelevant. 65.100.248.229 01:43, 27 July 2006 (UTC)
The number of implementations are a known problem. What do you think of the proposal?
I think the technicalities of the haskell code are extremely relevant, but the Haskell implementation is disputed.
I don't see any sign of the page's history material removed. -- 72.92.129.85 03:20, 27 July 2006 (UTC)
I'd like to remove the majority of implementations from this page. Visual FoxPro? Is it really necessary? From a good encyclopedia I would expect pseudo code and supporting implementation in a well known language such as C. Can anyone cite me a good standard algorithm text that has implementations in so many languages? If no one has any objection I will remove most of the implementations tomorrow. I am also against an article called Levenshtein distance (implementations), if such an article were to exist in an encyclopedia then it's purpose would be to describe existing implementations not to provide new ones. It is my understanding that in an encyclopedia we should be documenting existing entities, not providing new implementations or producing original research. New299 13:09, 18 September 2006 (UTC)
I've cleaned out many of the major offenders that were either obscure or just hideous (long lines, comment headers). It would be nice to have a C version for sentimentality, but C++ usually subsumes programming examples in most textbooks these days. Should we keep the VisualBasic implementation? -- 71.169.128.40 23:28, 18 September 2006 (UTC)
There was this deletion: "(23:59, 19 September 2006) Simetrical (→Implementations - Wikipedia is not a code repository. It's especially not a code repository for that not-explicitly-GFDLed Python script.)". I think it is OK not to store code here. But as code may be extremely useful for those who want to implement the algorithm the links should be kept. I remember there was a link to code for python and perl. These links should be restored! -- 148.6.178.137 07:54, 20 September 2006 (UTC)
I don't think the article now stands as a "mere collection of ... source code" (quoted from WP:NOT). It probably used to. The python script should be just a link anyway, regardless of copyright issues. I've done that and removed the hideous VisualBasic source code since no one made a motion to oppose it. We'll see if they do now. Currently, the page prints in under 10 pages on my system. -- 69.54.29.23 12:41, 20 September 2006 (UTC)
Seems to me that Visual FoxPro is as valid as the next language: that's why I keep restoring it. If we want to drop the article down to one or two reference implementations (C++, probably), I don't have an issue with it. Since the Haskell implementation looks nothing like the others, I'd vote for keeping that one as well, regardless of how widely-used it is. -- Sar e kOfVulcan 21:09, 9 October 2006 (UTC)
The size of the rendered page is about 44kB. If it get's too large, the VFP implementation should proably be first to go. -- 71.161.222.65 22:21, 23 October 2006 (UTC)
I think this article should include as many implementations as possible. I came here via Google search: 'edit distance' (couldn't recall the spelling of Levenshtein) looking for a Ruby implementation. Wikipedia ideally is a collection of all knowledge, regardless if the concept manifests itself as prose or code. Clearly the different implementations are conceptually different because of the facilities the language provides, and should all be included.
Rather than adding more, can we link to an exisitng page with more solutions? I agree with @Dcoetzee that it was confusing to catch the change of index from 0 to 1. Here's the page I eventually found that helped me solve my problem. wikibooks:Algorithm_Implementation/Strings/Levenshtein_distance - User:SomeAnonymousContributor
I removed the Haskell example, for two reasons:
If anyone knows of a language, not terribly obscure, in which memoization is guaranteed by the standard, please add an implementation in such a language. I think the clarity of such an implementation would be a great addition to the article.
For more information on my recent edits to example-code articles, and proposal to get rid of a lot of relatively esoteric example code along the lines that other editors have suggested in this section, see Wikipedia talk:WikiProject Programming languages#Category:Articles_with_example_code_proposal_and_call_for_volunteers. -- Quuxplusone 01:34, 4 December 2006 (UTC)
Using the Common Lisp implementation described in the article, I get notably wrong results:
[11]> (levenshtein-distance "cow" "cow") 3
Whoever added this should probably fix it. The Levenshtein distance from any string to itself should be zero. -- FOo 01:11, 28 August 2006 (UTC)
Thanks for reporting this. I've deleted until it can be fixed. It wasn't a very elegant implementation anyway, combined with it having been translated from the Python implementation. I'm sure somebody will be able to contribute something more useful in the future. -- 71.161.221.31 03:55, 28 August 2006 (UTC)
Can someone please explain the difference between DTW and Levenshtein distance?
The algorithms look almost identical. Prehaps a translation of both into C would claify this?
I removed the implementations and replaced with a link even though we already had one. It seems they were ignoring it anyway. I didn't even look at the source code to see how good the implementations were... I'm not sure I care. I'm not against people sharing their implementations, even multiple implementations per language, especially if one happens to use o(m) space instead of o(mn), or if there is some advantage to a particular implementation over another. If two implementations in the article were compared and contrasted, with benefits and disadvantages for both, that might be encyclopedic and useful for a more in depth investigation of the calculation of this metric. But all I saw was just code. And I'm not even sure it was good code at that. Root4( one) 03:17, 28 September 2007 (UTC)
In the upper and lower bounds section, point 5 seems to be merely re-stating point 1. I have taken the liberty of removing point 5, as it is the more complicated of the two. If this is wrong, feel free to correct me. -- Gnack ( talk) 05:21, 7 July 2008 (UTC)
Could someone please extend Levenshtein automaton article, as other possible implementation of this string metric? Current stub version is merely nothing. For the particular case when we do not need a transition graph (memory consuming n*m sized matrix) but scalar result only, FSM approach looks more appealing than dynamic one. —Preceding unsigned comment added by 91.78.191.197 ( talk) 13:17, 28 July 2008 (UTC)
The second para of the article currently says this "is often used in applications that need to determine how similar, or different, two strings are, such as spell checkers." Shouldn't that be "spell *correctors*"? A spell checker need only check for exact matches, whereas a spell corrector needs to find the closest match(es). Mcswell ( talk) 13:41, 13 October 2008 (UTC)
The layout of this article is a total failure. Rather then slapping the pseudocode in the reader’s face, go and figure it out, it should provide an introduction and explain the ideas behind the algorithm first. -- Yecril ( talk) 13:01, 25 November 2009 (UTC)
There was an article about Levenshtein distance in Dr. Dobbs Magazine: s. http://drdobbs.com/architecture-and-design/184408746?pgno=2 Maybe someone who knows how to include citations in the article could add this. — Preceding unsigned comment added by 84.128.62.142 ( talk) 16:49, 2 July 2011 (UTC)
The section Relationship with other edit distance metrics states that the length of the longest common subsequence gives an edit distance metric given addition and deletion as the allowable edit operations. This seems incorrect. If two strings have length and with a longest common subsequence of symbols, then the edit distance should be . Am I making some silly mistake? Otherwise this should be fixed. Augurar ( talk) 02:01, 5 January 2012 (UTC)
When the OR is removed from this page, what remains is a rather trivial modification of Levenshtein distance. QVVERTYVS ( hm?) 21:21, 25 November 2013 (UTC)
Does this variation of edit distance really deserve an article of its own? -- Sigmundur ( talk) 13:34, 31 December 2014 (UTC)
The start index of the string is 1 but the start index of the 2D int array is 0. This is an inconsistency that can cause confusion, especially when translated into a programming language. Most programming languages stick with either 1 or 0 but not both. — Preceding unsigned comment added by Blackraven36 ( talk • contribs) 18:19, 15 March 2015 (UTC)
There is a problem in 4.3 pseudocode: line v1[i] = i+1; causes writing outside the array boundary when s.Length > t.Length+1. — Preceding unsigned comment added by 95.80.245.102 ( talk) 12:47, 3 July 2015 (UTC)
I'm not sure why we're presenting a (presumably) fully working C implementation of the naïve recursive algorithm. No serious source presents this algorithm because it's so extremely inefficient. Anyone who wants to try it can derive it from the mathematical definition if they want. QVVERTYVS ( hm?) 12:29, 14 October 2015 (UTC)
The calculation of 'cost' is not
branching. The indicator function is only boolean evaluation expression. If ... else
statement is redundant and confusing. Please see
http://c2.com/cgi/wiki?ReturnBooleanEvaluations in
WikiWikiWeb. So the recursive C implementation source code is the following ---
Cedar101 (
talk)
02:13, 20 October 2015 (UTC)
// len_s and len_t are the number of characters in string s and t respectively
int LevenshteinDistance(string s, int len_s, string t, int len_t) { // 'string' is not in C. It's std::string in C++.
// base case: empty strings
if (len_s == 0) return len_t;
if (len_t == 0) return len_s;
// check mismatch of last characters of the strings
bool mismatch = (slen_s-1 != tlen_t-1]); // The bool type requires C++ or modern C(C99 or above) (#include <stdbool.h>)
int cost = (int)mismatch;
// return minimum of delete char from s, delete char from t, and delete char from both
return minimum(LevenshteinDistance(s, len_s - 1, t, len_t ) + 1,
LevenshteinDistance(s, len_s , t, len_t - 1) + 1,
LevenshteinDistance(s, len_s - 1, t, len_t - 1) + cost);
}
+=
or bizzare typecasts just makes the code more difficult for WP's readers.
Glrx (
talk)
16:17, 20 October 2015 (UTC)// len_s and len_t are the number of characters in string s and t respectively
int LevenshteinDistance(string s, int len_s, string t, int len_t) { // 'string' is not in C. It's std::string in C++.
// base case: empty strings
if (len_s == 0) return len_t;
if (len_t == 0) return len_s;
// return minimum(std::min) of delete char from s, delete char from t, and delete char from both
return min({LevenshteinDistance(s, len_s-1, t, len_t ) + 1,
LevenshteinDistance(s, len_s , t, len_t-1) + 1,
LevenshteinDistance(s, len_s-1, t, len_t-1) + mismatch_cost(s, len_s, t, len_t)});
}
// The indicator function: check (and count) mismatch of last characters of the strings
int mismatch_cost(string s, int len_s, string t, int len_t) {
return slen_s-1 != tlen_t-1];
}
int mismatch_cost = (slen_s-1 != tlen_t-1]) ? 1 : 0
p++; p+=2; (x)?y:z;
) as confusing to the reader, and I will bludgeon dirty typecasts.Please, prior to entering code into this article, take a look at its history. Many code examples have been introduced, gone through changes, and removed, according to the whims of the editors at the time. I have little doubt that the present examples will go through the same process. On top of that, there is a plethora of other articles on this subject such as edit distance, Jaro-Winkler distance, Damerau-Levenshtein distance, Needleman–Wunsch algorithm, string metric, Longest common subsequence problem, Approximate string matching, etcetera, which explain basically the same thing over and over again, often with their own code examples. It would be nice if we could repeat ourselves less often on this topic - but how to achieve that? Rp ( talk) 11:12, 23 October 2015 (UTC)
The article claims that
"A more efficient method would never repeat the same distance calculation. For example, the Levenshtein distance of all possible prefixes might be stored in an array d[][] where d[i][j] is the distance between the first i characters of string s and the first j characters of string t. The table is easy to construct one row at a time starting with row 0. When the entire table has been built, the desired distance is d[len_s][len_t]. While this technique is significantly faster, it will consume len_s * len_t more memory than the straightforward recursive implementation.”
Does anybody have any reference for the last statement, because I think it’s not right. Consider the huge stack space occupied by a recursive implementation…
Xrisk ( talk) 19:53, 29 October 2015 (UTC)
I have implemented this method with unexpected results: the distance obtained is asymmetrical and gives fairly different results than the straightforward implementation. I will read the article and check if the implementation on this article is correct. -- Farru ES ( talk) 18:38, 4 October 2016 (UTC)
The current page shows: function LevenshteinDistance(char s[1..m], char t[1..n]):
// create two work vectors of integer distances declare int v0[n + 1] declare int v1[n + 1]
I believe this should be: function LevenshteinDistance(char s[1..m], char t[1..n]):
// create two work vectors of integer distances declare int v0[n + 1] declare int v1[m + 1]
-- 165.225.81.1 ( talk) 17:06, 5 March 2019 (UTC)
There is an implied loop in the statement
// copy v1 (current row) to v0 (previous row) for next iteration swap v0 with v1
This loop just slows the algorithm down, and for no purpose. If the arrays were declared with an inner index then a 'swap' would consist of switching the inner index from 0 to 1 or 1 to 0. I'm not going to clutter the existing code on the page, but for coders searching for an algorithm this should be noted. — Preceding unsigned comment added by 165.225.81.1 ( talk • contribs) 17:24, 5 March 2019 (UTC)
The existing "naïve" implementation was in my opinion a good deal more naïve than it had to be. I've gone ahead and replaced the code sample with a slight variation on the algorithm that makes it a good deal faster. Of course the point of the naïve implementation was not to be fast, but to be easy to understand. It is my opinion that the new version is easier to understand than the previous version.
The change to the algorithm is to not branch when looking at identical characters. The previous algorithm would at every step branch into three calls, one for deletion, one for insertion, and one for both replacement and no action with a cost calculated based on the characters at the read heads. The new version first checks the values read heads, then if they are the same, move both heads forward (equivalent to no action), only branching if they are different into the three possible moves (insertion, deletion, replacement).
This works properly since , thus when calculating , no branching is needed simply calculate .
This is faster because it branches far less frequently. Every place this algorithm branches the old one would too, but there are some places the old one branched that this one does not. If you would like to verify this simply compare a very long string with itself on both versions. The new version is nearly instant, since it is linear on equal strings while the old version stuggles to handle strings reaching even 100 characters, since it is exponential on equal strings.
This is, in my opinion, easier to understand because it separates the logic for performing steps off from simply moving through the string. The cost
calculation in the old version was a little too abstract, and having both no action and replacement occupy the same call was a little confusing.
Asside from the change of algorithm there are a few other changes. The big change is that the code is now in Haskell instead of C, the biggest reason for this change is that I am more comfortable at writing high quality example Haskell than I am at doing so in C. If there is a real good case that C should be used instead then someone confident in C should replace this code with C code that implements the new algorithm sensibly. The other reason is that Haskell is in its nature a recursive language, making recursive algorithms more clearly expressed in Haskell than C. I've avoided using the more opaque Haskell syntax like pattern guards, instead opting for if then else
, to make this as approachable as possible to persons without Haskell knowledge.
The other change is that the algorithm now moves left to right instead of right to left. This is a product of the differences between Haskell lists and C arrays. I don't think this change impacts clarity.
AquitaneHungerForce ( talk) 16:49, 8 March 2020 (UTC)
The definition presented in the introduction (the smallest number of insertions, deletions, and subsitutions needed to transform one string to another) is perfect as a definition and should be the definition used in this article. (With one improvement: The concept of an insertion should be clarified so that it explicitly includes additions to the beginning or end of a string.)
Then it should be demonstrated that this in fact defines a metric on the space of finite strings from a fixed alphabet: reflexivity, symmetry, and transitivity.
The reason is that this definition is very intuitive, and that's what a definition should be.
Then the "definition" that the article presents in the section titled Definition should instead be presented as a theorem, and that theorem should be proved in the article. 47.44.96.195 ( talk) 02:38, 27 October 2020 (UTC)
I wanted to add a reducer-based Clojure implementation because reducers are a new functional programming concept ("new" as in "not around back in 2007 when the article was started"), and is an additional way of doing it (i.e. not iterative or recursive). It is less efficient computationally, but also terser, so I feel it merits adding. However, I saw the note not to add implementations and to come to the talk page, so I'll pop it here and see what others think:
(defn levenshtein a b
(peek
(reduce
(fn last this
(reduce
(fn row diagonal above other]] (let update (if ( = other this) diagonal (inc (min diagonal above (peek row))))] (conj row update)))
[(inc (first last))]
(map vector last (next last) b)))
(map #(identity %2) (cons nil b) (range))
a)))
-- 139.133.214.105 ( talk) 11:38, 8 August 2022 (UTC)
See https://stackoverflow.com/a/4057585/1709587 and http://www.berghel.net/publications/asm/asm.php, for instance.
"Where the earlier algorithm of Wagner and Fisher [6] was implemented in time O(n²), Ukkonen's algorithm runs in O(s * n) time, for edit distance s and string length n."
Obviously in the worst case s=n (so if you want to characterise the worst-case time complexity purely in terms of n, it's still O(n²)) but the point is that with the right algorithm, the time complexity approaches linear as the strings being compared approach being identical. This way of characterising time complexity of edit distance calculations is fairly normal - see also how the Myers diff algorithm (for calculating diffs between strings, with allowable edits being only insertions and deletions) is characterised (both typically and in the title of Myers' paper) as an O(nd) algorithm, where d is edit distance. ExplodingCabbage ( talk) 09:14, 23 November 2023 (UTC)
It is currently stated that the upper bound is at most the length of the longer string. It is however also stated that if the strings have the same size, the Hamming distance is an upper bound on the Levenshtein distance. This implies that the upper bound for the Levenshtein distance between two strings of different lengths can be calculated by adding the length difference to the Hamming distance between the shorter string and the substring of the longer string that starts at index 0 and has the same length as the shorter string. Should this be explicitly stated instead? Ratsatchel ( talk) 17:11, 21 June 2024 (UTC)