![]() | This article is rated Start-class on Wikipedia's
content assessment scale. It is of interest to the following WikiProjects: | |||||||||||||||||
|
|
|
.. should mention that this method is not recommended for professional use cases. I don't want to be a passenger of a plane which runs software merged some times in the past from different branches and contains old code due to a hash collision... cheers 82.83.133.92 ( talk) 20:18, 3 July 2015 (UTC)
Since I am activally using and contributing to this page (I am the one who posted the diff algorithm) to create a generic history tool I think Zer000 is right it lay out sucks.... for that reason I will undertake writting a new one once I start I will post a link to the draft page here for comment. (unsigned by Sarek1024, 2006-05-14)
C
array). What is left is the syntax of the for-loops (range
) and some stuff in bt_all
.Does anybody know a tight bound on the maximal number of longest common subsequences? (As in less than , ). If you take for example the strings XYXYXY...XY and YXYXYX...YX, there are LCSs, right? For XYZXYZ...XYZ and ZYXZYX...ZYX there are . We have that is maximal for . Is a bound? -- Nils Grimsmo 17:56, 13 August 2006 (UTC)
Hi folks, I dunno if it's just an artifact of an earlier version or an artifact of my wayward brain, but is "LCP" a typo for LCS in the Python code and there abouts? Also, this Python doesn't run as is, right? The matrix C isn't initialized. Isn't the LCS algorithm understood to take two sequences as its input? -- babbage 09:00, 27 October 2006 (UTC)
C = [[0] * (m+1) for i in xrange(n+1)]
Currently the notation is not consistent between the definition of the recurrence and the pseudo-code (my fault). Which should be used of X1..m and X[1..m]
? Or maybe something else? Or does it not matter that two different notations are used? Klem fra
Nils Grimsmo
14:29, 10 December 2006 (UTC)
For two strings and , the Levenshtein distance (edit distance) is
I don't think this is true. For example, if we take the words RHYTHM and ALGORITHM, then so the Levenshtein distance should become , but it should be 6.
I think the following equation is wrong.
The correct one should be
Isn't it? —The preceding unsigned comment was added by 140.115.213.121 ( talk) 04:59, 2 February 2007 (UTC).
Sorry, the original one is correct. I misunderstood the equation. —140.115.213.121.
I think the article could really do with an example demonstrating the LCS between two strings. I'm not sure I understand it well enough to add one, though... Tjwood 10:18, 5 March 2007 (UTC)
The worked example is very good, nice and visual. I notice it describes the use of a traceback system to save storing a big table of subsequences. In the languages I program in, the subsequences and even sets of subsequences would be objects, so in the untweaked version of the scheme (without the traceback) the individual cells of the table would be single pointers only, many of them to the same objects. (For every cell that inherits its set of longest sequences only from the left or only from above, no new object would need to be created at all. That is, any of the mismatch cells where the cell above or left also have different-lengthed longest subsequences to pass on. I think that would be the majority of cells when you get into comparing large and contrasting sequences. In the case of a mismatch cell which inherits from left and above, no new subsequence objects are generated, only two lists of them combined.) So does this discount the value of the traceback approach, and should it be mentioned?
59.101.175.101 ( talk) 10:57, 29 October 2011 (UTC)
The only advantage I can see identity integers having over hash functions is that they don't have collisions, at the cost of taking longer to calculate. The article currently makes it sound like they have more advantages.. eg:
Yet any hash function will yield the same hash value for two lines that are the same, else it's not a hash function. So the above point seems fairly moot..
As above.
Identity integers also lose this property:
So unless I'm misunderstanding something, should the article be changed? (I don't fully understand how the function works yet, so in the case that I'm wrong I'd rather not change/make a mess of things =P, and also have never worked in perl/c++)
Themania 05:20, 20 April 2007 (UTC)
Also I forgot to mention the largest advantage of hash values - you can calculate them just once for a file, and then compare that file against other files. Very useful!
Themania 05:24, 20 April 2007 (UTC)
In traceback since R is {} , we can omit it in the first if .
Under "Complexity" is the following equation:
L is not previously defined. But, more to the point, I've had calculus II, and I can't make sense of the upper limit of that summation. Perhaps it could be put in words, as well, for us mere mortals.-- Christopher King ( talk) 21:36, 10 January 2009 (UTC)
Someone added the note on the right to the start of the computer programming examples:
Should the code examples and their discussion be moved to that article? That article has no introduction, just code, so the material in this article could serve as an introduction to that article.-- Christopher King ( talk) 23:23, 2 February 2009 (UTC)
Why don't the examples of "longest common sequence" actually have a common sequence?
How about:
MYDOG YOURDOG! —Preceding unsigned comment added by 58.106.110.238 ( talk) 08:49, 10 June 2009 (UTC)
The pseudo code in the "Reading out an LCS" section doesn't work.
function backTrace(C[0..m,0..n], X[1..m], Y[1..n], i, j) if i = 0 or j = 0 return "" else if X[i-1] = Y[j-1] return backTrace(C, X, Y, i-1, j-1) + X[i-1] else if C[i,j-1] > C[i-1,j] return backTrace(C, X, Y, i, j-1) else return backTrace(C, X, Y, i-1, j)
Notice what happens when you take i and j to be 1,1. You get to the else if clause and compare X[0] to Y[0]. But the function definition shows X and Y to have ranges of 1..m/1..n, so the 0 is out of range.
Mbcook ( talk) 18:38, 12 June 2009 (UTC)
The statement The LCS of X and Y are the longest sequences contained in LCS(Xm – 1, Y) and LCS(X, Yn – 1) is not correct. The correct statement of the second property is The LCS of X and Y are the longest sequences contained in LCS(Xm – 1, Y) or LCS(X, Yn – 1). The statement that LCS(Xm – 1, Y) and LCS(X, Yn – 1) will both produce an LCS, as the example suggests, is false.
Take for example X = AGT and Y = ATC. LCS(Xm – 1, Y) = A and LCS(X, Yn – 1) = AT. Clearly AT is the longest common subsequence, not A. Thus, LCS(X, Y) = the longest sequences of LCS(Xm – 1, Y) or LCS(X, Yn – 1).
The current example of the second property is at best misleading.
Shannon Pattison ( talk) 20:31, 19 August 2009 (UTC)notpattison
Is the Equation really right? If in the third Case both Sequences have the same size? What did the max-function returns then? — Preceding unsigned comment added by 85.180.137.191 ( talk) 07:51, 28 August 2011 (UTC)
I think mentioning the LCS (aka MLCS or k-LCS) on more than two strings (NP-hard) is missing in this article. Jens ( talk) 10:34, 4 April 2012 (UTC)
The link in the intro links to http://en.wikipedia.org/wiki/Subsequence#Substring_vs._subsequence but no such header there, and not really any good examples of the difference. 129.241.124.103 ( talk) 21:09, 10 July 2013 (UTC)
Why has someone added a requirement for reference in the code sequence? The whole sequence is a paraphrase (very useful!) of what is explained above, and can be desk checked by any competent programmer. "unsourced material can be challenged and removed". This is not a controversial statement like the "sons of Nixons are involved in the My Lai massacre" which requires the backing of "reliable sources". It is more like "on a clear day, from the bell tower of the Notre Dame you can see the Effel tower". — Preceding unsigned comment added by 80.100.243.19 ( talk) 14:20, 13 April 2014 (UTC)
"For problems with a bounded alphabet size, the Method of Four Russians can be used to reduce the running time of the dynamic programming algorithm by a logarithmic factor"
I don't see how the Method of Four Russians is applicable here to this problem. Method of Four Russians requires the contents of the matrix to come from a bounded alphabet, where as the contents of the matrix here are unbounded integers (even if the two input strings come from a bounded alphabet). — Preceding unsigned comment added by 71.198.24.3 ( talk) 06:33, 21 January 2019 (UTC)
There is a paper by Ukkonen [1] and a paper by Myers [2] that present efficient algorithms for bounded D = N - L. Maybe the reason they are missing from this page is that they are primarily phrased as being about edit distance where deletion and insertion cost 1 and substitution is not allowed (or costs 2 if you prefer, so slightly different than Levenshtein distance). However, it's clear that those algorithm also apply to computing the LCS, and it's even explicitly stated in at least one of the papers.
The paper by Myers exposes some interesting results:
The paper by Ukkonen looks like a subset of what is in Myers paper (point 1) but the result applies to a more general setting (or at least is written in such a way that it looks like the result applies to a more general setting).
The paper by Myers is often cited in the context of file diffing algorithms.
Aureooms ( talk) 15:45, 26 May 2021 (UTC)
References
It is written: "The prefixes of X are ...", which uses an undefined notation. I assume the author meant "... ..." instead, do you agree? — MFH: Talk 01:22, 27 October 2021 (UTC)
I didn't see any of the LCS variants elsewhere on Wikipedia and wondered if this article would be a good place to add a section about them. Three variants I'm aware of are:
This paper has an overview of the three above:
CLCS in particular is described here with an O(n^2 · m^2 · r) solution:
And a more recent CLCS solution in O(n · m · r) is described here:
Thanks!
Catleeball ( talk) 19:16, 9 July 2023 (UTC)
This section doesn't mention data differencing or delta encoding. Isn't the longest common subsequence also related to this method of compression? Jarble ( talk) 22:53, 5 March 2024 (UTC)
![]() | This article is rated Start-class on Wikipedia's
content assessment scale. It is of interest to the following WikiProjects: | |||||||||||||||||
|
|
|
.. should mention that this method is not recommended for professional use cases. I don't want to be a passenger of a plane which runs software merged some times in the past from different branches and contains old code due to a hash collision... cheers 82.83.133.92 ( talk) 20:18, 3 July 2015 (UTC)
Since I am activally using and contributing to this page (I am the one who posted the diff algorithm) to create a generic history tool I think Zer000 is right it lay out sucks.... for that reason I will undertake writting a new one once I start I will post a link to the draft page here for comment. (unsigned by Sarek1024, 2006-05-14)
C
array). What is left is the syntax of the for-loops (range
) and some stuff in bt_all
.Does anybody know a tight bound on the maximal number of longest common subsequences? (As in less than , ). If you take for example the strings XYXYXY...XY and YXYXYX...YX, there are LCSs, right? For XYZXYZ...XYZ and ZYXZYX...ZYX there are . We have that is maximal for . Is a bound? -- Nils Grimsmo 17:56, 13 August 2006 (UTC)
Hi folks, I dunno if it's just an artifact of an earlier version or an artifact of my wayward brain, but is "LCP" a typo for LCS in the Python code and there abouts? Also, this Python doesn't run as is, right? The matrix C isn't initialized. Isn't the LCS algorithm understood to take two sequences as its input? -- babbage 09:00, 27 October 2006 (UTC)
C = [[0] * (m+1) for i in xrange(n+1)]
Currently the notation is not consistent between the definition of the recurrence and the pseudo-code (my fault). Which should be used of X1..m and X[1..m]
? Or maybe something else? Or does it not matter that two different notations are used? Klem fra
Nils Grimsmo
14:29, 10 December 2006 (UTC)
For two strings and , the Levenshtein distance (edit distance) is
I don't think this is true. For example, if we take the words RHYTHM and ALGORITHM, then so the Levenshtein distance should become , but it should be 6.
I think the following equation is wrong.
The correct one should be
Isn't it? —The preceding unsigned comment was added by 140.115.213.121 ( talk) 04:59, 2 February 2007 (UTC).
Sorry, the original one is correct. I misunderstood the equation. —140.115.213.121.
I think the article could really do with an example demonstrating the LCS between two strings. I'm not sure I understand it well enough to add one, though... Tjwood 10:18, 5 March 2007 (UTC)
The worked example is very good, nice and visual. I notice it describes the use of a traceback system to save storing a big table of subsequences. In the languages I program in, the subsequences and even sets of subsequences would be objects, so in the untweaked version of the scheme (without the traceback) the individual cells of the table would be single pointers only, many of them to the same objects. (For every cell that inherits its set of longest sequences only from the left or only from above, no new object would need to be created at all. That is, any of the mismatch cells where the cell above or left also have different-lengthed longest subsequences to pass on. I think that would be the majority of cells when you get into comparing large and contrasting sequences. In the case of a mismatch cell which inherits from left and above, no new subsequence objects are generated, only two lists of them combined.) So does this discount the value of the traceback approach, and should it be mentioned?
59.101.175.101 ( talk) 10:57, 29 October 2011 (UTC)
The only advantage I can see identity integers having over hash functions is that they don't have collisions, at the cost of taking longer to calculate. The article currently makes it sound like they have more advantages.. eg:
Yet any hash function will yield the same hash value for two lines that are the same, else it's not a hash function. So the above point seems fairly moot..
As above.
Identity integers also lose this property:
So unless I'm misunderstanding something, should the article be changed? (I don't fully understand how the function works yet, so in the case that I'm wrong I'd rather not change/make a mess of things =P, and also have never worked in perl/c++)
Themania 05:20, 20 April 2007 (UTC)
Also I forgot to mention the largest advantage of hash values - you can calculate them just once for a file, and then compare that file against other files. Very useful!
Themania 05:24, 20 April 2007 (UTC)
In traceback since R is {} , we can omit it in the first if .
Under "Complexity" is the following equation:
L is not previously defined. But, more to the point, I've had calculus II, and I can't make sense of the upper limit of that summation. Perhaps it could be put in words, as well, for us mere mortals.-- Christopher King ( talk) 21:36, 10 January 2009 (UTC)
Someone added the note on the right to the start of the computer programming examples:
Should the code examples and their discussion be moved to that article? That article has no introduction, just code, so the material in this article could serve as an introduction to that article.-- Christopher King ( talk) 23:23, 2 February 2009 (UTC)
Why don't the examples of "longest common sequence" actually have a common sequence?
How about:
MYDOG YOURDOG! —Preceding unsigned comment added by 58.106.110.238 ( talk) 08:49, 10 June 2009 (UTC)
The pseudo code in the "Reading out an LCS" section doesn't work.
function backTrace(C[0..m,0..n], X[1..m], Y[1..n], i, j) if i = 0 or j = 0 return "" else if X[i-1] = Y[j-1] return backTrace(C, X, Y, i-1, j-1) + X[i-1] else if C[i,j-1] > C[i-1,j] return backTrace(C, X, Y, i, j-1) else return backTrace(C, X, Y, i-1, j)
Notice what happens when you take i and j to be 1,1. You get to the else if clause and compare X[0] to Y[0]. But the function definition shows X and Y to have ranges of 1..m/1..n, so the 0 is out of range.
Mbcook ( talk) 18:38, 12 June 2009 (UTC)
The statement The LCS of X and Y are the longest sequences contained in LCS(Xm – 1, Y) and LCS(X, Yn – 1) is not correct. The correct statement of the second property is The LCS of X and Y are the longest sequences contained in LCS(Xm – 1, Y) or LCS(X, Yn – 1). The statement that LCS(Xm – 1, Y) and LCS(X, Yn – 1) will both produce an LCS, as the example suggests, is false.
Take for example X = AGT and Y = ATC. LCS(Xm – 1, Y) = A and LCS(X, Yn – 1) = AT. Clearly AT is the longest common subsequence, not A. Thus, LCS(X, Y) = the longest sequences of LCS(Xm – 1, Y) or LCS(X, Yn – 1).
The current example of the second property is at best misleading.
Shannon Pattison ( talk) 20:31, 19 August 2009 (UTC)notpattison
Is the Equation really right? If in the third Case both Sequences have the same size? What did the max-function returns then? — Preceding unsigned comment added by 85.180.137.191 ( talk) 07:51, 28 August 2011 (UTC)
I think mentioning the LCS (aka MLCS or k-LCS) on more than two strings (NP-hard) is missing in this article. Jens ( talk) 10:34, 4 April 2012 (UTC)
The link in the intro links to http://en.wikipedia.org/wiki/Subsequence#Substring_vs._subsequence but no such header there, and not really any good examples of the difference. 129.241.124.103 ( talk) 21:09, 10 July 2013 (UTC)
Why has someone added a requirement for reference in the code sequence? The whole sequence is a paraphrase (very useful!) of what is explained above, and can be desk checked by any competent programmer. "unsourced material can be challenged and removed". This is not a controversial statement like the "sons of Nixons are involved in the My Lai massacre" which requires the backing of "reliable sources". It is more like "on a clear day, from the bell tower of the Notre Dame you can see the Effel tower". — Preceding unsigned comment added by 80.100.243.19 ( talk) 14:20, 13 April 2014 (UTC)
"For problems with a bounded alphabet size, the Method of Four Russians can be used to reduce the running time of the dynamic programming algorithm by a logarithmic factor"
I don't see how the Method of Four Russians is applicable here to this problem. Method of Four Russians requires the contents of the matrix to come from a bounded alphabet, where as the contents of the matrix here are unbounded integers (even if the two input strings come from a bounded alphabet). — Preceding unsigned comment added by 71.198.24.3 ( talk) 06:33, 21 January 2019 (UTC)
There is a paper by Ukkonen [1] and a paper by Myers [2] that present efficient algorithms for bounded D = N - L. Maybe the reason they are missing from this page is that they are primarily phrased as being about edit distance where deletion and insertion cost 1 and substitution is not allowed (or costs 2 if you prefer, so slightly different than Levenshtein distance). However, it's clear that those algorithm also apply to computing the LCS, and it's even explicitly stated in at least one of the papers.
The paper by Myers exposes some interesting results:
The paper by Ukkonen looks like a subset of what is in Myers paper (point 1) but the result applies to a more general setting (or at least is written in such a way that it looks like the result applies to a more general setting).
The paper by Myers is often cited in the context of file diffing algorithms.
Aureooms ( talk) 15:45, 26 May 2021 (UTC)
References
It is written: "The prefixes of X are ...", which uses an undefined notation. I assume the author meant "... ..." instead, do you agree? — MFH: Talk 01:22, 27 October 2021 (UTC)
I didn't see any of the LCS variants elsewhere on Wikipedia and wondered if this article would be a good place to add a section about them. Three variants I'm aware of are:
This paper has an overview of the three above:
CLCS in particular is described here with an O(n^2 · m^2 · r) solution:
And a more recent CLCS solution in O(n · m · r) is described here:
Thanks!
Catleeball ( talk) 19:16, 9 July 2023 (UTC)
This section doesn't mention data differencing or delta encoding. Isn't the longest common subsequence also related to this method of compression? Jarble ( talk) 22:53, 5 March 2024 (UTC)