This article is rated Start-class on Wikipedia's
content assessment scale. It is of interest to the following WikiProjects: | |||||||||||||||||||||||||||||||||||||||||
|
Well, we have an implentation which does not produce correct results using two compilers. And the original poster does not perform the shift correctly. Do not use this code.
151.196.6.139 ( talk)noloader
I'm not native to the C programming language and I don't have time to start sifting through the code in this article to develop a pseudocode representation of this algorithm.
Can someone who knows C better than I please read through and put the code into something cross-platform?
-- Oddb411 13:01, 17 September 2006 (UTC)
Many programmers ( WikiWikiWeb:MeaningfulName) now recommend "Always, always, always use good, unabbreviated, correctly-spelled meaningful names." Is there some more meaningful name we could use instead of "hpos"Â ?
In the Boyer-Moore algorithm article, the "hpos" variable tracks the position where a copy of the needle might possibly start in the haystack, although we generally look at some later position in the haystack. (What would be a more meaningful name for this?)
In this BoyerâMooreâHorspool algorithm article, the "hpos" variable tracks the actual letter in the haystack that we look at. (What would be a more meaningful name for this?)
(The difference is precisely
BMH_hpos = BM_hpos + npos
).
I think doing it the same way in both articles would be less confusing. Which way (the current BM way, or the current BMH way) would make clearer, easier-to-understand code?
-- 68.0.120.35 02:32, 25 March 2007 (UTC)
The article says "For instance a 32 byte needle ending in "z" searching through a 255 byte haystack which doesn't have a 'z' byte in it would take 3 byte comparisons."
I think it meant 7 byte comparisons, since each one skips 32 bytes until there's less that 32 bytes remaining.
Can anybody confirm? âPreceding unsigned comment added by 155.210.218.53 ( talk) 18:21, 18 January 2008 (UTC)
- Yes, I concur, and will make the change. (And rather than use base-2 numbers, I'll use base-10 -- it gets the same point across w/ more-obvious arithmetic.)
not-just-yeti (
talk) 20:00, 3 July 2019 (UTC)
- Whoops, wait, I take that back -- that'd only be true if the needle were *entirely* "z"s, which is a bit *too* optimistic to include. You still get good average-case performance (skipping half the needle's length on average, each time) for haystacks whose letter-distribution matches the needle (besides 'z') and the needle contains no duplicate letters. âŚBut I don't have a citation, and it's a bit of a mouthful.
- I still think we should better characterize the example; "up to 244 comparisons for a string of length 255" feels a bit unfair to the algorithms true typical-good-case performance.
not-just-yeti ( talk) 20:11, 3 July 2019 (UTC)
I might be wrong, but it looks like space can be saved in the bad character skip table by using a hash of the character instead of its actual value. In the case of a collision, the result will just be a smaller shift.
Not a particularly useful idea when the table's only 256 chars long, but it would allow for much better storage requirements if you were using, say, a 32-bit character set. In a case like that, probably only a small fraction of the character set would be seen, and so chances of collision would be low for a reasonably sized hash. CountingPine ( talk) 08:27, 17 July 2009 (UTC)
How is this related to KMP? If anything, the other heuristic in Boyer-Moore (which is not in this algorithm) is closely related to KMP's table (e.g. the compute_prefix code in http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm is exactly the pre-processing in KMP). Raduberinde ( talk) 23:36, 17 August 2010 (UTC)
From http://code.activestate.com/recipes/117223-boyer-moore-horspool-string-searching/
# bmh.py
#
# An implementation of Boyer-Moore-Horspool string searching.
#
# This code is Public Domain.
#
def BoyerMooreHorspool(pattern, text):
m = len(pattern)
n = len(text)
if m > n: return -1
skip = []
for k in range(256): skip.append(m)
for k in range(m - 1): skipord(patternk])] = m - k - 1
skip = tuple(skip)
k = m - 1
while k < n:
j = m - 1; i = k
while j >= 0 and texti == patternj]:
j -= 1; i -= 1
if j == -1: return i + 1
k += skipord(textk])]
return -1
if __name__ == '__main__':
text = "this is the string to search in"
pattern = "the"
s = BoyerMooreHorspool(pattern, text)
print 'Text:',text
print 'Pattern:',pattern
if s > -1:
print 'Pattern \"' + pattern + '\" found at position',s
## end of http://code.activestate.com/recipes/117223/ }}}
(code is released under the PSF license ( http://www.opensource.org/licenses/Python-2.0) <-- What? It makes absolutely no sense to apply a more restrictive license to public domain source code. We can do whatever we want with it, PSF be damned.
Wouldn't there be a benefit in pointing out real live examples, e.g. https://github.com/ggreer/the_silver_searcher, to substantiate the usefulness of this algorithm?
I saw earlier the talk page that the algorithm itself was not implemented correctly. I am too lazy to verify the correctness of this code, so I'll leave it to others, but in the above example you have an implementation that does work (might have bugs though as any algo/sw).
There you have a truly super-fast grep tool using this algorithm for substring searches. â Preceding unsigned comment added by 192.75.88.232 ( talk) 19:49, 25 June 2013 (UTC)
It took me 2+ years to optimize and explore this beautiful and simple algorithm, finally the gods of searching helped me to reveal the FASTEST function for searching a block of memory into another block, the so called MEMMEM, given that Windows OS lacks this important function and seeing how *nix world has got nothing worthy enough (except some 2007-2008 empty talks in newsgroups) I think it is time for change, the role of the successor is played by 'Railgun_Sekireigan_Bari'.
Just saw that my BMH order 2/12 link in 'External links' is removed, what is the problem? â Preceding unsigned comment added by Sanmayce ( talk ⢠contribs) 17:45, 19 December 2013 (UTC)
Raita's is apparently an optimization of BMH. QVVERTYVS ( hm?) 15:08, 15 March 2015 (UTC)
I translated this from the Python version that was previously on this page, but it doesn't match Horspool's pseudocode very closely. It also contains some bugs that I discovered when implementing it (corner cases). QVVERTYVS ( hm?) 10:26, 6 June 2015 (UTC)
" A very fast substring search algorithm"; Daniel M. Sunday; Communications of the ACM; August 1990
The Sunday variant is slightly more efficient than Horspool.
Thierry Lecroq covers the three versions presented by Sunday. " Exact String Matching Algorithms" â Preceding unsigned comment added by SirWumpus ( talk ⢠contribs) 16:37, 5 November 2015 (UTC)
The pseudo code in the article
function preprocess(pattern) T â new table of 256 integers for i from 0 to 256 exclusive T[i] â length(pattern) for i from 0 to length(pattern) - 1 exclusive T[pattern[i]] â length(pattern) - 1 - i return T function search(needle, haystack) T â preprocess(needle) skip â 0 while length(haystack) - skip ⼠length(needle) i â length(needle) - 1 while haystack[skip + i] = needle[i] if i = 0 return skip i â i - 1 skip â skip + T[haystack[skip + length(needle) - 1]] return not-found
hangs at "skip = 2" when the haystack is "ADBBCCBDCDCC" and the needle is "ABC".
When you add a line like
T[pattern[length(pattern) - 1]] â 1
, then it works fine.
â Preceding unsigned comment added by Aunkrig ( talk ⢠contribs) 08:48, 4 October 2017 (UTC)
from 0 to length(pattern) - 1 exclusive
... so it seems wrong to belittle that choice in that comment.
The first operation in the while loop is to compare, somehow, the entire match candidate site with the pattern. If the comparison fails, the next operation is to get haystack[skip + length(needle) - 1]
to use as table index into T; this is the rightmost character in the candidate site.
No matter which end you start at, it's quite likely that the first character compared will fail to match, so it's better to start by reading the rightmost one, since you're going to need it anyway. I.e. the first 'if' below will usually fail, and the code will already have hlast
in a register for the table index of T.
Factoring the entire string compare into the 'same' function makes the algorithm a little clearer overall, but obscures this detail.
function search(needle, haystack) T â preprocess(needle) nn1 â len(needle) - 1 nlast â needle[nn1] skip â 0 while length(haystack) - skip ⼠length(needle) hlast â haystack[skip + nn1] if hlast == nlast if same(haystack[skip:], needle, nn1) return skip skip â skip + T[hlast] return not-found
Greg.smith.tor.ca ( talk) 16:42, 9 July 2023 (UTC)
This article is rated Start-class on Wikipedia's
content assessment scale. It is of interest to the following WikiProjects: | |||||||||||||||||||||||||||||||||||||||||
|
Well, we have an implentation which does not produce correct results using two compilers. And the original poster does not perform the shift correctly. Do not use this code.
151.196.6.139 ( talk)noloader
I'm not native to the C programming language and I don't have time to start sifting through the code in this article to develop a pseudocode representation of this algorithm.
Can someone who knows C better than I please read through and put the code into something cross-platform?
-- Oddb411 13:01, 17 September 2006 (UTC)
Many programmers ( WikiWikiWeb:MeaningfulName) now recommend "Always, always, always use good, unabbreviated, correctly-spelled meaningful names." Is there some more meaningful name we could use instead of "hpos"Â ?
In the Boyer-Moore algorithm article, the "hpos" variable tracks the position where a copy of the needle might possibly start in the haystack, although we generally look at some later position in the haystack. (What would be a more meaningful name for this?)
In this BoyerâMooreâHorspool algorithm article, the "hpos" variable tracks the actual letter in the haystack that we look at. (What would be a more meaningful name for this?)
(The difference is precisely
BMH_hpos = BM_hpos + npos
).
I think doing it the same way in both articles would be less confusing. Which way (the current BM way, or the current BMH way) would make clearer, easier-to-understand code?
-- 68.0.120.35 02:32, 25 March 2007 (UTC)
The article says "For instance a 32 byte needle ending in "z" searching through a 255 byte haystack which doesn't have a 'z' byte in it would take 3 byte comparisons."
I think it meant 7 byte comparisons, since each one skips 32 bytes until there's less that 32 bytes remaining.
Can anybody confirm? âPreceding unsigned comment added by 155.210.218.53 ( talk) 18:21, 18 January 2008 (UTC)
- Yes, I concur, and will make the change. (And rather than use base-2 numbers, I'll use base-10 -- it gets the same point across w/ more-obvious arithmetic.)
not-just-yeti (
talk) 20:00, 3 July 2019 (UTC)
- Whoops, wait, I take that back -- that'd only be true if the needle were *entirely* "z"s, which is a bit *too* optimistic to include. You still get good average-case performance (skipping half the needle's length on average, each time) for haystacks whose letter-distribution matches the needle (besides 'z') and the needle contains no duplicate letters. âŚBut I don't have a citation, and it's a bit of a mouthful.
- I still think we should better characterize the example; "up to 244 comparisons for a string of length 255" feels a bit unfair to the algorithms true typical-good-case performance.
not-just-yeti ( talk) 20:11, 3 July 2019 (UTC)
I might be wrong, but it looks like space can be saved in the bad character skip table by using a hash of the character instead of its actual value. In the case of a collision, the result will just be a smaller shift.
Not a particularly useful idea when the table's only 256 chars long, but it would allow for much better storage requirements if you were using, say, a 32-bit character set. In a case like that, probably only a small fraction of the character set would be seen, and so chances of collision would be low for a reasonably sized hash. CountingPine ( talk) 08:27, 17 July 2009 (UTC)
How is this related to KMP? If anything, the other heuristic in Boyer-Moore (which is not in this algorithm) is closely related to KMP's table (e.g. the compute_prefix code in http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm is exactly the pre-processing in KMP). Raduberinde ( talk) 23:36, 17 August 2010 (UTC)
From http://code.activestate.com/recipes/117223-boyer-moore-horspool-string-searching/
# bmh.py
#
# An implementation of Boyer-Moore-Horspool string searching.
#
# This code is Public Domain.
#
def BoyerMooreHorspool(pattern, text):
m = len(pattern)
n = len(text)
if m > n: return -1
skip = []
for k in range(256): skip.append(m)
for k in range(m - 1): skipord(patternk])] = m - k - 1
skip = tuple(skip)
k = m - 1
while k < n:
j = m - 1; i = k
while j >= 0 and texti == patternj]:
j -= 1; i -= 1
if j == -1: return i + 1
k += skipord(textk])]
return -1
if __name__ == '__main__':
text = "this is the string to search in"
pattern = "the"
s = BoyerMooreHorspool(pattern, text)
print 'Text:',text
print 'Pattern:',pattern
if s > -1:
print 'Pattern \"' + pattern + '\" found at position',s
## end of http://code.activestate.com/recipes/117223/ }}}
(code is released under the PSF license ( http://www.opensource.org/licenses/Python-2.0) <-- What? It makes absolutely no sense to apply a more restrictive license to public domain source code. We can do whatever we want with it, PSF be damned.
Wouldn't there be a benefit in pointing out real live examples, e.g. https://github.com/ggreer/the_silver_searcher, to substantiate the usefulness of this algorithm?
I saw earlier the talk page that the algorithm itself was not implemented correctly. I am too lazy to verify the correctness of this code, so I'll leave it to others, but in the above example you have an implementation that does work (might have bugs though as any algo/sw).
There you have a truly super-fast grep tool using this algorithm for substring searches. â Preceding unsigned comment added by 192.75.88.232 ( talk) 19:49, 25 June 2013 (UTC)
It took me 2+ years to optimize and explore this beautiful and simple algorithm, finally the gods of searching helped me to reveal the FASTEST function for searching a block of memory into another block, the so called MEMMEM, given that Windows OS lacks this important function and seeing how *nix world has got nothing worthy enough (except some 2007-2008 empty talks in newsgroups) I think it is time for change, the role of the successor is played by 'Railgun_Sekireigan_Bari'.
Just saw that my BMH order 2/12 link in 'External links' is removed, what is the problem? â Preceding unsigned comment added by Sanmayce ( talk ⢠contribs) 17:45, 19 December 2013 (UTC)
Raita's is apparently an optimization of BMH. QVVERTYVS ( hm?) 15:08, 15 March 2015 (UTC)
I translated this from the Python version that was previously on this page, but it doesn't match Horspool's pseudocode very closely. It also contains some bugs that I discovered when implementing it (corner cases). QVVERTYVS ( hm?) 10:26, 6 June 2015 (UTC)
" A very fast substring search algorithm"; Daniel M. Sunday; Communications of the ACM; August 1990
The Sunday variant is slightly more efficient than Horspool.
Thierry Lecroq covers the three versions presented by Sunday. " Exact String Matching Algorithms" â Preceding unsigned comment added by SirWumpus ( talk ⢠contribs) 16:37, 5 November 2015 (UTC)
The pseudo code in the article
function preprocess(pattern) T â new table of 256 integers for i from 0 to 256 exclusive T[i] â length(pattern) for i from 0 to length(pattern) - 1 exclusive T[pattern[i]] â length(pattern) - 1 - i return T function search(needle, haystack) T â preprocess(needle) skip â 0 while length(haystack) - skip ⼠length(needle) i â length(needle) - 1 while haystack[skip + i] = needle[i] if i = 0 return skip i â i - 1 skip â skip + T[haystack[skip + length(needle) - 1]] return not-found
hangs at "skip = 2" when the haystack is "ADBBCCBDCDCC" and the needle is "ABC".
When you add a line like
T[pattern[length(pattern) - 1]] â 1
, then it works fine.
â Preceding unsigned comment added by Aunkrig ( talk ⢠contribs) 08:48, 4 October 2017 (UTC)
from 0 to length(pattern) - 1 exclusive
... so it seems wrong to belittle that choice in that comment.
The first operation in the while loop is to compare, somehow, the entire match candidate site with the pattern. If the comparison fails, the next operation is to get haystack[skip + length(needle) - 1]
to use as table index into T; this is the rightmost character in the candidate site.
No matter which end you start at, it's quite likely that the first character compared will fail to match, so it's better to start by reading the rightmost one, since you're going to need it anyway. I.e. the first 'if' below will usually fail, and the code will already have hlast
in a register for the table index of T.
Factoring the entire string compare into the 'same' function makes the algorithm a little clearer overall, but obscures this detail.
function search(needle, haystack) T â preprocess(needle) nn1 â len(needle) - 1 nlast â needle[nn1] skip â 0 while length(haystack) - skip ⼠length(needle) hlast â haystack[skip + nn1] if hlast == nlast if same(haystack[skip:], needle, nn1) return skip skip â skip + T[hlast] return not-found
Greg.smith.tor.ca ( talk) 16:42, 9 July 2023 (UTC)