This article is rated Start-class on Wikipedia's
content assessment scale. It is of interest to the following WikiProjects: | ||||||||||||||||||
|
Important for hash function is to give us same result on every run, so shuffle is not allowed. -- Bastie ( talk) 08:53, 4 June 2023 (UTC)
Wikipedia is not the right place to share code. It is not the right place to debug code. It is the right place to quote code examples from reliable sources. See wp:rs and wp:or, two of the relevant policies. David Spector ( talk) 18:21, 4 June 2023 (UTC)
I've removed the Python code snippet for the following reasons:
Oli Filth( talk| contribs) 17:36, 11 September 2009 (UTC)
I have a bit of a conflict here, as I know that Pearson's Hash benchmarks well for many uses, from my own testing. So the practical impact is quite significant, and worth noting. However, as I understand, Wikipedia "no original research" policy frowns on linking back to your own articles, so I have NOT linked back to my own writings. Which also seems to make writing a weblog article to document my own benchmarks as not-quite-suitable. So how to make this all "legitimate"? pbannister ( talk) 21:50, 17 January 2010 (UTC)
Write a good article/blog post on it and wait for somebody else to consider it to be a good source for this article. That's all you can do. 82.33.61.156 ( talk) 16:00, 18 January 2010 (UTC)
In article there is talking about function that accepts any kind of byte, but c code example is for null terminated strings. I suggest function phash(char* buffer,int length,unsigned hash) —Preceding unsigned comment added by 89.212.91.6 ( talk) 14:00, 25 January 2010 (UTC)
I second that the C code looks dubious, for 2 reasons. First it claims to return a 64bit value, but unless I am mistaken I see that it returns an 8 bit char. I also think that line 32 (h = 0) is incorrect. In this case, a collision that is found for a single byte of output, will continue to generate a collision no matter how many bytes of output are generated. Simply removing this line would remove this problem.
74.205.213.237 (
talk) 20:24, 20 October 2013 (UTC)
The C implementation does not initialize the retval variable before use. Hence the value returned is undefined. 203.129.23.7 ( talk) 10:28, 8 May 2022 (UTC)
You probably did not read the comment I put on the deletion. I did not delete it due to the incorrectness and obvious lack of testing. I deleted it because of WP:OR. It needs to stay out of WP until it has been published and accepted by the programming community. David Spector ( talk) 14:42, 9 May 2022 (UTC)
It is easy to extend PH to 16 bits or more. Simply break up the large number into bytes, then PH each byte, using a seed which is in the program as a random constant for each byte number; concat the results to yield a hash value that is as large as the input domain. As usual, the result may be reduced to any range using bit-masking or mod.
David Spector (talk) 18:07, 19 April 2010 (UTC)
Seems to me there might be a more efficient notion for extending the Pearson hash to 16-bits or more, at the same cost (or nearly so) as the 8-bit hash.
No attempt at formal proof, and exclusion of original research puts this outside the scope of the Wikipedia article.
pbannister ( talk) 00:25, 20 April 2010 (UTC)
If you think about it, for C-string, all you need to generilize this algorithm is to cast (char*) to (ushort*), (ulong*) or (unsigned long long*) and complement the whole string with zeros to avoid any sneaky SIGSEGVs plus a little bitwise arithmetics. FYI. — Preceding
unsigned comment added by
217.9.91.108 (
talk) 22:32, 31 January 2013 (UTC)
There are several opinionated statements throughout this article:
These are just some of the examples. It may be worth taking a look to reword these statements to remove potential bias. Also, the singular citation seems rather weak to me, personally, but then I'm not willing to buy the article. If the article does support the listed benefits, it would at least seem reassuring to add a reference to the article there ("but it offers these benefits") -- otherwise the claims seem unverified. - 66.87.75.174 ( talk) 19:28, 15 May 2013 (UTC)
How useful is an 8-bit hash? :-) But humor aside, I get your point. I have made some revisions in response to the legitimate concern you expressed. But I don't believe that the remark 'excellent algorithm' is objectionable. It is a value judgement, but it's one based on a lot of experience in programming - I have decades - and before I wrote xPear16, I spent a fair amount of time simply studying the logic, so I could develop some deep confidence in it, which I did. I will take the time to explain my 'excellent' assessment rather than withdrawing that particular comment. I think it belongs in the article. Note: I'm not praising my program; I'm praising Pearson's wonderful little algorithm, there. It's some worthy niftiness.
Pearson is excellent. It's elegant and praise-worthy for three reasons. (1) It's remarkable simple, (2) it's fast (compared to other hashing algorithms) and (3) it offers excellent randomness in the values output. In computer science, in general, the first two traits are naturally admired. Simple code tends to be solid, reliable and it can be incredibly efficient. (Binary search is very simple, but wonderfully fast. This is similar. How much praise could we offer binary search? Plenty.) Speed is also generally desirable. In hashing, the third trait (randomness) is very valuable, if not essential. The greater the randomness, the less chance of a collision.
In a 64-bit number space, close to perfect randomness - what Pearson provides - makes hash collisions so unlikely that there is no practical need to test for them. That is NOT the case with 32-bit hashes. (I'm talking here from extensive practical experience, not theoretical grounds. I use hashes a lot.) Given perfectly random 32-bit hashes, and 'only' about a milion values generated, the reality is that one should expect a collision. Even if the chance is (apparently) only one in some thousands, there is almost bound to be a collision. That is the well-known 'birthdays in common' issue. Generally, with a million items, about one collision does occur, even though only about 1/4,000th of the number space is used. A million out of four billion numbers: good chance of collision. It seems contrary to reason, but it happens more often than not.
However in the realm of 64-bit numbers, it's a very different story. (Given the same data set.) The values used are then so sparse that the probability of a collision becomes infintessimal.
Trusting that the set of values generated will be collision-free depends (very much) on the randomness of the hash values output. One needs something very good, to make testing for collisions unnecessary, in practical terms. In 64-bit numbers, one could call it the chance of collsion the chance of fairy dust colliding. (Not likely.) 64-bit numbers are virtually ideal for hashing, because essentially the collision problem goes away. (If.)
Pearson's algorithm is excellent that way. Excellent randomness implies (in 64-bit numbers) no need to test for collisions, unless one is hashing many billion of items, then maybe testing would become necessary, at some point. No need to test for collisions also helps in terms of speed, of course. The argument is simple. Making the number space four billion times larger makes the chance of a collision four billion times smaller. (Such that it's no longer an issue.)
In summary: from Pearson's speedy, simple, little algorithm comes - hash-wise - that third very nice and desirable property: randomness. The values the program shown generates seem to be perfectly random, which means that it seems to be optimal that way. It's like the perfect hashing scheme, if you like. The credit isn't due to me; it's due to Mr Pearson, for inventing it. I just figured out how to make it more useful.
In any event, I would say that the merits of the algorithm (considerable) make 'excellent' an okay comment. That's not 'bias'. How to say this better? It's as good as hashing gets. ??? :-)
For two solid practical reasons, I believe Pearson's algorithm deserves the compliment I offered and the unbiased description excellent. The third reason - simplicity - is aesthetic. It's pleasing intellectually that it's so simple.
I will add a wee bit of personal history, that is relevant. Years ago I wrote my own hashing routine, it became very useful to me, it was excellent in terms of randomness (thus very useable), but rather slow, because it involved a lot of multiplying and adding. I eventually found a much faster routine, called SuperFastHash. I found it very good - much faster than mine, really speedy - but I was able to detect some subtle non-randomness in its output. (Not optimal. Nonetheless I extended it from 32 to 64 bits, and began using it.)
xPear16 is not only faster than SuperFastHash, but it also provides appreciably more random output. Checking digits was part of reaching that conclusion. It is literally the best - the fastest, with the most random output - hashing method that I am aware of. Now, prior to discovering Pearson, in trying to improve my own 'ugly' (brute force) way, that probably had the CPU running rather hot, I investigated a good many hashing schemes (at least a dozen), and speed-compared them all. They are mostly all complicated, logic wise; thus all were appreciably slower than SuperFastHash. But more complex to what benefit? I didn't see it. In practical terms, I don't much care what's 'under the hood', I just want it to go. I wanted very random; I also wanted really fast. At the end of all my investigating, and analyzing, and testing, I ended up with Pearson, and what I have personally devised from it.
My faith in Pearson's beautiful original algorithm is deep. (Deep enough to warrant this little essay on the subject.) I think it's the winner. The best, in three different respects. Incredibly, beautiful simple; really fast (I've pigged it down with the multiple scans, but it's still faster than the fastest hashing method I found and was using). And best of all: first-class output. If admiring an algorithm that is truly worthy makes me biased...so be it. Pearson's nifty hashing scheme will stand on its own merits, no matter what I say about it. :-) Cheers. Stellar-TO ( talk) 14:15, 23 May 2013 (UTC)
I was very dismayed to see my contribution deleted. I shouldn't have to, but let me argue it again. There is limited and questionable utility in an (only) 8-bit hash generator. It's lovely in principle, and in the case of the Pearson hash, it even warrants the term elegant. But it's of limited and questionable practical utility. However, with a bit of additional logic wrapped around it, Pearson's algorithm can be *extremely* useful. Many of my programs are riddled with it, and rely heavily on it. That is why my programs tend to be really fast. Reduce most things to numbers, then it's all just number-crunching. Which means they use an extended - my - form of Pearson's method. A very fast, quality (very random) 64-bit/16-byte hash generator *is* really useful. With a bit of tweaking, the code I've contributed could generate *very* long Pearson hashes; even 256 bytes or 1,024 bits long: possible, with the same code, slightly tweaked. It would do it fast, and no need to worry about any repeating values or integer overflow.
To the anonymous IP yonk who stupidly removed the section of the article that explains in clear terms *how* it can be easily done: the article is now back to showing, how very *useful* Pearson's fast, excellent little algorithm is. His method has so many advantages. Given that I've had much experience using it, and have never seen even a hint of a problem, it's undoubtedly the only hashing method I will ever use. But only writing that C program allowed me to realize how simple and easy it was to generate long hashes, using his method. Without a *clear* explanation, how do people understand? The best way is to show them some actual *logic*. Then they can soon go: ahah. And maybe begin using it.
Yes, there are other places I could put it. In fact, I'm sure the web pages devoted to Pearson hashing are legion. But how dumb was your reason - "the Internet is big" - for objecting to my putting that code here, when it gave a clear illustration of why the Pearson algorithm, in being so simple, is great? I say that's downright *dense*, pally. Stellar-TO ( talk) 13:12, 16 September 2013 (UTC)
Does anyone know of the Pearson algorithm having been analyzed as a possible Cryptographic hash function? It is an obvious choice as the basis for a Hash table implementation, but I haven't seen its properties promoted for Cryptography. Even if it is cryptographically excellent in its native 8 bits, in practice longer extensions must be used, and therefore analyzed. (The great speed of the Pearson algorithm can also be a drawback in some cryptographic applications, but an advantage in others.) David Spector ( talk) 21:55, 12 April 2014 (UTC)
I propose deleting the Example implementations section. While it is nice to have actual running code in WP, that is not part of WP's mandate; there may be other websites that provide sample PH code that has been tested professionally. We don't know who wrote the current code, whether it was tested other than by the author, or where the code comes from, since it contains no references. WP prohibits including unpublished material; see WP:OR. In addition, WP requires references to be to reliable sources, see WP:RS. I believe it is better to publish nothing than to publish code that may be incorrect or may not work on all versions of available compilers, interpreters, and OSs.
Finally, WP is an encyclopedia (see WP:Wikipedia is an encyclopedia); it is not a place to publish sample source code in multiple languages, whether tested or untested, unless the code was published already in a reliable source and is known not to be copyrighted. David Spector ( talk) 21:23, 14 June 2023 (UTC)
This article is rated Start-class on Wikipedia's
content assessment scale. It is of interest to the following WikiProjects: | ||||||||||||||||||
|
Important for hash function is to give us same result on every run, so shuffle is not allowed. -- Bastie ( talk) 08:53, 4 June 2023 (UTC)
Wikipedia is not the right place to share code. It is not the right place to debug code. It is the right place to quote code examples from reliable sources. See wp:rs and wp:or, two of the relevant policies. David Spector ( talk) 18:21, 4 June 2023 (UTC)
I've removed the Python code snippet for the following reasons:
Oli Filth( talk| contribs) 17:36, 11 September 2009 (UTC)
I have a bit of a conflict here, as I know that Pearson's Hash benchmarks well for many uses, from my own testing. So the practical impact is quite significant, and worth noting. However, as I understand, Wikipedia "no original research" policy frowns on linking back to your own articles, so I have NOT linked back to my own writings. Which also seems to make writing a weblog article to document my own benchmarks as not-quite-suitable. So how to make this all "legitimate"? pbannister ( talk) 21:50, 17 January 2010 (UTC)
Write a good article/blog post on it and wait for somebody else to consider it to be a good source for this article. That's all you can do. 82.33.61.156 ( talk) 16:00, 18 January 2010 (UTC)
In article there is talking about function that accepts any kind of byte, but c code example is for null terminated strings. I suggest function phash(char* buffer,int length,unsigned hash) —Preceding unsigned comment added by 89.212.91.6 ( talk) 14:00, 25 January 2010 (UTC)
I second that the C code looks dubious, for 2 reasons. First it claims to return a 64bit value, but unless I am mistaken I see that it returns an 8 bit char. I also think that line 32 (h = 0) is incorrect. In this case, a collision that is found for a single byte of output, will continue to generate a collision no matter how many bytes of output are generated. Simply removing this line would remove this problem.
74.205.213.237 (
talk) 20:24, 20 October 2013 (UTC)
The C implementation does not initialize the retval variable before use. Hence the value returned is undefined. 203.129.23.7 ( talk) 10:28, 8 May 2022 (UTC)
You probably did not read the comment I put on the deletion. I did not delete it due to the incorrectness and obvious lack of testing. I deleted it because of WP:OR. It needs to stay out of WP until it has been published and accepted by the programming community. David Spector ( talk) 14:42, 9 May 2022 (UTC)
It is easy to extend PH to 16 bits or more. Simply break up the large number into bytes, then PH each byte, using a seed which is in the program as a random constant for each byte number; concat the results to yield a hash value that is as large as the input domain. As usual, the result may be reduced to any range using bit-masking or mod.
David Spector (talk) 18:07, 19 April 2010 (UTC)
Seems to me there might be a more efficient notion for extending the Pearson hash to 16-bits or more, at the same cost (or nearly so) as the 8-bit hash.
No attempt at formal proof, and exclusion of original research puts this outside the scope of the Wikipedia article.
pbannister ( talk) 00:25, 20 April 2010 (UTC)
If you think about it, for C-string, all you need to generilize this algorithm is to cast (char*) to (ushort*), (ulong*) or (unsigned long long*) and complement the whole string with zeros to avoid any sneaky SIGSEGVs plus a little bitwise arithmetics. FYI. — Preceding
unsigned comment added by
217.9.91.108 (
talk) 22:32, 31 January 2013 (UTC)
There are several opinionated statements throughout this article:
These are just some of the examples. It may be worth taking a look to reword these statements to remove potential bias. Also, the singular citation seems rather weak to me, personally, but then I'm not willing to buy the article. If the article does support the listed benefits, it would at least seem reassuring to add a reference to the article there ("but it offers these benefits") -- otherwise the claims seem unverified. - 66.87.75.174 ( talk) 19:28, 15 May 2013 (UTC)
How useful is an 8-bit hash? :-) But humor aside, I get your point. I have made some revisions in response to the legitimate concern you expressed. But I don't believe that the remark 'excellent algorithm' is objectionable. It is a value judgement, but it's one based on a lot of experience in programming - I have decades - and before I wrote xPear16, I spent a fair amount of time simply studying the logic, so I could develop some deep confidence in it, which I did. I will take the time to explain my 'excellent' assessment rather than withdrawing that particular comment. I think it belongs in the article. Note: I'm not praising my program; I'm praising Pearson's wonderful little algorithm, there. It's some worthy niftiness.
Pearson is excellent. It's elegant and praise-worthy for three reasons. (1) It's remarkable simple, (2) it's fast (compared to other hashing algorithms) and (3) it offers excellent randomness in the values output. In computer science, in general, the first two traits are naturally admired. Simple code tends to be solid, reliable and it can be incredibly efficient. (Binary search is very simple, but wonderfully fast. This is similar. How much praise could we offer binary search? Plenty.) Speed is also generally desirable. In hashing, the third trait (randomness) is very valuable, if not essential. The greater the randomness, the less chance of a collision.
In a 64-bit number space, close to perfect randomness - what Pearson provides - makes hash collisions so unlikely that there is no practical need to test for them. That is NOT the case with 32-bit hashes. (I'm talking here from extensive practical experience, not theoretical grounds. I use hashes a lot.) Given perfectly random 32-bit hashes, and 'only' about a milion values generated, the reality is that one should expect a collision. Even if the chance is (apparently) only one in some thousands, there is almost bound to be a collision. That is the well-known 'birthdays in common' issue. Generally, with a million items, about one collision does occur, even though only about 1/4,000th of the number space is used. A million out of four billion numbers: good chance of collision. It seems contrary to reason, but it happens more often than not.
However in the realm of 64-bit numbers, it's a very different story. (Given the same data set.) The values used are then so sparse that the probability of a collision becomes infintessimal.
Trusting that the set of values generated will be collision-free depends (very much) on the randomness of the hash values output. One needs something very good, to make testing for collisions unnecessary, in practical terms. In 64-bit numbers, one could call it the chance of collsion the chance of fairy dust colliding. (Not likely.) 64-bit numbers are virtually ideal for hashing, because essentially the collision problem goes away. (If.)
Pearson's algorithm is excellent that way. Excellent randomness implies (in 64-bit numbers) no need to test for collisions, unless one is hashing many billion of items, then maybe testing would become necessary, at some point. No need to test for collisions also helps in terms of speed, of course. The argument is simple. Making the number space four billion times larger makes the chance of a collision four billion times smaller. (Such that it's no longer an issue.)
In summary: from Pearson's speedy, simple, little algorithm comes - hash-wise - that third very nice and desirable property: randomness. The values the program shown generates seem to be perfectly random, which means that it seems to be optimal that way. It's like the perfect hashing scheme, if you like. The credit isn't due to me; it's due to Mr Pearson, for inventing it. I just figured out how to make it more useful.
In any event, I would say that the merits of the algorithm (considerable) make 'excellent' an okay comment. That's not 'bias'. How to say this better? It's as good as hashing gets. ??? :-)
For two solid practical reasons, I believe Pearson's algorithm deserves the compliment I offered and the unbiased description excellent. The third reason - simplicity - is aesthetic. It's pleasing intellectually that it's so simple.
I will add a wee bit of personal history, that is relevant. Years ago I wrote my own hashing routine, it became very useful to me, it was excellent in terms of randomness (thus very useable), but rather slow, because it involved a lot of multiplying and adding. I eventually found a much faster routine, called SuperFastHash. I found it very good - much faster than mine, really speedy - but I was able to detect some subtle non-randomness in its output. (Not optimal. Nonetheless I extended it from 32 to 64 bits, and began using it.)
xPear16 is not only faster than SuperFastHash, but it also provides appreciably more random output. Checking digits was part of reaching that conclusion. It is literally the best - the fastest, with the most random output - hashing method that I am aware of. Now, prior to discovering Pearson, in trying to improve my own 'ugly' (brute force) way, that probably had the CPU running rather hot, I investigated a good many hashing schemes (at least a dozen), and speed-compared them all. They are mostly all complicated, logic wise; thus all were appreciably slower than SuperFastHash. But more complex to what benefit? I didn't see it. In practical terms, I don't much care what's 'under the hood', I just want it to go. I wanted very random; I also wanted really fast. At the end of all my investigating, and analyzing, and testing, I ended up with Pearson, and what I have personally devised from it.
My faith in Pearson's beautiful original algorithm is deep. (Deep enough to warrant this little essay on the subject.) I think it's the winner. The best, in three different respects. Incredibly, beautiful simple; really fast (I've pigged it down with the multiple scans, but it's still faster than the fastest hashing method I found and was using). And best of all: first-class output. If admiring an algorithm that is truly worthy makes me biased...so be it. Pearson's nifty hashing scheme will stand on its own merits, no matter what I say about it. :-) Cheers. Stellar-TO ( talk) 14:15, 23 May 2013 (UTC)
I was very dismayed to see my contribution deleted. I shouldn't have to, but let me argue it again. There is limited and questionable utility in an (only) 8-bit hash generator. It's lovely in principle, and in the case of the Pearson hash, it even warrants the term elegant. But it's of limited and questionable practical utility. However, with a bit of additional logic wrapped around it, Pearson's algorithm can be *extremely* useful. Many of my programs are riddled with it, and rely heavily on it. That is why my programs tend to be really fast. Reduce most things to numbers, then it's all just number-crunching. Which means they use an extended - my - form of Pearson's method. A very fast, quality (very random) 64-bit/16-byte hash generator *is* really useful. With a bit of tweaking, the code I've contributed could generate *very* long Pearson hashes; even 256 bytes or 1,024 bits long: possible, with the same code, slightly tweaked. It would do it fast, and no need to worry about any repeating values or integer overflow.
To the anonymous IP yonk who stupidly removed the section of the article that explains in clear terms *how* it can be easily done: the article is now back to showing, how very *useful* Pearson's fast, excellent little algorithm is. His method has so many advantages. Given that I've had much experience using it, and have never seen even a hint of a problem, it's undoubtedly the only hashing method I will ever use. But only writing that C program allowed me to realize how simple and easy it was to generate long hashes, using his method. Without a *clear* explanation, how do people understand? The best way is to show them some actual *logic*. Then they can soon go: ahah. And maybe begin using it.
Yes, there are other places I could put it. In fact, I'm sure the web pages devoted to Pearson hashing are legion. But how dumb was your reason - "the Internet is big" - for objecting to my putting that code here, when it gave a clear illustration of why the Pearson algorithm, in being so simple, is great? I say that's downright *dense*, pally. Stellar-TO ( talk) 13:12, 16 September 2013 (UTC)
Does anyone know of the Pearson algorithm having been analyzed as a possible Cryptographic hash function? It is an obvious choice as the basis for a Hash table implementation, but I haven't seen its properties promoted for Cryptography. Even if it is cryptographically excellent in its native 8 bits, in practice longer extensions must be used, and therefore analyzed. (The great speed of the Pearson algorithm can also be a drawback in some cryptographic applications, but an advantage in others.) David Spector ( talk) 21:55, 12 April 2014 (UTC)
I propose deleting the Example implementations section. While it is nice to have actual running code in WP, that is not part of WP's mandate; there may be other websites that provide sample PH code that has been tested professionally. We don't know who wrote the current code, whether it was tested other than by the author, or where the code comes from, since it contains no references. WP prohibits including unpublished material; see WP:OR. In addition, WP requires references to be to reliable sources, see WP:RS. I believe it is better to publish nothing than to publish code that may be incorrect or may not work on all versions of available compilers, interpreters, and OSs.
Finally, WP is an encyclopedia (see WP:Wikipedia is an encyclopedia); it is not a place to publish sample source code in multiple languages, whether tested or untested, unless the code was published already in a reliable source and is known not to be copyrighted. David Spector ( talk) 21:23, 14 June 2023 (UTC)