This is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page. |
Archive 1 |
Hi Pfunk42! I just saw your major overhaul of hash function and "merging" of hash algorithm. Very nice work! You beat me to it. I put up those merging notices but never got around to note on the talk pages that I was going to do that work in some days. But I am happy you did it since it saved me the work and you did it so well. I added a picture instead, hope you like it. -- David Göthberg 02:31, 5 September 2005 (UTC)
The article says that hash functions are surjective. Is that true? They're certainly not injective, but my understanding is that "surjective" means that "All elements of the image domain are actually 'used' by the mapping," to quote a posting by Peter Simons from the monoton-devel mailing list. [1]
So a hash function could be surjective, but I don't think it has to be.
But this is so far from my areas of expertise that I don't want to just change it; I may be way offbase. -- Elysdir 00:10, 17 December 2005 (UTC)
What's a stock hash function? Isn't the case when an algorithm become published it becomes stock? What's significant about a stock function the way it is being described in this article? Just a little bit confused. -- MegaHasher
Ok. maybe i should have discussed it before throwing it in there, but this is widely used term for a hash function that is one-to-one. a google search for "randomization function" reveals that most of the results use the term as i describe.
True, a bijective function whose domain equals its range represents a permutation, but calling any one-to-one hash function a "permutation" is highly misleading.
It might be worth creating a separate "Randomization Function" page (with redirect from "Mix Function"), but i'm not sure. Comments welcome.-- pfunk42 06:12, 2 January 2006 (UTC)
IMHO, Wikipedia already has too many semi-related pages about random functions. In my mind, permutations used for hashing has less strength than permutations used for pseudo randomness. Even with lots of rounds these permutations probably would not be too good for pseudo randomness, because the period is too short when there are already fast RNG generators with very long periods. -- MegaHasher 06:33, 2 January 2006 (UTC)
Many uses of randomization functions, especially those which use them under that name, have no "period" requirement/concern. In fact, periodicity concerns might inappropriately narrow the space of possible functions. Other uses (such as pseudorandom number generation) might call for functions satisfying the requirements of a "randomization function" as well as other, stronger requirements. It would seem to me appropriate to disambiguate such classes of requirements in an article. Here's one thing i believe is a useful observation: a random function is an idealization of a hash function; a random permutation is an idealization of a randomization function. -- pfunk42 13:08, 2 January 2006 (UTC)
I admit perhaps the original intro is too technical. Hopefully my edit is more to the level of a beginner programmer. -- MegaHasher 03:44, 25 January 2006 (UTC)
I appreciate your effort, but even so the opening sentence basically says "a hash function is something used to create a hash value". I got to this page by searching for "hash value", which redirected me here. In other words, I still don't know what "hash" means, except with respect to meat dishes. I'm going attempt to redress that. ShawnVW 18:52, 2 March 2006 (UTC)
I think the intro is good (concise, simple, accurate, and to the point). There is one small inaccuracy, I believe. The input to the hash function is not necessarily variable length. Here are a few definitions (from a Google search of "define hash function") that might be worth consideration as part of the intro:
Web definitions
A hash function is any well-defined procedure or mathematical function that converts a large, possibly variable-sized amount of data into a small datum, usually a single integer that may serve as an index to an array. ... en.wikipedia.org/wiki/Hash_function
(Hash functions) take a large amount of data and create a small fingerprint based on that data.
www.informit.com/guides/content.aspx
I like aspects the first one, because it includes "possibly," and I like aspects of the second one because it is easy to picture and understand if someone is not familiar with this type of information. Maybe the best of both could be incorporated?
--Mark Nadon 8/13/2013
It seems the intro has become longer than the main text. Clean-up is again needed. - MegaHasher 23:50, 5 April 2006 (UTC)
The Intro is incorrect. A well designed hash function is not necessarily one-way nor unforgeable. A well designed hash function only needs to distribute input keys well across all buckets. You might be able to prove that a hash function that does a good job of spreading values across buckets is also one-way or non-forgeable, but those are not initial design constraints. There are "niche" applicaitons in which one-way or non-forgeable attributes are important, but it is not a design constraint for hash functions in general. Chuck Simmons 20:06, 14 April 2006 (UTC)
Say I have a hash (I know the hashing function and it has no salt), and I know my cleartext is between 12 and 18 bytes long and contains only ascii a-z \. 0-9 Would it be possible to write a function to generate all the possible cleartexts that fall under this particular category with this hash? Doodle77 03:24, 28 March 2006 (UTC)
-- pfunk42 12:36, 23 April 2006 (UTC)
"Contrary to an assumption often voiced by people who don't follow the industry carefully,"
That seems a bit too opinionated to be in accord with WP:NPOV and WP:AWW, all the more so that it's uncited. I made the statement a bit more neutral and verifiable, but I still ask that a source be given for this claim, or otherwise it should be removed entirely. The claim regarding the noisy room is also uncited, so I'm requesting a source for that as well. -- Ultimus 17:07, 1 July 2006 (UTC)
Two issues are in this section:
Contrary to an assumption often voiced by people who don't follow the industry carefully, hashing algorithms do exist that are robust to these minor differences.
This sentence is way too vague. Could someone please elucidate which algorithms and give some perspective on how they work? I assume that one can bring up a more broad topic that is "algorithms for analog signals identification", too, but I really have no knowledge about the topic. Poli 10:38, 12 Jun 2005 (UTC)
A Google search led me to
I suggest we move most of this information on audio identification to the acoustic fingerprint article. -- 75.37.227.177 14:55, 18 July 2007 (UTC)
Musicbrainz indeed is an open source system as the article describes; however Musicbrainz does not do audio hashing. It depends on a commercial system to do that— Trevor Caira 11:09, 29 December 2006 (UTC)
How do we know it isn't using human recognition [2]? -- 75.37.227.177 14:55, 18 July 2007 (UTC)
The article claims that a merge with One-way security is being considered. As far as I can tell there is no discussion of that here and no proposed reason why the merge is a good idea. I don't see why that merge makes sense. So, I'm going to remove the merge tags. Feel free to add them back and explain why you think it would be a good idea. -- SamHartman 21:22, 3 October 2007 (UTC)
If, given a message x, it is computationally infeasible to find a message y not equal to x such that H(x) = H(y) then H is said to be a weakly collision-free hash function. On the other hand, a strongly collision-free hash function H is one for which it is computationally infeasible to find any two messages x and y such that H(x) = H(y).
This seems awkward and opaque to me...
The only semantic difference I can see in this text seems to suggest that for "strongly" collision free hash functions, even if message x = y, H(x) != H(y)! Which doesn't sound much like a function to me...
Looking around, I found (at http://www.cs.berkeley.edu/~jordan/courses/174-spring02/notes/note14.pdf) the following:
2. A hash function h(x) is said to be weakly collision-free if given a message x1 it is hard to find another message x2 such that h(x1) = h(x2). 3. A hash function h(x) is said to be strongly collision-free if it is hard to find any pair of messages x1, x2 such that h(x1) = h(x2).
This is still not too clear to me.
Weakly collision free seems to me to mean: Given any specific message x, it is computationally infeasible to find a message y such that H(x) = H(y), but if you wish to, you can craft two messages together (x and y), such that H(x) = H(y).
That is, if you're NOT a hacker trying to create a forged message (which is hard), but rather a social engineer trying to create TWO messages which can be interchanged to create plausible deniability (which is easier), you can do it with a Weakly collision free hashing function.
Strongly collision free seems to me to mean: Even though in a large domain of messages there will clearly be a large number of collisions, It is computationally infeasible to find ANY two messages x and y such that x != y and H(x) = H(y), even if you are trying to craft the two together.
Sigh... I'm still not sure how to phrase this is a way which is clear to the casual reader, even assuming that I have correctly interpreted this. Anyone else care to give it a try?
I agree with Nahaj. However, I think the distinction between "weakly collision-free hash function" vs. "strongly collision-free hash function" only matters when we are talking about cryptographic hash functions -- right? So I see no need to even mention that distinction in the hash function article -- please delegate it to the cryptographic hash function article.
I agree this is only an issue for cryptographic hashing. There are similar issues that come up in adversarial hashing. For example if someone can find a large group of messages that all fall into the same hash bucket they can make your performance suffer. However I think that collisions resistance is enough to make something a cryptographic hash function. SamHartman 21:33, 3 October 2007 (UTC)
I have restored the qualification "usually a single integer that may serve as an index into an array" for the result of a hash function. To my knowledge, this property is what distinguishes hash functions (in the proper sense of the term) from other related but distinct concepts such as checksums, crypto hash functions, fingerprints, etc.. For instance, a function that returns a 128-bit value cannot be called a hash function.
Perhaps some peopel would rather define "hash function" as a generic term that covers all those concepts. But then we would be left without a name for the original sense ('a function that is suitable for "hashing" scattered data into a table'). All the best, -- Jorge Stolfi ( talk) 18:41, 21 November 2008 (UTC)
This article seems to have some overlap in function and content with Hash_table#Choosing_a_good_hash_function. Also, the latter characterizes the most important test for a hash function as being the avalanche condition, which seems a bit dubious to me - could an expert take a look at this? Dcoetzee 10:49, 24 November 2008 (UTC)
This article had an illustration that was actually showing a checksum or fingerprint function, rather than a hash function proper (see at right). The illustration has been copied to the checksum and fingerprint (computing) articles. Would someone provide an illustration for hashing in the proper sense? (I don't have any SVG editor at hand). Thanks, -- Jorge Stolfi ( talk) 19:06, 21 November 2008 (UTC)
Quote: "See the external link "Hash Functions for Hash Table Lookup" below for a comparison of speed and quality of various hash functions, including Pearson Hashing and some fast, high quality hash functions by Jenkins himself." It would be informative to know who on earth this Jenkins is, before he is alluded to as 'the big man himself'. :) -- Egregius 23:11, 29 Mar 2005 (UTC)
The section Geometric hashing looks like a description of Vector quantization. Vector quantisation collects groups (not just pairs) of data points that are close together (in typically 2- or more dimension space) so they can be represented with small errors by a single vector e.g. the centroid of the data group. Lossy compression is achieved by describing the data set using the reduced set of vectors. An example application is image compression from 24-bit RGB color to an optimum indexed 256-colour palette. In contrast, hashing seeks to group data in its own space that is constructed with the intention that data points that are actually close should not be kept together. Cuddlyable3 ( talk) 11:22, 19 December 2008 (UTC)
The example code given for a trivial hash function doesn't seem to be the best way to explain this; whilst it is a completely valid section of code, it is probably a bit obscure to anyone who doesn't know C and/or C++. An example in some form of pseudocode might make more sense? The parts they may be too obscure to someone unfamiliar with the language are
char*
declaring a pointer to one character within a string of text, the conditional
if(g=h&0xf0000000)
both changing the value of g and checking this value, and
*p!='\0'
catching the end of the string of text. -- Elder pegasus ( talk) 22:32, 6 March 2009 (UTC)
There seems to be some confusion about the "modulo the table size" bit. A hash function used for a hash table with N buckets should return an integer in 0..N-1. That is, the "mod" operation, if it is done at all, is technically inside the hash function, not outside it. The whole discussion about what makes a good hash function assumes this view. As said further on in the article, one way to get a hash function for a given size N is to take a hash function h with a much larger range (such as 232 and then compute h(k) mod N. But that is not the only way, and it may easily give a very poor hash function. If N is a power of two, the mod-N trick gives the low-order bits of h(k); which, for some popular hash functions, are known to be the worst ones. There are many other ways of reducing a large hash value to a smaller one, but they cannot be analyzed or recommended separately from the raw hash function. This is another reason to consider the mod-N operation a part of the hash function, and to avoid mentioning it outside of that section. All the best, -- Jorge Stolfi ( talk) 01:57, 18 April 2009 (UTC)
From the two pages it seem like hash function and fingerprint are the same thing. Could someone explain the difference? Thanks, — sligocki ( talk) 18:15, 12 November 2009 (UTC)
I'm writing to ask that editors interested in hashing consider taking a look at MurmurHash. Recently, a particular editor with a track record of aggressive and hasty deletionism started cutting the article down and removing key references, turning it into a stub. He removed the entire summary paragraph, repeatedly tried to remove the algorithm, and has taken out almost all of the links to ports in various languages. This lowers the value of the article and I'm obviously very unhappy about it, but I don't see further discussion with him as likely to be productive. Instead, I'd like to get more people involved so that a genuine consensus can be formed. 208.80.104.2 ( talk) 13:53, 11 September 2009 (UTC)
I've inherited the article, which I've recreated under guidance from the admins who initially opposed it. It's back on the chopping block, so please chime in: Wikipedia:Articles for deletion/MurmurHash (2nd nomination). Phil Spectre ( talk) 00:35, 15 November 2009 (UTC)
The article does not mention universal hash functions which are used in applications such as Cuckoo Hashing. It would be nice if they did.
Dr Tom Conway ( talk) —Preceding comment was added at 09:30, 31 March 2008 (UTC)
I agree. But I'm biased being one of the authors of Universal Hash Functions, Wegman, so I won't make the change myself. It's much more than Cuckoo Hashing though. Universal Hash Functions almost guarantee that for any data set the hash function will work as extpected. That is to say the programmer doesn't have to worry about collisions. I would propose to replace this section:
Uniformity
A good hash function should map the expected inputs as evenly as possible over its output range. That is, every hash value in the output range should be generated with roughly the same probability. The reason for this last requirement is that the cost of hashing-based methods goes up sharply as the number of collisions — pairs of inputs that are mapped to the same hash value — increases. Basically, if some hash values are more likely to occur than others, a larger fraction of the lookup operations will have to search through a larger set of colliding table entries.
Note that this criterion only requires the value to be uniformly distributed, not random in any sense. A good randomizing function is usually good for hashing, but the converse need not be true.
Hash tables often contain only a small subset of the valid inputs. For instance, a club membership list may contain only a hundred or so member names, out of the very large set of all possible names. In these cases, the uniformity criterion should hold for almost all typical subsets of entries that may be found in the table, not just for the global set of all possible entries.
In other words, if a typical set of m records is hashed to n table slots, the probability of a bucket receiving many more than m/n records should be vanishingly small. In particular, if m is less than n, very few buckets should have more than one or two records. (In an ideal "perfect hash function", no bucket should have more than one record; but a small number of collisions is virtually inevitable, even if n is much larger than m -- see the birthday paradox).
When testing a hash function, the uniformity of the distribution of hash values can be evaluated by the chi-square test.[3]
with something different.
Few Collisions
The odds of a pair of elements colliding under the hash function should be small. If x and y are different it's desirable that the probability of H(x) = H(y) be roughly the size of the range of the function. That is to say if you are hashing into a table the odds of two elements colliding should be no more than the number of elements in the table. The cost of look up in the best of implementations of tables is roughly the number of elements that collide.
Unfortunately hash functions are many to one mappings. The indexes to the tables we use are smaller than the strings. In a cryptography application the authentication tag is typically much smaller than the string being authenticated. So by definition there is some set of inputs that is all mapped to a single element and those inputs by definition collide. A good hash function can require someone to think carefully about the kinds of values being hashed to make sure that there won't be too many inputs that will all map to the same value under the chosen function.
An alternative to careful thought is universal_hashing. A universal hash function has one or more parameters to the function. The value of those parameters are chosen randomly. A universal hash function has the property that for any set of inputs the vast majority of parameter values will distribute the set with few collisions. For example Ha,b(x) might equal ax+b (mod p) (mod q), where p is a prime the size of the largest input and a and b are choosen less than p and where q is the number of elements in the table. There are many implementation tricks to make the calculation of such a function fast.
The current write up refers to perfect hashing which can only be used once the entire set of possibly hashed values has been determined and they tend to be too slow to compute for most examples. There is no reason to bother with a chi-square test when using universal functions the odds of getting a high value on such a test are provably infinitesimal. —Preceding
unsigned comment added by
MarkWegman (
talk •
contribs) 23:00, 19 January 2010 (UTC)
In Introduction it is written, that a hash function is a mathematical function. Therefore I missed a proper mathematical definition of a hash function. For example something like:
A hash function is a polynomial time algorithm H which maps a bit string of arbitrary length to a bit string of a certain length. In formula: .
Now you could distinguish certain kinds and properties of hash functions in a more formal way. For example like:
I read above that someone more prefer to write such properties in the cryptographic hash function article. But the book I actually read (Introduction to modern cryptography by Katz and Lindell) describe cryptographic hash function as functions which fulfill certain properties but define this properties for ('noncryptographic') hash functions in general. Anyhow at the moment this things aren't described in detail neither in Hash_function, nor in Cryptographic_hash_function.
-- Wohingenau ( talk) 13:54, 26 March 2010 (UTC)
The link at the bottom of the page to the 'Bernstein hash' page goes to http://fr.wikipedia.org/wiki/Table_de_hachage#Fonction_de_Hachage , which is probably undesirable on the English version of Wikipedia. I did a quick search for 'Bernstein hash' on the English wikipedia and didn't find a specifc page about the algorithm, otherwise I would have fixed the link myself. Either the link should be removed, or the content in the French wiki article should be copied over. 202.20.20.202 ( talk) 05:15, 24 May 2010 (UTC)
Is there a connection between hashes and Cutter numbers (Library Science)? The aim of Cutters, as far as I can tell, is to map points from a large sparse space (eg names of people) to a small space (eg 3 decimal digits). That sounds like hashing. Cutters have the additional property that order is preserved - that is, if name X is alphabetically before name Y, then the Cutter number of X comes before the Cutter number of Y.
An issue that Cutters make use of is that points in the name space are not uniformly distributed. A practical way to generate Cutters for people's names is to use a phone book. Find the page with the name, or where the name would be if it is missing, and use the page number. This takes account of the non-uniform frequency distribution of names.
138.251.194.28 ( talk) 14:43, 23 August 2010 (UTC)
Cutter codes are generated according to a set of rules (i.e., a hash function), although some human judgement is allowed to accommodate very common names; on the other hand, such judgement could be incorporated into the rules of one's Cutter algorithm. Cutter is not a hash function itself, but a kind of hash function, and different libraries may use different Cutter tables in generating their Cutter codes (hashes). Here's an example of a Cutter table: Example Cutter table
However, this Wikipedia page defines hash codes as being of "fixed length", which Cutter codes are not, yet gives no explanation why hash codes must be of fixed length.
How are any of these functions regarded as reliable? Take SHA-1, for example: with a message size of up to 2^64-1 bits, and a digest size of 160 bits, any given digest can have many tens of trillions of messages (roughly 2^57, assuming the algorithm yields high efficiency and even distribution). Compared to the fact that any given message yields exactly one of up to 2^160 digests, there's an incredible discrepancy (like, a billion orders of magnitude) and collisions seem inevitable under the circumstances of even very low-tax, common applications (those applications where message sizes are nowhere near the full possible length). I guess the question in short is: How long can messages be, before the risk of frequent collisions becomes a practical problem for the application? Something like message lengths of 2^8 bits appears to be the point at which collisions become absolutely guaranteed. For many cryptographic hash applications, it doesn't matter if you discover the actual message, if you discover any one of the many tens of trillions of messages which match the digest, you're ready to spoof. So, I am missing something. -- 76.102.243.117 ( talk) 22:25, 7 June 2010 (UTC)
A hash function that has no collisions is what we call a lossless coding. Cuddlyable3 ( talk) 23:48, 26 November 2008 (UTC)
This page defines hash codes as being of "fixed length", but offers no explanation of why uniform length is an inherent and necessary trait of hash codes. There are hashing systems (such as Cutter codes) that produce (slightly) varied hash-code lengths, but that satisfy all the functional requirements of a hashing system. I would argue that fixed or uniform length of hash codes is a typical characteristic, but not a required characteristic of hash codes. Anyone know any reason to the contrary? — Preceding unsigned comment added by 76.121.65.17 ( talk) 17:28, 23 November 2012 (UTC)
Yes, Bloom filters use hash functions, but this section isn't helpful. What does "a compact data structure that provides an enclosing approximation to a set of them" even mean? I'd recommend saying something that makes sense here, or deleting the section.
Paul Kube ( talk) 22:54, 4 January 2013 (UTC)
maybe we should mention that any Perfect Hash function cannot output any smaller data sets than the input? :p
if the hasing function is "perfect", the output cannot be smaller than the input... Divinity76 ( talk) 19:08, 19 July 2012 (UTC)
Perfect???: PCR-Hash to texts for Programm Codon Restrict IS rename failure by Polesinskij:
l:=(Length of bytes)
S:=(total plass of bites of all bytes)
U:=(total multiplicat for bites of all bytes)
with sqrt on n=N:
([S/l]=n to sqrt of U)=k
x:=([k(to float)]=n2 to sqrt of U)
total concatenat:l+k+x
second part with_2-sqrt:
beg {z:=l/2; int Y+1; while z>2;} end
beg {k:=2sqrt of U;Y-1; while Y>0;} end
x:=[k(to float)=n to sqrt of U]
total concatenat:l+k+x
— Preceding unsigned comment added by 95.153.189.81 ( talk) 19:48, 1 November 2013 (UTC)
The article uses singular, such as "the data is". — Preceding unsigned comment added by 193.11.93.250 ( talk) 09:10, 5 November 2013 (UTC)
Neither telephone numbers nor license plates have a uniform distribution over their number ranges/values. Therefore, using modulo arithmetic will not have the desired effect. 203.97.96.78 ( talk) 23:07, 27 August 2008 (UTC)
The article states that hash functions are the same thing as checksums, fingerprints, and cryptographic hash functions. Actually these concepts are related but distinct; they serve different purposes and have different requirements. Jorge Stolfi ( talk) 19:58, 23 February 2008 (UTC)
True! Anyone care to edit it lol, wikipedia is full of "this" these days, a hash is defiantly an extension of the checksum idea, which is an extension of the parity bit idea, this is an emergent field though so some people do not understand the history of the language, it waits for someone to take authority on the idea, but citations are preferred DarkShroom ( talk) 10:16, 13 April 2014 (UTC)
Bret Mulvey's analysis referenced in the article shows a CRC-32 implementation where the polynomial used is the most-significant-bit (msb)-first representation of one of the standard CRC-32 implementations. However, his table initialization code is for a lsb-first implementation. (Note that all 256 of his table values will have the most significant 4 bits unset.) His CRC-calculation code is correct for checking the CRC-32 in a protocol where the CRC-32 is appended to the message such that the CRC-32 of the message with the appended CRC-32 is a constant. However, it appears that he didn't append 4 null bytes to his keys (as is effectively done in protocol implementations by zero-ing out the 4 bytes left for the CRC at the end of the message buffer). Note that the worked-through 3-bit CRC example in the Wikipedia article on CRCs includes padding with 3 zero bits.
These two flaws (improperly mixing lsb-first and msb-first implementations, and failing to pad with 32 zero bits) together account for his poor results.
Kmag ( talk) 05:48, 15 December 2013 (UTC)
Every once in a while, someone confuses functions with algorithms. Functions are mappings, algorithms are abstractions of computer programs. Functions are concepts, algorithms are implementations. So saying that "A hash function is any algorithm that maps ..." has things entirely backwards. Correct would be: "A hash function is a function that maps ...", and "A hash algorithm is an algorithm that implements some hash function". Boute ( talk) 13:45, 27 April 2014 (UTC)
...
Process the message in successive 512-bit chunks:
break message into 512-bit chunks
for each chunk
...
True, I brought up cryptographic hash functions because that's what I could get a solid definition for. Ignore the cryptography, I'm asking you to find a reliable source that supports your point of view. Wikipedia requires sources for contentious claims. If there's no source then Wikipedia shouldn't state this. -- intgr [talk] 16:34, 5 June 2014 (UTC)
This citation is referencing a Wikipedia book and has no added value — Preceding unsigned comment added by 131.159.218.64 ( talk) 13:06, 16 June 2015 (UTC)
in the case of Unicode characters, the table would have 17×216 = 1114112 entries.
As the first paragraph of Unicode notes, there are more than 216 different Unicode characters, so this sentence is wrong. For purposes of the article's example, it's sufficient (and sufficiently accurate) to change 216 to 232, reflecting the fixed character width used in UTF-32. (17× that = 73,014,444,032 entries) Enoent ( talk) 17:00, 5 July 2015 (UTC)
What about the history of hash functions? Admittedly hashing has only been a vital part of information security systems since the Computer Age, but there must be some history of the mathematical foundations. — Preceding unsigned comment added by 202.163.119.34 ( talk) 10:01, 11 October 2015 (UTC)
Do you think this website is helpful and should be added in the External links section? http://tools.timodenk.com/?p=hash-function
-- 109.192.64.215 ( talk) 13:56, 31 December 2015 (UTC)
They are not good (failing Spectral_test with very few domains, but small and famous (fast
LCG-hash) cores.
//[Linear congruental generator] hash, 4 domains in, 4 domains out.
#define HASHSCALE4 vec4(1031,.1030,.0973,.1099)
vec4 hash44(vec4 p){p=fract(p*HASHSCALE4);//arbitiarily scaled seesaw modulo remainder. with bias to hide; failig a [Spectral test] in 2d.
p+=dot(p,p.wzxy+19.19);//dot product with domain-swiveled self
return fract(p.zxqx*(p.xxyz+p.yzzw));//seesaw modulo remainder of linear polynomial with domain-swiveled selves.
}//
https://www.shadertoy.com/view/4djSRW
//[Linear congruental generator] hash, 4 domains in, 4 domains out; does mod(x,17*17). Apparently intended for small type integer, does not even bother to scale or swivel domains.
vec4 mod289(vec4 x){return x-floor(x*(1./289.))*289.;}//289=17*17;
Ollj ( talk) 07:34, 8 September 2017 (UTC)
The article says "A hash function is any function that can be used to map data of arbitrary size to data of fixed size.", then immediately retorts that checksums are not hash functions. However, checksums satisfy this definition. A perfunctory examination suggests that, possibly, some other of the article's listed: "checksums, check digits, fingerprints, lossy compression, randomization functions, error-correcting codes, and ciphers", fit the given definition of hash functions. — Preceding unsigned comment added by Ihearthonduras ( talk • contribs) 04:20, 21 March 2018 (UTC)
"Most of the algorithms available are not extremely robust, but some are so robust that they can identify music played on loud-speakers in a noisy room." - very interesting, can we have a source - william sharkey
Since nobody knows what this is, what is point of the sentence?
Given that the previous so-called sentence runs on and on without saying anything, what "remaining characters"?
The opening sentence says: A hash function is any function that can be used to map data of arbitrary size to fixed-size values. And it's cited, though the source is popular rather than scholarly. The definition is a common one, though I'd say it's aped from one source to another. In many cases, the data is already a fixed size. What then? Suppose I have 40,000,000,000 64-bit values, and need to collect some modest set of them meeting some criteria into a table which I need to index quickly to retrieve any particular one. It's ok if the table has empty slots or is even sparse, but must be many orders of magnitude smaller than 2^64, or even 10^10. I don't need a hash function, because the definition says so: my data is already fixed length, all 64-bits.
On the other hand suppose I define a function which always returns 1, regardless of the value or length of the key. According to the definition, that's a hash function. Not only is the hash value fixed length (one bit), but also a fixed value. You may say that it's merely the case that we've constructed a bad (very bad) hash function. But the function is so bad, that it no longer conforms to ordinary notions of what a hash function is. Suppose again that I define an algebraic function, which always returns one. I didn't specify what the effective domain of the function is. I input the value 1, and it returns 1, the square root of 1. It's a square root function by definition, but again does not conform to our ordinary notion of such a function.
For their bountiful effusion, the editors have not clarified what the basic concepts of a hash function are, and the definition is symptomatic of the problem.
A hash function is a mapping. It does two things: first, it yields an index into the space of key values for each key, not necessarily a unique index; i.e. it distributes the key values in some fashion, the more random the better, over the key space. The key values may all be sequential, like soldiers' dog tag numbers, allocated from some arbitrary starting number serially. If we find a dog tag, and want to look up the soldier's name, how do we find that dog tag in a list? It won't do to try linear or even binary search of the list - the search space is enormous, and the list isn't necessarily constructed in a serially increasing order by dog tag number: someone at the end of recruiting took all the sign-up sheets in a messy stack and entered the tag numbers line by line into the computer. Now we need to construct a referenceable table. We can use the dog tag number itself, so that dog tag number 1,000,000 (the nth dog tag in the list) goes into slot number 1,000,000 in the table. So if a medic finds dog tag number 1,000,000 on the battlefield, he can index slot 1,000,000 in the table, and find that the soldier's name is Joe Blow. The problem is, there are billions, possibly trillions or more, possible dog tag numbers, and mapping them onto the computer system's disks either takes a lot of disks or is infeasible. So, second: a hash function maps a large state space of key indices into a manageable state space, such that collisions are minimized. Key indices which were distinguishable in the original key space may not be distinguishable in the reduced key space, i.e. there's a collision, and one of the dog tag indices will have to be mapped somewhere else in such a way that it is findable.
If key values were truly random, then we could take any n bits of the key and use it as an index into a table of size 2^n. The "hash" function could be just about anything that selects n bits. But the keys are not random, they're clustered serially, one cluster for each recruiting station. If the clusters are not broken up, when we attempt to map the indices into a reduced table, there will be many collisions, making a reduced table with little more order than a messy stack of papers. So the hash function cannot be just 'anything' - it has to breakup the clusters and distribute the key indices over the key space, then map the indices into a practical size table. When we combine both processes into a single operation, like modulo, we confuse ourselves about exactly what happened. (modulo, BTW, does NOT break up serial clusters of key values, so is a poor choice for hashing dog tag numbers, or any application where serial clustering may occur).
All the rambling and roaming of the article fails to tell us about the one, and then about the other. And it doesn't have any information on collision resolution, or performance of the algorithms. Nor how external hashing, i.e. across physical interfaces like disks, drums, tapes, etc, is handled efficiently.
We need to start again, and we need to start at the top with a notional definition, cited or not. Sbalfour ( talk) 16:54, 20 September 2019 (UTC)
This article is not for these. They have their own articles. I propose deleting them forthwith, since there's already a note at the bottom of the lead redirecting readers to them. Sbalfour ( talk) 18:19, 20 September 2019 (UTC)
I'm going to be BOLD and just do it. Sbalfour ( talk) 18:50, 20 September 2019 (UTC)
I have no idea what the given formula means: everything is undefined, including M, W, m, w, a, x, y, and ha (how's that different from h?). I'm pretty sure it was copied from somewhere, and not the source stated, which uses different notation. So what's the real source? If we knew that, we'd probably have a copyvio. Sbalfour ( talk) 20:05, 21 September 2019 (UTC)
This article says much, but conveys little. It's like a survey with a few "how-to" tidbits thrown in, and little or no analysis of how these functions perform. For example, the three sections: Hashing uniformly distributed data; Hashing data with other distributions; Special-purpose hash functions; essentially say the same thing, that if we know something about the distribution or structure of the data, we can construct a better hash function. How we do that depends on what we know, and that's simply too wide an endeavor to detail here. Sbalfour ( talk) 15:42, 20 September 2019 (UTC)
I've fixed a lot of this with focus on two things: hashing integers, and hashing variable length data. Sbalfour ( talk) 18:58, 22 September 2019 (UTC)
There's nothing in the article about deletions, performance of the algorithms versus data, or analysis (i.e. theoretical performance, best case, average case, worst case, etc). The article is critically anemic. Sbalfour ( talk) 19:00, 22 September 2019 (UTC)
A hash function takes as input a key, which is associated with a datum or record and used to identify it to the data storage and retrieval application. The keys may be fixed, like an integer, or variable length, like a name. In some cases, the key is the datum itself. The output is a hash code used to index a hash table holding the data or records, or pointers to them.
A hash function may be considered to perform three functions:
A good hash function satisfies two basic properties: 1) it should be very fast to compute; 2) it should minimize collisions. Hash functions rely on favorable probability distributions for their effectiveness, reducing access time to nearly constant. High table loading factors, pathological key sets and poorly designed hash functions can result in access times approaching linear in the number of items in the table. Hash functions can be designed to give best worst-case performance, [Notes 1]good performance under high table loading factors, and in special cases, perfect (collisionless) mapping of keys into hash codes. Implementation is based on parity-preserving bit operations (XOR and ADD), multiply or divide. Implementations include an adjunct collision-resolution method. Sbalfour ( talk) 00:44, 23 September 2019 (UTC)
The text says this Typically, ... a hash function ... will map several different keys to the same index which could result in collisions. Not could result in a collision, but WILL result in a collision. And no, a hash function will NOT typically map several keys to the same index - it will typically map only ONE key to an index. If there are more than a few collisions, we'll usually choose another hash function. The whole section needs cleaned up for conceptual presentation as well as technical diction. Sbalfour ( talk) 19:17, 20 September 2019 (UTC)
Then it says: Hash table implementations include a specific hash function—such as a Jenkins hash or Zobrist hashing—and a hash-table collision resolution scheme—such as coalesced hashing, cuckoo hashing, or hopscotch hashing. This is just mongering, or I don't know what. These are highly specialized hash functions and techniques. The most common hash function is modulo based, even if hashing strings, the most common hash table structure is open addressing, and the most common collision resolution technique, linear probing. These are the simplest, and usually quite good enough, that's why. I think we ought to say these first, and relegate the others to a footnote, if we mention them at all. Sbalfour ( talk) 21:30, 20 September 2019 (UTC)
I think the editor of that section had a singular and rather dogmatic idea of what hash functions and tables actually look like in practice. Sbalfour ( talk) 18:40, 23 September 2019 (UTC)
I propose this:
Hash functions are used in conjunction with hash tables to store and retrieve data items or data records. The hash function translates the key associated with each datum or record into a hash code which is used to index the hash table. When an item is to be added to the table, the hash code may index an empty slot (also called a bucket), in which case the item is added to the table there. If the hash code indexes a full slot, some kind of collision resolution is required: the new item may be omitted (not added to the table), or replace the old item, or it can be added to the table in some other location by a specified procedure. That procedure depends on the structure of the hash table: In chained hashing, each slot is the head of a linked list or chain, and items that collide at the slot are added to the chain. Chains may be kept in random order and searched linearly, or in serial order, or as a self-ordering list by frequency to speed up access. In open address hashing, the table is probed starting from the occupied slot in a specified manner, usually by linear probing, quadratic probing, or double hashing until an open slot is located or the entire table is probed (overflow). Searching for the item follows the same procedure until the item is located, an open slot is found or the entire table has been searched (item not in table).
I've deleted numerous examples from the text for three reasons: programming language code is copyrighted unless the program itself is in the public domain, or the example was uncited and almost certainly original art that would otherwise be copyrightable to the creator. In the the absence of all else, the encyclopedia is not a how-to manual: actual compilable code and elaborate pseudo-code, except when plainly necessary to clarify obscure text, is for programming textbooks, not the encyclopedia. Clearly, we can "invent" self-evident examples such as "1+1=2 is an example of integer arithmetic addition". The best course is to accurately and succinctly describe the algorithm, ... an exercise in technical diction, which is exactly what we want in the encyclopedia. Examples must not be original art, copyrighted, or uncited. Sbalfour ( talk) 15:04, 24 September 2019 (UTC)
What good does it do to speak of Hash function without Hash table? And vise-versa. It's inane. There's no use for a hash function without a hash table (I'm not talking about crypto, checksums, etc, which have their own articles) or a hash table without a hash function. I propose combining both articles into one, and title it Hashing. It's almost a direct insert - no need to merge sections, paragraphs, etc. Sbalfour ( talk) 17:26, 24 September 2019 (UTC)
Article says (obtusely): "Use of hash functions relies on statistical properties of key and function interaction: worst-case behaviour is intolerably bad with a vanishingly small probability, and average-case behaviour can be nearly optimal (minimal collision).[1]" (
FairNPOV (
talk) 17:33, 11 January 2022 (UTC))
Cite error: There are <ref group=Notes>
tags on this page, but the references will not show without a {{reflist|group=Notes}}
template (see the
help page).
This is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page. |
Archive 1 |
Hi Pfunk42! I just saw your major overhaul of hash function and "merging" of hash algorithm. Very nice work! You beat me to it. I put up those merging notices but never got around to note on the talk pages that I was going to do that work in some days. But I am happy you did it since it saved me the work and you did it so well. I added a picture instead, hope you like it. -- David Göthberg 02:31, 5 September 2005 (UTC)
The article says that hash functions are surjective. Is that true? They're certainly not injective, but my understanding is that "surjective" means that "All elements of the image domain are actually 'used' by the mapping," to quote a posting by Peter Simons from the monoton-devel mailing list. [1]
So a hash function could be surjective, but I don't think it has to be.
But this is so far from my areas of expertise that I don't want to just change it; I may be way offbase. -- Elysdir 00:10, 17 December 2005 (UTC)
What's a stock hash function? Isn't the case when an algorithm become published it becomes stock? What's significant about a stock function the way it is being described in this article? Just a little bit confused. -- MegaHasher
Ok. maybe i should have discussed it before throwing it in there, but this is widely used term for a hash function that is one-to-one. a google search for "randomization function" reveals that most of the results use the term as i describe.
True, a bijective function whose domain equals its range represents a permutation, but calling any one-to-one hash function a "permutation" is highly misleading.
It might be worth creating a separate "Randomization Function" page (with redirect from "Mix Function"), but i'm not sure. Comments welcome.-- pfunk42 06:12, 2 January 2006 (UTC)
IMHO, Wikipedia already has too many semi-related pages about random functions. In my mind, permutations used for hashing has less strength than permutations used for pseudo randomness. Even with lots of rounds these permutations probably would not be too good for pseudo randomness, because the period is too short when there are already fast RNG generators with very long periods. -- MegaHasher 06:33, 2 January 2006 (UTC)
Many uses of randomization functions, especially those which use them under that name, have no "period" requirement/concern. In fact, periodicity concerns might inappropriately narrow the space of possible functions. Other uses (such as pseudorandom number generation) might call for functions satisfying the requirements of a "randomization function" as well as other, stronger requirements. It would seem to me appropriate to disambiguate such classes of requirements in an article. Here's one thing i believe is a useful observation: a random function is an idealization of a hash function; a random permutation is an idealization of a randomization function. -- pfunk42 13:08, 2 January 2006 (UTC)
I admit perhaps the original intro is too technical. Hopefully my edit is more to the level of a beginner programmer. -- MegaHasher 03:44, 25 January 2006 (UTC)
I appreciate your effort, but even so the opening sentence basically says "a hash function is something used to create a hash value". I got to this page by searching for "hash value", which redirected me here. In other words, I still don't know what "hash" means, except with respect to meat dishes. I'm going attempt to redress that. ShawnVW 18:52, 2 March 2006 (UTC)
I think the intro is good (concise, simple, accurate, and to the point). There is one small inaccuracy, I believe. The input to the hash function is not necessarily variable length. Here are a few definitions (from a Google search of "define hash function") that might be worth consideration as part of the intro:
Web definitions
A hash function is any well-defined procedure or mathematical function that converts a large, possibly variable-sized amount of data into a small datum, usually a single integer that may serve as an index to an array. ... en.wikipedia.org/wiki/Hash_function
(Hash functions) take a large amount of data and create a small fingerprint based on that data.
www.informit.com/guides/content.aspx
I like aspects the first one, because it includes "possibly," and I like aspects of the second one because it is easy to picture and understand if someone is not familiar with this type of information. Maybe the best of both could be incorporated?
--Mark Nadon 8/13/2013
It seems the intro has become longer than the main text. Clean-up is again needed. - MegaHasher 23:50, 5 April 2006 (UTC)
The Intro is incorrect. A well designed hash function is not necessarily one-way nor unforgeable. A well designed hash function only needs to distribute input keys well across all buckets. You might be able to prove that a hash function that does a good job of spreading values across buckets is also one-way or non-forgeable, but those are not initial design constraints. There are "niche" applicaitons in which one-way or non-forgeable attributes are important, but it is not a design constraint for hash functions in general. Chuck Simmons 20:06, 14 April 2006 (UTC)
Say I have a hash (I know the hashing function and it has no salt), and I know my cleartext is between 12 and 18 bytes long and contains only ascii a-z \. 0-9 Would it be possible to write a function to generate all the possible cleartexts that fall under this particular category with this hash? Doodle77 03:24, 28 March 2006 (UTC)
-- pfunk42 12:36, 23 April 2006 (UTC)
"Contrary to an assumption often voiced by people who don't follow the industry carefully,"
That seems a bit too opinionated to be in accord with WP:NPOV and WP:AWW, all the more so that it's uncited. I made the statement a bit more neutral and verifiable, but I still ask that a source be given for this claim, or otherwise it should be removed entirely. The claim regarding the noisy room is also uncited, so I'm requesting a source for that as well. -- Ultimus 17:07, 1 July 2006 (UTC)
Two issues are in this section:
Contrary to an assumption often voiced by people who don't follow the industry carefully, hashing algorithms do exist that are robust to these minor differences.
This sentence is way too vague. Could someone please elucidate which algorithms and give some perspective on how they work? I assume that one can bring up a more broad topic that is "algorithms for analog signals identification", too, but I really have no knowledge about the topic. Poli 10:38, 12 Jun 2005 (UTC)
A Google search led me to
I suggest we move most of this information on audio identification to the acoustic fingerprint article. -- 75.37.227.177 14:55, 18 July 2007 (UTC)
Musicbrainz indeed is an open source system as the article describes; however Musicbrainz does not do audio hashing. It depends on a commercial system to do that— Trevor Caira 11:09, 29 December 2006 (UTC)
How do we know it isn't using human recognition [2]? -- 75.37.227.177 14:55, 18 July 2007 (UTC)
The article claims that a merge with One-way security is being considered. As far as I can tell there is no discussion of that here and no proposed reason why the merge is a good idea. I don't see why that merge makes sense. So, I'm going to remove the merge tags. Feel free to add them back and explain why you think it would be a good idea. -- SamHartman 21:22, 3 October 2007 (UTC)
If, given a message x, it is computationally infeasible to find a message y not equal to x such that H(x) = H(y) then H is said to be a weakly collision-free hash function. On the other hand, a strongly collision-free hash function H is one for which it is computationally infeasible to find any two messages x and y such that H(x) = H(y).
This seems awkward and opaque to me...
The only semantic difference I can see in this text seems to suggest that for "strongly" collision free hash functions, even if message x = y, H(x) != H(y)! Which doesn't sound much like a function to me...
Looking around, I found (at http://www.cs.berkeley.edu/~jordan/courses/174-spring02/notes/note14.pdf) the following:
2. A hash function h(x) is said to be weakly collision-free if given a message x1 it is hard to find another message x2 such that h(x1) = h(x2). 3. A hash function h(x) is said to be strongly collision-free if it is hard to find any pair of messages x1, x2 such that h(x1) = h(x2).
This is still not too clear to me.
Weakly collision free seems to me to mean: Given any specific message x, it is computationally infeasible to find a message y such that H(x) = H(y), but if you wish to, you can craft two messages together (x and y), such that H(x) = H(y).
That is, if you're NOT a hacker trying to create a forged message (which is hard), but rather a social engineer trying to create TWO messages which can be interchanged to create plausible deniability (which is easier), you can do it with a Weakly collision free hashing function.
Strongly collision free seems to me to mean: Even though in a large domain of messages there will clearly be a large number of collisions, It is computationally infeasible to find ANY two messages x and y such that x != y and H(x) = H(y), even if you are trying to craft the two together.
Sigh... I'm still not sure how to phrase this is a way which is clear to the casual reader, even assuming that I have correctly interpreted this. Anyone else care to give it a try?
I agree with Nahaj. However, I think the distinction between "weakly collision-free hash function" vs. "strongly collision-free hash function" only matters when we are talking about cryptographic hash functions -- right? So I see no need to even mention that distinction in the hash function article -- please delegate it to the cryptographic hash function article.
I agree this is only an issue for cryptographic hashing. There are similar issues that come up in adversarial hashing. For example if someone can find a large group of messages that all fall into the same hash bucket they can make your performance suffer. However I think that collisions resistance is enough to make something a cryptographic hash function. SamHartman 21:33, 3 October 2007 (UTC)
I have restored the qualification "usually a single integer that may serve as an index into an array" for the result of a hash function. To my knowledge, this property is what distinguishes hash functions (in the proper sense of the term) from other related but distinct concepts such as checksums, crypto hash functions, fingerprints, etc.. For instance, a function that returns a 128-bit value cannot be called a hash function.
Perhaps some peopel would rather define "hash function" as a generic term that covers all those concepts. But then we would be left without a name for the original sense ('a function that is suitable for "hashing" scattered data into a table'). All the best, -- Jorge Stolfi ( talk) 18:41, 21 November 2008 (UTC)
This article seems to have some overlap in function and content with Hash_table#Choosing_a_good_hash_function. Also, the latter characterizes the most important test for a hash function as being the avalanche condition, which seems a bit dubious to me - could an expert take a look at this? Dcoetzee 10:49, 24 November 2008 (UTC)
This article had an illustration that was actually showing a checksum or fingerprint function, rather than a hash function proper (see at right). The illustration has been copied to the checksum and fingerprint (computing) articles. Would someone provide an illustration for hashing in the proper sense? (I don't have any SVG editor at hand). Thanks, -- Jorge Stolfi ( talk) 19:06, 21 November 2008 (UTC)
Quote: "See the external link "Hash Functions for Hash Table Lookup" below for a comparison of speed and quality of various hash functions, including Pearson Hashing and some fast, high quality hash functions by Jenkins himself." It would be informative to know who on earth this Jenkins is, before he is alluded to as 'the big man himself'. :) -- Egregius 23:11, 29 Mar 2005 (UTC)
The section Geometric hashing looks like a description of Vector quantization. Vector quantisation collects groups (not just pairs) of data points that are close together (in typically 2- or more dimension space) so they can be represented with small errors by a single vector e.g. the centroid of the data group. Lossy compression is achieved by describing the data set using the reduced set of vectors. An example application is image compression from 24-bit RGB color to an optimum indexed 256-colour palette. In contrast, hashing seeks to group data in its own space that is constructed with the intention that data points that are actually close should not be kept together. Cuddlyable3 ( talk) 11:22, 19 December 2008 (UTC)
The example code given for a trivial hash function doesn't seem to be the best way to explain this; whilst it is a completely valid section of code, it is probably a bit obscure to anyone who doesn't know C and/or C++. An example in some form of pseudocode might make more sense? The parts they may be too obscure to someone unfamiliar with the language are
char*
declaring a pointer to one character within a string of text, the conditional
if(g=h&0xf0000000)
both changing the value of g and checking this value, and
*p!='\0'
catching the end of the string of text. -- Elder pegasus ( talk) 22:32, 6 March 2009 (UTC)
There seems to be some confusion about the "modulo the table size" bit. A hash function used for a hash table with N buckets should return an integer in 0..N-1. That is, the "mod" operation, if it is done at all, is technically inside the hash function, not outside it. The whole discussion about what makes a good hash function assumes this view. As said further on in the article, one way to get a hash function for a given size N is to take a hash function h with a much larger range (such as 232 and then compute h(k) mod N. But that is not the only way, and it may easily give a very poor hash function. If N is a power of two, the mod-N trick gives the low-order bits of h(k); which, for some popular hash functions, are known to be the worst ones. There are many other ways of reducing a large hash value to a smaller one, but they cannot be analyzed or recommended separately from the raw hash function. This is another reason to consider the mod-N operation a part of the hash function, and to avoid mentioning it outside of that section. All the best, -- Jorge Stolfi ( talk) 01:57, 18 April 2009 (UTC)
From the two pages it seem like hash function and fingerprint are the same thing. Could someone explain the difference? Thanks, — sligocki ( talk) 18:15, 12 November 2009 (UTC)
I'm writing to ask that editors interested in hashing consider taking a look at MurmurHash. Recently, a particular editor with a track record of aggressive and hasty deletionism started cutting the article down and removing key references, turning it into a stub. He removed the entire summary paragraph, repeatedly tried to remove the algorithm, and has taken out almost all of the links to ports in various languages. This lowers the value of the article and I'm obviously very unhappy about it, but I don't see further discussion with him as likely to be productive. Instead, I'd like to get more people involved so that a genuine consensus can be formed. 208.80.104.2 ( talk) 13:53, 11 September 2009 (UTC)
I've inherited the article, which I've recreated under guidance from the admins who initially opposed it. It's back on the chopping block, so please chime in: Wikipedia:Articles for deletion/MurmurHash (2nd nomination). Phil Spectre ( talk) 00:35, 15 November 2009 (UTC)
The article does not mention universal hash functions which are used in applications such as Cuckoo Hashing. It would be nice if they did.
Dr Tom Conway ( talk) —Preceding comment was added at 09:30, 31 March 2008 (UTC)
I agree. But I'm biased being one of the authors of Universal Hash Functions, Wegman, so I won't make the change myself. It's much more than Cuckoo Hashing though. Universal Hash Functions almost guarantee that for any data set the hash function will work as extpected. That is to say the programmer doesn't have to worry about collisions. I would propose to replace this section:
Uniformity
A good hash function should map the expected inputs as evenly as possible over its output range. That is, every hash value in the output range should be generated with roughly the same probability. The reason for this last requirement is that the cost of hashing-based methods goes up sharply as the number of collisions — pairs of inputs that are mapped to the same hash value — increases. Basically, if some hash values are more likely to occur than others, a larger fraction of the lookup operations will have to search through a larger set of colliding table entries.
Note that this criterion only requires the value to be uniformly distributed, not random in any sense. A good randomizing function is usually good for hashing, but the converse need not be true.
Hash tables often contain only a small subset of the valid inputs. For instance, a club membership list may contain only a hundred or so member names, out of the very large set of all possible names. In these cases, the uniformity criterion should hold for almost all typical subsets of entries that may be found in the table, not just for the global set of all possible entries.
In other words, if a typical set of m records is hashed to n table slots, the probability of a bucket receiving many more than m/n records should be vanishingly small. In particular, if m is less than n, very few buckets should have more than one or two records. (In an ideal "perfect hash function", no bucket should have more than one record; but a small number of collisions is virtually inevitable, even if n is much larger than m -- see the birthday paradox).
When testing a hash function, the uniformity of the distribution of hash values can be evaluated by the chi-square test.[3]
with something different.
Few Collisions
The odds of a pair of elements colliding under the hash function should be small. If x and y are different it's desirable that the probability of H(x) = H(y) be roughly the size of the range of the function. That is to say if you are hashing into a table the odds of two elements colliding should be no more than the number of elements in the table. The cost of look up in the best of implementations of tables is roughly the number of elements that collide.
Unfortunately hash functions are many to one mappings. The indexes to the tables we use are smaller than the strings. In a cryptography application the authentication tag is typically much smaller than the string being authenticated. So by definition there is some set of inputs that is all mapped to a single element and those inputs by definition collide. A good hash function can require someone to think carefully about the kinds of values being hashed to make sure that there won't be too many inputs that will all map to the same value under the chosen function.
An alternative to careful thought is universal_hashing. A universal hash function has one or more parameters to the function. The value of those parameters are chosen randomly. A universal hash function has the property that for any set of inputs the vast majority of parameter values will distribute the set with few collisions. For example Ha,b(x) might equal ax+b (mod p) (mod q), where p is a prime the size of the largest input and a and b are choosen less than p and where q is the number of elements in the table. There are many implementation tricks to make the calculation of such a function fast.
The current write up refers to perfect hashing which can only be used once the entire set of possibly hashed values has been determined and they tend to be too slow to compute for most examples. There is no reason to bother with a chi-square test when using universal functions the odds of getting a high value on such a test are provably infinitesimal. —Preceding
unsigned comment added by
MarkWegman (
talk •
contribs) 23:00, 19 January 2010 (UTC)
In Introduction it is written, that a hash function is a mathematical function. Therefore I missed a proper mathematical definition of a hash function. For example something like:
A hash function is a polynomial time algorithm H which maps a bit string of arbitrary length to a bit string of a certain length. In formula: .
Now you could distinguish certain kinds and properties of hash functions in a more formal way. For example like:
I read above that someone more prefer to write such properties in the cryptographic hash function article. But the book I actually read (Introduction to modern cryptography by Katz and Lindell) describe cryptographic hash function as functions which fulfill certain properties but define this properties for ('noncryptographic') hash functions in general. Anyhow at the moment this things aren't described in detail neither in Hash_function, nor in Cryptographic_hash_function.
-- Wohingenau ( talk) 13:54, 26 March 2010 (UTC)
The link at the bottom of the page to the 'Bernstein hash' page goes to http://fr.wikipedia.org/wiki/Table_de_hachage#Fonction_de_Hachage , which is probably undesirable on the English version of Wikipedia. I did a quick search for 'Bernstein hash' on the English wikipedia and didn't find a specifc page about the algorithm, otherwise I would have fixed the link myself. Either the link should be removed, or the content in the French wiki article should be copied over. 202.20.20.202 ( talk) 05:15, 24 May 2010 (UTC)
Is there a connection between hashes and Cutter numbers (Library Science)? The aim of Cutters, as far as I can tell, is to map points from a large sparse space (eg names of people) to a small space (eg 3 decimal digits). That sounds like hashing. Cutters have the additional property that order is preserved - that is, if name X is alphabetically before name Y, then the Cutter number of X comes before the Cutter number of Y.
An issue that Cutters make use of is that points in the name space are not uniformly distributed. A practical way to generate Cutters for people's names is to use a phone book. Find the page with the name, or where the name would be if it is missing, and use the page number. This takes account of the non-uniform frequency distribution of names.
138.251.194.28 ( talk) 14:43, 23 August 2010 (UTC)
Cutter codes are generated according to a set of rules (i.e., a hash function), although some human judgement is allowed to accommodate very common names; on the other hand, such judgement could be incorporated into the rules of one's Cutter algorithm. Cutter is not a hash function itself, but a kind of hash function, and different libraries may use different Cutter tables in generating their Cutter codes (hashes). Here's an example of a Cutter table: Example Cutter table
However, this Wikipedia page defines hash codes as being of "fixed length", which Cutter codes are not, yet gives no explanation why hash codes must be of fixed length.
How are any of these functions regarded as reliable? Take SHA-1, for example: with a message size of up to 2^64-1 bits, and a digest size of 160 bits, any given digest can have many tens of trillions of messages (roughly 2^57, assuming the algorithm yields high efficiency and even distribution). Compared to the fact that any given message yields exactly one of up to 2^160 digests, there's an incredible discrepancy (like, a billion orders of magnitude) and collisions seem inevitable under the circumstances of even very low-tax, common applications (those applications where message sizes are nowhere near the full possible length). I guess the question in short is: How long can messages be, before the risk of frequent collisions becomes a practical problem for the application? Something like message lengths of 2^8 bits appears to be the point at which collisions become absolutely guaranteed. For many cryptographic hash applications, it doesn't matter if you discover the actual message, if you discover any one of the many tens of trillions of messages which match the digest, you're ready to spoof. So, I am missing something. -- 76.102.243.117 ( talk) 22:25, 7 June 2010 (UTC)
A hash function that has no collisions is what we call a lossless coding. Cuddlyable3 ( talk) 23:48, 26 November 2008 (UTC)
This page defines hash codes as being of "fixed length", but offers no explanation of why uniform length is an inherent and necessary trait of hash codes. There are hashing systems (such as Cutter codes) that produce (slightly) varied hash-code lengths, but that satisfy all the functional requirements of a hashing system. I would argue that fixed or uniform length of hash codes is a typical characteristic, but not a required characteristic of hash codes. Anyone know any reason to the contrary? — Preceding unsigned comment added by 76.121.65.17 ( talk) 17:28, 23 November 2012 (UTC)
Yes, Bloom filters use hash functions, but this section isn't helpful. What does "a compact data structure that provides an enclosing approximation to a set of them" even mean? I'd recommend saying something that makes sense here, or deleting the section.
Paul Kube ( talk) 22:54, 4 January 2013 (UTC)
maybe we should mention that any Perfect Hash function cannot output any smaller data sets than the input? :p
if the hasing function is "perfect", the output cannot be smaller than the input... Divinity76 ( talk) 19:08, 19 July 2012 (UTC)
Perfect???: PCR-Hash to texts for Programm Codon Restrict IS rename failure by Polesinskij:
l:=(Length of bytes)
S:=(total plass of bites of all bytes)
U:=(total multiplicat for bites of all bytes)
with sqrt on n=N:
([S/l]=n to sqrt of U)=k
x:=([k(to float)]=n2 to sqrt of U)
total concatenat:l+k+x
second part with_2-sqrt:
beg {z:=l/2; int Y+1; while z>2;} end
beg {k:=2sqrt of U;Y-1; while Y>0;} end
x:=[k(to float)=n to sqrt of U]
total concatenat:l+k+x
— Preceding unsigned comment added by 95.153.189.81 ( talk) 19:48, 1 November 2013 (UTC)
The article uses singular, such as "the data is". — Preceding unsigned comment added by 193.11.93.250 ( talk) 09:10, 5 November 2013 (UTC)
Neither telephone numbers nor license plates have a uniform distribution over their number ranges/values. Therefore, using modulo arithmetic will not have the desired effect. 203.97.96.78 ( talk) 23:07, 27 August 2008 (UTC)
The article states that hash functions are the same thing as checksums, fingerprints, and cryptographic hash functions. Actually these concepts are related but distinct; they serve different purposes and have different requirements. Jorge Stolfi ( talk) 19:58, 23 February 2008 (UTC)
True! Anyone care to edit it lol, wikipedia is full of "this" these days, a hash is defiantly an extension of the checksum idea, which is an extension of the parity bit idea, this is an emergent field though so some people do not understand the history of the language, it waits for someone to take authority on the idea, but citations are preferred DarkShroom ( talk) 10:16, 13 April 2014 (UTC)
Bret Mulvey's analysis referenced in the article shows a CRC-32 implementation where the polynomial used is the most-significant-bit (msb)-first representation of one of the standard CRC-32 implementations. However, his table initialization code is for a lsb-first implementation. (Note that all 256 of his table values will have the most significant 4 bits unset.) His CRC-calculation code is correct for checking the CRC-32 in a protocol where the CRC-32 is appended to the message such that the CRC-32 of the message with the appended CRC-32 is a constant. However, it appears that he didn't append 4 null bytes to his keys (as is effectively done in protocol implementations by zero-ing out the 4 bytes left for the CRC at the end of the message buffer). Note that the worked-through 3-bit CRC example in the Wikipedia article on CRCs includes padding with 3 zero bits.
These two flaws (improperly mixing lsb-first and msb-first implementations, and failing to pad with 32 zero bits) together account for his poor results.
Kmag ( talk) 05:48, 15 December 2013 (UTC)
Every once in a while, someone confuses functions with algorithms. Functions are mappings, algorithms are abstractions of computer programs. Functions are concepts, algorithms are implementations. So saying that "A hash function is any algorithm that maps ..." has things entirely backwards. Correct would be: "A hash function is a function that maps ...", and "A hash algorithm is an algorithm that implements some hash function". Boute ( talk) 13:45, 27 April 2014 (UTC)
...
Process the message in successive 512-bit chunks:
break message into 512-bit chunks
for each chunk
...
True, I brought up cryptographic hash functions because that's what I could get a solid definition for. Ignore the cryptography, I'm asking you to find a reliable source that supports your point of view. Wikipedia requires sources for contentious claims. If there's no source then Wikipedia shouldn't state this. -- intgr [talk] 16:34, 5 June 2014 (UTC)
This citation is referencing a Wikipedia book and has no added value — Preceding unsigned comment added by 131.159.218.64 ( talk) 13:06, 16 June 2015 (UTC)
in the case of Unicode characters, the table would have 17×216 = 1114112 entries.
As the first paragraph of Unicode notes, there are more than 216 different Unicode characters, so this sentence is wrong. For purposes of the article's example, it's sufficient (and sufficiently accurate) to change 216 to 232, reflecting the fixed character width used in UTF-32. (17× that = 73,014,444,032 entries) Enoent ( talk) 17:00, 5 July 2015 (UTC)
What about the history of hash functions? Admittedly hashing has only been a vital part of information security systems since the Computer Age, but there must be some history of the mathematical foundations. — Preceding unsigned comment added by 202.163.119.34 ( talk) 10:01, 11 October 2015 (UTC)
Do you think this website is helpful and should be added in the External links section? http://tools.timodenk.com/?p=hash-function
-- 109.192.64.215 ( talk) 13:56, 31 December 2015 (UTC)
They are not good (failing Spectral_test with very few domains, but small and famous (fast
LCG-hash) cores.
//[Linear congruental generator] hash, 4 domains in, 4 domains out.
#define HASHSCALE4 vec4(1031,.1030,.0973,.1099)
vec4 hash44(vec4 p){p=fract(p*HASHSCALE4);//arbitiarily scaled seesaw modulo remainder. with bias to hide; failig a [Spectral test] in 2d.
p+=dot(p,p.wzxy+19.19);//dot product with domain-swiveled self
return fract(p.zxqx*(p.xxyz+p.yzzw));//seesaw modulo remainder of linear polynomial with domain-swiveled selves.
}//
https://www.shadertoy.com/view/4djSRW
//[Linear congruental generator] hash, 4 domains in, 4 domains out; does mod(x,17*17). Apparently intended for small type integer, does not even bother to scale or swivel domains.
vec4 mod289(vec4 x){return x-floor(x*(1./289.))*289.;}//289=17*17;
Ollj ( talk) 07:34, 8 September 2017 (UTC)
The article says "A hash function is any function that can be used to map data of arbitrary size to data of fixed size.", then immediately retorts that checksums are not hash functions. However, checksums satisfy this definition. A perfunctory examination suggests that, possibly, some other of the article's listed: "checksums, check digits, fingerprints, lossy compression, randomization functions, error-correcting codes, and ciphers", fit the given definition of hash functions. — Preceding unsigned comment added by Ihearthonduras ( talk • contribs) 04:20, 21 March 2018 (UTC)
"Most of the algorithms available are not extremely robust, but some are so robust that they can identify music played on loud-speakers in a noisy room." - very interesting, can we have a source - william sharkey
Since nobody knows what this is, what is point of the sentence?
Given that the previous so-called sentence runs on and on without saying anything, what "remaining characters"?
The opening sentence says: A hash function is any function that can be used to map data of arbitrary size to fixed-size values. And it's cited, though the source is popular rather than scholarly. The definition is a common one, though I'd say it's aped from one source to another. In many cases, the data is already a fixed size. What then? Suppose I have 40,000,000,000 64-bit values, and need to collect some modest set of them meeting some criteria into a table which I need to index quickly to retrieve any particular one. It's ok if the table has empty slots or is even sparse, but must be many orders of magnitude smaller than 2^64, or even 10^10. I don't need a hash function, because the definition says so: my data is already fixed length, all 64-bits.
On the other hand suppose I define a function which always returns 1, regardless of the value or length of the key. According to the definition, that's a hash function. Not only is the hash value fixed length (one bit), but also a fixed value. You may say that it's merely the case that we've constructed a bad (very bad) hash function. But the function is so bad, that it no longer conforms to ordinary notions of what a hash function is. Suppose again that I define an algebraic function, which always returns one. I didn't specify what the effective domain of the function is. I input the value 1, and it returns 1, the square root of 1. It's a square root function by definition, but again does not conform to our ordinary notion of such a function.
For their bountiful effusion, the editors have not clarified what the basic concepts of a hash function are, and the definition is symptomatic of the problem.
A hash function is a mapping. It does two things: first, it yields an index into the space of key values for each key, not necessarily a unique index; i.e. it distributes the key values in some fashion, the more random the better, over the key space. The key values may all be sequential, like soldiers' dog tag numbers, allocated from some arbitrary starting number serially. If we find a dog tag, and want to look up the soldier's name, how do we find that dog tag in a list? It won't do to try linear or even binary search of the list - the search space is enormous, and the list isn't necessarily constructed in a serially increasing order by dog tag number: someone at the end of recruiting took all the sign-up sheets in a messy stack and entered the tag numbers line by line into the computer. Now we need to construct a referenceable table. We can use the dog tag number itself, so that dog tag number 1,000,000 (the nth dog tag in the list) goes into slot number 1,000,000 in the table. So if a medic finds dog tag number 1,000,000 on the battlefield, he can index slot 1,000,000 in the table, and find that the soldier's name is Joe Blow. The problem is, there are billions, possibly trillions or more, possible dog tag numbers, and mapping them onto the computer system's disks either takes a lot of disks or is infeasible. So, second: a hash function maps a large state space of key indices into a manageable state space, such that collisions are minimized. Key indices which were distinguishable in the original key space may not be distinguishable in the reduced key space, i.e. there's a collision, and one of the dog tag indices will have to be mapped somewhere else in such a way that it is findable.
If key values were truly random, then we could take any n bits of the key and use it as an index into a table of size 2^n. The "hash" function could be just about anything that selects n bits. But the keys are not random, they're clustered serially, one cluster for each recruiting station. If the clusters are not broken up, when we attempt to map the indices into a reduced table, there will be many collisions, making a reduced table with little more order than a messy stack of papers. So the hash function cannot be just 'anything' - it has to breakup the clusters and distribute the key indices over the key space, then map the indices into a practical size table. When we combine both processes into a single operation, like modulo, we confuse ourselves about exactly what happened. (modulo, BTW, does NOT break up serial clusters of key values, so is a poor choice for hashing dog tag numbers, or any application where serial clustering may occur).
All the rambling and roaming of the article fails to tell us about the one, and then about the other. And it doesn't have any information on collision resolution, or performance of the algorithms. Nor how external hashing, i.e. across physical interfaces like disks, drums, tapes, etc, is handled efficiently.
We need to start again, and we need to start at the top with a notional definition, cited or not. Sbalfour ( talk) 16:54, 20 September 2019 (UTC)
This article is not for these. They have their own articles. I propose deleting them forthwith, since there's already a note at the bottom of the lead redirecting readers to them. Sbalfour ( talk) 18:19, 20 September 2019 (UTC)
I'm going to be BOLD and just do it. Sbalfour ( talk) 18:50, 20 September 2019 (UTC)
I have no idea what the given formula means: everything is undefined, including M, W, m, w, a, x, y, and ha (how's that different from h?). I'm pretty sure it was copied from somewhere, and not the source stated, which uses different notation. So what's the real source? If we knew that, we'd probably have a copyvio. Sbalfour ( talk) 20:05, 21 September 2019 (UTC)
This article says much, but conveys little. It's like a survey with a few "how-to" tidbits thrown in, and little or no analysis of how these functions perform. For example, the three sections: Hashing uniformly distributed data; Hashing data with other distributions; Special-purpose hash functions; essentially say the same thing, that if we know something about the distribution or structure of the data, we can construct a better hash function. How we do that depends on what we know, and that's simply too wide an endeavor to detail here. Sbalfour ( talk) 15:42, 20 September 2019 (UTC)
I've fixed a lot of this with focus on two things: hashing integers, and hashing variable length data. Sbalfour ( talk) 18:58, 22 September 2019 (UTC)
There's nothing in the article about deletions, performance of the algorithms versus data, or analysis (i.e. theoretical performance, best case, average case, worst case, etc). The article is critically anemic. Sbalfour ( talk) 19:00, 22 September 2019 (UTC)
A hash function takes as input a key, which is associated with a datum or record and used to identify it to the data storage and retrieval application. The keys may be fixed, like an integer, or variable length, like a name. In some cases, the key is the datum itself. The output is a hash code used to index a hash table holding the data or records, or pointers to them.
A hash function may be considered to perform three functions:
A good hash function satisfies two basic properties: 1) it should be very fast to compute; 2) it should minimize collisions. Hash functions rely on favorable probability distributions for their effectiveness, reducing access time to nearly constant. High table loading factors, pathological key sets and poorly designed hash functions can result in access times approaching linear in the number of items in the table. Hash functions can be designed to give best worst-case performance, [Notes 1]good performance under high table loading factors, and in special cases, perfect (collisionless) mapping of keys into hash codes. Implementation is based on parity-preserving bit operations (XOR and ADD), multiply or divide. Implementations include an adjunct collision-resolution method. Sbalfour ( talk) 00:44, 23 September 2019 (UTC)
The text says this Typically, ... a hash function ... will map several different keys to the same index which could result in collisions. Not could result in a collision, but WILL result in a collision. And no, a hash function will NOT typically map several keys to the same index - it will typically map only ONE key to an index. If there are more than a few collisions, we'll usually choose another hash function. The whole section needs cleaned up for conceptual presentation as well as technical diction. Sbalfour ( talk) 19:17, 20 September 2019 (UTC)
Then it says: Hash table implementations include a specific hash function—such as a Jenkins hash or Zobrist hashing—and a hash-table collision resolution scheme—such as coalesced hashing, cuckoo hashing, or hopscotch hashing. This is just mongering, or I don't know what. These are highly specialized hash functions and techniques. The most common hash function is modulo based, even if hashing strings, the most common hash table structure is open addressing, and the most common collision resolution technique, linear probing. These are the simplest, and usually quite good enough, that's why. I think we ought to say these first, and relegate the others to a footnote, if we mention them at all. Sbalfour ( talk) 21:30, 20 September 2019 (UTC)
I think the editor of that section had a singular and rather dogmatic idea of what hash functions and tables actually look like in practice. Sbalfour ( talk) 18:40, 23 September 2019 (UTC)
I propose this:
Hash functions are used in conjunction with hash tables to store and retrieve data items or data records. The hash function translates the key associated with each datum or record into a hash code which is used to index the hash table. When an item is to be added to the table, the hash code may index an empty slot (also called a bucket), in which case the item is added to the table there. If the hash code indexes a full slot, some kind of collision resolution is required: the new item may be omitted (not added to the table), or replace the old item, or it can be added to the table in some other location by a specified procedure. That procedure depends on the structure of the hash table: In chained hashing, each slot is the head of a linked list or chain, and items that collide at the slot are added to the chain. Chains may be kept in random order and searched linearly, or in serial order, or as a self-ordering list by frequency to speed up access. In open address hashing, the table is probed starting from the occupied slot in a specified manner, usually by linear probing, quadratic probing, or double hashing until an open slot is located or the entire table is probed (overflow). Searching for the item follows the same procedure until the item is located, an open slot is found or the entire table has been searched (item not in table).
I've deleted numerous examples from the text for three reasons: programming language code is copyrighted unless the program itself is in the public domain, or the example was uncited and almost certainly original art that would otherwise be copyrightable to the creator. In the the absence of all else, the encyclopedia is not a how-to manual: actual compilable code and elaborate pseudo-code, except when plainly necessary to clarify obscure text, is for programming textbooks, not the encyclopedia. Clearly, we can "invent" self-evident examples such as "1+1=2 is an example of integer arithmetic addition". The best course is to accurately and succinctly describe the algorithm, ... an exercise in technical diction, which is exactly what we want in the encyclopedia. Examples must not be original art, copyrighted, or uncited. Sbalfour ( talk) 15:04, 24 September 2019 (UTC)
What good does it do to speak of Hash function without Hash table? And vise-versa. It's inane. There's no use for a hash function without a hash table (I'm not talking about crypto, checksums, etc, which have their own articles) or a hash table without a hash function. I propose combining both articles into one, and title it Hashing. It's almost a direct insert - no need to merge sections, paragraphs, etc. Sbalfour ( talk) 17:26, 24 September 2019 (UTC)
Article says (obtusely): "Use of hash functions relies on statistical properties of key and function interaction: worst-case behaviour is intolerably bad with a vanishingly small probability, and average-case behaviour can be nearly optimal (minimal collision).[1]" (
FairNPOV (
talk) 17:33, 11 January 2022 (UTC))
Cite error: There are <ref group=Notes>
tags on this page, but the references will not show without a {{reflist|group=Notes}}
template (see the
help page).