This is the
talk page for discussing improvements to the
Hash table article. This is not a forum for general discussion of the article's subject. |
Article policies
|
Find sources: Google ( books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL |
Archives: 1 |
This
level-5 vital article is rated B-class on Wikipedia's
content assessment scale. It is of interest to the following WikiProjects: | ||||||||||||||||||||||||||||
|
In the section "Choosing a good hash function", a (presumably integer) "s" is repeatedly referred-to. However this is as far as I can see totally undefined in the wikipedia article. Although (some) computer scientists might infer s from context, other readers may not. This is bad form. Suggest to change.
152.3.68.83 ( talk) 18:39, 22 November 2013 (UTC)
Already the first sentence is misleading, if not even wrong. A hash map associates keys with values (i.e. maps) and is implemented using a hashtable, though a hashtable itself does not necessarily do any mapping or association of keys and values. Just consider a hashtable where all you do is to store integers, there are no values here. —Preceding unsigned comment added by 91.36.112.7 ( talk) 14:51, 23 January 2008 (UTC)
I have a few small issues with the current Hash table article. First, the pseudo-code is a little convoluted—in that the findSlot(..) function is not 100% obvious. It hides away the linear probe code, which makes reading the set(..) and lookup(..) functions a little confusing on first read. I was just going over it with an intro computer science student and even had to read it twice myself. Josh 08:11, 10 April 2006 (UTC)
There is a line which states:
I'm not really sure that this is accurate. I might agree wth calling it expected O(1), or average O(1). But it is not, as far as I can tell, worst-case O(1). In fact, remove is worst-case O(n), as in all hash operations. Josh 08:11, 10 April 2006 (UTC)
I know that there has been some discussion of amortized runtime here, but I'm not really sure that you can use amortized analysis on this function. Please correct me if I am wrong, otherwise, I will change shortly. Further, I am not sure why the article says only possible in linearly probed hash tables... If the remove function is O(1) in linear probe, then why not with a quadratic probe? Josh 08:11, 10 April 2006 (UTC)
"When we say T(N) [is] O(f(N)), we are guaranteeing that the function T(N) grows at a rate no faster than f(N)." (Data Structures and Algorithm Analysis in C++, Third Edition, Mark Allen Weiss, page 44)
It is entirely possible to implement a hash table that makes a collision every time a new value is added. This would be a useless implementation, but a valid one. A hash table can't then be O(1) ever in any way if the above possibility exists.
"Hash tables can be used to implement the insert and contains operations in constant average time...The worst case for hashing generally results from an implementation error." (page 207)
I would suggest that the article should be changed to mention that hash tables are not O(1) by the definition of Big-O notation, but that all good implementations come closer to acting like they are than binary search trees.
199.111.229.133 00:23, 16 October 2007 (UTC)
If when adding to a table the hashing function leads to a collision for 2 particular keys, a and b, then using probing b will be stored somewhere after a. When then looking up the value associated with b, won't the table actually return the value associated with a. How does it know which of the two values to return, and how to get to the index associated with b?
I hope that makes sense. Iae 11:22, 8 June 2006 (UTC)
You know you hit the end of the list in a (probed) hash table when you hit a empty "not occupied" slot. In theory, one could have a "occupied" bit for each row of the hash table that is initially 0. (In practice, typically each slot that is not occupied begins with a NULL byte. If it *is* occupied, that byte is the first byte of the key).
I know of 2 very different ways to delete from a hash table: the "deleted bit" method", and the "move stuff around" method. (My understanding is that the "move stuff around" method is impossible to implement with "quadratic probing" or "double hashing" hash tables (and probably a few other types). Those hash tables are forced to use the "deleted bit" method.)
Let me try to explain those 2 methods
"deleted bit" method: works with any kind of hash table. Every row of the hash table (in addition to the key, value pairs) has a "deleted" bit that starts out 0. To delete a record from the hash table, use the key to find it (using find_slot), then set the "deleted" bit to 1. (Some hash tables cram the "deleted" bit and the "occupied" bit into the "key" part of the record, reserving 1 special key to indicate "unoccupied", another special key to indicate "deleted", and any other value to indicate a real occupied "key" ).
"move stuff around" method: only works with linear probing.
The function remove(key) in the article is supposed to describe the "move stuff around" method. How could we make it easier to understand?
Often when the application deletes a record, the following slot is already not occupied. In that case you can wipe out the record by marking that record as not occupied -- overwriting the key with a NULL byte -- and you are done.
Unfortunately, there are a bunch of special cases the application needs to be able to handle, even though they rarely happen. As the article says, "For all records in a cluster, there must be no vacant slots between their natural hash position and their current position (else lookups will terminate before finding the record)." The application needs to scan through *all* the records between the record you want to delete, and the next "not occupied" slot, and make sure the above requirement is met. In some cases, you must move all those records up 1 slot. In other cases, you must move some (but not all) of those records.
In yet other cases, you must not move any of them, just mark the deleted slot "not occupied" just like the simple case. (For example, if you want to delete the record in slot 870, and you see that the "key" in slot 871 actually hashes to "871", and slot 872 is "not occupied" -- you must mark slot 870 "not occupied", and leave the record in slot 871 alone).
Once you understand how it works, please update the article to make it easier to understand for the next person.
-- 68.0.120.35 07:42, 5 March 2007 (UTC)
I wanted to check in here before making huge changes to the article, but one thing I'd find very helpful is a discussion of concrete choices for the hash function itself. Here's an outline of what I'd say:
It's very common to implement hash tables with poor hashing functions. Knuth is largely to blame, advocating the very weak "multiplicative hash" function, and even going so far as to claim that its clustering property is a good thing! (section 6.4, TAoCP vol III, 2e, p. 517). Variants of the multiplicative hash remain popular, as do other linear techniques such as CRC.
Surveying the field, two excellent newer alternatives stand out. For most simple hashing tasks, the Fowler Noll Vo Hash is an excellent performer. It is among the simplest of known hash functions, is quite fast, and has a good distribution. For use in a hash table, the FNV-1a variant is likely the best choice, as it has better (more dispersed) clustering behavior than FNV-1.
For some applications, particularly when keys are long, the newer Jenkins lookup3.c hash function may be a better performer. It achieves better speed by consuming 12 bytes of the input per iteration, as opposed to one byte for FNV. Disadvantages include greater code complexity, a sensitivity to machine endianness (causing potential difficulties when reusing hash values across disparate computers), and the need to pad byte-aligned input to 32 bit words. Do keep in mind, though, that benchmarks showing impressive speed for large blocks of data may not translate to real-world gains. Common usage patterns involve relatively short keys, so the amount of time spent in the hashing function inner-loop may be less relevant than, say, the gains from a compiler being able to automatically inline a simpler hash function.-- Raph Levien 02:32, 5 July 2006 (UTC)
Some interesting links I came across while searching:
You might also want to check out HSH 11/13 which seems to distribute quite well (as per published graph) and also performs nicely with its handful of code lines.
Followup:
I went ahead and completely rewrote the section. I may have come close to bending the letter of the law on doing original research (measuring avalanche behavior of Jenkins One-at-a-time hash using Bret Mulvey's tools) and NPOV (I advocate particular hash functions and trash criticize other popular choices), but everything I said is verifiable, and I think the result is the most helpful and reliable advice on choosing hash functions anywhere on the Internet or in dead tree form.
I measured the HSH 11/13 hash using Mulvey's AvalancheTest, and found it slightly inferior to the Jenkins One-at-a-time hash. Herbert Glarner's tests for uniform distribution are not as sensitive as Mulvey's chi-squared tests. Further, HSH 11/13 is quite slow, because of its inner loop.
Obviously, I also changed my mind about FNV, swayed in large part by Mulvey's analysis. It's not a bad choice, being one of the simplest of the "pretty good" choices, but even there the XOR-folding adds a bit more complexity. In any case, I feel like I've provided enough information for people to make their own informed choice.-- Raph Levien 21:31, 12 August 2006 (UTC)
Simplicity and speed are readily measured objectively
There is a caveat here. Experimental measurements of speeds are necessarily done on a "representative sample" of inputs. It may be the case that such or such algorithm performs with varying speed depending on the kind of inputs, and that some sample, representative of one kind of inputs, may not be representative of another kind. I don't think this would happen with usual hash functions on e.g. strings but this may happen in more esoteric cases. David.Monniaux 14:38, 13 September 2006 (UTC)
Problem 1: The Hash Table will be used for storage of student records. You may assume the maximum number of Items will not exceed 240. An item corresponds to a student record. A key is taken to be an ASCII character string of maximum length 32 containing name of the student. The data is taken to be 9-digit id and the name of the state of residence of the student in India. A list node contains an Item and a reference to the next node. You will need to implement the classes Item and the ListNode with construct and appropriate data access operations. 2. Write a main() program to test your class HashTable. 3. Input: The name of the input file, containing data items (student records) and data operations in the format given below, and the name of the output file, will be passed to your program on the command line or interactively. Data Items: student-id, student-name, state-of-residence Example. 200412018, Swati Mishra, Uttar Pradesh one per line. Data Operations: <operation> <item> The <operation> is s,i, or d for search, insert and delete, respectively. The item will have fields: student-name, student-id, and state-of-residence. The fields of an item will be separated by commas, and operation will be separated from the item by a colon ”:” and a space. Example. s: 200211001, Akash Gokhale, Maharashtra one per line. The data items will be separated from the data operations by a blank line. 4. Output: Your program is to read the input file and populate the Hash Table with student records by repeated Insert() operations. And then print to the output file, the size of the Hash Table, and the size of each linked list in the Hash Table. Then, it will continue to read each line and execute appropriate data operations. Following each Insert()/Delete() operation, it will output the size of the Hash Table and the size of each of the linked list, and for each Search operation, it will output
-- 220.225.53.35 09:18, 11 October 2006
Hi, I've implemented the joaat_hash function which is described in pseudocode on this page in my C Program, and encountered an error. Better said I produced one, since I misunderstood len as the len of the hashtable, and not as len of the key.
here is my (correct) function implemented in C, please consider updating the article:
int joaat_hash(char *key, size_t len) //len is the size of the hashtable { unsigned int hash = 0; unsigned int i; for (i = 0; i < strlen(key); i++) /* [...] as in the article */ return (hash % len); }
-- 88.76.141.17 03:43, 2 January 2007 (UTC)
You are right -- the "len" in the pseudocode in the article is the length of the key. (That pseudocode gives the same results as the original C implementation "One-at-a-Time Hash" ub4 one_at_a_time(char *key, ub4 len) http://www.burtleburtle.net/bob/hash/doobs.html , right?)
I think the above implementation gives the same results as the original C implementation for all possible ASCII strings. Unfortunately, the above implementation gives different results for other sorts of data structures cast into byte arrays, when those data structures include a zero byte.
Since the original version gives *better* results for those data structures, and the *same* results for everything else, I think I'll stick with the original version, except use "key_len" to clarify exactly of what it is the length. (Or have I missed something obvious?) -- 68.0.120.35 07:42, 5 March 2007 (UTC)
From the article: "In fact, even a cryptographic hash does not provide protection against an adversary who wishes to degrade hash table performance by choosing keys all hashing to the same bucket."
I thought that one of the criteria for a cryptographic hash function was that it be infeasible to find collisions. Therefore, it would provide defense against such an adversary. Or am I misunderstanding something? Ralphmerridew 21:55, 30 January 2007 (UTC)
What lots of people seem to forget when discussing the need for hash functions that reduce predictable collisions is, that they are only needed for hostile environments, where an attacker can control the input data *and* where it is important that proper speed is maintained (e.g. tcp/ip hashtables within a kernel). It is much much less important in simple programs where a hashtable is just used to store some data that might even be created dynamically by the program, or doesn't occur in the input set the program was designed for. It might be possible to degrade a hashtable that just uses some "id * 1000000007UL & ((1<<n)-1)" for finding the appropriate slot in your face recognition software by feeding it with a carefully crafted artificial bitmap pattern. but why care? trash-in, trash-out.
A recent edit deleted most of the following paragraph, claiming it was "verbiage":
This paragraph is explaining that *cyptographic* hash functions (a different concept altogether, see lead section) are not necessarily good choices for hash tables, because their only advantage (probabilistically guaranteed good performance even on data submitted by hostile clients) can be obtained at smaller cost by using ordinary (non-crypto) hash functions with secret salt (as discussed above). The recent edit removed this information. But perhaps the wording is not clear and needs to be improved. All the best, -- Jorge Stolfi ( talk) 21:28, 6 February 2010 (UTC)
I'm considering merging the open hashing article with this article, as has already been done for closed hashing. For that I think we should rename the section on Chaining to Open hashing and make more clear the already mentioned fact that linked lists are only one way, though perhaps the most popular one, of organizing hash buckets and introduce a mention of buckets allocated in a contiguous memory/disk space that is presently discussed in open hashing. Jyotirmoyb 05:16, 20 February 2007 (UTC)
Makes sense - can't have it both ways - with closed hashing merged (thought it's hard to find any content called "closed hashing" in the article) and open hashing in its own article. On the other hand, there is no problem with an article on "open" and "closed" and other types of hashing if there is enough content to justify. Does wikipedia ave guidelines on article length??-- 121.45.246.110 14:10, 9 April 2007 (UTC)
The remove function will loop indefinitely if the hash table is full, since it only exits when it finds an unoccupied slot. I think adding the following after the j := (j+1) line should fix it:
if j = i exit loop
32.97.110.142 20:36, 11 April 2007 (UTC)
In the discussion of the costs of resizing the hash, there seems to be an ambiguity in the amortized running time analysis:
... If in the end it contains n elements, then the total add operations performed for all the resizings is:
1 + 2 + 4 + ... + n = 2n - 1. Because the costs of the resizings form a geometric series, the total cost is O(n)....
The n in "2n - 1" is actually the value of the nth element, not the number of elements. The sentence that follows makes it seem that the sum of a geometric series is linear, which is clearly not the case.
Guray9000 00:47, 16 April 2007 (UTC)
The article states:
To see why this is true, suppose a hash table using chaining begins at the minimum size of 1 and is doubled each time it fills above 100%. If in the end it contains n elements, then the total add operations performed for all the resizings is:
1 + 2 + 4 + ... + n = 2n - 1.
This does not seem right. The expected number of steps it takes to double the table size from (1) to (n>1) at each overfill should be: truncate(LOG2(n-1))+1 (whare LOG2 means logarithm base 2 of n). Also, if my math is correct 1 + 2 + 4 + ... + n = n(n+1)/2, but this does not seem the right computation to make: we're not supposed to add the values corresponding to each step but to count them.
If this reasoning is correct the total cost should read O(LN(n)) rather than O(n), which should mean this method scales very well with n.
In that section it is written: "Like arrays, hash tables provide constant-time O(1) lookup on average, regardless of the number of items in the table." (*)
Searching in sorted list of keys by using binary search can be done in O(log N). Average running time of binary search is also log N (this is according to http://planetmath.org/encyclopedia/BinarySearch.html). I don't know any better algorithm for non special sorted data. So I think statement (*) isn't true in asymptotic sense.
However if we know that N is bounded by some number we can say that O(log N) = O(1). But this would be wrong... But in some philosophical (or maybe practical) sense we could be right (we have large but fixed maximum amount of RAM, HDD space, maximum amount of hash table entries etc). But this would be not right in asymptotic sense where N not bounded but free.
What I don't understand?
This section is quickly devolving into yet another trivia list. MegaHasher 19:45, 11 July 2007 (UTC)
Are there any references on a data structure that implements a hash table and binary tree at the same time? MegaHasher 20:04, 3 August 2007 (UTC)
This article could be improved with more direct citations. MegaHasher 06:11, 14 August 2007 (UTC)
I am suspicious of the random table based hash function. This seems to be a variation on Pearson's hash function, except there are multiple tables, therefore much worse cache misses. Pearson's hash function is not all that fast to start with. MegaHasher 03:22, 18 August 2007 (UTC)
For single-byte indices and 32 bit hash numbers, the tables are 256*4 = 1024 bytes each. This algorithm is obviously best when the cache size is larger, but also very suitable when a few keys account for most of the lookups, or when the cost of a collision is far greater than the cost of a memory cache miss (such as when hashing into disk records).
By having multiple tables, this algorithm ensures lookups to the same index in the table don't cancel each other out, thereby generating literally perfectly randomly distributed but repeatable hashes. The Wikipedia article on Pearson's hash function indicates it employ a table with the numbers 0 to 255 randomly arranged, and it ultimately selects one of the 256 different values in the table at random. The multi-table algorithm you removed doesn't simply select one of the 256 values in one of the tables, it xors values together to create a far stronger hash. Look at the algorithm properly and you might appreciate it more.
FWIW, the algorithm is proven technology, used to hash > 3 million highly-repetitive financial security keys by the world's leading financial data provider's proprietary security database, handling hundreds of thousands of database inserts per second.
I'll reinstate the algorithm, with the attribution removed as you've (offensively) referred to it as "vanity" material, and I hope you, Mr "MegaHasher", have much better reasons before you remove it again.
I don't consider this to be original "research". It is simple stuff, and important precisely because of that. Too many programmers either don't use hash tables or use poor algorithms because they don't have a feel for or ability to assess mathematical hashes. This may suit the few specialists who write the algorithms and relish the elitist mystique of it all, but it's a big loss for computing as a whole. More generally, I might accept that this belongs on the Hash Function page if you also moved the joaat_hash off this one. You can't have it both ways. I agree with the earlier discussion comments that your posting and defense of Bob Jenkin's algorithm looks very inappropriate on this page, and all the more so for your refusal to accept anything that removes any of the focus from it. —Preceding unsigned comment added by 205.228.104.142 ( talk) 23:59, August 29, 2007 (UTC)
I have seen two separate referenced articles that had measurements that show under separate chaining, an increase of load factor by 1 has the impact of increasing CPU instruction count around 4 to 5%. This is much better than what I would have guessed. The first reference is "Dynamic Hash Tables" by Larson (1988), and the second reference is "Main-memory linear hashing" by Pettersson (1993). Have I completely mis-read these articles? MegaHasher 21:33, 20 August 2007 (UTC)
Can somebody please add this picture to the article? I believe it will help a lot of people to understand the notion of the hash table. -- yanis 12:25, 21 August 2007 (UTC)
Like arrays, hash tables provide constant-time O(1) lookup on average, regardless of the number of items in the table. However, the very rare worst-case lookup time can be as bad as O(n).
This is essentially wrong. In a non buggy implementation which is when the table is a reasonable size, and the table is never too full and the hash function is working correctly n-collisions cannot occur. WolfKeeper 14:36, 16 October 2007 (UTC)
WolfKeeper: You have no idea what you are talking about. In standard implementations in standard SW like all dynamic languages the linear chaining strategy is open to well-known hash collision attacks, which can be defeated even with randomized seeds. So hash tables need to protect themselves against those O(n) worst-cases, which is e.g. a DOS attack on the web against php, perl, ruby with well-crafted keys. And those O(n) possibilities lead the implementations as those attacks appear in practical web security. ReiniUrban ( talk) 19:12, 2 April 2014 (UTC)
That's wrong for another reason. Dynamic perfect hashing methods like cuckoo hashing provide constant lookup time. Only updates may take O(n) time in the worst case. 23:24, 2 November 2007 (UTC)
The probability of n collisions by sheer chance is p^n where p is a number usually well under 0.8. The chances of say, 200 collisions is 0.8^200 = 4e-20, and on modern computers that would still run quickly. At a million hashes every second that statistically wouldn't happen once in a million years; and that assumes the table is full, and that's a small table by hashing standards. There's more chance of the hardware failing, by about a factor of 10^6. WolfKeeper 14:36, 16 October 2007 (UTC)
I feel like we are talking past each other, because we are talking about several different things but using the same name for all of them:
I agree with Wolfkeeper that, with natural data, known hash tables are so unlikely to ever exhibit O(n) performance that "it just never happens and never would happen before the end of the universe." However, the hash table#Drawbacks section has a few references that imply that known hash tables are vulnerable to a DoS attack involving carefully crafted unnatural data that tricks the computer into the worst-case O(n) performance. Some of those references seem to recommend fixed secret hash tables, which still have worst-case O(n) performance, but make it practically impossible for an attacker to exploit that possibility unless the attacker knows the secret. However, other references imply that an attacker can learn the secret after an hour of probing, and then use that secret to send carefully crafted unnatural data that tricks the computer into the worst-case O(n) performance. Those references recommend temporary secret algorithms, which generally guarantee O(1) lookup times even in the worst case under active attack, at the cost of making the worst-case insertion time slower.
I feel this article needs to clearly distinguish between algorithms that have guaranteed O(1) lookup times even when under active attack, vs. algorithms that suffer from O(n) lookup times when under active attack. Is that difference big enough that these two categories of algorithms need to have two separate Wikipedia articles? -- DavidCary ( talk) 16:39, 29 July 2016 (UTC)
I see that elsewhere on this talk page someone says "Dynamic perfect hashing methods like cuckoo hashing provide constant lookup time."
Is "dynamic perfect hashing methods" a better name for the entire category of algorithms that have guaranteed O(1) lookup times even when under active attack -- a category I previously called "temporary secret algorithms" -- a category that includes cuckoo hashing, hopscotch hashing, etc.? -- DavidCary ( talk) 15:52, 10 May 2017 (UTC)
The intro to this article is severely misguided and confusing. A hash table is not the same thing as a "hash map" (which is a Javaism meaning a map/dictionary implemented by a hash table instead of a tree). Hash tables have nothing to do with keys and values -- a hash table only knows that elements have hashes, and that elements are at the index in its array corresponding to the value of the hash function for an element. For example, a hash table can also be used to implement an unordered set. Additionally, the concept of "buckets" is not intrinsic to a hash table. It is a common strategy for resoloving collisions but there are others (double hashing, for instance).
What's the correct way to tag a page that has inaccurate information on it?
There are some non-wikipedia ghits that have the correct definition, such as: http://www.sparknotes.com/cs/searching/hashtables/section1.html
However, it's a fairly common misconception that hashes map keys to values, since so often they are used to implement maps or dictionaries. I am fairly sure that Knuth makes this distinction, but I don't have it on hand.
-- 67.180.15.227 ( talk) 16:49, 5 December 2007 (UTC)
I went reading through the article, as I do every few years since it represents my first ever contribution to Wikipedia, and noticed that the short pseudo-code for the deletion algorithm for a linearly probed hash tree had been removed. Standing in its place is a reference to the ICI implementation where the algorithm can be gleaned, but not so simply because it's handling a more complex case (set subtraction).
I just wondered why the clearer pseudo-code section was removed?
I confess to a feeling of ownership for the deletion algorithm, since that part of the entry came about after Tim Long and I discovered wikipedia. Some years earlier we'd been using a lot of hash tables, and it had bugged me that I could find no reference to an efficient deletion method. I worked out how to do it (I thought), only to discover that my method didn't always work. Tim debugged it (by adding a missing condition to the boolean expression.) We thought Wikipedia was the perfect place to put the small invention.
In fact, the pseudo-code for all the basic operations has been removed. I don't understand the logic behind that, hence this note. Lukekendall ( talk) 15:08, 28 January 2008 (UTC)
122.106.85.3 ( talk) 00:21, 29 January 2008 (UTC)
This article is terrible - someone needs to make this article clearer for beginners or even intermediates. It barely explains anything at first. How about a simple example instead of jumping into advanced theory??? 129.128.241.131 ( talk) 17:53, 7 February 2008 (UTC)
I feel the same way. Someone (like me) who doesn't know what a hash table is for will not find out quickly. I had to read most of the article before I could figure it out. I would suggest something like this be added, either in the introduction and/or in the very beginning of the contents. (I'm no expert; someone who knows what they're talking about should write it.)
Say an array of available space for storing data has indices like 1, 2, 3, ..., 1000: ARRAY[1], ARRAY[2], etc. However, the keys to the data may be something entirely different, like "John Smith", "Lisa Smith", and "Sam Doe". If they would just be put into the array directly, some sort of search method would be necessary to find the desired entry. A hash table is a way of solving this problem, allowing extremely rapid lookup. In the example above, a pseudo-random function is called to assign each of John Smith, Lisa Smith, and Sam Doe to some number between 1 and 1000. If "John Smith" is assigned 873 by the hash function, its data is stored in ARRAY[873]. It can be retrieved immediately on demand by just recomputing the hash function on "John Smith" again locate the correct index. If the hashing function is well-designed, different keys being sent to the same number is unlikely until the array begins to fill up. At some point, of course, keys will begin to "collide", to be assigned the same array index. The hashing algorithm needs a way to find the right index anyhow.
-- MikeR7 ( talk) 21:07, 4 March 2008 (UTC)
Wow! MikeR7's introduction is great for me. And it would be so for many bigginers. Thank you! -- Kang —Preceding unsigned comment added by 59.17.69.101 ( talk) 04:50, 28 August 2009 (UTC) It would be good to add to this an example of how the hash function might be constructed. —Preceding unsigned comment added by Lunedi9 ( talk • contribs) 20:28, 6 November 2010 (UTC)
I can't access it at the moment, but this paper is probably also relevant to the robin hood hashing: Munro, J. I., and Celis, P. 1986. Techniques for collision resolution in hash tables with open addressing. In Proceedings of 1986 ACM Fall Joint Computer Conference (Dallas, Texas, United States). IEEE Computer Society Press, Los Alamitos, CA, 601-610. —Preceding unsigned comment added by 75.71.67.71 ( talk) 17:02, 15 April 2008 (UTC)
The description of "open hashing" seems quite inaccurate.
That "data indexed by the hash is 'stored externally' to the hash table" may be *required* for this data structure, but does not define it. To store variable length data open addressing may be used in conjunction with storing references to the actual data in the hash table. Open addressing is used synonymously to closed hashing, which results in a contradiction.
The example which follows the first sentence seems at least incomprehensible to me (By the way: Examples should be separated from a general description.) Key-value pairs are added to the group/bucket according to the hash value of the key. (The hash of the key defines the group.)
"The key that identifies a bucket is now put into the hash table (in this case it is as simple as the bucket number)" seems wrong to me. Actually, a reference to the group is 'put into' the hash table. And this is not equal to the key (Hash keys are generally not stored in hash tables, the values are stored).
Separate chaining should be characterized as a form of open hashing.
Sorry for the extensive criticism. This part of the article really confused me...
-- Whzlt ( talk) 03:24, 18 May 2008 (UTC)
The mathematical definition of O(1) is when n tends to infinity. All the demonstrations of O(1) I've seen assume at some point that the size of your data is bounded, or that you don't use your table at more than 80%, etc. But if your data gets really big (say greater than 10^10^10), either your collision rate will increase to 100% (then you tend to O(log n) ), or you'll have to increase your table size. The table size is about proportional to the data size. Then you'll have to compute a greater hashkey to uniquely identify the buckets. As it happens, the size of the hashkey (and the time it takes to compute it) grows with log n, where n is the number of buckets. So the operations on a hashtable are really O(log n), even if it stays reasonable for large data.-- Yitscar ( talk) 08:11, 24 May 2008 (UTC)
Wrong, all this talking about log(n) is fully wrong. Normal, linear search is O(n/2) with worst case O(n). Hash tables - as they are mostly implemented - have O(n2/m) where m is the size of the hash table. That's why crowded hash tables become as slow as normal search. But nowhere O(log n), forget about that! 178.197.232.59 ( talk) 11:16, 22 October 2012 (UTC)
Sorry to be thick, but can we define it please? Thanks. 124.101.249.63 ( talk) 14:01, 17 July 2008 (UTC)
The page points to some page that claims multiplicative hashing has poor clustering. But that page doesn't get Knuth's multiplicative hashing right -- it falls into the trap that many people do of thinking that it works as (k*a) mod m for some a. In fact, multiplicative hashing does fine if you implement it correctly, by taking only the high bits of k*a. —Preceding unsigned comment added by AndrewCMyers ( talk • contribs) 16:45, 13 January 2009 (UTC)
For the x-th time, 1+2+4+8+...+n = 2n-1, not 2^n-1; you're thinking of 1+2+3+4+...+n. Just put in values for n and you will see this, or do you need a formal proof? Please people, actually read the text you edit. Adrianwn ( talk) 17:22, 26 March 2009 (UTC)
This sentence has been removed:
Hash functions need not be pseudo-random, they need only spread the keys as evenly as possible. Moreover collisions are not due to pseudo-randomness of the function. -- Jorge Stolfi ( talk) 05:59, 3 April 2009 (UTC)
The article claimed that
This statement sounds rather biased. What matters is the actual hash function, that maps the key to a bucket index. A "raw" hash function that gives good results when taken modulo a prime is OK as long as the table size s is a prime. One can do dynamic resizing with tables of prime size, with liltle extra cost. The above statement can be turned around to say "using tables whose size is a prime number increases the chance that the hashes will be unformly distributed, since some popular hash functions are known to perform badly when the table size is a power of two and the hash index is obtained by bit masking." So, how should we rephrase this sentence? -- Jorge Stolfi ( talk) 00:40, 12 April 2009 (UTC)
There is a misunderstanding in the recnt edits. The "modulo the table size" step is technically a part of the hash function, not of the hash table algorithm. If the bucket array has size s, the hash functiion must return a number in 0..s-1. See Talk:Hash function. Briefly, the mod-s trick is not always a good idea, and cannot be discussed separately from the "raw" hash function. More importantly, all discussion about the hash function (desired properties, meaning of "perfect", etc.) assume that there is no external mod-s step. All the best, -- Jorge Stolfi ( talk) 02:21, 18 April 2009 (UTC)
At work I regularly explain to new developers the implementation of hash tables in our software product. I describe the computation prior to "mod N" as creating a fixed size digest of (some of) the entropy in the (typically larger, sometimes variable length) input value. Often we use a number of digest functions all returning the same type (typically an unsigned 32 bit integer). My developers readily grasp that the range of such digests is too great and must be transformed into an index into the N entry hash array, hence the "mod N" step. Making the size of the array prime simply ensures that all entropy in the digest participates in selection of the hash array position.
The independence of the digest step and the "mod N" step is underscored by the fact that N may change over time as values get added to the hash table, while the chosen digest function does not change. From experience I know that whenever we fail to distinguish clearly these two notions confusion ensues. 67.189.170.148 ( talk) 21:20, 8 August 2012 (UTC) "John Yates" <john@yates-sheets.org>
The "Basic algorithm" section does not seem to add to what has aleady been said in the lead section. Considering that the "mod N" trick does not belong here, the section seems redundant. -- Jorge Stolfi ( talk) 03:01, 18 April 2009 (UTC)
Array hashing (separately chained hashing using dynamic arrays to stroe the buckets) is advantageous only in some limited conditions. If the load factor is low (< 1) and the entries are well-distributed, most buckets have at most one entry, so the cache provides little benefit. Also the dynamic arrays usualy require a "bucket length" field; if there is only one entry, then this extra field takes away the memory saved by omitting the chain links. On the other hand, if the load factor is high (>>1), and the bucket arrays are resized by doubling/halving, then the vacant slots will waste more memory than the chain links would. All the best, -- Jorge Stolfi ( talk) 23:47, 24 April 2009 (UTC)
Dynamic arrays only require a "bucket length" field when they store 32-bit or 64-bit integer keys. It is not required for variable-length null-terminated strings (since strings are length-encoded).
The advantage of the array hash table is that it can scale much more efficiently than a chained hash table, with respect to both time and space. If a very low load factor is used (say 16 million slots and only 1 million keys), and considering that 32-bit integer keys are used, then the array hash table will consume no more space than the chained hash table, while still rivaling its speed. Once more keys are inserted, the array hash table will start to save space over the chained hash table, while retaining high access speed. The chained hash table, on the other hand, will start to consume more memory and slow down --- eventually, performance will get so bad that you will need to resize the chained hash table, further consuming more space and computing resources. This is not the case with the array hash table, which has been shown to scale well as the load factor increases, making it a more practical choice in a dynamic environment.
When variable-length string keys are used (as was the original design of the array hash table), even under low load, the array hash table will consume less space than the chained hash table, because it is cheaper to store a string in a dynamic array, then it is in a node of a linked list (as a result of memory allocation overheads and the requirement of a next-link pointer).
Also, the dynamic arrays are not resized in by doubling or halving. Dynamic arrays are grown in an exact-fit manner --- so they are only grown by as many bytes as needed, which is both fast and memory efficient.
cheers. —Preceding unsigned comment added by Hetori ( talk • contribs) 23:26, 11 June 2010 (UTC)
#include <stdlib.h>
typedef struct {
void **array_ptr; // underlying array
size_t array_current_size; // size of allocation of `array_ptr` pointer
} array_structure;
int insert_new_element(array_structure *array, void *new_element) {
array->array_ptr = realloc(array->array_ptr, ++array->array_current_size * sizeof(void*));
if (array->array_ptr != NULL) {
array->array_ptrarray->array_current_size - 1 = new_element;
return 0;
}
return -1;
}
{{
List_data_structure_comparison}}
template table -- as soon as we find a name for it. --
DavidCary (
talk) 07:00, 23 February 2022 (UTC)
Hi, a recent edit claims that hash tables (HT) allow aritrary insertions and deletions in O(1) worst-case sense. I am thinking of usual implementations where the table is doubled/halved whenever the load factor gets above/beyond fixed thresholds. For these implementations the insertion cost is O(n) worst-case but O(1) only in amortized sense.
Perhaps the editor is thinking of implementations that keep 2 (or 3) arrays and spread out the copying load over many ops? Those would indeed have O(1) worst-case insertion cost in theory, but the overhead seems quite large, and I wonder whether such solutions are really practical. Are they? If not, methinks that the claim of O(1) worst-case in that section would be misleading, since the edited sentence is about practical advantages of HTs. All the best, --
Jorge Stolfi (
talk) 03:30, 27 May 2009 (UTC)
(unindent) Please stop the condescension, it's rude and annoying. Currently, this article contains no analysis whatsoever. The article should describe how hash tables perform under various models of analysis, including both the worst-case model where the hash function is unspecified and the more realistic model where it is a random variable over a sensible (say, uniform) distribution. Because the actual hash function is a parameter specified by programmer, these are useful in establishing bounds on performance and emphasizing the impact of hash function choice on performance. And as you know, the linear worst case performance emerges not from geometric table expansion, but from collisions. I'll add something, and I really hope I won't be facing resistance about it. Dcoetzee 02:19, 28 May 2009 (UTC)
From the lead paragraph of "Choosing a good hash function":
A good hash function is essential for good hash table performance... However, since poor hashing usually degrades hash table performance by a constant factor, and hashing is often a small part of the overall computation, such problems commonly go undetected.
If the latter is true, how can the former be true? If the problems caused by poor hashing commonly go undetected, and the hashing is a small part of the overall performance, it seems that a good hash function can't be called "essential for good performance" but is rather an aesthetic preference at most. One might distinguish between 'adequate' and 'bad' functions, where 'bad' might run slowly and produce uneven distributions, but if your problems are going undetected, I'd say you've at least struck 'adequate', and improving to 'good' would be an exercise in lily-gilding.
But perhaps I'm missing something. --
Fader (
talk) 14:51, 24 July 2009 (UTC)
The following text was recently added to the article:
begin text
An example of a simple hash function with good behavior is:
unsigned long f(char key[], int arrayLength) { unsigned long h = 0, v = 401; int i; if (arrayLength >= 2) { h ^= (0xffUL & key[0]); h ^= (0xffUL & key[1]) << 8; h = ((h + 9) * (h + 2) * v) >> 1; } for (i = 2; i < arrayLength; i += 2) { h ^= (0xffUL & key[i]); h ^= (0xffUL & key[i + 1]) << 8; h ^= ((h + 9) * (h + 2) * v) >> 1; } if ((arrayLength & 1)) { h ^= (0xffUL & key[arrayLength - 1]); h ^= ((h + 9) * (h + 2) * v) >> 1; } return h % N; }
This function has the recommendable property that if arrayLength is 2, and N is 216, then the function is almost a perfect hash, filling 65531 of 65536 slots.
end text
This example needs more context before it is put in the article. Is there a reference for it? Also the comment below the code is confusing. All the best, -- Jorge Stolfi ( talk) 12:47, 17 February 2010 (UTC)
N is the size of the hash table and array length is the length of the char array if i'm not mistaken. It still needs more info on hash time and comparisons to other hash functions. I think an example is needed to give the reader a better idea of what a hash function is.
I agree, although I think that the following quote "Poor hashing usually degrades hash table performance by a constant factor,[citation needed] but hashing is often only a small part of the overall computation" was unnecessarily marked as needing a citation, as the truth of the assertion is self-evident. Consider a hash function f such that f(x) = 0 for all objects x. Obviously, this is the worst hash function possible, but it illustrates my point. Clearly, if the keys are comparable, then insertion and lookup will always have time complexity O(log n) rather than O(1). If the keys are not comparable, then insertion will have time complexity O(1) and lookup will have time complexity O(n) (since you can't binary search a list of items without an ordering). —Preceding unsigned comment added by 24.69.148.22 ( talk) 01:39, 11 April 2010 (UTC)
Oh, I see what you're saying. I assumed that "constant factor" meant a constant factor of n where proportional to the number of items mapped by the hash function. Perhaps the sentence should be edited for clarity? I think that whoever wrote that didn't mean to say that the performance degradation was in O(k*1) = O(1). I think the person meant O(kn) = O(n) for some "constant factor" k > 1. If that is what was meant, I am sure that we could both agree that it's more likely to be true than the claim that degradation is in O(1), which is should be rejected prime facie. —Preceding unsigned comment added by 24.69.148.22 ( talk) 06:35, 11 April 2010 (UTC)
My clean-up of the external links section was met with opposition, so let's discuss the individual items (I'll exclude the ones that have been removed again in the meantime):
I think that all these links should be removed, for the reasons given above (maybe except for #5). – Adrian Willenbücher ( talk) 16:39, 23 September 2010 (UTC)
Any objections to me setting up auto-archive on this page? It's getting rather lengthy, and I'd rather set it up to work automatically than keep doing it by hand. me_ and 16:46, 23 September 2010 (UTC)
Currently, it turns out from this article that open addressing is absolute nonsense (except, perhaps, because of caching). Actually it's not better so much because of insertion, deletion etc. speed, but because iteration speed (over whole table) is very good even with very non-uniform hashes. Thus, it's good for example to do duplicate check for values, which you do not insert too often, where the common operation is iterating over all values (for example if you want to iterate over table of size 100 000 each time you have added 20 elements). Consequently, rehashing can also be faster. qtvali -- 83.178.58.29 ( talk) 22:03, 3 July 2011 (UTC)
"For example, a chained hash table with 1000 slots and 10,000 stored keys (load factor 10) is five to ten times slower than a 10,000-slot table (load factor 1); but still 1000 times faster than a plain sequential list, and possibly even faster than a balanced search tree."
This example and comment in the article seems to suggest that balanced search tree is even slower than a sequential list (linear search). This is absurd! :) — Preceding unsigned comment added by 151.197.168.146 ( talk) 16:15, 9 July 2011 (UTC)
You should define k to be the size of the hash table (the size of the underlying array ) and n to be the number of elements in the hash table, and unless you make the hypothesis that k < n, the space complexity should be O(k+n) rather than simply O(n).
By the way, this page has a more detailed complexity analysis for hash table that is much clearer than the mess on the wiki page
The section needs to justify the use of a hash function: Why is a hash function even needed? Why not, for example, just store with each key the address of (pointer to) the value?
The diagram above shows buckets separate from the keys, but this sentence seems to say each (full) k-v pair is in a bucket.
"Suggests"? Surely that's far too gentle a word. How about "gives" or "tells"?
Given the definition of collisions in the introduction, it's not clear why they are a problem that requires resolution. It's also not clear how uniformity has anything to do with them (apart from the mysterious hash function); it would seem that either two keys point the same value or they do not.
JKeck ( talk) 17:14, 31 July 2013 (UTC)
2nd problem: keys vs bucket vs entries. Each bucket stores one entry or a linked list of entries. Each entry stores k-v pairs. Both, the sentence and the graphics do make sense. I have no idea how to improve the wording to make it clearer to confused people. More graphics probably. 3rd: "Suggests" is a proper word, since the key is not always found at the calculated index. Sometimes it has to search for more locations, either in the same array (open addressing) or in a separate bucket list of colliding entries. Also your forth complain makes not much sense. It's very "clear why there is a problem that requires resolution", dependent on the quality of the initially calculated index, which is dependent on the hash function, the fill factor and the uniformity of the distribution. ReiniUrban ( talk) 16:47, 28 August 2016 (UTC)
I'm with JKeck. The section is unscholarly, and "suggests" is an unscholarly word. Hashing is a time/storage space trade-off: it takes a very long time to search a long list, and it takes a very large amount of storage to address directly with say, a 64-bit key (~10^19 slots). Hashing allows data storage and retrieval in a small and nearly constant amount of time, and storage only fractionally larger than that required for the records + keys themselves. The primary requirements for a hash function are efficiency and uniformity. Since the section on Choosing a hash function doesn't mention efficiency, one may presume that a hash function that takes a geological amount of time to compute won't be a problem, or that in any case, one doesn't need to worry about such things. Perfect hashing is a combinatorial curiosity only, and given the paucity of real information in that section, it ought to just go. I could go on and on, but I'll stop here to more carefully read over the entire article, which sort of just rambles on. Sbalfour ( talk) 20:30, 24 September 2019 (UTC)
Can anyone explain what the following is trying to get at?? The reference seems to be off line.
For open addressing schemes, the hash function should also avoid clustering, the mapping of two or more keys to consecutive slots. Such clustering may cause the lookup cost to skyrocket, even if the load factor is low and collisions are infrequent. The popular multiplicative hash[3] is claimed to have particularly poor clustering behavior.[7]
Hew Johns ( talk) 17:44, 19 August 2013 (UTC) Okay, I see some of the problem "open addressing" == "closed hashing", i.e. in table storage... Hew Johns ( talk) 20:08, 19 August 2013 (UTC) OK, never mind. After reading the whole thing I realize it is a very poor expression of the idea that a probe or re-hash function must also be uniformly unbiased, rather than clustering about the original hash. Hew Johns ( talk) 21:26, 19 August 2013 (UTC)
"Python's built-in hash table implementation, in the form of the dict type, as well as Perl's hash type (%) are highly optimized as they are used internally to implement namespaces."
This statement is misleading. Perl used a simple and especially insecure but fast form of a Jenkins OOAT hash until 5.6, when it was attacked by collisions against the simple linear chaining. The proposed fix was to use universal hashing, to keep performance and add security against hash collisions. Implemented was randomized seeding on rehashing instead. This trick was found to be broken with 5.17.6, and so all optimizations were thrown overboard and a rather slow and secure hash function was selected, SipHash, instead of choosing an optimized and proper hash function (like City or Murmur) and a fast and secure collision strategy. Python choose the same road and changed to SipHash. Perl went back before the 5.18 release to a bit faster JenkinsOOAT hash function, but this still didn't solve the problem of the wrong collision resolution. Optimized strategies would have been open addressing, universal hashing or perfect hashing. There are furthermore no other possible optimizations implemented, like using SIMD or HW assisted crc32 intrinsics on Intel processors, as CRC32 turns out to be the fastest and best hash function for the average case (least number of collisions). See http://blogs.perl.org/users/rurban/2014/04/statistics-for-perl-hash-tables.html
I propose to use the wording "Python's built-in hash table implementation, in the form of the dict type, as well as Perl's hash type (%) are used internally to implement namespaces and therefore need to pay more attention to security, i.e. collision attacks."
Highly optimized and concurrent hashes can be studied with Clojure, the Hopscotch table or Robin Hood. ReiniUrban ( talk) 17:01, 12 April 2014 (UTC)
Perl is not interpreted. — Preceding unsigned comment added by 68.183.92.186 ( talk) 00:17, 14 May 2014 (UTC)
Why link to an non-existent article? — Preceding unsigned comment added by 75.79.3.84 ( talk) 20:09, 15 May 2014 (UTC)
This article needs simple examples up front so that novices can get the general idea. 2602:306:35FA:D540:2487:9F59:8A65:3095 ( talk) 04:44, 27 January 2017 (UTC)
Right now Robin Hood hashing and 2-choice hashing are listed as a direct children of "Collision Resolution". They seems to fit the definition of Open Addressing resolution strategies. Since all other open addressing strategies are listed under the Open Addressing subsection, shouldn't these two techniques be moved there too?
Andreas Lundblad ( talk) 07:13, 10 July 2019 (UTC)
The lead says: Ideally, the hash function will assign each key to a unique bucket, but most hash table designs employ an imperfect hash function, which might cause hash collisions...Such collisions must be accommodated in some way. .
We don't live in an ideal world, so starting out with a statement about ideal is supercilious. The implication is that if a hash function is 'imperfect', like an assembly line making imperfect parts, why don't we fix it so it's 'perfect'? An imperfect hash function might cause hash collisions, but the sense of this statement is vague, and beggars the question, why don't we build one that 'might not' cause hash collisions? Easy enough to postulate, eh? We might lead a novice into actually attempting to do that. I wish him luck.
How about this in place of the whole paragraph: Hash functions are designed to be efficient to compute and minimize collisions, duplicate hash codes resulting from distinct keys. 19 words versus 48. And it's very informative. The lead is not the place to go into what happens in a collision: we may assume, and indeed we find, a whole section talking about just that. Hmmm... or we could add, Hash functions are accompanied by an adjunct collision-resolution method. Now I think we've got a substantial digest. Scholarly diction is an art. Sbalfour ( talk) 21:47, 24 September 2019 (UTC)
Let's try again. The lead says, In many situations, hash tables turn out to be on average more efficient than search trees or any other table lookup structure. For this reason,.... What reason? It's a dangling participle, with look-thru ambiguity, it's not attached to anything. If we turn the sentence on its head, we get: Hash tables are more efficient in many situations than other things because <reason>. We need to state the reason, then we can say, For that reason,... How about Hash tables provide nearly constant access time and compact storage for large numbers of records, that can't be efficiently stored or accessed by other methods. There's ordered and unordered lists, linked lists, double-linked lists, circularly-linked lists, tries, heaps, stacks, queues, and etc, too numerous to mention in the lead, so don't try. If it matters - and it's worth a paragraph or maybe a sub-sub-section - there we distinguish the situations where we might use, other methods. Sbalfour ( talk) 22:19, 24 September 2019 (UTC)
Robin hood hashing is entirely analogous to the three ways the chains may be ordered in chained hashing: as unordered lists, as serially ordered lists (alphabetically or by some other key-dependent criteria), or as self-ordering lists by probe frequency (i.e. key-independent criteria). So, we reshuffle the bucket sequences for collided keys just as we reshuffle the chains. Yeh? I think we lost sight of the forest for the trees on this one. Sbalfour ( talk) 22:38, 24 September 2019 (UTC)
Would this be a good citation for the better performance of variable-sized data in chaining hash tables? https://www.researchgate.net/profile/Haval_Ahmed2/post/Any_ideas_about_hash_table_implementation/attachment/59d6430bc49f478072eabc10/AS:273806436831232@1442291949107/download/In-memory+hash+tables+for+accumulating+text+vocabularies.pdf ?
First sentence of section "2.3. Hash tables" seems to be useful, but it is not backed up in the article itself.
Recently someone used the "Inside the latency of hash table operations" by Andy Ke as a reference for this article. I feel this is perilously close to becoming a WP:REFLOOP, because many sentences and illustrations in that article are a word-for-word identical to (an earlier version of) this "hash table" Wikipedia article. (A few of of those sentences are sentences I wrote for Wikipedia). What should we do about this? -- DavidCary ( talk) 23:24, 6 May 2020 (UTC)
I spent several hours documenting a simple example of a hash table ( [11]), only to have it deleted one week later with the edit summary "rm unsourced code, we don't need that here". I agree that it's unsourced, but I disagree with the assessment that we don't need a code example here. I'll never forget my first visit to this article many years ago -- before I had experience with hash tables -- and found the article seriously lacking because it didn't show a single functional example of an actual hash table. I realize that my example may not be acceptable because -- even though I personally tested it and determined that it functions exactly as explained -- it might be considered OR and therefore verboten. Having said that, I don't really care if my example is restored or some other example is used instead, but I would argue that simple functional example is essential to help explain this topic. Lambtron talk 22:50, 7 February 2022 (UTC)
It's not the approach I would have taken, but I can see the appeal of simply deleting my contribution rather than working collaboratively to improve it. For example, I would have used my knowledge of "Wikipedia-mandated" language agnostic coding to translate the code -- a job an expert could probably do in a few minutes -- and left intact the explanation. But that's just me: averse to throwing out the baby with the bathwater. BTW, it's not clear to me why you think this basic example is not appropriate for this article, or why a "scholarly journal article" is required for the simple calculations involved. Frankly, I find it a bit frustrating that you seem opposed to having an easily understood example here when there's such an obvious and compelling need for one. Lambtron talk 16:51, 8 February 2022 (UTC)
straightforward explanation alone cannot be trusted- WP:VERIFYOR WikiLinuz🍁( talk) 20:01, 8 February 2022 (UTC)
Sometime in the past the following chart/image was added to the article:
. There are several problems with this graphic but the biggest is that by its own description it appears to be original research. Per the WP:NOR policy, no original research is allowed in Wikipedia articles. If the poster ( User:Dcoetzee, apparently) would like to update it with a reliable, published source for its specific data, then I will withdraw my objections. Otherwise I will remove it in the near future. — Preceding unsigned comment added by RBarryYoung ( talk • contribs) 14:51, 9 February 2022 (UTC)
Introduction sentence and the claim that Hash tables are implemented as associative arrays is incorrect. Even the linked reference says that associative arrays are implemented as hash tables, not the other way around. Currently associative array and hash table articles are in an infinite loop as each claims they are implemented using the other. 2001:7D0:87E3:100:C528:86D0:1049:5205 ( talk) 06:18, 27 October 2023 (UTC)
In computing, a hash table, also known as a hash map, is a data structure that implements an associative array, also called a dictionary, which is an abstract data type that maps keys to values.
Where is the hash digest ...
This explanation in section Hashing by division
is unclear because the term hash digest is not explained and does not appear in the equation above it. (Also, it should probably not be capitalized because it is not a complete sentence.)
In the following section, Hashing by multiplication
, it is the other way around: appears in the equation but is not explained.
Gnib Bnil (
talk) 14:55, 5 June 2024 (UTC)
This is the
talk page for discussing improvements to the
Hash table article. This is not a forum for general discussion of the article's subject. |
Article policies
|
Find sources: Google ( books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL |
Archives: 1 |
This
level-5 vital article is rated B-class on Wikipedia's
content assessment scale. It is of interest to the following WikiProjects: | ||||||||||||||||||||||||||||
|
In the section "Choosing a good hash function", a (presumably integer) "s" is repeatedly referred-to. However this is as far as I can see totally undefined in the wikipedia article. Although (some) computer scientists might infer s from context, other readers may not. This is bad form. Suggest to change.
152.3.68.83 ( talk) 18:39, 22 November 2013 (UTC)
Already the first sentence is misleading, if not even wrong. A hash map associates keys with values (i.e. maps) and is implemented using a hashtable, though a hashtable itself does not necessarily do any mapping or association of keys and values. Just consider a hashtable where all you do is to store integers, there are no values here. —Preceding unsigned comment added by 91.36.112.7 ( talk) 14:51, 23 January 2008 (UTC)
I have a few small issues with the current Hash table article. First, the pseudo-code is a little convoluted—in that the findSlot(..) function is not 100% obvious. It hides away the linear probe code, which makes reading the set(..) and lookup(..) functions a little confusing on first read. I was just going over it with an intro computer science student and even had to read it twice myself. Josh 08:11, 10 April 2006 (UTC)
There is a line which states:
I'm not really sure that this is accurate. I might agree wth calling it expected O(1), or average O(1). But it is not, as far as I can tell, worst-case O(1). In fact, remove is worst-case O(n), as in all hash operations. Josh 08:11, 10 April 2006 (UTC)
I know that there has been some discussion of amortized runtime here, but I'm not really sure that you can use amortized analysis on this function. Please correct me if I am wrong, otherwise, I will change shortly. Further, I am not sure why the article says only possible in linearly probed hash tables... If the remove function is O(1) in linear probe, then why not with a quadratic probe? Josh 08:11, 10 April 2006 (UTC)
"When we say T(N) [is] O(f(N)), we are guaranteeing that the function T(N) grows at a rate no faster than f(N)." (Data Structures and Algorithm Analysis in C++, Third Edition, Mark Allen Weiss, page 44)
It is entirely possible to implement a hash table that makes a collision every time a new value is added. This would be a useless implementation, but a valid one. A hash table can't then be O(1) ever in any way if the above possibility exists.
"Hash tables can be used to implement the insert and contains operations in constant average time...The worst case for hashing generally results from an implementation error." (page 207)
I would suggest that the article should be changed to mention that hash tables are not O(1) by the definition of Big-O notation, but that all good implementations come closer to acting like they are than binary search trees.
199.111.229.133 00:23, 16 October 2007 (UTC)
If when adding to a table the hashing function leads to a collision for 2 particular keys, a and b, then using probing b will be stored somewhere after a. When then looking up the value associated with b, won't the table actually return the value associated with a. How does it know which of the two values to return, and how to get to the index associated with b?
I hope that makes sense. Iae 11:22, 8 June 2006 (UTC)
You know you hit the end of the list in a (probed) hash table when you hit a empty "not occupied" slot. In theory, one could have a "occupied" bit for each row of the hash table that is initially 0. (In practice, typically each slot that is not occupied begins with a NULL byte. If it *is* occupied, that byte is the first byte of the key).
I know of 2 very different ways to delete from a hash table: the "deleted bit" method", and the "move stuff around" method. (My understanding is that the "move stuff around" method is impossible to implement with "quadratic probing" or "double hashing" hash tables (and probably a few other types). Those hash tables are forced to use the "deleted bit" method.)
Let me try to explain those 2 methods
"deleted bit" method: works with any kind of hash table. Every row of the hash table (in addition to the key, value pairs) has a "deleted" bit that starts out 0. To delete a record from the hash table, use the key to find it (using find_slot), then set the "deleted" bit to 1. (Some hash tables cram the "deleted" bit and the "occupied" bit into the "key" part of the record, reserving 1 special key to indicate "unoccupied", another special key to indicate "deleted", and any other value to indicate a real occupied "key" ).
"move stuff around" method: only works with linear probing.
The function remove(key) in the article is supposed to describe the "move stuff around" method. How could we make it easier to understand?
Often when the application deletes a record, the following slot is already not occupied. In that case you can wipe out the record by marking that record as not occupied -- overwriting the key with a NULL byte -- and you are done.
Unfortunately, there are a bunch of special cases the application needs to be able to handle, even though they rarely happen. As the article says, "For all records in a cluster, there must be no vacant slots between their natural hash position and their current position (else lookups will terminate before finding the record)." The application needs to scan through *all* the records between the record you want to delete, and the next "not occupied" slot, and make sure the above requirement is met. In some cases, you must move all those records up 1 slot. In other cases, you must move some (but not all) of those records.
In yet other cases, you must not move any of them, just mark the deleted slot "not occupied" just like the simple case. (For example, if you want to delete the record in slot 870, and you see that the "key" in slot 871 actually hashes to "871", and slot 872 is "not occupied" -- you must mark slot 870 "not occupied", and leave the record in slot 871 alone).
Once you understand how it works, please update the article to make it easier to understand for the next person.
-- 68.0.120.35 07:42, 5 March 2007 (UTC)
I wanted to check in here before making huge changes to the article, but one thing I'd find very helpful is a discussion of concrete choices for the hash function itself. Here's an outline of what I'd say:
It's very common to implement hash tables with poor hashing functions. Knuth is largely to blame, advocating the very weak "multiplicative hash" function, and even going so far as to claim that its clustering property is a good thing! (section 6.4, TAoCP vol III, 2e, p. 517). Variants of the multiplicative hash remain popular, as do other linear techniques such as CRC.
Surveying the field, two excellent newer alternatives stand out. For most simple hashing tasks, the Fowler Noll Vo Hash is an excellent performer. It is among the simplest of known hash functions, is quite fast, and has a good distribution. For use in a hash table, the FNV-1a variant is likely the best choice, as it has better (more dispersed) clustering behavior than FNV-1.
For some applications, particularly when keys are long, the newer Jenkins lookup3.c hash function may be a better performer. It achieves better speed by consuming 12 bytes of the input per iteration, as opposed to one byte for FNV. Disadvantages include greater code complexity, a sensitivity to machine endianness (causing potential difficulties when reusing hash values across disparate computers), and the need to pad byte-aligned input to 32 bit words. Do keep in mind, though, that benchmarks showing impressive speed for large blocks of data may not translate to real-world gains. Common usage patterns involve relatively short keys, so the amount of time spent in the hashing function inner-loop may be less relevant than, say, the gains from a compiler being able to automatically inline a simpler hash function.-- Raph Levien 02:32, 5 July 2006 (UTC)
Some interesting links I came across while searching:
You might also want to check out HSH 11/13 which seems to distribute quite well (as per published graph) and also performs nicely with its handful of code lines.
Followup:
I went ahead and completely rewrote the section. I may have come close to bending the letter of the law on doing original research (measuring avalanche behavior of Jenkins One-at-a-time hash using Bret Mulvey's tools) and NPOV (I advocate particular hash functions and trash criticize other popular choices), but everything I said is verifiable, and I think the result is the most helpful and reliable advice on choosing hash functions anywhere on the Internet or in dead tree form.
I measured the HSH 11/13 hash using Mulvey's AvalancheTest, and found it slightly inferior to the Jenkins One-at-a-time hash. Herbert Glarner's tests for uniform distribution are not as sensitive as Mulvey's chi-squared tests. Further, HSH 11/13 is quite slow, because of its inner loop.
Obviously, I also changed my mind about FNV, swayed in large part by Mulvey's analysis. It's not a bad choice, being one of the simplest of the "pretty good" choices, but even there the XOR-folding adds a bit more complexity. In any case, I feel like I've provided enough information for people to make their own informed choice.-- Raph Levien 21:31, 12 August 2006 (UTC)
Simplicity and speed are readily measured objectively
There is a caveat here. Experimental measurements of speeds are necessarily done on a "representative sample" of inputs. It may be the case that such or such algorithm performs with varying speed depending on the kind of inputs, and that some sample, representative of one kind of inputs, may not be representative of another kind. I don't think this would happen with usual hash functions on e.g. strings but this may happen in more esoteric cases. David.Monniaux 14:38, 13 September 2006 (UTC)
Problem 1: The Hash Table will be used for storage of student records. You may assume the maximum number of Items will not exceed 240. An item corresponds to a student record. A key is taken to be an ASCII character string of maximum length 32 containing name of the student. The data is taken to be 9-digit id and the name of the state of residence of the student in India. A list node contains an Item and a reference to the next node. You will need to implement the classes Item and the ListNode with construct and appropriate data access operations. 2. Write a main() program to test your class HashTable. 3. Input: The name of the input file, containing data items (student records) and data operations in the format given below, and the name of the output file, will be passed to your program on the command line or interactively. Data Items: student-id, student-name, state-of-residence Example. 200412018, Swati Mishra, Uttar Pradesh one per line. Data Operations: <operation> <item> The <operation> is s,i, or d for search, insert and delete, respectively. The item will have fields: student-name, student-id, and state-of-residence. The fields of an item will be separated by commas, and operation will be separated from the item by a colon ”:” and a space. Example. s: 200211001, Akash Gokhale, Maharashtra one per line. The data items will be separated from the data operations by a blank line. 4. Output: Your program is to read the input file and populate the Hash Table with student records by repeated Insert() operations. And then print to the output file, the size of the Hash Table, and the size of each linked list in the Hash Table. Then, it will continue to read each line and execute appropriate data operations. Following each Insert()/Delete() operation, it will output the size of the Hash Table and the size of each of the linked list, and for each Search operation, it will output
-- 220.225.53.35 09:18, 11 October 2006
Hi, I've implemented the joaat_hash function which is described in pseudocode on this page in my C Program, and encountered an error. Better said I produced one, since I misunderstood len as the len of the hashtable, and not as len of the key.
here is my (correct) function implemented in C, please consider updating the article:
int joaat_hash(char *key, size_t len) //len is the size of the hashtable { unsigned int hash = 0; unsigned int i; for (i = 0; i < strlen(key); i++) /* [...] as in the article */ return (hash % len); }
-- 88.76.141.17 03:43, 2 January 2007 (UTC)
You are right -- the "len" in the pseudocode in the article is the length of the key. (That pseudocode gives the same results as the original C implementation "One-at-a-Time Hash" ub4 one_at_a_time(char *key, ub4 len) http://www.burtleburtle.net/bob/hash/doobs.html , right?)
I think the above implementation gives the same results as the original C implementation for all possible ASCII strings. Unfortunately, the above implementation gives different results for other sorts of data structures cast into byte arrays, when those data structures include a zero byte.
Since the original version gives *better* results for those data structures, and the *same* results for everything else, I think I'll stick with the original version, except use "key_len" to clarify exactly of what it is the length. (Or have I missed something obvious?) -- 68.0.120.35 07:42, 5 March 2007 (UTC)
From the article: "In fact, even a cryptographic hash does not provide protection against an adversary who wishes to degrade hash table performance by choosing keys all hashing to the same bucket."
I thought that one of the criteria for a cryptographic hash function was that it be infeasible to find collisions. Therefore, it would provide defense against such an adversary. Or am I misunderstanding something? Ralphmerridew 21:55, 30 January 2007 (UTC)
What lots of people seem to forget when discussing the need for hash functions that reduce predictable collisions is, that they are only needed for hostile environments, where an attacker can control the input data *and* where it is important that proper speed is maintained (e.g. tcp/ip hashtables within a kernel). It is much much less important in simple programs where a hashtable is just used to store some data that might even be created dynamically by the program, or doesn't occur in the input set the program was designed for. It might be possible to degrade a hashtable that just uses some "id * 1000000007UL & ((1<<n)-1)" for finding the appropriate slot in your face recognition software by feeding it with a carefully crafted artificial bitmap pattern. but why care? trash-in, trash-out.
A recent edit deleted most of the following paragraph, claiming it was "verbiage":
This paragraph is explaining that *cyptographic* hash functions (a different concept altogether, see lead section) are not necessarily good choices for hash tables, because their only advantage (probabilistically guaranteed good performance even on data submitted by hostile clients) can be obtained at smaller cost by using ordinary (non-crypto) hash functions with secret salt (as discussed above). The recent edit removed this information. But perhaps the wording is not clear and needs to be improved. All the best, -- Jorge Stolfi ( talk) 21:28, 6 February 2010 (UTC)
I'm considering merging the open hashing article with this article, as has already been done for closed hashing. For that I think we should rename the section on Chaining to Open hashing and make more clear the already mentioned fact that linked lists are only one way, though perhaps the most popular one, of organizing hash buckets and introduce a mention of buckets allocated in a contiguous memory/disk space that is presently discussed in open hashing. Jyotirmoyb 05:16, 20 February 2007 (UTC)
Makes sense - can't have it both ways - with closed hashing merged (thought it's hard to find any content called "closed hashing" in the article) and open hashing in its own article. On the other hand, there is no problem with an article on "open" and "closed" and other types of hashing if there is enough content to justify. Does wikipedia ave guidelines on article length??-- 121.45.246.110 14:10, 9 April 2007 (UTC)
The remove function will loop indefinitely if the hash table is full, since it only exits when it finds an unoccupied slot. I think adding the following after the j := (j+1) line should fix it:
if j = i exit loop
32.97.110.142 20:36, 11 April 2007 (UTC)
In the discussion of the costs of resizing the hash, there seems to be an ambiguity in the amortized running time analysis:
... If in the end it contains n elements, then the total add operations performed for all the resizings is:
1 + 2 + 4 + ... + n = 2n - 1. Because the costs of the resizings form a geometric series, the total cost is O(n)....
The n in "2n - 1" is actually the value of the nth element, not the number of elements. The sentence that follows makes it seem that the sum of a geometric series is linear, which is clearly not the case.
Guray9000 00:47, 16 April 2007 (UTC)
The article states:
To see why this is true, suppose a hash table using chaining begins at the minimum size of 1 and is doubled each time it fills above 100%. If in the end it contains n elements, then the total add operations performed for all the resizings is:
1 + 2 + 4 + ... + n = 2n - 1.
This does not seem right. The expected number of steps it takes to double the table size from (1) to (n>1) at each overfill should be: truncate(LOG2(n-1))+1 (whare LOG2 means logarithm base 2 of n). Also, if my math is correct 1 + 2 + 4 + ... + n = n(n+1)/2, but this does not seem the right computation to make: we're not supposed to add the values corresponding to each step but to count them.
If this reasoning is correct the total cost should read O(LN(n)) rather than O(n), which should mean this method scales very well with n.
In that section it is written: "Like arrays, hash tables provide constant-time O(1) lookup on average, regardless of the number of items in the table." (*)
Searching in sorted list of keys by using binary search can be done in O(log N). Average running time of binary search is also log N (this is according to http://planetmath.org/encyclopedia/BinarySearch.html). I don't know any better algorithm for non special sorted data. So I think statement (*) isn't true in asymptotic sense.
However if we know that N is bounded by some number we can say that O(log N) = O(1). But this would be wrong... But in some philosophical (or maybe practical) sense we could be right (we have large but fixed maximum amount of RAM, HDD space, maximum amount of hash table entries etc). But this would be not right in asymptotic sense where N not bounded but free.
What I don't understand?
This section is quickly devolving into yet another trivia list. MegaHasher 19:45, 11 July 2007 (UTC)
Are there any references on a data structure that implements a hash table and binary tree at the same time? MegaHasher 20:04, 3 August 2007 (UTC)
This article could be improved with more direct citations. MegaHasher 06:11, 14 August 2007 (UTC)
I am suspicious of the random table based hash function. This seems to be a variation on Pearson's hash function, except there are multiple tables, therefore much worse cache misses. Pearson's hash function is not all that fast to start with. MegaHasher 03:22, 18 August 2007 (UTC)
For single-byte indices and 32 bit hash numbers, the tables are 256*4 = 1024 bytes each. This algorithm is obviously best when the cache size is larger, but also very suitable when a few keys account for most of the lookups, or when the cost of a collision is far greater than the cost of a memory cache miss (such as when hashing into disk records).
By having multiple tables, this algorithm ensures lookups to the same index in the table don't cancel each other out, thereby generating literally perfectly randomly distributed but repeatable hashes. The Wikipedia article on Pearson's hash function indicates it employ a table with the numbers 0 to 255 randomly arranged, and it ultimately selects one of the 256 different values in the table at random. The multi-table algorithm you removed doesn't simply select one of the 256 values in one of the tables, it xors values together to create a far stronger hash. Look at the algorithm properly and you might appreciate it more.
FWIW, the algorithm is proven technology, used to hash > 3 million highly-repetitive financial security keys by the world's leading financial data provider's proprietary security database, handling hundreds of thousands of database inserts per second.
I'll reinstate the algorithm, with the attribution removed as you've (offensively) referred to it as "vanity" material, and I hope you, Mr "MegaHasher", have much better reasons before you remove it again.
I don't consider this to be original "research". It is simple stuff, and important precisely because of that. Too many programmers either don't use hash tables or use poor algorithms because they don't have a feel for or ability to assess mathematical hashes. This may suit the few specialists who write the algorithms and relish the elitist mystique of it all, but it's a big loss for computing as a whole. More generally, I might accept that this belongs on the Hash Function page if you also moved the joaat_hash off this one. You can't have it both ways. I agree with the earlier discussion comments that your posting and defense of Bob Jenkin's algorithm looks very inappropriate on this page, and all the more so for your refusal to accept anything that removes any of the focus from it. —Preceding unsigned comment added by 205.228.104.142 ( talk) 23:59, August 29, 2007 (UTC)
I have seen two separate referenced articles that had measurements that show under separate chaining, an increase of load factor by 1 has the impact of increasing CPU instruction count around 4 to 5%. This is much better than what I would have guessed. The first reference is "Dynamic Hash Tables" by Larson (1988), and the second reference is "Main-memory linear hashing" by Pettersson (1993). Have I completely mis-read these articles? MegaHasher 21:33, 20 August 2007 (UTC)
Can somebody please add this picture to the article? I believe it will help a lot of people to understand the notion of the hash table. -- yanis 12:25, 21 August 2007 (UTC)
Like arrays, hash tables provide constant-time O(1) lookup on average, regardless of the number of items in the table. However, the very rare worst-case lookup time can be as bad as O(n).
This is essentially wrong. In a non buggy implementation which is when the table is a reasonable size, and the table is never too full and the hash function is working correctly n-collisions cannot occur. WolfKeeper 14:36, 16 October 2007 (UTC)
WolfKeeper: You have no idea what you are talking about. In standard implementations in standard SW like all dynamic languages the linear chaining strategy is open to well-known hash collision attacks, which can be defeated even with randomized seeds. So hash tables need to protect themselves against those O(n) worst-cases, which is e.g. a DOS attack on the web against php, perl, ruby with well-crafted keys. And those O(n) possibilities lead the implementations as those attacks appear in practical web security. ReiniUrban ( talk) 19:12, 2 April 2014 (UTC)
That's wrong for another reason. Dynamic perfect hashing methods like cuckoo hashing provide constant lookup time. Only updates may take O(n) time in the worst case. 23:24, 2 November 2007 (UTC)
The probability of n collisions by sheer chance is p^n where p is a number usually well under 0.8. The chances of say, 200 collisions is 0.8^200 = 4e-20, and on modern computers that would still run quickly. At a million hashes every second that statistically wouldn't happen once in a million years; and that assumes the table is full, and that's a small table by hashing standards. There's more chance of the hardware failing, by about a factor of 10^6. WolfKeeper 14:36, 16 October 2007 (UTC)
I feel like we are talking past each other, because we are talking about several different things but using the same name for all of them:
I agree with Wolfkeeper that, with natural data, known hash tables are so unlikely to ever exhibit O(n) performance that "it just never happens and never would happen before the end of the universe." However, the hash table#Drawbacks section has a few references that imply that known hash tables are vulnerable to a DoS attack involving carefully crafted unnatural data that tricks the computer into the worst-case O(n) performance. Some of those references seem to recommend fixed secret hash tables, which still have worst-case O(n) performance, but make it practically impossible for an attacker to exploit that possibility unless the attacker knows the secret. However, other references imply that an attacker can learn the secret after an hour of probing, and then use that secret to send carefully crafted unnatural data that tricks the computer into the worst-case O(n) performance. Those references recommend temporary secret algorithms, which generally guarantee O(1) lookup times even in the worst case under active attack, at the cost of making the worst-case insertion time slower.
I feel this article needs to clearly distinguish between algorithms that have guaranteed O(1) lookup times even when under active attack, vs. algorithms that suffer from O(n) lookup times when under active attack. Is that difference big enough that these two categories of algorithms need to have two separate Wikipedia articles? -- DavidCary ( talk) 16:39, 29 July 2016 (UTC)
I see that elsewhere on this talk page someone says "Dynamic perfect hashing methods like cuckoo hashing provide constant lookup time."
Is "dynamic perfect hashing methods" a better name for the entire category of algorithms that have guaranteed O(1) lookup times even when under active attack -- a category I previously called "temporary secret algorithms" -- a category that includes cuckoo hashing, hopscotch hashing, etc.? -- DavidCary ( talk) 15:52, 10 May 2017 (UTC)
The intro to this article is severely misguided and confusing. A hash table is not the same thing as a "hash map" (which is a Javaism meaning a map/dictionary implemented by a hash table instead of a tree). Hash tables have nothing to do with keys and values -- a hash table only knows that elements have hashes, and that elements are at the index in its array corresponding to the value of the hash function for an element. For example, a hash table can also be used to implement an unordered set. Additionally, the concept of "buckets" is not intrinsic to a hash table. It is a common strategy for resoloving collisions but there are others (double hashing, for instance).
What's the correct way to tag a page that has inaccurate information on it?
There are some non-wikipedia ghits that have the correct definition, such as: http://www.sparknotes.com/cs/searching/hashtables/section1.html
However, it's a fairly common misconception that hashes map keys to values, since so often they are used to implement maps or dictionaries. I am fairly sure that Knuth makes this distinction, but I don't have it on hand.
-- 67.180.15.227 ( talk) 16:49, 5 December 2007 (UTC)
I went reading through the article, as I do every few years since it represents my first ever contribution to Wikipedia, and noticed that the short pseudo-code for the deletion algorithm for a linearly probed hash tree had been removed. Standing in its place is a reference to the ICI implementation where the algorithm can be gleaned, but not so simply because it's handling a more complex case (set subtraction).
I just wondered why the clearer pseudo-code section was removed?
I confess to a feeling of ownership for the deletion algorithm, since that part of the entry came about after Tim Long and I discovered wikipedia. Some years earlier we'd been using a lot of hash tables, and it had bugged me that I could find no reference to an efficient deletion method. I worked out how to do it (I thought), only to discover that my method didn't always work. Tim debugged it (by adding a missing condition to the boolean expression.) We thought Wikipedia was the perfect place to put the small invention.
In fact, the pseudo-code for all the basic operations has been removed. I don't understand the logic behind that, hence this note. Lukekendall ( talk) 15:08, 28 January 2008 (UTC)
122.106.85.3 ( talk) 00:21, 29 January 2008 (UTC)
This article is terrible - someone needs to make this article clearer for beginners or even intermediates. It barely explains anything at first. How about a simple example instead of jumping into advanced theory??? 129.128.241.131 ( talk) 17:53, 7 February 2008 (UTC)
I feel the same way. Someone (like me) who doesn't know what a hash table is for will not find out quickly. I had to read most of the article before I could figure it out. I would suggest something like this be added, either in the introduction and/or in the very beginning of the contents. (I'm no expert; someone who knows what they're talking about should write it.)
Say an array of available space for storing data has indices like 1, 2, 3, ..., 1000: ARRAY[1], ARRAY[2], etc. However, the keys to the data may be something entirely different, like "John Smith", "Lisa Smith", and "Sam Doe". If they would just be put into the array directly, some sort of search method would be necessary to find the desired entry. A hash table is a way of solving this problem, allowing extremely rapid lookup. In the example above, a pseudo-random function is called to assign each of John Smith, Lisa Smith, and Sam Doe to some number between 1 and 1000. If "John Smith" is assigned 873 by the hash function, its data is stored in ARRAY[873]. It can be retrieved immediately on demand by just recomputing the hash function on "John Smith" again locate the correct index. If the hashing function is well-designed, different keys being sent to the same number is unlikely until the array begins to fill up. At some point, of course, keys will begin to "collide", to be assigned the same array index. The hashing algorithm needs a way to find the right index anyhow.
-- MikeR7 ( talk) 21:07, 4 March 2008 (UTC)
Wow! MikeR7's introduction is great for me. And it would be so for many bigginers. Thank you! -- Kang —Preceding unsigned comment added by 59.17.69.101 ( talk) 04:50, 28 August 2009 (UTC) It would be good to add to this an example of how the hash function might be constructed. —Preceding unsigned comment added by Lunedi9 ( talk • contribs) 20:28, 6 November 2010 (UTC)
I can't access it at the moment, but this paper is probably also relevant to the robin hood hashing: Munro, J. I., and Celis, P. 1986. Techniques for collision resolution in hash tables with open addressing. In Proceedings of 1986 ACM Fall Joint Computer Conference (Dallas, Texas, United States). IEEE Computer Society Press, Los Alamitos, CA, 601-610. —Preceding unsigned comment added by 75.71.67.71 ( talk) 17:02, 15 April 2008 (UTC)
The description of "open hashing" seems quite inaccurate.
That "data indexed by the hash is 'stored externally' to the hash table" may be *required* for this data structure, but does not define it. To store variable length data open addressing may be used in conjunction with storing references to the actual data in the hash table. Open addressing is used synonymously to closed hashing, which results in a contradiction.
The example which follows the first sentence seems at least incomprehensible to me (By the way: Examples should be separated from a general description.) Key-value pairs are added to the group/bucket according to the hash value of the key. (The hash of the key defines the group.)
"The key that identifies a bucket is now put into the hash table (in this case it is as simple as the bucket number)" seems wrong to me. Actually, a reference to the group is 'put into' the hash table. And this is not equal to the key (Hash keys are generally not stored in hash tables, the values are stored).
Separate chaining should be characterized as a form of open hashing.
Sorry for the extensive criticism. This part of the article really confused me...
-- Whzlt ( talk) 03:24, 18 May 2008 (UTC)
The mathematical definition of O(1) is when n tends to infinity. All the demonstrations of O(1) I've seen assume at some point that the size of your data is bounded, or that you don't use your table at more than 80%, etc. But if your data gets really big (say greater than 10^10^10), either your collision rate will increase to 100% (then you tend to O(log n) ), or you'll have to increase your table size. The table size is about proportional to the data size. Then you'll have to compute a greater hashkey to uniquely identify the buckets. As it happens, the size of the hashkey (and the time it takes to compute it) grows with log n, where n is the number of buckets. So the operations on a hashtable are really O(log n), even if it stays reasonable for large data.-- Yitscar ( talk) 08:11, 24 May 2008 (UTC)
Wrong, all this talking about log(n) is fully wrong. Normal, linear search is O(n/2) with worst case O(n). Hash tables - as they are mostly implemented - have O(n2/m) where m is the size of the hash table. That's why crowded hash tables become as slow as normal search. But nowhere O(log n), forget about that! 178.197.232.59 ( talk) 11:16, 22 October 2012 (UTC)
Sorry to be thick, but can we define it please? Thanks. 124.101.249.63 ( talk) 14:01, 17 July 2008 (UTC)
The page points to some page that claims multiplicative hashing has poor clustering. But that page doesn't get Knuth's multiplicative hashing right -- it falls into the trap that many people do of thinking that it works as (k*a) mod m for some a. In fact, multiplicative hashing does fine if you implement it correctly, by taking only the high bits of k*a. —Preceding unsigned comment added by AndrewCMyers ( talk • contribs) 16:45, 13 January 2009 (UTC)
For the x-th time, 1+2+4+8+...+n = 2n-1, not 2^n-1; you're thinking of 1+2+3+4+...+n. Just put in values for n and you will see this, or do you need a formal proof? Please people, actually read the text you edit. Adrianwn ( talk) 17:22, 26 March 2009 (UTC)
This sentence has been removed:
Hash functions need not be pseudo-random, they need only spread the keys as evenly as possible. Moreover collisions are not due to pseudo-randomness of the function. -- Jorge Stolfi ( talk) 05:59, 3 April 2009 (UTC)
The article claimed that
This statement sounds rather biased. What matters is the actual hash function, that maps the key to a bucket index. A "raw" hash function that gives good results when taken modulo a prime is OK as long as the table size s is a prime. One can do dynamic resizing with tables of prime size, with liltle extra cost. The above statement can be turned around to say "using tables whose size is a prime number increases the chance that the hashes will be unformly distributed, since some popular hash functions are known to perform badly when the table size is a power of two and the hash index is obtained by bit masking." So, how should we rephrase this sentence? -- Jorge Stolfi ( talk) 00:40, 12 April 2009 (UTC)
There is a misunderstanding in the recnt edits. The "modulo the table size" step is technically a part of the hash function, not of the hash table algorithm. If the bucket array has size s, the hash functiion must return a number in 0..s-1. See Talk:Hash function. Briefly, the mod-s trick is not always a good idea, and cannot be discussed separately from the "raw" hash function. More importantly, all discussion about the hash function (desired properties, meaning of "perfect", etc.) assume that there is no external mod-s step. All the best, -- Jorge Stolfi ( talk) 02:21, 18 April 2009 (UTC)
At work I regularly explain to new developers the implementation of hash tables in our software product. I describe the computation prior to "mod N" as creating a fixed size digest of (some of) the entropy in the (typically larger, sometimes variable length) input value. Often we use a number of digest functions all returning the same type (typically an unsigned 32 bit integer). My developers readily grasp that the range of such digests is too great and must be transformed into an index into the N entry hash array, hence the "mod N" step. Making the size of the array prime simply ensures that all entropy in the digest participates in selection of the hash array position.
The independence of the digest step and the "mod N" step is underscored by the fact that N may change over time as values get added to the hash table, while the chosen digest function does not change. From experience I know that whenever we fail to distinguish clearly these two notions confusion ensues. 67.189.170.148 ( talk) 21:20, 8 August 2012 (UTC) "John Yates" <john@yates-sheets.org>
The "Basic algorithm" section does not seem to add to what has aleady been said in the lead section. Considering that the "mod N" trick does not belong here, the section seems redundant. -- Jorge Stolfi ( talk) 03:01, 18 April 2009 (UTC)
Array hashing (separately chained hashing using dynamic arrays to stroe the buckets) is advantageous only in some limited conditions. If the load factor is low (< 1) and the entries are well-distributed, most buckets have at most one entry, so the cache provides little benefit. Also the dynamic arrays usualy require a "bucket length" field; if there is only one entry, then this extra field takes away the memory saved by omitting the chain links. On the other hand, if the load factor is high (>>1), and the bucket arrays are resized by doubling/halving, then the vacant slots will waste more memory than the chain links would. All the best, -- Jorge Stolfi ( talk) 23:47, 24 April 2009 (UTC)
Dynamic arrays only require a "bucket length" field when they store 32-bit or 64-bit integer keys. It is not required for variable-length null-terminated strings (since strings are length-encoded).
The advantage of the array hash table is that it can scale much more efficiently than a chained hash table, with respect to both time and space. If a very low load factor is used (say 16 million slots and only 1 million keys), and considering that 32-bit integer keys are used, then the array hash table will consume no more space than the chained hash table, while still rivaling its speed. Once more keys are inserted, the array hash table will start to save space over the chained hash table, while retaining high access speed. The chained hash table, on the other hand, will start to consume more memory and slow down --- eventually, performance will get so bad that you will need to resize the chained hash table, further consuming more space and computing resources. This is not the case with the array hash table, which has been shown to scale well as the load factor increases, making it a more practical choice in a dynamic environment.
When variable-length string keys are used (as was the original design of the array hash table), even under low load, the array hash table will consume less space than the chained hash table, because it is cheaper to store a string in a dynamic array, then it is in a node of a linked list (as a result of memory allocation overheads and the requirement of a next-link pointer).
Also, the dynamic arrays are not resized in by doubling or halving. Dynamic arrays are grown in an exact-fit manner --- so they are only grown by as many bytes as needed, which is both fast and memory efficient.
cheers. —Preceding unsigned comment added by Hetori ( talk • contribs) 23:26, 11 June 2010 (UTC)
#include <stdlib.h>
typedef struct {
void **array_ptr; // underlying array
size_t array_current_size; // size of allocation of `array_ptr` pointer
} array_structure;
int insert_new_element(array_structure *array, void *new_element) {
array->array_ptr = realloc(array->array_ptr, ++array->array_current_size * sizeof(void*));
if (array->array_ptr != NULL) {
array->array_ptrarray->array_current_size - 1 = new_element;
return 0;
}
return -1;
}
{{
List_data_structure_comparison}}
template table -- as soon as we find a name for it. --
DavidCary (
talk) 07:00, 23 February 2022 (UTC)
Hi, a recent edit claims that hash tables (HT) allow aritrary insertions and deletions in O(1) worst-case sense. I am thinking of usual implementations where the table is doubled/halved whenever the load factor gets above/beyond fixed thresholds. For these implementations the insertion cost is O(n) worst-case but O(1) only in amortized sense.
Perhaps the editor is thinking of implementations that keep 2 (or 3) arrays and spread out the copying load over many ops? Those would indeed have O(1) worst-case insertion cost in theory, but the overhead seems quite large, and I wonder whether such solutions are really practical. Are they? If not, methinks that the claim of O(1) worst-case in that section would be misleading, since the edited sentence is about practical advantages of HTs. All the best, --
Jorge Stolfi (
talk) 03:30, 27 May 2009 (UTC)
(unindent) Please stop the condescension, it's rude and annoying. Currently, this article contains no analysis whatsoever. The article should describe how hash tables perform under various models of analysis, including both the worst-case model where the hash function is unspecified and the more realistic model where it is a random variable over a sensible (say, uniform) distribution. Because the actual hash function is a parameter specified by programmer, these are useful in establishing bounds on performance and emphasizing the impact of hash function choice on performance. And as you know, the linear worst case performance emerges not from geometric table expansion, but from collisions. I'll add something, and I really hope I won't be facing resistance about it. Dcoetzee 02:19, 28 May 2009 (UTC)
From the lead paragraph of "Choosing a good hash function":
A good hash function is essential for good hash table performance... However, since poor hashing usually degrades hash table performance by a constant factor, and hashing is often a small part of the overall computation, such problems commonly go undetected.
If the latter is true, how can the former be true? If the problems caused by poor hashing commonly go undetected, and the hashing is a small part of the overall performance, it seems that a good hash function can't be called "essential for good performance" but is rather an aesthetic preference at most. One might distinguish between 'adequate' and 'bad' functions, where 'bad' might run slowly and produce uneven distributions, but if your problems are going undetected, I'd say you've at least struck 'adequate', and improving to 'good' would be an exercise in lily-gilding.
But perhaps I'm missing something. --
Fader (
talk) 14:51, 24 July 2009 (UTC)
The following text was recently added to the article:
begin text
An example of a simple hash function with good behavior is:
unsigned long f(char key[], int arrayLength) { unsigned long h = 0, v = 401; int i; if (arrayLength >= 2) { h ^= (0xffUL & key[0]); h ^= (0xffUL & key[1]) << 8; h = ((h + 9) * (h + 2) * v) >> 1; } for (i = 2; i < arrayLength; i += 2) { h ^= (0xffUL & key[i]); h ^= (0xffUL & key[i + 1]) << 8; h ^= ((h + 9) * (h + 2) * v) >> 1; } if ((arrayLength & 1)) { h ^= (0xffUL & key[arrayLength - 1]); h ^= ((h + 9) * (h + 2) * v) >> 1; } return h % N; }
This function has the recommendable property that if arrayLength is 2, and N is 216, then the function is almost a perfect hash, filling 65531 of 65536 slots.
end text
This example needs more context before it is put in the article. Is there a reference for it? Also the comment below the code is confusing. All the best, -- Jorge Stolfi ( talk) 12:47, 17 February 2010 (UTC)
N is the size of the hash table and array length is the length of the char array if i'm not mistaken. It still needs more info on hash time and comparisons to other hash functions. I think an example is needed to give the reader a better idea of what a hash function is.
I agree, although I think that the following quote "Poor hashing usually degrades hash table performance by a constant factor,[citation needed] but hashing is often only a small part of the overall computation" was unnecessarily marked as needing a citation, as the truth of the assertion is self-evident. Consider a hash function f such that f(x) = 0 for all objects x. Obviously, this is the worst hash function possible, but it illustrates my point. Clearly, if the keys are comparable, then insertion and lookup will always have time complexity O(log n) rather than O(1). If the keys are not comparable, then insertion will have time complexity O(1) and lookup will have time complexity O(n) (since you can't binary search a list of items without an ordering). —Preceding unsigned comment added by 24.69.148.22 ( talk) 01:39, 11 April 2010 (UTC)
Oh, I see what you're saying. I assumed that "constant factor" meant a constant factor of n where proportional to the number of items mapped by the hash function. Perhaps the sentence should be edited for clarity? I think that whoever wrote that didn't mean to say that the performance degradation was in O(k*1) = O(1). I think the person meant O(kn) = O(n) for some "constant factor" k > 1. If that is what was meant, I am sure that we could both agree that it's more likely to be true than the claim that degradation is in O(1), which is should be rejected prime facie. —Preceding unsigned comment added by 24.69.148.22 ( talk) 06:35, 11 April 2010 (UTC)
My clean-up of the external links section was met with opposition, so let's discuss the individual items (I'll exclude the ones that have been removed again in the meantime):
I think that all these links should be removed, for the reasons given above (maybe except for #5). – Adrian Willenbücher ( talk) 16:39, 23 September 2010 (UTC)
Any objections to me setting up auto-archive on this page? It's getting rather lengthy, and I'd rather set it up to work automatically than keep doing it by hand. me_ and 16:46, 23 September 2010 (UTC)
Currently, it turns out from this article that open addressing is absolute nonsense (except, perhaps, because of caching). Actually it's not better so much because of insertion, deletion etc. speed, but because iteration speed (over whole table) is very good even with very non-uniform hashes. Thus, it's good for example to do duplicate check for values, which you do not insert too often, where the common operation is iterating over all values (for example if you want to iterate over table of size 100 000 each time you have added 20 elements). Consequently, rehashing can also be faster. qtvali -- 83.178.58.29 ( talk) 22:03, 3 July 2011 (UTC)
"For example, a chained hash table with 1000 slots and 10,000 stored keys (load factor 10) is five to ten times slower than a 10,000-slot table (load factor 1); but still 1000 times faster than a plain sequential list, and possibly even faster than a balanced search tree."
This example and comment in the article seems to suggest that balanced search tree is even slower than a sequential list (linear search). This is absurd! :) — Preceding unsigned comment added by 151.197.168.146 ( talk) 16:15, 9 July 2011 (UTC)
You should define k to be the size of the hash table (the size of the underlying array ) and n to be the number of elements in the hash table, and unless you make the hypothesis that k < n, the space complexity should be O(k+n) rather than simply O(n).
By the way, this page has a more detailed complexity analysis for hash table that is much clearer than the mess on the wiki page
The section needs to justify the use of a hash function: Why is a hash function even needed? Why not, for example, just store with each key the address of (pointer to) the value?
The diagram above shows buckets separate from the keys, but this sentence seems to say each (full) k-v pair is in a bucket.
"Suggests"? Surely that's far too gentle a word. How about "gives" or "tells"?
Given the definition of collisions in the introduction, it's not clear why they are a problem that requires resolution. It's also not clear how uniformity has anything to do with them (apart from the mysterious hash function); it would seem that either two keys point the same value or they do not.
JKeck ( talk) 17:14, 31 July 2013 (UTC)
2nd problem: keys vs bucket vs entries. Each bucket stores one entry or a linked list of entries. Each entry stores k-v pairs. Both, the sentence and the graphics do make sense. I have no idea how to improve the wording to make it clearer to confused people. More graphics probably. 3rd: "Suggests" is a proper word, since the key is not always found at the calculated index. Sometimes it has to search for more locations, either in the same array (open addressing) or in a separate bucket list of colliding entries. Also your forth complain makes not much sense. It's very "clear why there is a problem that requires resolution", dependent on the quality of the initially calculated index, which is dependent on the hash function, the fill factor and the uniformity of the distribution. ReiniUrban ( talk) 16:47, 28 August 2016 (UTC)
I'm with JKeck. The section is unscholarly, and "suggests" is an unscholarly word. Hashing is a time/storage space trade-off: it takes a very long time to search a long list, and it takes a very large amount of storage to address directly with say, a 64-bit key (~10^19 slots). Hashing allows data storage and retrieval in a small and nearly constant amount of time, and storage only fractionally larger than that required for the records + keys themselves. The primary requirements for a hash function are efficiency and uniformity. Since the section on Choosing a hash function doesn't mention efficiency, one may presume that a hash function that takes a geological amount of time to compute won't be a problem, or that in any case, one doesn't need to worry about such things. Perfect hashing is a combinatorial curiosity only, and given the paucity of real information in that section, it ought to just go. I could go on and on, but I'll stop here to more carefully read over the entire article, which sort of just rambles on. Sbalfour ( talk) 20:30, 24 September 2019 (UTC)
Can anyone explain what the following is trying to get at?? The reference seems to be off line.
For open addressing schemes, the hash function should also avoid clustering, the mapping of two or more keys to consecutive slots. Such clustering may cause the lookup cost to skyrocket, even if the load factor is low and collisions are infrequent. The popular multiplicative hash[3] is claimed to have particularly poor clustering behavior.[7]
Hew Johns ( talk) 17:44, 19 August 2013 (UTC) Okay, I see some of the problem "open addressing" == "closed hashing", i.e. in table storage... Hew Johns ( talk) 20:08, 19 August 2013 (UTC) OK, never mind. After reading the whole thing I realize it is a very poor expression of the idea that a probe or re-hash function must also be uniformly unbiased, rather than clustering about the original hash. Hew Johns ( talk) 21:26, 19 August 2013 (UTC)
"Python's built-in hash table implementation, in the form of the dict type, as well as Perl's hash type (%) are highly optimized as they are used internally to implement namespaces."
This statement is misleading. Perl used a simple and especially insecure but fast form of a Jenkins OOAT hash until 5.6, when it was attacked by collisions against the simple linear chaining. The proposed fix was to use universal hashing, to keep performance and add security against hash collisions. Implemented was randomized seeding on rehashing instead. This trick was found to be broken with 5.17.6, and so all optimizations were thrown overboard and a rather slow and secure hash function was selected, SipHash, instead of choosing an optimized and proper hash function (like City or Murmur) and a fast and secure collision strategy. Python choose the same road and changed to SipHash. Perl went back before the 5.18 release to a bit faster JenkinsOOAT hash function, but this still didn't solve the problem of the wrong collision resolution. Optimized strategies would have been open addressing, universal hashing or perfect hashing. There are furthermore no other possible optimizations implemented, like using SIMD or HW assisted crc32 intrinsics on Intel processors, as CRC32 turns out to be the fastest and best hash function for the average case (least number of collisions). See http://blogs.perl.org/users/rurban/2014/04/statistics-for-perl-hash-tables.html
I propose to use the wording "Python's built-in hash table implementation, in the form of the dict type, as well as Perl's hash type (%) are used internally to implement namespaces and therefore need to pay more attention to security, i.e. collision attacks."
Highly optimized and concurrent hashes can be studied with Clojure, the Hopscotch table or Robin Hood. ReiniUrban ( talk) 17:01, 12 April 2014 (UTC)
Perl is not interpreted. — Preceding unsigned comment added by 68.183.92.186 ( talk) 00:17, 14 May 2014 (UTC)
Why link to an non-existent article? — Preceding unsigned comment added by 75.79.3.84 ( talk) 20:09, 15 May 2014 (UTC)
This article needs simple examples up front so that novices can get the general idea. 2602:306:35FA:D540:2487:9F59:8A65:3095 ( talk) 04:44, 27 January 2017 (UTC)
Right now Robin Hood hashing and 2-choice hashing are listed as a direct children of "Collision Resolution". They seems to fit the definition of Open Addressing resolution strategies. Since all other open addressing strategies are listed under the Open Addressing subsection, shouldn't these two techniques be moved there too?
Andreas Lundblad ( talk) 07:13, 10 July 2019 (UTC)
The lead says: Ideally, the hash function will assign each key to a unique bucket, but most hash table designs employ an imperfect hash function, which might cause hash collisions...Such collisions must be accommodated in some way. .
We don't live in an ideal world, so starting out with a statement about ideal is supercilious. The implication is that if a hash function is 'imperfect', like an assembly line making imperfect parts, why don't we fix it so it's 'perfect'? An imperfect hash function might cause hash collisions, but the sense of this statement is vague, and beggars the question, why don't we build one that 'might not' cause hash collisions? Easy enough to postulate, eh? We might lead a novice into actually attempting to do that. I wish him luck.
How about this in place of the whole paragraph: Hash functions are designed to be efficient to compute and minimize collisions, duplicate hash codes resulting from distinct keys. 19 words versus 48. And it's very informative. The lead is not the place to go into what happens in a collision: we may assume, and indeed we find, a whole section talking about just that. Hmmm... or we could add, Hash functions are accompanied by an adjunct collision-resolution method. Now I think we've got a substantial digest. Scholarly diction is an art. Sbalfour ( talk) 21:47, 24 September 2019 (UTC)
Let's try again. The lead says, In many situations, hash tables turn out to be on average more efficient than search trees or any other table lookup structure. For this reason,.... What reason? It's a dangling participle, with look-thru ambiguity, it's not attached to anything. If we turn the sentence on its head, we get: Hash tables are more efficient in many situations than other things because <reason>. We need to state the reason, then we can say, For that reason,... How about Hash tables provide nearly constant access time and compact storage for large numbers of records, that can't be efficiently stored or accessed by other methods. There's ordered and unordered lists, linked lists, double-linked lists, circularly-linked lists, tries, heaps, stacks, queues, and etc, too numerous to mention in the lead, so don't try. If it matters - and it's worth a paragraph or maybe a sub-sub-section - there we distinguish the situations where we might use, other methods. Sbalfour ( talk) 22:19, 24 September 2019 (UTC)
Robin hood hashing is entirely analogous to the three ways the chains may be ordered in chained hashing: as unordered lists, as serially ordered lists (alphabetically or by some other key-dependent criteria), or as self-ordering lists by probe frequency (i.e. key-independent criteria). So, we reshuffle the bucket sequences for collided keys just as we reshuffle the chains. Yeh? I think we lost sight of the forest for the trees on this one. Sbalfour ( talk) 22:38, 24 September 2019 (UTC)
Would this be a good citation for the better performance of variable-sized data in chaining hash tables? https://www.researchgate.net/profile/Haval_Ahmed2/post/Any_ideas_about_hash_table_implementation/attachment/59d6430bc49f478072eabc10/AS:273806436831232@1442291949107/download/In-memory+hash+tables+for+accumulating+text+vocabularies.pdf ?
First sentence of section "2.3. Hash tables" seems to be useful, but it is not backed up in the article itself.
Recently someone used the "Inside the latency of hash table operations" by Andy Ke as a reference for this article. I feel this is perilously close to becoming a WP:REFLOOP, because many sentences and illustrations in that article are a word-for-word identical to (an earlier version of) this "hash table" Wikipedia article. (A few of of those sentences are sentences I wrote for Wikipedia). What should we do about this? -- DavidCary ( talk) 23:24, 6 May 2020 (UTC)
I spent several hours documenting a simple example of a hash table ( [11]), only to have it deleted one week later with the edit summary "rm unsourced code, we don't need that here". I agree that it's unsourced, but I disagree with the assessment that we don't need a code example here. I'll never forget my first visit to this article many years ago -- before I had experience with hash tables -- and found the article seriously lacking because it didn't show a single functional example of an actual hash table. I realize that my example may not be acceptable because -- even though I personally tested it and determined that it functions exactly as explained -- it might be considered OR and therefore verboten. Having said that, I don't really care if my example is restored or some other example is used instead, but I would argue that simple functional example is essential to help explain this topic. Lambtron talk 22:50, 7 February 2022 (UTC)
It's not the approach I would have taken, but I can see the appeal of simply deleting my contribution rather than working collaboratively to improve it. For example, I would have used my knowledge of "Wikipedia-mandated" language agnostic coding to translate the code -- a job an expert could probably do in a few minutes -- and left intact the explanation. But that's just me: averse to throwing out the baby with the bathwater. BTW, it's not clear to me why you think this basic example is not appropriate for this article, or why a "scholarly journal article" is required for the simple calculations involved. Frankly, I find it a bit frustrating that you seem opposed to having an easily understood example here when there's such an obvious and compelling need for one. Lambtron talk 16:51, 8 February 2022 (UTC)
straightforward explanation alone cannot be trusted- WP:VERIFYOR WikiLinuz🍁( talk) 20:01, 8 February 2022 (UTC)
Sometime in the past the following chart/image was added to the article:
. There are several problems with this graphic but the biggest is that by its own description it appears to be original research. Per the WP:NOR policy, no original research is allowed in Wikipedia articles. If the poster ( User:Dcoetzee, apparently) would like to update it with a reliable, published source for its specific data, then I will withdraw my objections. Otherwise I will remove it in the near future. — Preceding unsigned comment added by RBarryYoung ( talk • contribs) 14:51, 9 February 2022 (UTC)
Introduction sentence and the claim that Hash tables are implemented as associative arrays is incorrect. Even the linked reference says that associative arrays are implemented as hash tables, not the other way around. Currently associative array and hash table articles are in an infinite loop as each claims they are implemented using the other. 2001:7D0:87E3:100:C528:86D0:1049:5205 ( talk) 06:18, 27 October 2023 (UTC)
In computing, a hash table, also known as a hash map, is a data structure that implements an associative array, also called a dictionary, which is an abstract data type that maps keys to values.
Where is the hash digest ...
This explanation in section Hashing by division
is unclear because the term hash digest is not explained and does not appear in the equation above it. (Also, it should probably not be capitalized because it is not a complete sentence.)
In the following section, Hashing by multiplication
, it is the other way around: appears in the equation but is not explained.
Gnib Bnil (
talk) 14:55, 5 June 2024 (UTC)