Class | Sorting algorithm |
---|---|
Data structure | Array |
Worst-case performance | |
Best-case performance | |
Average performance | |
Worst-case space complexity | total, |
Proxmap Sorting, or Proxmap, is a sorting and searching algorithm that works by partitioning an array of data items, or keys, into a number of buckets. The name is short for computing a "proximity map," which indicates for each key K, the beginning of a subarray in the array A where K will reside in the final sorted order. Keys are dropped into each bucket using insertion sort. If keys are "well distributed" amongst the buckets, sorting occurs in time, much faster than comparison-based sorting, which can do no better than . The computational complexity estimates involve the number of buckets and the key used. During generation a proxmap is created, which is used to find keys in an average of time. It is a form of bucket and radix sort. The algorithm also scales up well with large data.
In general: Given an array A with n keys:
Simplied version: Given an array A with n keys
Note: "keys" may also contain other data, for instance an array of Student objects that contain the key plus a student ID and name. This makes proxmap suitable for organizing groups of objects, not just keys themselves.
Consider a full array: A0 to n-1] with n keys. Let i be an index of A. Sort A's keys into array A2 of equal size.
The MapKey(key) function will be defined as mapKey(key) = floor(K).
A1 | 6.7 | 5.9 | 8.4 | 1.2 | 7.3 | 3.7 | 11.5 | 1.1 | 4.8 | 0.4 | 10.5 | 6.1 | 1.8 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
H | 1 | 3 | 0 | 1 | 1 | 1 | 2 | 1 | 1 | 0 | 1 | 1 | |
P | 0 | 1 | -1 | 4 | 5 | 6 | 7 | 9 | 10 | -1 | 11 | 12 | |
L | 7 | 6 | 10 | 1 | 9 | 4 | 12 | 1 | 5 | 0 | 11 | 7 | 1 |
A2 | 0.4 | 1.1 | 1.2 | 1.8 | 3.7 | 4.8 | 5.9 | 6.1 | 6.7 | 7.3 | 8.4 | 10.5 | 11.1 |
for i = 0 to 11 // compute hit counts
Hi = 0;
for i = 0 to 12
{
pos = MapKey(Ai);
Hpos = Hpos + 1;
}
runningTotal = 0; // compute prox map – location of start of each subarray
for i = 0 to 12
if Hi = 0
Pi = 0;
else
Pi = runningTotal;
runningTotal = runningTotal + Hi;
for i = 0 to 12 // compute location – subarray – in A2 into which each item in A is to be placed
Li = PMapKey(Ai);
for I = 0 to 12; // sort items
A2I = <empty>;
for i = 0 to 12 // insert each item into subarray beginning at start, preserving order
{
start = Li; // subarray for this item begins at this location
insertion made = false;
for j = start to an empty cell or the end of A2 is found, and insertion not made
{
if A2j == <empty> // if subarray empty, just put item in first position of subarray
A2j = Ai;
insertion made = true;
else if key < A2j // key belongs at A2[j]
int end = j + 1; // Find end of used part of bucket – where first <empty> is
while (A2end != <empty>
end++;
for k = end -1 to j // Move larger keys to the right 1 cell
A2k+1 = A2k;
A2j = Ai;
insertion made = true; // Add in new key
}
}
Here A is the array to be sorted and the mapKey functions basically determines the number of buckets to use. For example, floor(K) will simply assign as many buckets as there are integers from the data in A. Dividing the key by a constant like 10, or floor(K/10) reduces the number of buckets; different functions can be used to translate the range of elements in A to buckets, such as converting the letters A–Z to 0–25 or returning the first character (0–255) for sorting strings. Buckets get sorted as the data comes in, not all at the end like bucket sorting typically does.
Typically, proxmap sorting is used along with proxmap searching, which uses the proxMap array generated by the proxmap sort to find keys in the sorted array, or A2 in constant time.
function mapKey(key) return floor(key)
proxMap ← previously generated proxmap array of size n A2 ← previously sorted array of size n function proxmap-search(key) for i = proxMap[mapKey(key)] to (length(array)-1 if (sortedArray[i].key == key) return sortedArray[i]
Computing H, P, and L all take time. Each is computed with one pass through an array, with constant time spent at each array location.
Having a good MapKey function is imperative for avoiding the worst case. We must know something about the distribution of the data to come up with a good key.
Many optimizations including these use the same array to store the sorted data, as well as reusing the hitCount, proxMap, and location arrays used to sort the data.
Since proxmap sort is not a comparison sort, the Ω(n log n) lower bound is inapplicable. Generally, it works relatively fast when compared to other sorting algorithms as its not comparison-based and it uses arrays instead of dynamically allocated objects and pointers to follow, like binary search tree nodes. The function is nearly constant access time which makes it very appealing for large databases. Despite the O(n) build time, it makes up for it with its average access time. If the data doesn't need to be updated often, the access time may make this function more favorable than other non-comparison sorting based sorts.
Like proxmap, bucket sort generally operates on a list of n numeric inputs between zero and some maximum key or value M and divides the value range into n buckets each of size M/n. If each bucket is sorted using insertion sort, proxmap and bucket sort can be shown to run in predicted linear time. [1] However, the performance of this sort degrades with clustering (or too few buckets with too many keys); if many values occur close together, they will all fall into a single bucket and be performance severely diminished. This holds a similar story with Proxmap, if the buckets are too large or too small then the performance of the function will degrade severely.
Class | Sorting algorithm |
---|---|
Data structure | Array |
Worst-case performance | |
Best-case performance | |
Average performance | |
Worst-case space complexity | total, |
Proxmap Sorting, or Proxmap, is a sorting and searching algorithm that works by partitioning an array of data items, or keys, into a number of buckets. The name is short for computing a "proximity map," which indicates for each key K, the beginning of a subarray in the array A where K will reside in the final sorted order. Keys are dropped into each bucket using insertion sort. If keys are "well distributed" amongst the buckets, sorting occurs in time, much faster than comparison-based sorting, which can do no better than . The computational complexity estimates involve the number of buckets and the key used. During generation a proxmap is created, which is used to find keys in an average of time. It is a form of bucket and radix sort. The algorithm also scales up well with large data.
In general: Given an array A with n keys:
Simplied version: Given an array A with n keys
Note: "keys" may also contain other data, for instance an array of Student objects that contain the key plus a student ID and name. This makes proxmap suitable for organizing groups of objects, not just keys themselves.
Consider a full array: A0 to n-1] with n keys. Let i be an index of A. Sort A's keys into array A2 of equal size.
The MapKey(key) function will be defined as mapKey(key) = floor(K).
A1 | 6.7 | 5.9 | 8.4 | 1.2 | 7.3 | 3.7 | 11.5 | 1.1 | 4.8 | 0.4 | 10.5 | 6.1 | 1.8 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
H | 1 | 3 | 0 | 1 | 1 | 1 | 2 | 1 | 1 | 0 | 1 | 1 | |
P | 0 | 1 | -1 | 4 | 5 | 6 | 7 | 9 | 10 | -1 | 11 | 12 | |
L | 7 | 6 | 10 | 1 | 9 | 4 | 12 | 1 | 5 | 0 | 11 | 7 | 1 |
A2 | 0.4 | 1.1 | 1.2 | 1.8 | 3.7 | 4.8 | 5.9 | 6.1 | 6.7 | 7.3 | 8.4 | 10.5 | 11.1 |
for i = 0 to 11 // compute hit counts
Hi = 0;
for i = 0 to 12
{
pos = MapKey(Ai);
Hpos = Hpos + 1;
}
runningTotal = 0; // compute prox map – location of start of each subarray
for i = 0 to 12
if Hi = 0
Pi = 0;
else
Pi = runningTotal;
runningTotal = runningTotal + Hi;
for i = 0 to 12 // compute location – subarray – in A2 into which each item in A is to be placed
Li = PMapKey(Ai);
for I = 0 to 12; // sort items
A2I = <empty>;
for i = 0 to 12 // insert each item into subarray beginning at start, preserving order
{
start = Li; // subarray for this item begins at this location
insertion made = false;
for j = start to an empty cell or the end of A2 is found, and insertion not made
{
if A2j == <empty> // if subarray empty, just put item in first position of subarray
A2j = Ai;
insertion made = true;
else if key < A2j // key belongs at A2[j]
int end = j + 1; // Find end of used part of bucket – where first <empty> is
while (A2end != <empty>
end++;
for k = end -1 to j // Move larger keys to the right 1 cell
A2k+1 = A2k;
A2j = Ai;
insertion made = true; // Add in new key
}
}
Here A is the array to be sorted and the mapKey functions basically determines the number of buckets to use. For example, floor(K) will simply assign as many buckets as there are integers from the data in A. Dividing the key by a constant like 10, or floor(K/10) reduces the number of buckets; different functions can be used to translate the range of elements in A to buckets, such as converting the letters A–Z to 0–25 or returning the first character (0–255) for sorting strings. Buckets get sorted as the data comes in, not all at the end like bucket sorting typically does.
Typically, proxmap sorting is used along with proxmap searching, which uses the proxMap array generated by the proxmap sort to find keys in the sorted array, or A2 in constant time.
function mapKey(key) return floor(key)
proxMap ← previously generated proxmap array of size n A2 ← previously sorted array of size n function proxmap-search(key) for i = proxMap[mapKey(key)] to (length(array)-1 if (sortedArray[i].key == key) return sortedArray[i]
Computing H, P, and L all take time. Each is computed with one pass through an array, with constant time spent at each array location.
Having a good MapKey function is imperative for avoiding the worst case. We must know something about the distribution of the data to come up with a good key.
Many optimizations including these use the same array to store the sorted data, as well as reusing the hitCount, proxMap, and location arrays used to sort the data.
Since proxmap sort is not a comparison sort, the Ω(n log n) lower bound is inapplicable. Generally, it works relatively fast when compared to other sorting algorithms as its not comparison-based and it uses arrays instead of dynamically allocated objects and pointers to follow, like binary search tree nodes. The function is nearly constant access time which makes it very appealing for large databases. Despite the O(n) build time, it makes up for it with its average access time. If the data doesn't need to be updated often, the access time may make this function more favorable than other non-comparison sorting based sorts.
Like proxmap, bucket sort generally operates on a list of n numeric inputs between zero and some maximum key or value M and divides the value range into n buckets each of size M/n. If each bucket is sorted using insertion sort, proxmap and bucket sort can be shown to run in predicted linear time. [1] However, the performance of this sort degrades with clustering (or too few buckets with too many keys); if many values occur close together, they will all fall into a single bucket and be performance severely diminished. This holds a similar story with Proxmap, if the buckets are too large or too small then the performance of the function will degrade severely.