![]() | This article is rated C-class on Wikipedia's
content assessment scale. It is of interest to the following WikiProjects: | |||||||||||||
|
The text says that EOF is ignored when sorting, but the example seems to suggest that EOF is considered to come after all normal letters. AxelBoldt 14:14 Aug 26, 2002 (PDT)
The dot wasn't an EOF. It was just a dot that was there to make the rotations stand out visually. Hopefully it's clearer now. -- LC 08:06 Aug 27, 2002 (PDT)
Obviously sufficiently many repeats of the transform must eventually restore the original text, and obviously that provides another (usually inefficient?) way of inverting a transform (just store the inputs temporarily, and give back the last input before the text to be inverted turns up again). But one interesting thing is this: since the original is itself the result of a transform on some text, clearly the transform does not necessarily always produce something better suited for compression. It must be something that is statistically likely, based on properties that are common in many inputs. What are these, and what are the odds that a transform won't actually help and might even hurt? If we do get one of these cases, how can we tell (cheaply)? And so on. PML.
Assuming you're talking about the Bijective version (since your claim is false of the other versions), then the statistical property it takes advantage of is the tendency for letter sequences to be exactly repeated more often than merely similar letter sequences. Exact matches create runs of the same letter. Like, when you're alphabetizing rotations of a book and you get to the 'uick' section, it's a pretty good bet that the previous character (the output from the BWT for that letter) is 'q'; when you get to the 'the' section, there are going to be a lot of spaces as the output character, from all the instances of 'the', 'then', 'there', and 'their' and others, but some 'o' will be mixed in from 'other', etc. If you already have a lot of repeats or are simply high-entropy, it's not going to help.
How can repeating the transform restore the original text? The transform is a not a bijection between strings of length n. It is a one-to-one function from strings of length n into the set of ordered pairs of a string of length n and an integer less than n. Thus, different inputs can produce the same output string with a different index. For example .BANANA. and ANA..BAN produce the same output string. -- Luke Somers
There is a simple way to avoid the use of an EOF symbol which is treated differently in the many verisons of BWT. Since any rotation of the original string gets you the same BWT string output. Which is the way the Burrows Wheeler Transform is usually first discussed in the literature. One problem with the Special symbol for EOF is that when dealing with a file on 256 character types one needs to add a special character in to do the BWT. Then one has to pick its value some make it the heaviest symbol some the lightest and in some binary versions they pick a weight half way in between. One method that I have used to get back to the traditional BWT is to rotate the string until it represents a single lyndon word. And then do the BWT as done in the start of this article and pass as an index the number of rotations needed to get the start of original string. One nice feature of this it that when the string is rotated to make a single lyndon word the string resulting from the classical BWT and the Bijective BWT are the SAME. â Preceding unsigned comment added by 69.152.152.134 ( talk) 15:23, 22 August 2012 (UTC)
The current use of special symbols is... concerning. The original 1994 article does not even use an EOF for the "simple version"; instead it just keeps rotating the columns till it finds something that's sorted when decoding. (This could be the idea Special:Diff/715996790 was trying to get across, but... bruv you need to re-read the article.) EOF was only introduced in section 4.1, whereas the decompression algorithm does not even involve the rotation in the simple version. And don't even get me started on who came up with the idea of adding a "start-of-string marker". -- Artoria 2e5 đ 03:18, 1 January 2020 (UTC)
This article gives a really bad example for this algorithm. Take the text:
SIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXES
and replace all repeated substrings (as deflate would detect them) with a special control character which I'll represent as _
:
SIX.M_ED.P_IES._FT_XTY_.DUS_BOX_
Now take the transformed text:
TEXYDST.E.XIIXIXXSMPPSS.B...S.EEUSFXDIOIIIIT
and do the same:
TEXYDST.E.XII_XXSMPPSS.B.__EEUSFXDIOI_T
Clearly, the original string compresses better. And indeed gzip and bzip2 confirm this: gzip compresses the above to 67 bytes, bzip2 to 69 bytes. (Both can hardly be called compression: the original was only 44 bytes ...) â Timwi 14:14, 9 Nov 2004 (UTC)
That is unfortunately the curse of writing about data compression: in a substantial number of cases, it can be almost impossible to find an example small enough to be practical, and clear enough to show the principle involved, but large enough so that the savings achieved with the method can be immediately seen to be significant. Add in a requirement about 'no other method can do this example better' and I am afraid you're asking for the impossible. IMHO, it's much better to use an example that shows the principle clearly and note in the text if needed that 'in this example, the saving from the run-length encoding doesn't overcome the overhead of the encoding; in a larger example which produced longer runs, it would.' After all, generations of textbooks have used recursive Fibonacci algorithms to demonstrate recursion when that's not the most efficient way to calculate Fibonacci numbers.
By the way, have you noticed that your deflate example is incorrect? Deflate operates on strings of minimum length 3. Even with your simplification of all replaced substrings taking as much space as one literal, here's how it would actually look:
SIX.MIXED.PIXIES.SIFT._TY_.DUST.BOXES
vs.
TEXYDST.E.XII_XXSMPPSS.B.__EEUSFXDIOI_T
Not a stellar improvement. -- Antaeus Feldspar 15:11, 9 Nov 2004 (UTC)
but the fact that the authors (because they didn't understand why it was needed and left some part out) didn't implement the BWT properly in bzip, causing extremely long (as in years) of compression time on some inputs. The RLE pre-pass only exists to avoid these "degenerate" cases. If the BWT would have been implemented properly this would not be needed, and in fact, a lot of documentation still refers to fixed-size blocks of input being compressed, which is no longer the case due to the RLE prepass. So RLE is a bug workaround, it's not the best method to apply after a BWT, a straight MFT encoder is. 2001:470:D08D:0:0:0:0:1 ( talk) 07:20, 25 January 2013 (UTC)
LZ77 | BWT | RLE | Huffman/Arithmetic | |
---|---|---|---|---|
repeated multi-byte sequences | yes, at cost of two parameters (offset, length) | converts into single-character runs | no | no |
single-character runs | yes, at cost of two parameters (offset, length) | no | yes, at cost of single parameter (length of run) | no |
non-flat frequency distribution of characters | no | no | no | yes, exploits optimally |
This whole discussing about a bad example is rather pointless and can be easily solved by quoting burrows and wheeler form there original report (to read it yourself => you can find it at the sources-list in the article) "The size of the input block must be large (a few kilobytes) to achieve good compression." ~ Band B 22:40, 14 February 2007 (UTC)
Is there any reason that the C implementation given in the Polish Wikipedia shouldn't be included here? -- Shutranm 21:47, 20 May 2005 (UTC)
function bwt () {
echo $((S=$1;
X=$(echo $S|sed -E 's/^(.*)(.)$/\2\1/g');
echo $S;
while
"$X"Â != "$S" ;
do echo $X;
X=$(echo $X|sed -E 's/^(.*)(.)$/\2\1/g');
done;) | sort | sed -E 's/^(.*)(.)$/\2/g' ) | sed 's/ //g'Â ; }
bwt "^BANANA|"
function bwt () { generateRotations $1 | sort | takeLastChar | concatLines }
â Preceding unsigned comment added by 2806:106E:B:DCE2:B91C:FFCC:4813:8157 ( talk) 19:52, 12 October 2021 (UTC)
Nobody reading this page cares about Posix collation. Why don't we pick a sample string with no such issues? -- Doradus 17:45, 16 January 2006 (UTC)
==== Note on sorting convention ==== If you sort with [[Posix]] [[collating]], you get the slightly different string TEXYDST.E.IXIXIXXSSMPPS.B..E.S.EUSFXDIIOIIIT instead of TEXYDST.E.XIIXIXXSMPPSS.B...S.EEUSFXDIOIIIIT [[ISO 8859]] has complex collating rules, but in this case, periods are ignored. Posix collating treats periods as characters.
As a visitor trying to use this page to work out my own implementation, it would be nice to know what sorting convention is being used for the example. Otherwise, I'd say that th example in't very useful. -- 128.252.110.200 ( talk) 01:20, 7 March 2014 (UTC)
Could someone check the example, please? I was trying to verify something and my simple quick implementation of BWT outputs
TEXYDST.E.IXIXIXXSSMPPS.B..E.S.EUSFXDIIOIIIT
not
TEXYDST.E.XIIXIXXSMPPSS.B...S.EEUSFXDIOIIIIT
as the article writes.
(And note that I intend to at least replace the black dot with another character, or to remove it altogether. It is not a good style to unnecessarily convey meaning only with a color variation, mind e.g. blind people.)
-- Mormegil 18:12, 18 June 2006 (UTC)
.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST .DUST.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE .MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX .PIXIE.DUST.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY .PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MIXED .SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MIXED.PIXIES .SIXTY.PIXIE.DUST.BOXESSIX.MIXED.PIXIES.SIFT BOXESSIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST. D.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MIXE DUST.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE. E.DUST.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY.PIXI ED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MIX ES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MIXED.PIXI ESSIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOX FT.SIXTY.PIXIE.DUST.BOXESSIX.MIXED.PIXIES.SI IE.DUST.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY.PIX IES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MIXED.PIX IFT.SIXTY.PIXIE.DUST.BOXESSIX.MIXED.PIXIES.S IX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXESS IXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX.M IXIE.DUST.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY.P IXIES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MIXED.P IXTY.PIXIE.DUST.BOXESSIX.MIXED.PIXIES.SIFT.S MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX. OXESSIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.B PIXIE.DUST.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY. PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MIXED. S.SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MIXED.PIXIE SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MIXED.PIXIES. SIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXES SIXTY.PIXIE.DUST.BOXESSIX.MIXED.PIXIES.SIFT. SSIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXE ST.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DU T.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUS T.SIXTY.PIXIE.DUST.BOXESSIX.MIXED.PIXIES.SIF TY.PIXIE.DUST.BOXESSIX.MIXED.PIXIES.SIFT.SIX UST.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.D X.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXESSI XED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MI XESSIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BO XIE.DUST.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY.PI XIES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MIXED.PI XTY.PIXIE.DUST.BOXESSIX.MIXED.PIXIES.SIFT.SI Y.PIXIE.DUST.BOXESSIX.MIXED.PIXIES.SIFT.SIXT
Which gives "TEXYDST.E.IXIXIXXSSMPPS.B..E.S.EUSFXDIIOIIIT"
Which is understandable, as a space is a smaller ASCII value than [A..Z|a..z]
-- Ambulnick 10:17, 27 Sept 2006 (GMT)
If you take the C code and run it on ".BANANA" instead of "Polska Wikipedia", you don't get the same answer as the article gives: BNN.AA@A, you get ANNB.AA. That's because "." sorts before capital letters in ascii, not after, as the example shows. Changing "." to "x" makes the example work. If someone else will verify that they see the same problem, I'll fix the example. But it's late at night -- I may just be confusing myself :-)
Cbogart2 06:04, 18 December 2006 (UTC)
The BWT is not just the transformed string. If using method above you need to include something
like the position of original data in example above where the SIX.MIXED row occurs. If we don't do this then people we wrongly think the normal BWT is just a permutation. When in fact its two outputs the permutation and an integer index. âPreceding unsigned comment added by 75.54.98.144 ( talk) 16:28, 2 March 2009 (UTC)
Article says:
The Burrows-Wheeler transform (BWT, also called block-sorting compression)
Technically, BWT is not a compression. Should it be noted? It leads to wrong understanding of the subject â lim 16:54, 16 May 2007 (UTC)
Putting aside the issue that this is a really horrible example of program code, there's a major inefficiency in it: In the function rotlexcmp() if 'li' is equal to 'ri', it will needlessly traverse the entire data and then just return 0. According to my own experiments this makes it almost 100 times slower than it could be (well, depending on the exact implementation of the sorting function, of course). Simply adding a "if(li==ri) return 0;" at the beginning solves this issue and makes the program a lot faster with large datasets.
Wikipedia is not about tweaking the last bit of Performance out of some Algorithm. It only has to give the right oputput. âPreceding unsigned comment added by 129.13.72.198 ( talk) 08:35, 6 February 2009 (UTC)
We donât need two âsample implementationsâ and should not have them here both. We are not a code repository. Maybe we should clean one of them to focus on the algorithm (e.g. remove those #includes and main() in the C version)? -- Mormegil 17:54, 9 November 2007 (UTC)
Just for Fun.
BWT[str_]Â := Block[{list, n, data},
list = StringSplit[str, ""]; n = Length@list; data = Sort@Table[RotateRight[list, i], {i, 1, n}]; StringJoin@data All, -1
]
inBWT[str_, endOfFile_: "@"]Â := Block[{list, n, K},
list = StringSplit[str, ""]; n = Length[list]; K = Sort@Transpose@List@list; Do[K = Sort@Transpose@Prepend[Transpose@K, list]Â ;, {n - 1}]; StringJoin@Select[K, Last@# == endOfFile &] 1 ] âPreceding unsigned comment added by 189.83.138.252 ( talk) 22:35, 24 October 2009 (UTC)
Does anybody have independent confirmation of the Bijective BWT actually working? After having read the link to David Scott's method, I'm extremely sceptical that his technique works.
The main reason why I'm extremely sceptical is because he essentially appears to be using the 1st column of the BWT reversed. If that works, then there's no reason why you couldn't use a binary input (instead of ASCII characters) and end up with what would surely be the world's most compressible file, consisting merely of a count of 0's and 1's. Tiggs ( talk) 00:16, 30 August 2008 (UTC)
Actually Yuta Mori has put it in OpenBWT-v1.3 people on one of the more active compression sites
http://encode.ru/forum/ have tested it with many files. Surprisingly it often bets regular BWT based compressors using code tuned for standard BWT the main draw back beside being hard to understand is that it is somewhat slower to execute than normal BWT.
It is not the 1st column of the BWT reversed you can try Scott's or Mori's code on any reasonably sized file to see this this is not what is happening. âPreceding
unsigned comment added by
75.54.97.66 (
talk) 22:14, 3 October 2008 (UTC)
Agreed. I've spent some time looking over the source code and it is, indeed, a bijective transform.
Whether or not it should be classified as a Burrows Wheeler transform, or whether it should be called a Scott Transform, however, is a completely different question. As it changes the order that the characters are transformed into, I'd favour the later. Tiggs ( talk) 20:53, 10 October 2008 (UTC)
Actually there really is no standard BWT even for the index version. Many consider the one where you have the full wrap around on compares the original others use a special character for EOF some treat the EOF character as if it has the greatest some the lowest value there are many way to do it. And actually there are more an how to label the index. But if you look at the BWT that uses the full wrap around and do a UNBWTS on it you get the original file back except that it may be rotated since there is no index. Actually the normal BWT could be thought of as subset of BWTS. I have not heard of anyone calling it the Scott Transform but even Yuta referred to as the BWT Scottified. In fact there are a whole series of transforms that could be called bijective BWT transforms but few know about them or understand the concept. But tests done by others show this version tends to lead to better compression than what the users there call a normal BWT. The field is wide open not just for compression. âPreceding
unsigned comment added by
75.54.97.66 (
talk) 02:33, 11 October 2008 (UTC)
Would it be possible to explain why the bijective version is interesting? Eg where is it used and why? Saving the cost of one index does not seem to justify it. The text seems to suggest it was created for more than its mathematical beauty. 128.16.7.220 ( talk) 15:17, 7 June 2014 (UTC) Bill
1) it allows it to be used on data that potentially saturate their character space, so that you would have to make every character fatter to accommodate the EOF character. 2) Why not? A character's a character. â Preceding unsigned comment added by 108.52.71.107 ( talk) 03:23, 28 July 2014 (UTC)
I'm concerned that the only source for the bijective transformation is the paper by Gil and Scott that seems to be (in traditional terms) an "unpublished manuscript". That isn't quite up to Wikipedia standards. That isn't just useless policy-wanking in this case; the paper contains meaning-disturbing typos and mistakes that I don't think would have survived in a published paper. At least we shouldn't mislead readers into thinking this is as reliable a source as the ones we usually provide. Furthermore, Gil is an academic, but no reference to the paper appears on his web page, which could indicate that it was later retracted.
In particular: The paper presents a "string-sorting transform" with two variants, depending on how one compares strings of different length where one is a prefix of the other. It also presents an inverse transform, though one without any variants, and it is not quite stated which of the two main transforms it is an inverse of. According to my pencil-and-paper experiments, the inverse transform does seem to invert the "infinite-periodic" variant of the main transform. It does not work for the "usual lexicographical" variant, and indeed that variant is not a bijection! As a minimal example, the input strings "bab" and "abb" both transform to "bba".(*)
Unfortunately it is only the "usual lexicographical" transformation that the authors provide a linear-time algorithm for (though it is very sketchy, and I'm not convinced by the time analysis). For the infinite-periodic variant, the authors explicitly deny having a linear algorithm. They claim to have an O(nlogn) one, but do not describe it further.
While a bijective variant of the BWT is indeed an interesting and worthy topic, the one presented by the paper is not quite as ready for use as the paper's abstract suggests.
(*) Input: bab abb Lyndon factorization: b·ab abb Cyclic rotations: b,ab,ba abb,bba,bab Sorted: ab abb b bab ba bba Output: bba bba
(The "infinte-periodic" variant sorts "b" after "ba" and therefore transforms "bab" to "bab").
â Henning Makholm ( talk) 13:44, 6 December 2010 (UTC)
You have the correct transfrom for abb which is bba
however bab which break up to b ab is sorted wrong it should be
ab to b ba to a bb to b
you repeat each substring as needed before the comparistion so b sorts after ba not before thus the bwts is bab for bab its a self bwts. One conceptual way to do this is repeat the substring so a common length example if sub strings 3 6 2 1 5 use 30 characters then sort as you would in normal bwt
I also don't think the last changes to explaination aid to helping people understand what it is. I would like to change it back but would prefer someone else fix the english. I read what is now there and I don't think it helps many people to understand the concept. â Preceding unsigned comment added by 99.152.26.69 ( talk) 15:15, 17 January 2012 (UTC)
when when is comparing two strings lexicographically you pretend the shorter string is padded with spaces this is not the same as comparing two unequal length strings in bijective BWT you keep repeating the strings until there is a difference. â Preceding unsigned comment added by 66.136.119.147 ( talk) 23:25, 30 September 2012 (UTC)
Section "BWT in bioinformatics" and pages Compressed suffix array and FM-index should probably be merged into a single page. They are all about essentially the same data structure:
The differences between various compressed suffix arrays and FM-indexes are mostly in the encodings they use. The following survey articles are probably the best resources about compressed self-indexes so far:
Jouni Sirén ( talk) 23:48, 18 December 2010 (UTC)
The article is very nicely written and clear even for a beginner like me. But I didn't get one point:
Would it be possible to explain, why step 3 in inverse transformation is possible? Sort and concatenate with BW'ed string in step 2 is clarified in the text, but why one can sort again and concatenate the BW'ed string again? Though it works.
cheers, Oliver Drechsel ( talk) 08:39, 20 January 2011 (UTC)
the added character @ should be lexicographically before any other character, so:
ANANA@^B
ANA@^BAN
A@^BANAN
BANANA@^
NANA@^BA
NA@^BANA
^BANANA@
@^BANANA
should be
@^BANANA
ANANA@^B
ANA@^BAN
A@^BANAN
BANANA@^
NANA@^BA
NA@^BANA
^BANANA@
and the resulting string would be ABNN^AA@ â Preceding unsigned comment added by 84.223.127.232 ( talk) 19:33, 12 April 2012 (UTC)
Calling bwt('abracadabra') results in ('rdarcaaaabb', [2, 5, 6, 7, 8, 9, 10, 4, 1, 0, 3]) , but calling ibwt('rdarcaaaabb', [2, 5, 6, 7, 8, 9, 10, 4, 1, 0, 3]) results in 'aabracadabr'.
This article is beautiful. Thank you so much to all for the work in making this algorithm understandable. I mentally cried tears of joy because I was expecting the worst. I've never seen a Wikipedia article so carefully curated to its intended audience.
-- Dominic ( talk) 12:58, 23 December 2016 (UTC)
The algorithm claims to be linear in time, but clearly it cannot be, since for a string of length N it requires the construction of an NĂN matrix, or since it is symmetric (over negation), an upper or lower triangle NĂN matrix, which require O(N^2) to manipulate.
That glaring flaw aside, it is actually quite a a nice construction algorithm, and elucidates some nice intuitions about the problem of BWT construction, but claiming it is efficient is simply wrong from an algorithmic perspective. Drtomconway ( talk) 22:33, 22 December 2019 (UTC)
It may be "by convention", but the Example claims that rotation is done 8 times. It. Is. Not. It is done 7 times for a total of 8 strings if you include the original un-rotated string. That is, N = 7 for a string of 8 characters. It should also, imho, be mentioned that sorting of 'special characters' (which in this case would include both the caret (^) and the EOS (end of string? end of line?) character (|)) depends on where they are arbitrarily positioned in the sort order (it isn't necessary that both occur before the 'a' or after the 'z' of "lexical order". 98.21.213.235 ( talk) 23:31, 10 February 2021 (UTC)
In is common to use ^ and $ to start and end a string, respectively -- see Regular expression#POSIX basic and extended. Should we use $ instead of | in this article? â Quantling ( talk | contribs) 03:26, 14 November 2023 (UTC)
The original BurrowsâWheeler paper has neither a string-starting character (like ^) nor a string-ending character (like @ or | or $), and does not invoke them in any way. Can I remove them from this Wikipedia article? â Quantling ( talk | contribs) 00:33, 15 November 2023 (UTC)
![]() | This article is rated C-class on Wikipedia's
content assessment scale. It is of interest to the following WikiProjects: | |||||||||||||
|
The text says that EOF is ignored when sorting, but the example seems to suggest that EOF is considered to come after all normal letters. AxelBoldt 14:14 Aug 26, 2002 (PDT)
The dot wasn't an EOF. It was just a dot that was there to make the rotations stand out visually. Hopefully it's clearer now. -- LC 08:06 Aug 27, 2002 (PDT)
Obviously sufficiently many repeats of the transform must eventually restore the original text, and obviously that provides another (usually inefficient?) way of inverting a transform (just store the inputs temporarily, and give back the last input before the text to be inverted turns up again). But one interesting thing is this: since the original is itself the result of a transform on some text, clearly the transform does not necessarily always produce something better suited for compression. It must be something that is statistically likely, based on properties that are common in many inputs. What are these, and what are the odds that a transform won't actually help and might even hurt? If we do get one of these cases, how can we tell (cheaply)? And so on. PML.
Assuming you're talking about the Bijective version (since your claim is false of the other versions), then the statistical property it takes advantage of is the tendency for letter sequences to be exactly repeated more often than merely similar letter sequences. Exact matches create runs of the same letter. Like, when you're alphabetizing rotations of a book and you get to the 'uick' section, it's a pretty good bet that the previous character (the output from the BWT for that letter) is 'q'; when you get to the 'the' section, there are going to be a lot of spaces as the output character, from all the instances of 'the', 'then', 'there', and 'their' and others, but some 'o' will be mixed in from 'other', etc. If you already have a lot of repeats or are simply high-entropy, it's not going to help.
How can repeating the transform restore the original text? The transform is a not a bijection between strings of length n. It is a one-to-one function from strings of length n into the set of ordered pairs of a string of length n and an integer less than n. Thus, different inputs can produce the same output string with a different index. For example .BANANA. and ANA..BAN produce the same output string. -- Luke Somers
There is a simple way to avoid the use of an EOF symbol which is treated differently in the many verisons of BWT. Since any rotation of the original string gets you the same BWT string output. Which is the way the Burrows Wheeler Transform is usually first discussed in the literature. One problem with the Special symbol for EOF is that when dealing with a file on 256 character types one needs to add a special character in to do the BWT. Then one has to pick its value some make it the heaviest symbol some the lightest and in some binary versions they pick a weight half way in between. One method that I have used to get back to the traditional BWT is to rotate the string until it represents a single lyndon word. And then do the BWT as done in the start of this article and pass as an index the number of rotations needed to get the start of original string. One nice feature of this it that when the string is rotated to make a single lyndon word the string resulting from the classical BWT and the Bijective BWT are the SAME. â Preceding unsigned comment added by 69.152.152.134 ( talk) 15:23, 22 August 2012 (UTC)
The current use of special symbols is... concerning. The original 1994 article does not even use an EOF for the "simple version"; instead it just keeps rotating the columns till it finds something that's sorted when decoding. (This could be the idea Special:Diff/715996790 was trying to get across, but... bruv you need to re-read the article.) EOF was only introduced in section 4.1, whereas the decompression algorithm does not even involve the rotation in the simple version. And don't even get me started on who came up with the idea of adding a "start-of-string marker". -- Artoria 2e5 đ 03:18, 1 January 2020 (UTC)
This article gives a really bad example for this algorithm. Take the text:
SIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXES
and replace all repeated substrings (as deflate would detect them) with a special control character which I'll represent as _
:
SIX.M_ED.P_IES._FT_XTY_.DUS_BOX_
Now take the transformed text:
TEXYDST.E.XIIXIXXSMPPSS.B...S.EEUSFXDIOIIIIT
and do the same:
TEXYDST.E.XII_XXSMPPSS.B.__EEUSFXDIOI_T
Clearly, the original string compresses better. And indeed gzip and bzip2 confirm this: gzip compresses the above to 67 bytes, bzip2 to 69 bytes. (Both can hardly be called compression: the original was only 44 bytes ...) â Timwi 14:14, 9 Nov 2004 (UTC)
That is unfortunately the curse of writing about data compression: in a substantial number of cases, it can be almost impossible to find an example small enough to be practical, and clear enough to show the principle involved, but large enough so that the savings achieved with the method can be immediately seen to be significant. Add in a requirement about 'no other method can do this example better' and I am afraid you're asking for the impossible. IMHO, it's much better to use an example that shows the principle clearly and note in the text if needed that 'in this example, the saving from the run-length encoding doesn't overcome the overhead of the encoding; in a larger example which produced longer runs, it would.' After all, generations of textbooks have used recursive Fibonacci algorithms to demonstrate recursion when that's not the most efficient way to calculate Fibonacci numbers.
By the way, have you noticed that your deflate example is incorrect? Deflate operates on strings of minimum length 3. Even with your simplification of all replaced substrings taking as much space as one literal, here's how it would actually look:
SIX.MIXED.PIXIES.SIFT._TY_.DUST.BOXES
vs.
TEXYDST.E.XII_XXSMPPSS.B.__EEUSFXDIOI_T
Not a stellar improvement. -- Antaeus Feldspar 15:11, 9 Nov 2004 (UTC)
but the fact that the authors (because they didn't understand why it was needed and left some part out) didn't implement the BWT properly in bzip, causing extremely long (as in years) of compression time on some inputs. The RLE pre-pass only exists to avoid these "degenerate" cases. If the BWT would have been implemented properly this would not be needed, and in fact, a lot of documentation still refers to fixed-size blocks of input being compressed, which is no longer the case due to the RLE prepass. So RLE is a bug workaround, it's not the best method to apply after a BWT, a straight MFT encoder is. 2001:470:D08D:0:0:0:0:1 ( talk) 07:20, 25 January 2013 (UTC)
LZ77 | BWT | RLE | Huffman/Arithmetic | |
---|---|---|---|---|
repeated multi-byte sequences | yes, at cost of two parameters (offset, length) | converts into single-character runs | no | no |
single-character runs | yes, at cost of two parameters (offset, length) | no | yes, at cost of single parameter (length of run) | no |
non-flat frequency distribution of characters | no | no | no | yes, exploits optimally |
This whole discussing about a bad example is rather pointless and can be easily solved by quoting burrows and wheeler form there original report (to read it yourself => you can find it at the sources-list in the article) "The size of the input block must be large (a few kilobytes) to achieve good compression." ~ Band B 22:40, 14 February 2007 (UTC)
Is there any reason that the C implementation given in the Polish Wikipedia shouldn't be included here? -- Shutranm 21:47, 20 May 2005 (UTC)
function bwt () {
echo $((S=$1;
X=$(echo $S|sed -E 's/^(.*)(.)$/\2\1/g');
echo $S;
while
"$X"Â != "$S" ;
do echo $X;
X=$(echo $X|sed -E 's/^(.*)(.)$/\2\1/g');
done;) | sort | sed -E 's/^(.*)(.)$/\2/g' ) | sed 's/ //g'Â ; }
bwt "^BANANA|"
function bwt () { generateRotations $1 | sort | takeLastChar | concatLines }
â Preceding unsigned comment added by 2806:106E:B:DCE2:B91C:FFCC:4813:8157 ( talk) 19:52, 12 October 2021 (UTC)
Nobody reading this page cares about Posix collation. Why don't we pick a sample string with no such issues? -- Doradus 17:45, 16 January 2006 (UTC)
==== Note on sorting convention ==== If you sort with [[Posix]] [[collating]], you get the slightly different string TEXYDST.E.IXIXIXXSSMPPS.B..E.S.EUSFXDIIOIIIT instead of TEXYDST.E.XIIXIXXSMPPSS.B...S.EEUSFXDIOIIIIT [[ISO 8859]] has complex collating rules, but in this case, periods are ignored. Posix collating treats periods as characters.
As a visitor trying to use this page to work out my own implementation, it would be nice to know what sorting convention is being used for the example. Otherwise, I'd say that th example in't very useful. -- 128.252.110.200 ( talk) 01:20, 7 March 2014 (UTC)
Could someone check the example, please? I was trying to verify something and my simple quick implementation of BWT outputs
TEXYDST.E.IXIXIXXSSMPPS.B..E.S.EUSFXDIIOIIIT
not
TEXYDST.E.XIIXIXXSMPPSS.B...S.EEUSFXDIOIIIIT
as the article writes.
(And note that I intend to at least replace the black dot with another character, or to remove it altogether. It is not a good style to unnecessarily convey meaning only with a color variation, mind e.g. blind people.)
-- Mormegil 18:12, 18 June 2006 (UTC)
.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST .DUST.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE .MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX .PIXIE.DUST.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY .PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MIXED .SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MIXED.PIXIES .SIXTY.PIXIE.DUST.BOXESSIX.MIXED.PIXIES.SIFT BOXESSIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST. D.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MIXE DUST.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE. E.DUST.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY.PIXI ED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MIX ES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MIXED.PIXI ESSIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOX FT.SIXTY.PIXIE.DUST.BOXESSIX.MIXED.PIXIES.SI IE.DUST.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY.PIX IES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MIXED.PIX IFT.SIXTY.PIXIE.DUST.BOXESSIX.MIXED.PIXIES.S IX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXESS IXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX.M IXIE.DUST.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY.P IXIES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MIXED.P IXTY.PIXIE.DUST.BOXESSIX.MIXED.PIXIES.SIFT.S MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX. OXESSIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.B PIXIE.DUST.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY. PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MIXED. S.SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MIXED.PIXIE SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MIXED.PIXIES. SIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXES SIXTY.PIXIE.DUST.BOXESSIX.MIXED.PIXIES.SIFT. SSIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXE ST.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DU T.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUS T.SIXTY.PIXIE.DUST.BOXESSIX.MIXED.PIXIES.SIF TY.PIXIE.DUST.BOXESSIX.MIXED.PIXIES.SIFT.SIX UST.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.D X.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXESSI XED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MI XESSIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BO XIE.DUST.BOXESSIX.MIXED.PIXIES.SIFT.SIXTY.PI XIES.SIFT.SIXTY.PIXIE.DUST.BOXESSIX.MIXED.PI XTY.PIXIE.DUST.BOXESSIX.MIXED.PIXIES.SIFT.SI Y.PIXIE.DUST.BOXESSIX.MIXED.PIXIES.SIFT.SIXT
Which gives "TEXYDST.E.IXIXIXXSSMPPS.B..E.S.EUSFXDIIOIIIT"
Which is understandable, as a space is a smaller ASCII value than [A..Z|a..z]
-- Ambulnick 10:17, 27 Sept 2006 (GMT)
If you take the C code and run it on ".BANANA" instead of "Polska Wikipedia", you don't get the same answer as the article gives: BNN.AA@A, you get ANNB.AA. That's because "." sorts before capital letters in ascii, not after, as the example shows. Changing "." to "x" makes the example work. If someone else will verify that they see the same problem, I'll fix the example. But it's late at night -- I may just be confusing myself :-)
Cbogart2 06:04, 18 December 2006 (UTC)
The BWT is not just the transformed string. If using method above you need to include something
like the position of original data in example above where the SIX.MIXED row occurs. If we don't do this then people we wrongly think the normal BWT is just a permutation. When in fact its two outputs the permutation and an integer index. âPreceding unsigned comment added by 75.54.98.144 ( talk) 16:28, 2 March 2009 (UTC)
Article says:
The Burrows-Wheeler transform (BWT, also called block-sorting compression)
Technically, BWT is not a compression. Should it be noted? It leads to wrong understanding of the subject â lim 16:54, 16 May 2007 (UTC)
Putting aside the issue that this is a really horrible example of program code, there's a major inefficiency in it: In the function rotlexcmp() if 'li' is equal to 'ri', it will needlessly traverse the entire data and then just return 0. According to my own experiments this makes it almost 100 times slower than it could be (well, depending on the exact implementation of the sorting function, of course). Simply adding a "if(li==ri) return 0;" at the beginning solves this issue and makes the program a lot faster with large datasets.
Wikipedia is not about tweaking the last bit of Performance out of some Algorithm. It only has to give the right oputput. âPreceding unsigned comment added by 129.13.72.198 ( talk) 08:35, 6 February 2009 (UTC)
We donât need two âsample implementationsâ and should not have them here both. We are not a code repository. Maybe we should clean one of them to focus on the algorithm (e.g. remove those #includes and main() in the C version)? -- Mormegil 17:54, 9 November 2007 (UTC)
Just for Fun.
BWT[str_]Â := Block[{list, n, data},
list = StringSplit[str, ""]; n = Length@list; data = Sort@Table[RotateRight[list, i], {i, 1, n}]; StringJoin@data All, -1
]
inBWT[str_, endOfFile_: "@"]Â := Block[{list, n, K},
list = StringSplit[str, ""]; n = Length[list]; K = Sort@Transpose@List@list; Do[K = Sort@Transpose@Prepend[Transpose@K, list]Â ;, {n - 1}]; StringJoin@Select[K, Last@# == endOfFile &] 1 ] âPreceding unsigned comment added by 189.83.138.252 ( talk) 22:35, 24 October 2009 (UTC)
Does anybody have independent confirmation of the Bijective BWT actually working? After having read the link to David Scott's method, I'm extremely sceptical that his technique works.
The main reason why I'm extremely sceptical is because he essentially appears to be using the 1st column of the BWT reversed. If that works, then there's no reason why you couldn't use a binary input (instead of ASCII characters) and end up with what would surely be the world's most compressible file, consisting merely of a count of 0's and 1's. Tiggs ( talk) 00:16, 30 August 2008 (UTC)
Actually Yuta Mori has put it in OpenBWT-v1.3 people on one of the more active compression sites
http://encode.ru/forum/ have tested it with many files. Surprisingly it often bets regular BWT based compressors using code tuned for standard BWT the main draw back beside being hard to understand is that it is somewhat slower to execute than normal BWT.
It is not the 1st column of the BWT reversed you can try Scott's or Mori's code on any reasonably sized file to see this this is not what is happening. âPreceding
unsigned comment added by
75.54.97.66 (
talk) 22:14, 3 October 2008 (UTC)
Agreed. I've spent some time looking over the source code and it is, indeed, a bijective transform.
Whether or not it should be classified as a Burrows Wheeler transform, or whether it should be called a Scott Transform, however, is a completely different question. As it changes the order that the characters are transformed into, I'd favour the later. Tiggs ( talk) 20:53, 10 October 2008 (UTC)
Actually there really is no standard BWT even for the index version. Many consider the one where you have the full wrap around on compares the original others use a special character for EOF some treat the EOF character as if it has the greatest some the lowest value there are many way to do it. And actually there are more an how to label the index. But if you look at the BWT that uses the full wrap around and do a UNBWTS on it you get the original file back except that it may be rotated since there is no index. Actually the normal BWT could be thought of as subset of BWTS. I have not heard of anyone calling it the Scott Transform but even Yuta referred to as the BWT Scottified. In fact there are a whole series of transforms that could be called bijective BWT transforms but few know about them or understand the concept. But tests done by others show this version tends to lead to better compression than what the users there call a normal BWT. The field is wide open not just for compression. âPreceding
unsigned comment added by
75.54.97.66 (
talk) 02:33, 11 October 2008 (UTC)
Would it be possible to explain why the bijective version is interesting? Eg where is it used and why? Saving the cost of one index does not seem to justify it. The text seems to suggest it was created for more than its mathematical beauty. 128.16.7.220 ( talk) 15:17, 7 June 2014 (UTC) Bill
1) it allows it to be used on data that potentially saturate their character space, so that you would have to make every character fatter to accommodate the EOF character. 2) Why not? A character's a character. â Preceding unsigned comment added by 108.52.71.107 ( talk) 03:23, 28 July 2014 (UTC)
I'm concerned that the only source for the bijective transformation is the paper by Gil and Scott that seems to be (in traditional terms) an "unpublished manuscript". That isn't quite up to Wikipedia standards. That isn't just useless policy-wanking in this case; the paper contains meaning-disturbing typos and mistakes that I don't think would have survived in a published paper. At least we shouldn't mislead readers into thinking this is as reliable a source as the ones we usually provide. Furthermore, Gil is an academic, but no reference to the paper appears on his web page, which could indicate that it was later retracted.
In particular: The paper presents a "string-sorting transform" with two variants, depending on how one compares strings of different length where one is a prefix of the other. It also presents an inverse transform, though one without any variants, and it is not quite stated which of the two main transforms it is an inverse of. According to my pencil-and-paper experiments, the inverse transform does seem to invert the "infinite-periodic" variant of the main transform. It does not work for the "usual lexicographical" variant, and indeed that variant is not a bijection! As a minimal example, the input strings "bab" and "abb" both transform to "bba".(*)
Unfortunately it is only the "usual lexicographical" transformation that the authors provide a linear-time algorithm for (though it is very sketchy, and I'm not convinced by the time analysis). For the infinite-periodic variant, the authors explicitly deny having a linear algorithm. They claim to have an O(nlogn) one, but do not describe it further.
While a bijective variant of the BWT is indeed an interesting and worthy topic, the one presented by the paper is not quite as ready for use as the paper's abstract suggests.
(*) Input: bab abb Lyndon factorization: b·ab abb Cyclic rotations: b,ab,ba abb,bba,bab Sorted: ab abb b bab ba bba Output: bba bba
(The "infinte-periodic" variant sorts "b" after "ba" and therefore transforms "bab" to "bab").
â Henning Makholm ( talk) 13:44, 6 December 2010 (UTC)
You have the correct transfrom for abb which is bba
however bab which break up to b ab is sorted wrong it should be
ab to b ba to a bb to b
you repeat each substring as needed before the comparistion so b sorts after ba not before thus the bwts is bab for bab its a self bwts. One conceptual way to do this is repeat the substring so a common length example if sub strings 3 6 2 1 5 use 30 characters then sort as you would in normal bwt
I also don't think the last changes to explaination aid to helping people understand what it is. I would like to change it back but would prefer someone else fix the english. I read what is now there and I don't think it helps many people to understand the concept. â Preceding unsigned comment added by 99.152.26.69 ( talk) 15:15, 17 January 2012 (UTC)
when when is comparing two strings lexicographically you pretend the shorter string is padded with spaces this is not the same as comparing two unequal length strings in bijective BWT you keep repeating the strings until there is a difference. â Preceding unsigned comment added by 66.136.119.147 ( talk) 23:25, 30 September 2012 (UTC)
Section "BWT in bioinformatics" and pages Compressed suffix array and FM-index should probably be merged into a single page. They are all about essentially the same data structure:
The differences between various compressed suffix arrays and FM-indexes are mostly in the encodings they use. The following survey articles are probably the best resources about compressed self-indexes so far:
Jouni Sirén ( talk) 23:48, 18 December 2010 (UTC)
The article is very nicely written and clear even for a beginner like me. But I didn't get one point:
Would it be possible to explain, why step 3 in inverse transformation is possible? Sort and concatenate with BW'ed string in step 2 is clarified in the text, but why one can sort again and concatenate the BW'ed string again? Though it works.
cheers, Oliver Drechsel ( talk) 08:39, 20 January 2011 (UTC)
the added character @ should be lexicographically before any other character, so:
ANANA@^B
ANA@^BAN
A@^BANAN
BANANA@^
NANA@^BA
NA@^BANA
^BANANA@
@^BANANA
should be
@^BANANA
ANANA@^B
ANA@^BAN
A@^BANAN
BANANA@^
NANA@^BA
NA@^BANA
^BANANA@
and the resulting string would be ABNN^AA@ â Preceding unsigned comment added by 84.223.127.232 ( talk) 19:33, 12 April 2012 (UTC)
Calling bwt('abracadabra') results in ('rdarcaaaabb', [2, 5, 6, 7, 8, 9, 10, 4, 1, 0, 3]) , but calling ibwt('rdarcaaaabb', [2, 5, 6, 7, 8, 9, 10, 4, 1, 0, 3]) results in 'aabracadabr'.
This article is beautiful. Thank you so much to all for the work in making this algorithm understandable. I mentally cried tears of joy because I was expecting the worst. I've never seen a Wikipedia article so carefully curated to its intended audience.
-- Dominic ( talk) 12:58, 23 December 2016 (UTC)
The algorithm claims to be linear in time, but clearly it cannot be, since for a string of length N it requires the construction of an NĂN matrix, or since it is symmetric (over negation), an upper or lower triangle NĂN matrix, which require O(N^2) to manipulate.
That glaring flaw aside, it is actually quite a a nice construction algorithm, and elucidates some nice intuitions about the problem of BWT construction, but claiming it is efficient is simply wrong from an algorithmic perspective. Drtomconway ( talk) 22:33, 22 December 2019 (UTC)
It may be "by convention", but the Example claims that rotation is done 8 times. It. Is. Not. It is done 7 times for a total of 8 strings if you include the original un-rotated string. That is, N = 7 for a string of 8 characters. It should also, imho, be mentioned that sorting of 'special characters' (which in this case would include both the caret (^) and the EOS (end of string? end of line?) character (|)) depends on where they are arbitrarily positioned in the sort order (it isn't necessary that both occur before the 'a' or after the 'z' of "lexical order". 98.21.213.235 ( talk) 23:31, 10 February 2021 (UTC)
In is common to use ^ and $ to start and end a string, respectively -- see Regular expression#POSIX basic and extended. Should we use $ instead of | in this article? â Quantling ( talk | contribs) 03:26, 14 November 2023 (UTC)
The original BurrowsâWheeler paper has neither a string-starting character (like ^) nor a string-ending character (like @ or | or $), and does not invoke them in any way. Can I remove them from this Wikipedia article? â Quantling ( talk | contribs) 00:33, 15 November 2023 (UTC)