This is the talk page for discussing improvements to the Huffman coding article. This is not a forum for general discussion of the article's subject. |
Article policies
|
Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL |
Archives: 1 |
![]() | This article is rated C-class on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | |||||||||||||||||||||||||||||||||||||
|
This section has several inaccuracies and would benefit from links to other pure entropy-encoding algorithms.
"Although Huffman's original algorithm is optimal for a symbol-by-symbol coding (i.e., a stream of unrelated symbols) with a known input probability distribution,"
This is incorrect. Huffman is optimal for a prefix code algorithm, but there are non-prefix code coders that give better compression so clearly Huffman is not optimal out of all of them. (What it has in advantage is simplicity, especially in hardware implementations.) I also don't understand why Arithmetic coding and LZW are grouped together here. That is muddling up modelling with entropy encoding. Many of the most widely used LZ algorithms also utilise huffman (eg gzip / Deflate) - these are totally orthogonal issues and it is confusing to the reader to imply otherwise. I would therefore drop LZW from this statement and add in Asymmetric Numeral System instead as it fits along side Arithmetic Coding as a direct alternative to entropy encoding and we should be unbiased in our listing of alternatives.
"Prefix codes tend to have inefficiency on small alphabets, where probabilities often fall between these optimal points. The worst case for Huffman coding can happen when the probability of a symbol exceeds 2−1 = 0.5, making the upper limit of inefficiency unbounded. Combining symbols together ("blocking") can be used to make Huffman coding as efficient as theoretically possible."
Agreed on the principle of symbol combining to improve efficiency, but it cannot make huffman "as efficient as theoretically possible" unless we want to encode the entire stream as a single combined symbol (and then how do we encode the table?). Rather it reduces the inefficiency. Nothing quite matches the theoretical shannon limit. Huffman isn't even as efficient as most arithmetic coders, even with blocking. Also if we have a large alphabet with one very common symbol (eg byte 0 out of 256 byte values - all used) then symbol combining cannot solve this unless we move to a larger alphabet size than 256 which brings its own set of inefficiencies when implementing. I see therefore why RLE is added here too, but again it starts to feel like we're muddling up modelling with entropy encoding at this point.
Symbol combining and RLE are valid and well used techniques (I've used both myself, even with other more advanced entropy encoders) so it's good to have their inclusion here to educate readers on how to practically work around the limitations if and when they crop up. However it could perhaps do with some practical examples. Eg the most obvious is a stream of just two symbols with no inter-symbol correlation; eg A A A B A A B A B A. p(A)=7/10, p(B)=3/10 giving around 0.88 bits per symbol, but huffman can only ever assign 0 and 1 giving exactly 1 bit per symbol and 0% compression. We could combine them into AA AB AA BA BA symbols, which on a larger data set would offer some compression using huffman bit lengths 1(AA), 2(AB) and 3(BA, BB) offering .905 bits per symbol. Further improvements can be made by combining more symbols together into successively large alphabets. Such techniques do not work with an already large alphabet, requiring alternative methods such as Run Length Encoding.
80.229.230.181 (talk) 09:32, 24 July 2017 (UTC) James
Under the n-ary Huffman coding section, it is mentioned that "The same algorithm applies as for binary (n equals 2) codes, except that the n least probable symbols are taken together, instead of just the 2 least probable." - but is that really the case?
Let's look at a simple example: trinary alphabet (n equals 3), with four symbols (call them w, x, y and z, in increasing order of probability). The algorithm as described will group w, x and y under a single node, and then (supposedly, as there are only two nodes remaining, not three, so this isn't explained explicitly, but I think this is the reasonable way to understand it) that new node and z under one more new node, which ends up as the root:
root / \ wxy z / | \ w x y
So the root only has two children (which is OK, the article does mention the possibility that "additional 0-probability place holders must be added"). But surely this can't be the optimum, as better compression would be achieved by moving one of the first three symbols (optimally y) to occupy the free position directly under the root, rather than its current, deeper position:
root / | \ wx y z / \ w x
This prompted me to look at Huffman's paper (http://compression.ru/download/articles/huff/huffman_1952_minimum-redundancy-codes.pdf as referred to from the article), where I think the issue is actually addressed, by saying (left column of the fourth - and last - page, which is page 1101) "... it will be required that all these brackets, with the possible exception of one combining the least probable messages of the original ensemble, always combine a number of messages equal to D." (emphasis mine).
I think the article should be fixed to reflect this and to explain the algorithm more clearly. And I'd rather leave this task to someone more proficient than myself (but I might do it myself eventually if I notice in some time that this hasn't been addressed). 95.90.235.121 (talk) 12:00, 29 December 2020 (UTC)
What it's called Huffman tree it's called in bibliography as Kraft Tree--138.4.177.124 (talk) 15:20, 16 February 2021 (UTC)
The applications section needs some work. The first paragraph compares Huffman codes to Arithmetic coding (see the rest of the drama in this talk page), which is not an application. The second paragraph describes some compression algorithms that use prefix codes which are specifically described as NOT Huffman codes.
So... the applications section has no applications in it.
J2kun (talk) 02:00, 4 May 2021 (UTC)
The definition speaks of sets, but doesn't say finite. Does the tree construction still work out OK for infinite alphabets and trees of infinite depth? Clearly one can't work from the leaves combining least-probably nodes if there are infinitely many. See Golomb code, the last part of this edit where I'm not so sure about what I said. Dicklyon (talk) 16:07, 12 July 2021 (UTC)
Current § Problem definition subsection is overly abstract. I got a paper described in much simpler terms how to get huffman code for a given set of symbols (probability values). No hard math involved, take a look: Huffman Offline.pdf AXONOV (talk) ⚑ 15:18, 3 December 2021 (UTC)
I think there´s some kind of "mistake" in the description: "this code requires 135 (or 147) bits, as opposed to 288 (or 180) bits if 36 characters of 8 (or 5) bits were used.". There are only 16 sympbols listed in the table given below the tree-graphics, so for a FAIR comparison, only 4 bits are required to binary code 16 different symbols. Please don´t compare most wasteful ASCII-Coding with the Huffman Code, please don´t limit Huffman Coding to text-data! If it comes to the GAIN everybody is greedily looking for, it´s 135 bits used by huffman-coding (without codebook and decoder) versus 36 chars * 4 binary code bits = 144 bits (NOT 288 or 180 given in the description!). I know it does look much less impressive, if you honestly say "only 144-135=9 bits gain in this example, and overhead for codebook and decoder is even discounted!". I ask for a fair comparison, because everybody should face the truth how hard it is to find powerful algorithms for data compression. Don´t try to make Huffman Coding better than it is, don´t feed the "magic function theory". And PLEASE also show an example where beloved Huffman Coding clearly FAILS! — Preceding unsigned comment added by 84.115.238.220 (talk) 05:47, 3 June 2023 (UTC)