information theory and source coding lecture notes and information theory and coding lecture notes nptel pdf free download
Source Encoding and Compression
Department of Information Technology
University of Turku
This course is about data compression, which means reducing the size of source data
representation by decreasing the redundancy occurring in it. In practice this means choosing a
suitable coding scheme for the source symbols. There are two practical motivations for
compression: Reduction of storage requirements, and increase of transmission speed in data
communication. Actually, the former can be regarded as transmission of data ‘from now to
then’. We shall concentrate on lossless compression, where the source data must be
recovered in the decompression phase exactly into the original form. The other alternative is
lossy compression, where it is sufficient to recover the original data approximately, within
specified error bounds. Lossless compression is typical for alphabetic and other symbolic
source data types, whereas lossy compression is most often applied to numerical data which
result from digitally sampled continuous phenomena, such as sound and images.
There are two main fields of coding theory, namely
1. Source coding, which tries to represent the source symbols in minimal form for storage or
2. Channel coding, the purpose of which is to enhance detection and correction of trans-
mission errors, by choosing symbol representations which are far apart from each other.
Data compression can be considered an extension of source coding. It can be divided into two
1. Modelling of the information source means defining suitable units for coding, such as
characters or words, and estimating the probability distribution of these units.
2. Source coding (called also statistical or entropy coding) is applied to the units, using their
The theory of the latter phase is nowadays quite mature, and optimal methods are known.
Modelling, instead, still offers challenges, because it is most often approximate, and can be
made more and more precise. On the other hand, there are also practical considerations, in
addition to minimizing the data size, such as compression and decompression speed, and also
the size of the model itself.
In data transmission, the different steps can be thought to be performed in sequence, as
Source Channel Channel Source
encoding encoding decoding decoding
We consider source and channel coding to be independent, and concentrate on the former.
Both phases can also be performed either by hardware or software. We shall discuss only the
software solutions, since they are more flexible.4
In compression, we usually assume two alphabets, one for the source and the other for the
target. The former could be for example ASCII or Unicode, and the latter is most often
binary. There are many ways to classify the huge number of suggested compression methods.
One is based on the grouping of source and target symbols in compression:
1. Fixed-to-fixed: A fixed number of source symbols are grouped and represented by a fixed
number of target symbols. This is seldom applicable; an example could be reducing the
ASCII codes of numbers 0-9 to 4 bits, if we know (from our model) that only numbers
occur in the source.
2. Variable-to-fixed: A variable number of source symbols are grouped and represented by
fixed-length codes. These methods are quite popular and many of the commercial com-
pression packages belong to this class. A manual example is the Braille code, developed
for the blind, where 2x3 arrays of dots represents characters but also some combinations of
3. Fixed-to-variable: The source symbols are represented by a variable number of target
symbols. A well-known example of this class is the Morse code, where each character is
represented by a variable number of dots and dashes. The most frequent characters are
assigned the shortest codes, for compression. The target alphabet of Morse code is not
strictly binary, because there appear also inter-character and inter-word spaces.
4. Variable-to-variable: A variable-size group of source symbols is represented by a variable-
size code. We could, for example, make an index of all words occurring in a source text,
and assign them variable-length binary codes; the shortest codes of course for the most
The first two categories are often called block coding methods, where the block refers to the
fixed-size result units. The best (with respect to compressed size) methods today belong to
category 3, where extremely precise models of the source result in very high gains. For
example, English text can typically be compressed to about 2 bits per source character.
Category 4 is, of course, the most general, but it seems that it does not offer notable
improvement over category 3; instead, modelling of the source becomes more complicated, in
respect of both space and time. Thus, in this course we shall mainly take example methods
from categories 2 and 3. The former are often called also dictionary methods, and the latter
Another classification of compression methods is based on the availability of the source
1. Off-line methods assume that all the source data (the whole message ) is available at the
start of the compression process. Thus, the model of the data can be built before the actual
encoding. This is typical of storage compression trying to reduce the consumed disk space.
2. On-line methods can start the coding without seeing the whole message at once. It is
possible that the tail of the message is not even generated yet. This is the normal situation
in data transmission, where the sender generates source symbols one-by-one and
compresses them on the fly.
We shall use the word message to represent the unit which is given as input to one execution of the com-
pression program. In storage compression, the message usually means a file.5
Still another classification, related to this one, is based on the emergence and status of the
1. Static methods assume a fixed model, which is used by both the encoder and decoder, and
which is common for all messages from the source. This is acceptable, if there is not much
variation between distinct messages, for example, if they are all English texts from the
same application domain.
2. Semi-adaptive methods construct a model (for example a codebook) separately for each
source message. The model is based on the whole message and built before starting the
actual encoding. Thus, two passes through the data are needed. The encoder must first
send the model and then the encoded message to the decoder. This presupposes an off-line
method. In some data communication situations this kind of delay is not acceptable.
3. Adaptive methods construct the model on the fly. When encoding a certain input symbol
(or group of symbols), the model is based on the previous part of the message, already
encoded and transmitted to the decoder. This means that the decoder can construct the
same model intact. Both partners gradually learn the properties of the source. The process
starts with some initial model, for example a uniform distribution for all possible symbols,
and then changes the model to converge towards the actual distribution.
Even though semi-adaptive methods can in principle create a more precise model of the
source, it has turned out that adaptive methods can achieve the same compression efficiency.
One reason is that they do not have to transmit the model from the encoder to the decoder.
Another advantage is that they may adapt to local changes in the statistical properties within
the message. Adaptive methods are quite common today in general-purpose compression
Compression efficiency is normally measured by the compression ratio, i.e. the ratio of the
source message size to its compressed size, both measured in bits. In text compression, an
objective measure is bits per source character (bpc), because it is independent of the original
coding. In image compression, bits per pixel is a suitable measure.
Earlier, character codes were normally of fixed length, such as ASCII (7 or 8 bpc), and ISO-
8859 character sets (8 bits per character). The more recent Unicode has been adopted in
many contexts (Java, XML). Some Unicode versions have a fixed number of bits per
character: UCS-2 with 16 bpc, and UCS-4 with 32 bpc. The others are of variable length:
UTF-8 with 8, 16, 24 or 32 bpc, and UTF-16 with 16 or 32 bpc. Even though our source text
characters may have a variable number of bits, we can still talk about e.g. fixed-to-variable
coding, if we encode one source character at a time because a character is logically a fixed
2. Coding-theoretic foundations
Here we study the coding of messages, consisting of symbols from a given source alphabet,
denoted S = s , ..., s . In practice, the actual alphabet, used in coding, can be an extension
of the original, so that also groups of symbols are assigned their own codes. A codebook
contains a codeword w for each (extended) symbol s. Here we assume that the codewords
can be of variable length. In adaptive compression, the codebook can also be dynamic, so that
it may change during the coding of the source message. However, for each position of the
message, there is a unique, discrete sequence of bits for every possible symbol.
The codebook may also be implicit if the code is a computable function of the source
symbols. The function is determined from our model of the source. As mentioned, it is
possible to encode a group of source symbols as a unit. Since the codebook grows
exponentially with the group size, large groups can only be coded with implicit methods. An
extreme is a method which treats the whole message as one group, and computes a single
codeword for it. In such coding, there is no precise correspondence between individual source
symbols and coded bits: For a certain bit in the coded sequence, we cannot say exactly, which
source symbol has produced it. Arithmetic coding, to be discussed later, is an example of this
kind of method.
Let us study the desirable properties what the codebook (explicit or implicit), denoted W =
w , ..., w , should have. The first necessary condition is, of course, uniqueness: For each
s , s Î S: s ¹ s Þ C(s) ¹ C(s), where C is the coding function. This condition would be
i j i j i j
sufficient if we should transmit only a single symbol (and if end-of-code can be signalled by
some means). However, this is not usually the case. For multi-symbol messages (and variable-
length codewords) we have two options:
(a) use some special symbols (so called commas) between the codes,
(b) develop self-punctuating codes.
The former would be inefficient use of the special symbols, thus we concentrate on the latter.
Let X and Y be any two sequences of source symbols. If X¹ Y Þ C(X) ¹ C(Y), then code C is
called uniquely decodable (decipherable). The following is an example of an ill-designed
a = 0
b = 01
c = 11
d = 00
Now consider the encoded sequence “00110”. It can result either from message “dca” or
“aaca”. The codebook (or code, for short) is not uniquely decodable. The following codebook
is, instead, such:
a = 0
b = 10
c = 110
d = 1117
Here the only interpretation of “00110” is “aac”. This code is uniquely decodable and,
moreover, a prefix-free code (prefix code, for short): No codeword is the prefix of any other
Codebooks are often illustrated by binary trees (called also decoding trees or decision trees).
The latter of the above codebooks can be visualized as follows:
It is easy to see that a prefix-free code is represented by a binary tree where each source
symbol corresponds to a leaf node. Most often this kind of coding trees are also complete, i.e.
all internal nodes have two children. The prefix-free code is always an instantaneous code, i.e.
we know the source symbol as soon as all its code bits have been received. Instantaneous
decoding is extremely useful, but not absolutely necessary for unique decodability. For
example, the following code is uniquely decodable, even though it is not prefix-free:
a = 0
b = 01
c = 011
d = 111
When decoding the bit string 00110, we cannot say after the first bit, whether it is part of ‘a’,
‘b’, or ‘c’. After the second bit we can, since there is no code with “00”. The next source
symbol can be decided only after the last ‘0’ has been seen. A more general worst case is the
sequence “0111...111”, for which the first source symbol can be decided only after seeing the
very last bit. The reader might notice that each codeword is the reverse of the corresponding
prefix-free codeword in the earlier codebook. Unique decodability results from the property
that the coded sequence can be decoded backwards as if the codebook were prefix-free
(actually it is suffix-free).
If the decoding tree is fully balanced, it means that all codewords are of the same length, i.e.
we have a block code. If the size of the source alphabet is not a power of 2, at least the lowest
level is incomplete. If the difference between the longest and shortest codeword length is 1,
then the code is called shortened block code. For example, the 5-symbol alphabet a, b, c, d,
e can be encoded as
a = 00
b = 01
c = 10
d = 110
e = 111
A rather simple condition has been proved about the lengths l , ..., l of an instantaneous
code, namely the Kraft inequality:8
Theorem 2.1. An instantaneous codebook W of q codewords exists if and only if the lengths
l , ..., l of the codewords satisfy K = £ 1.
Proof. For an instantaneous code we have the symbols in the leaves of the decoding tree, and
any binary tree with q leaves can be interpreted into an instantaneous code. Thus we show
that for any binary tree the depths of leaves satisfy the inequality. We apply induction: For
q=1, there is one leaf with depth = 1, so the sum K is equal to ½ £ 1. (Note that the code
length cannot be zero, so a root-only tree is illegal.) For q=2 we have two leaves with depth
1, so the sum is ½ + ½ £ 1. Now assume an arbitrary tree with q nodes, with sum K £ 1.
(a) Extend any leaf (with depth d) with one new leaf (making the old leaf internal). The new
leaf has depth d+1. Thus the sum of the theorem becomes K-1/2 +1/2 £ K £ 1.
(b) Extend any leaf (with depth d) with two new leaves (making the old leaf internal). The
new leaves have depth d+1. Now the number of leaves becomes q+1, and the sum value is
d d+1 d+1
K-1/2 +1/2 +1/2 = K £ 1.
Since any binary tree can be constructed step by step using extensions (a) and (b), the
In general form, the Kraft inequality has radix r in the denominator, instead of 2, but we
restrict ourselves to the binary target alphabet. Moreover, it can also be shown that, for binary
target alphabet, strict equality should hold; otherwise the code is inefficient.
Let us examine the validity of the theorem by studying the previous reduced block code, with
codeword lengths 2, 2, 2, 3, 3. We obtain the sum ¼ + ¼ + ¼ + 1/8 + 1/8 = 1. Thus it is
easy to check for a given set of lengths, whether a codebook can be constructed or not, and if
yes, whether the code is inefficient. Consider the set of lengths 1, 3, 3, 3, which produces
the sum 7/8. We can directly say that the code is inefficient. The related tree has a leaf with no
brother; therefore the leaf can be omitted, and the related symbol gets one bit shorter code-
word (corresponding to the parent). Now, the set 1, 2, 3, 3 produces K = 1.
The Kraft inequality holds for instantaneous codes. The MacMillan inequality says that the
same holds for any uniquely decodable code. The proof is not quite as straightforward as
above. Important is that, when choosing a codebook, we can calmly restrict ourselves to
instantaneous codes, because for any set of lengths in a uniquely decodable code, we can
derive an instantaneous code with the same lengths.
In some applications, the source alphabet can in principle be infinite (but yet countable). One
such situation is run-length coding, where we should encode the number of similar successive
symbols in a sequence. Of course, all runs are of finite length, but no upper bound can be
given. In this case we need a general coding system for all integers; the traditional fixed-length
representation cannot be used. A coding system is in a sense universal, if it works for all
possible sources in some class. Usually, an additional requirement is set for true universality,
such that the code satisfies a certain efficiency constraint (to be explained later). Since the
alphabet is infinite, we cannot store an explicit codebook, but we must provide an algorithm
(function), which produces the code for any integer. Two desirable properties for such a code
· Effectiveness: There is an effective procedure that can decide whether a given sequence of
bits belongs to the codebook or not.
· Completeness: adding any new code would create a codebook that is not uniquely de-
There are various ways of designing the universal code; Peter Elias has described and
analysed potential coding approaches for positive integers. A simplified description of them is
1.a-coding: This is unary coding, where number n is represented by n-1 zeroes plus an end-
ing one-bite: 1®1; 2®01; 3®001; 4®0001; 5®00001; ... Actually,a-coding is not
universal in the strict sense, because it does not satisfy the efficiency constraint.
2.b-coding: This is the traditional (positional) representation of numbers, excluding the
leading zeroes. Since its codes are not self-punctuative, we need an end symbol ‘’, and
thus our code alphabet is ternary. The leading 1-bit can be omitted: 1®; 2®0; 3®1;
4®00; 5®01; 6®10; 7®11; 8®000; ... This can be transformed to the normal
binary representation e.g. by further coding: 0 ®0; 1®10; ®11. The final encoding
would thus be 1®11; 2®011; 3®1011; 4®0011; 5®01011; 6®10011; 7®101011;
8®00011. Notice that the lengths of the codewords are not monotonically increasing with
the number to be encoded.
3.g-coding: To avoid using the end symbol, the positional representation is prefixed by an a-
coded length. Again, the leading 1-bit is omitted from the number - actually we can think
that the 1-bit that ends the unary code is simultaneously the leading 1-bit. The small
integers are now coded as follows: 1®1; 2®010; 3®011; 4®00100; 5®00101;
6®00110; 7®00111; 8®0001000; ...
4.d-coding: similar as g-coding, but the length is not represented with unary coding, butg-
coding. Examples: 1®1; 2®0100; 3®0101; 4®01100; 5®01101; 6®01110; 7®01111;
d-coding is theoretically better thang-coding for large numbers (cf. next section).
There are many other useful properties of codes. One such property is robustness against
transmission errors. In the worst case, inverting a single bit in a sequence of variable-length
codes may scramble the whole tail of the coded sequence. The synchronization property
refers to the behaviour of the decoding process in the context of errors (add/drop/complement
a bit). Synchronization means, how fast the decoder will catch again the correct codeword
An example of an elegant, robust, universal coding scheme for integers is so called
Zeckendorf representation (called also Fibonacci representation) . Its simplest form encodes
Peter Elias: “Universal Codeword Sets and Representations of the Integers”, IEEE Trans. on Inf. Theory,
Vol. IT-21, No. 2, 1975.
A. S. Fraenkel, S. T. Klein “Robust Universal Complete Codes for Transmission and Compression “,
Discrete Applied Mathematics, Vol. 64, 1996, pp 31–55.10
numbers by bit-weighted sums of Fibonacci numbers …, 34, 21, 13, 8, 5, 3, 2, 1. For
example, the value 17 can be represented by the sum 0·34 + 0·21 + 1·13 + 0·8 + 0·5 + 1·3 +
0·2 + 1·1. The robustness of Zeckendorf code results from the fact that every number can be
represented with a code with no pairs of adjacent 1-bits (results immediately from the
Fibonacci system). Thus, by extending the code with two additional 1-bits we can make the
code self-delimiting. Moreover, by transmitting the least significant bits first, up to the most
significant 1-bit, only a single extra 1-bit is needed to delimit the code. The codebook for
small integers looks like follows.
Number Weights for Code
8 5 3 2 1
1 0 0 0 0 1 11
2 0 0 0 1 0 011
3 0 0 1 0 0 0011
4 0 0 1 0 1 1011
5 0 1 0 0 0 00011
6 0 1 0 0 1 10011
7 0 1 0 1 0 01011
8 1 0 0 0 0 000011
9 1 0 0 0 1 100011
10 1 0 0 1 0 010011
11 1 0 1 0 0 001011
12 1 0 1 0 1 101011
… … …
Since no two consecutive 1-bits can exist within a codeword, excluding the end, single-bit
errors can only propagate to the next codeword, but not further. What is more, it has been
shown that basic arithmetic operations can be performed directly with Zeckendorf-coded
However, transmission errors are not our main concern in this course, so we shall later only
mention some results and observations from this topic.
As an example of a parametric code, we present the so called Golomb code , which is optimal
for specific types of numeric sources. As a preliminary step, we introduce a code for uniform
distribution of a finite set of symbols (i.e. all q symbols are equally likely). If q is a power of
two (q = 2), then the codes for the alphabet naturally consist of all possible k-bit
combinations. However, if q is not a power of 2, a fixed-length code ofélog qù bits would be
redundant (some code values unused). In this case we choose a so-called semi-fixed-length
code, where some of the codewords have lengthëlog qû, and the rest have length élog qù .
The following tables shows semi-fixed-length codebooks for q in 2…8.
x q = 2 q = 3 q = 4 q = 5 q = 6 q = 7 q = 8
0 0 0 00 00 00 00 000
P. Fenwick: “Zeckendorf Integer Arithmetic”, Fibonacci Quarterly, Nov 2003, pp. 405 – 413.
S. W. Golomb: “Run-Length Encodings”, IEEE Trans. on Inf. Theory, Vol. 12, 1966, pp. 399-401.11
1 1 10 01 01 01 010 001
2 11 10 10 100 011 010
11 110 101 100 011
4 111 110 101 100
5 111 110 101
6 111 110
Golomb coding is applied to numeric data (i.e. integers). It has one parameter m, which is
used in encoding of an integer x as follows. Encode the quotient a = ëx/mû with unary
coding, and the remainder x- ma with semi-fixed-length coding, where the alphabet size q =
m. The two parts are concatenated to codewords. An example with m=3 is given in the table
below. The parts of the codewords are separated by a space, for clarity. In the special case
where m is a power of 2 (denote 2 ) we obtain a so called Rice code , which is structurally
simpler, as shown in the table for m = 2.
x Golomb Rice
m = 3 m = 2
0 0 0 00
1 0 10 01
2 0 11 100
3 10 0 101
4 10 10 1100
5 10 11 1101
6 110 0 11100
7 110 10 11101
8 110 11 111100
9 1110 0 111101
Both are very efficient in practice, if the distribution of source numbers is close to geometric.
Such a distribution results e.g. from repeating binary events (sequences of ‘success’/’failure’)
where a success is more probable than a failure, and we encode the number of successes up to
the next failure. Another interpretation of the same situation is run-length encoding of a binary
sequence. If p = P(0) P(1), then m should be chosen such that the probability of m 0-bits
before the next 1-bit is about 0.5. From this it follows that m » -1/log p is the best choice.
The next chapter studies the connection between symbol probabilities and optimal codes more
R. F. Rice: “Some Practical Universal Noiseless Coding Techniques”, Jet Propulsion Laboratory, JPL
Publication 79-22, Pasadena, CA, 1979.12
3. Information-theoretic foundations
Information theory, mainly founded by Claude Shannon at Bell Labs, studies both noise
protection and efficient use of the communication channel. It gives bounds for the ultimate
transmission rate of the channel, and for the ultimate data compression. Of these two we
concentrate on the latter. First we shall define a quantitative measure of information for both
single symbols and the whole alphabet. Notice that the word ‘information’ is here a quite
concrete technical concept, and bears only loose analogy with the ‘normal’ concept of
information in human communication.
Suppose that we have a source alphabet S = s , ... s , with probabilities P = p , ... p . The
1 q 1 q
question is, how much information the receiver gets when seeing a certain symbol s. The
principle is that a less frequent character gives more information than a more frequent one.
Thus, the amount of information measures the degree of surprise (uncertainty). If some p = 1
then all symbols to be sent are = s , and there is no surprise - no information. For constructing
an information function I(p), three properties should be satisfied:
1. I(p)³ 0,
2. I(p p ) = I(p ) + I(p ) for independent events,
1 2 1 2
3. I(p) is a continuous function.
From condition 2 it follows that I(p ) = nI(p). We see that I satisfies the same properties as
the logarithm function. In fact, the solution is essentially unique, and we define I(p) = k log p.
From property 1 (and noticing that 0 £ p £ 1), we deduce the value k = -1. Hence we have:
I(p) = -log p = log (1/p)
The base of logarithm can be chosen. Normally we take base = 2, and define the unit of
information to be a bit. This word has now two meanings: a digit in 2-base representation and
a unit in measuring information. Of course, this is not a coincidence, as seen from the
following example: Consider tossing a (fair) coin with P(heads) = P(tails) = ½. Then I(heads)
= I(tails) = -log ½ = 1 (bit), which is consistent with the idea that with one binary digit we
can tell the outcome. If the coin is unfair, such that P(heads) = 1/8 and P(tails) = 7/8, then we
have I(heads) =-log (1/8) = 3 and I(tails) =-log (7/8)» 0.193 bits. Thus getting heads is a
much bigger surprise than getting tails. Notice also that the information amount is a real-
valued quantity, and thus more general than the binary digit. For studying the additive pro-
perty, consider tossing a fair coin three times. The information amount of the sequence:
heads-tails-heads (or any other 3-sequence), is: I(½½½) = I(1/8) =-log (1/8) = 3 bits = I(½)
+ I(½) + I(½), as expected.
Next we define the average information when receiving a symbol from alphabet S, with
probability distribution P. Since we get amount I(p) with probability p, the average (ex-
pected) information per symbol, called entropy, is a weighted sum:
HP ()= pIp()= p logç÷
i i i
C. E. Shannon: “A Mathematical Theory of Communication”, Bell System Technical Journal, Vol. 27, pp.
379-423, 623-656, 1948.13
Example. Let P = 0.1, 0.2, 0.3, 0.4. Then H(P) = -0.1 log 0.1 - 0.2 log 0.2 - 0.3 log 0.3 -
0.4 log 0.4» 0.332 + 0.464 + 0.521 + 0.529 = 1.846. Compare this with uniform distribution
P = 0.25, 0.25, 0.25, 0.25, which has entropy H(P) = -4 ´ 0.25 log 0.25 = 2. It is easy to
prove the observation generally: a uniform distribution gives the maximum entropy. Notice
the analogy with physical phenomena e.g. in thermodynamics, where the same word is used.
Shannon has proved the following fundamental “noiseless source encoding theorem”:
Theorem 3.1. Entropy H(S) gives a lower bound to the average code length L for any
instantaneously decodable system. (Proof skipped)
Even though precise symbol probabilities are seldom known, entropy is an excellent reference
measure when studying the compression power of a source coding method. The difference
L-H(S) between the average codeword length and entropy is called redundancy of the code.
The equality in Theorem 3.1 holds if the probabilities p are powers of 2, i.e. p= 1/ 2 . In
this case we reach the theoretic optimum with redundancy = 0. Even if this were not the case,
L can always be limited to at most H(S) + 1 by choosing
æ 1ö æ 1ö
logç÷£ l logç ÷+ 1
2 i 2
èø è ø
A code is called universal if L £ c H(S) + c asymptotically for some constants c and c and
1 2 1 2
for nonzero entropy H(S). The code is asymptotically optimal if c = 1.
Consider the encoding of natural numbers n = 1, 2, etc. Often we do not know their exact
probabilities, and they are hard to find out experimentally. However, the following
fundamental result has been proved :
Theorem 3.2. There exist codes, which satisfy the above universality condition for any
distribution satisfying P(1)³ P(2) ³ P(3)³ … ³ P(n)³ … (Proof skipped).
The codeword lengths for such encodings are non-decreasing with n. Of the coding methods
for numbers (Chapter 2), it can be shown that a-code is not universal in the above sense,
whereasg- and d-codes are. Moreover,g-code has constant c = 2, while d-code has c = 1,
the latter being asymptotically optimal. Zeckendorf code (called also Fibonacci code) is also
universal, but not asymptotically optimal; its c -value is about 1.440. Nevertheless, for a large
subset of integers, the Zeckendorf codewords are actually shorter thand-codes.
Even thougha-code (= unary code) is not universal, it is optimal if the probabilities are:
P(1) = 1/2, P(2) = 1/4, P(3) = 1/8, …, P(n) = 1/2 , … In fact, this kind of distribution is not
uncommon. Note that the sum of probabilities 1/2 + 1/4 + 1/8 + … = 1, as required.
Above we assumed the successive symbols of the message to be independent. In practice this
is not the case, so we should extend the concept of entropy. This leads us to the modelling
part of compression, which we shall study later more deeply. Here we mention the natural
extension to conditional probabilities. Suppose that we know m previous symbols before s ,
See: Peter Elias: “Universal Codeword Sets and Representations of the Integers”, IEEE Trans. on Inf.
Theory, IT-21, No 2, pp. 194-203, 197514
namely s,, L s , and the related conditional probability P(s s,, L s ). The conditional in-
i i i i
1 m 1 m
formation of s in this m-memory source is naturally log (1 P(ss ,L,s )) and the conditional
2 ii i
H(Ss ,L,) s = P(ss ,L,s )log
i i ii i 2
1 m 1 m
P(ss ,L,) s
S è ø
The situation can be described by an m’th order Markov process, which is a finite automaton
with states for all m-grams of symbols, and transitions corresponding to successors of m-
grams, with the conditional probabilities. In an ergodic Markov process the state probabilities
approach the so called equilibrium. The global entropy is
HS ()= Ps ( ,L,s )P(ss ,L,s )log
i i ii i 2
1 m 1 m ç ÷
m P(ss ,L,) s
S è ii i
= Ps ( ,L,ss , )log
i i i 2
m+1 P(ss ,L,) s
S ii i
There is another measure for the complexity of a message which is called Kolmogorov com-
plexity, and which can be considered more general than entropy. It is defined as the length of
the shortest (binary) program for generating the message. If the sequence of symbols is drawn
at random from a distribution that has entropy H, then the Kolmogorov complexity is
approximately equal to H. However, consider a pseudo random number generator, which
produces a seemingly random sequence of symbols. The generating algorithm can be
regarded as the compressed representation of the message, because the sequence can be
recovered by applying it.
Unfortunately, Kolmogorov complexity has been shown to be non-computable, i.e. there is no
algorithm to determine it (cf. Gödel’s incompleteness theorem). This makes the concept
rather impractical. However, a somewhat related idea is used in fractal compression of
images, where the compressor tries to find self-repeating components from the image, i.e.
tries to find the ‘hidden algorithm’ that has generated the image. However, this can only be
done approximately (i.e. the method is lossy), unless the image is a pure fractal.15
4. Source coding methods
In this section we shall study the second part of compression, namely source coding, and
leave the first, ‘harder’ part (= modelling) to the next sections. Source coding is based on the
assumed probability distribution (P) of the source symbols (S). For a finite alphabet it is
usually possible to give at least estimates of the probabilities. For an infinite alphabet the
distribution is often implicit. For example, we could assume that the probability of a number is
inversely proportional to its magnitude. At least, it must hold that p ® 0 as i ® ¥, because
Sp = 1. We shall concentrate on finite alphabets.
4.1. Shannon-Fano coding
The theory of Section 3 gives a symbol (s ) with probability p the information log (1/p ) bits.
i i 2 i
An optimal coding solution would give the symbol a code of this length. However, the length
is an integer only if p is a negative power of 2. If this is not the case, we can take as the actual
length the next larger integer, i.e.élog (1/p)ù. It is easy to see that this choice satisfies the
l l l l
i i i i
l³log(1/) pÞ2³1/ pÞ p³12/Þ p³ (12/)Þ³ 1 (12/)
i 2 i i i å i å å
Thus, there is a prefix code with lengths élog (1/p )ù for i = 1, ..., q. Since every codeword
length is at most one larger than the theoretical optimum, it also holds that the average code
length L satisfies:
H(S)£ L£ H(S) + 1
Example. Let p = p = 0.3, and p = p = p = p = 0.1. From log (1/0.3)» 1.737 and
1 2 3 4 5 6 2
log (1/0.1)» 3.322, we get the code lengths 2, 2, 4, 4, 4, 4. The codes can be assigned in
lexicographic order for the list of lengths sorted into ascending order:
s = 00, s = 01, s = 1000, s = 1001, s = 1010, s = 1011
1 2 3 4 5 5
which has the average code length of 2.8, while the entropy is H(S) » 2.371. If we study the
decoding tree, we notice that it is not complete (not succinct): The subtree 11... is missing.
Checking the Kraft inequality confirms the property (the sum is not 1):
1 1 1 1 1 1 1 3
= + + + + + = 1
l 2 2 4 4 4 4
2 2 2 2 2 2 2 4
The situation is more obvious if we take a binary source alphabet with p = 0.99 and p =
0.01, for which the above rule gives code lengths l = 1 and l = 7, although the only sensible
possibility is to choose both lengths to be = 1 (corresponding to codes 0 and 1). Nevertheless,
for lengths 1 and 7, the average code length would yet be only 0.99´1 + 0.01´7 = 1.06, while
the entropy is
- 0.99×log 0.99- 0.01× log 0.01» 0.08 bits
The above description is the theoretical basis for Shannon-Fano coding (presented in-
dependently by Shannon & Weaver and Fano in 1949), but the actual algorithm is more
constructive, and also avoids non-complete decoding trees. Its idea is to divide the
‘probability mass’ (with related symbols) repeatedly into two approximately same-sized
halves, and simultaneously construct the decoding tree in a top-down fashion. With no loss of
generality, we assume here that probabilities p , ..., p are in descending order. Sorting may
thus precede the following codebook generation algorithm:
Algorithm 4.1. Shannon-Fano codebook generation.
Input: Alphabet S = s , ..., s with probability distribution P = p , ..., p , where p ³ p .
1 q 1 q i i+1
Output: Decoding tree for S.
Create a root vertex r and associate alphabet S with it.
If S has only one symbol then return a tree with r as the only node (root).
Choose index j ( ¹ 0 and ¹ q), for which the sums p and p are the closest.
i =1 ij =+1
Find decoding trees, rooted by r and r , for the sub-alphabets s , ..., s and
1 2 1 j
s , ..., s recursively and add them as subtrees of r. Attach labels 0 and 1 to the
corresponding parent-child relationships rà r and r à r .
Return the tree rooted by r.
It can be proved that the same properties hold for this code as for the previous code based on
direct computation of the codeword lengths, i.e. the redundancy is at most 1. To be precise,
the selection of an optimal split index j is not trivial (generally, optimal partitioning is NP-
hard). In practice the suggested condition (based on closest sums) is sufficient.
Example. S = a, b, c, d, e, f, P = 0.3, 0.3, 0.1, 0.1, 0.1, 0.1. The codebook generation
proceeds step by step, as shown in the figure on the next page, with subsets of symbols and
related sums of probabilities marked for nodes.
The first split is done into probability subsets 0.3, 0.3 and 0.1, 0.1, 0.1, 0.1, attached with
bits 0 and 1 (to be prefixes for the codes in the subsets). The former is split again, and
produces singleton subsets 0.3 and 0.3 with final codes 00 and 01 for a and b. Subset
0.1, 0.1, 0.1, 0.1 is split first into 0.1, 0.1 and 0.1, 0.1, producing prefixes (= paths
from the root) 10 and 11. The final splits produce four singleton subsets, each with probability
0.1. The related codes are 100, 101, 110 and 111. For the codebook a=00, b=01, c=100,
d=101, e=110, f=111 the average code length is 2.4 bits, having redundancy » 0.029 bits per
symbol. In fact, the obtained code is optimal (among methods assigning distinct codewords
for symbols), although generally Shannon-Fano code does not guarantee optimality.
The tree structure is suitable for the decoder: by branching according to the received bits it is
easy to pick up the correct original symbol from the leaf. The encoder, instead, needs some
other data structure, presumably a table with an entry for each symbol and the related
(1) (2) (3)
a,b,c,d,e,f: 1.0 a,b,c,d,e,f: 1.0
0 1 0 1
a,b: 0.6 c,d,e,f: 0.4
a: 0.3 b: 0.3
a,b,c,d,e,f: 1.0 a,b,c,d,e,f: 1.0
0 1 0 1
c,d,e,f: 0.4 c,d,e,f: 0.4
a,b: 0.6 a,b: 0.6
0 1 0 1 0 1 0 1
0 1 0 1
a: 0.3 b: 0.3 c,d: 0.2 e,f: 0.2 a:0.3 b:0.3
c:0.1 d:0.1 e:0.1 f:0.1
4.2. Huffman coding
The best-known source coding method is without doubt Huffman coding. The generation of a
Huffman code resembles the Shannon-Fano method, but the direction is opposite: Shannon-
Fano generates the decoding tree top-down, while Huffman method does it bottom-up. The
average codeword length of the Huffman code reaches the entropy if the probabilities are
integer powers of ½ (cf. Chapter 3). It also holds that, even in cases where the entropy is not
reached, Huffman coding is optimal among the methods that assign distinct codewords for
separate, independent symbols of the source alphabet.
The idea of Huffman code is developed as follows. First assume, without loss of generality,
that the probabilities of symbols are sorted into decreasing order: p³ p³ ... ³ p . Obviously,
1 2 q
this implies that the codeword lengths must satisfy: l £ l £ ... £ l . Since all non-root nodes
1 2 q
in an efficient (complete) decoding tree have a sibling, and symbol s is on the lowest level
(leaf), also s should be a leaf, more precisely, the sibling of s and have the same codeword
length. (The choice may not be unique, if there are equal probabilities, but then we can make
an arbitrary choice.) Therefore, we can assign suffixes 0 and 1 to the last two symbols; their
final codewords will be of the same length and differ only in the last bit position. The pair can
be considered a single ‘meta-symbol’ (= sub-alphabet s , s ), which has probability p +
q-1 q q-1
p , and becomes the parent of the last two symbols. From this we can continue by repeating
the process for the reduced alphabet of q-1 symbols, and attaching new parents to already ge-
nerated subtrees. More formally, the algorithm proceeds as follows:
Algorithm 4.2. Huffman codebook generation.18
Input: Alphabet S = s , ..., s with probability distribution P = p , ..., p , where p ³ p .
1 q 1 q i i+1
Output: Decoding tree for S.
Initialize forest F to contain a distinct single-node tree T for each symbol s and set
weight(T ) = p .
while F 1 do
Let X and Y be two trees in the forest which have the lowest weights.
Create a binary tree Z, with X and Y as subtrees (equipped with labels 0 and 1).
The root of Z is a new node representing the combined symbol set of X and Y.
Set weight(Z) = weight(X) + weight(Y).
Add Z to forest F and remove X and Y from it.
Return the single remaining tree of forest F.
The resulting code is a prefix code, because the original symbols will be leaves of the final
tree. The proof of optimality is skipped here, but essentially we proved it already in the above
derivation. As Huffman coding is optimal, and Shannon-Fano satisfies
H(S) £ L £ H(S) + 1
then also Huffman code satisfies it. In fact, it has been proved that an upper bound for the
redundancy is p + 0.086, where p is the highest probability among the symbols. To be
precise, the upper bound is Min(p + 0.086, 1).
The generated codebook is not unique, if there occur equal probabilities during the process;
ties in choosing the two smallest weights can be broken arbitrarily, without affecting the
average length L of codewords. However, if we always favour the smallest trees, we obtain a
code, the longest codeword of which is minimal. This may have practical advantage. When we
still notice that, in each combining step, 0 and 1 can be assigned to subtrees in either way, it is
clear that the number of different optimal codebooks is exponential.
Example. Determine the Huffman code for alphabet S = a, b, c, d, e, f where P = 0.3, 0.3,
0.1, 0.1, 0.1, 0.1. Forest F develops in the steps shown in the picture on the following page.
We obtain the same codebook 00, 01, 100, 101, 110, 111 as with Shannon-Fano method,
which in this case is also optimal. Technically the Huffman algorithm can be implemented e.g.
in the following ways:
(1) Build and maintain a heap (more precisely, a min-heap) structure for the symbols (both
original and meta-symbols), ordered by weight. The heap can be built in linear time using
Floyd’s algorithm. Extracting the smallest and adding a new meta-symbol need both
O(log q) operations for q nodes, and thus the total complexity is O(q log q).
(2) If the alphabet is already sorted by probability, we can build the tree in linear time by
keeping a queue of the created metasymbols: they will be generated in increasing order of
weight. The two smallest weights are thus found either among the sorted list of original
symbols or from the queue of metasymbols.19
0.3 0.3 0.1 0.1 0.1 0.1
0.3 0.3 0.1 0.1 0.1 0.1
a b c d e f
a b c d e f
(3) (4) 0 0.4
0 1 0 1
0 1 0 1
0.3 0.3 0.1 0.1 0.1 0.1 0.3 0.3 0.1 0.1 0.1 0.1
a b c d e f
a b c d e f
(5) (6) 0 1.0 1
0.6 0.2 0.2
0.6 0.2 0.2
0 1 0 1 0 1
0 1 0 1 0 1
0.3 0.3 0.1 0.1 0.1 0.1 0.3 0.3 0.1 0.1 0.1 0.1
a b c d e f
a b c d e f
There are some special probability distributions worth looking at. First, if all symbols are
equally likely, and if the number of symbols q is a power of two, we get a perfectly balanced
binary tree and a block code with each codeword having length log q. If symbols are equally
likely but q is not a power of two, we obtain a shortened block code, where the lengths differ
by at most one. A somewhat weaker condition is sufficient for this: If the sum of two smallest
probabilities (p + p ) is greater than the biggest probability (p ), then we also end up to a
q-1 q 1
(shortened) block code. This is seen immediately from technique (2) above, because all actual
symbols are combined before any of the metasymbols, and the metasymbols are combined in
the order they were generated.
Another extreme is the (negative) exponential distribution p = 1/2 , the Huffman tree of which
is a degenerated tree where each internal node has a leaf child (two leaves on the lowest
level). This kind of code is similar to unary code (11...10), except for one codeword that has
all its bits = 1.
A distribution that often occurs in practice, e.g. for symbols in natural language, is the Zipf
distribution, where the i’th symbol in the alphabet, ordered in decreasing order of probability,
has probability proportional to 1/i. Since the sum of 1/i is not limited when i approaches
infinity, the Zipf distribution is valid only for finite alphabets. A typical compression result is
about 5 bits per character.
In static compression, the Huffman code can be determined once for all inputs. In semi-
adaptive coding, however, the code is determined for each input message separately.
Therefore, in transmitting the message to the decoder, one has to send both the codebook and
the encoded message. This reduces the compression gain but, having a finite alphabet, the loss
is negligible for long messages. Anyway, it is worthwhile to consider the compression of the20
codebook (= decoding tree) itself. The shape of the tree can be expressed with 2q-1 bits: The
whole tree contains always this number of nodes, and for each node we must tell whether it is
a leaf or not. In addition, we have to transmit the order of symbols in the leaves from left to
right. In total, the representation amounts to 2q-1 + qélog qù bits. It is also possible to express
the codebook by transmitting only the lengths of codewords (in alphabetic order), or counts
of different lengths plus symbols in probability order. We shall return to the latter alternative
4.2.1. Extended Huffman code
For large alphabets, with the largest probability not too big, Huffman coding is generally
efficient. However, for very small alphabets with a skew distribution it can be rather bad,
compared to the entropy. For example, if we have S = a, b, c with P = 0.8, 0.18, 0.02,
the optimal codebook will be 0, 10, 11, with average codeword length = 1.2 bits, while the
entropy is only 0.816 bits. The redundancy (0.384 bits per character) can be reduced by
extending the alphabet S to S , consisting of n-symbol blocks, called n-grams. Their count is
q , and the probability of each can be computed as the product of elementary probabilities.
Then we can construct the Huffman code for the extended alphabet. As before, we have the
(n) (n) (n)
following condition for the average codeword length: H(S ) £ L £ H(S ) + 1. Therefore,
the number (L) of bits per original symbol is bounded as follows:
H(S )/n£ L£ (H(S ) + 1)/n Þ H(S) £ L £ H(S) + 1/n
In the limit, we get arbitrarily close to the entropy. In the above example, by taking pairs of
symbols (aa, ab, ac, ba, bb, bc, ca, cb, cc) as the new coding alphabet (n = 2), the average
codeword length will reduce to 0.8758 bits, having a redundancy of only 0.06 bits per symbol.
Of course, the size of the codebook, as well as the effort of computing it, will explode when n
gets larger. This restricts the applicability of the method. Moreover, many of the n-grams will
not even occur in the actual message. It should somehow be possible to compute the needed
codes on the fly, without building an explicit decoding tree. This idea will be used later,
especially in arithmetic coding.
4.2.2. Adaptive Huffman coding
Normal Huffman coding needs knowledge of the symbol probabilities before starting the
encoding. This means that the compression is done off-line, in two passes. If this is not
applicable, we should develop a modification where the code adapts to the probabilities
symbol by symbol, while the message is being transmitted. Then the decoder can maintain the
same code synchronously with the encoder, and there is no need to transmit the decoding
A naive solution would be to start with some default, e.g. uniform distribution, maintain
symbol frequencies, and compute the whole code repeatedly; once per each transmitted
symbol. Although this is in a sense optimal, it is clearly too laborious. It should be noticed that
small local changes in the symbol probabilities do not usually cause global changes in the
Huffman tree structure. We shall take advantage of this observation. In what follows we talk
about symbol and node weights (corresponding to frequencies), instead of probabilities. This21
has no effect on the resulting code, only the scaling of numbers is different. Weights are thus
integers, and a parent has always a weight which is the sum of the weights of its children. A
Huffman tree has the sibling property if the following two conditions hold:
(1) Each node, except the root, has a sibling (i.e. the binary tree is complete).
(2) The tree nodes can be listed in non-decreasing order of weight in such a way that each
node is adjacent to its sibling in the list.
Theorem. A binary tree with weights associated with its nodes, as defined above, is a
Huffman tree if and only if it has the sibling property.
The proof is skipped. Obviously, it must be possible to swap nodes during the course of
adaptation, as the order of weights (frequencies) changes. To do this, we maintain a threaded
list of nodes in non-decreasing order. In this list, the adjacent nodes with equal weights are
called a block. When, during encoding, a new symbol x has been transmitted, the tree must be
updated: The related leaf weight is increased by one. Now it may happen that the node is not
anymore in the correct place. Therefore, it is swapped with the rightmost node y of the block;
the next block must have weights at least as large as the increased value, so the order is
correct. If x = y then x was the only node in its block, and no swapping is required. Next
assume that x¹ y and denote the sibling of x by z, and the sibling of y by w. Now we check
that the sibling property holds. Since before the update weight(x) = weight(y), then after the
swap the sibling property holds for y and z. For w, two cases can be distinguished:
1) weight(x) = weight(w) before the update, and the order of nodes is x...wy... After the swap
and increase of weight(x) it holds: weight(w) weight(x) and the order is y...wx... Thus the
sibling property holds.
2) weight(x) weight(w) before the update, and the order of nodes is x...yw... After the swap
and increase of weight(x) it holds: weight(x)£ weight(w) and the order is y...xw... Again,
the sibling property holds.
Notice also that in swapping, the possibly attached subtrees are carried along. The procedure
continues to the parent of x, the weight of which is also increased by one, and the same test
plus possible swap are done. The process stops at the root.
The following example shows the evolution (including swap x « y) of a sample Huffman
tree, after increase of weight(a). Node a maintains its position, but the increase of the weight
of its parent x from 3 to 4 makes it out of order, and relocation of x follows, whereby its
children a and b move along.