Lecture notes Information Theory Coding

information theory and source coding lecture notes and information theory and coding lecture notes nptel pdf free download
Dr.DavidWllker Profile Pic
Dr.DavidWllker,United States,Professional
Published Date:11-07-2017
Your Website URL(Optional)
Comment
Source Encoding and Compression Jukka Teuhola Computer Science Department of Information Technology University of Turku Spring 2016 Lecture notes3 1. Introduction This course is about data compression, which means reducing the size of source data representation by decreasing the redundancy occurring in it. In practice this means choosing a suitable coding scheme for the source symbols. There are two practical motivations for compression: Reduction of storage requirements, and increase of transmission speed in data communication. Actually, the former can be regarded as transmission of data ‘from now to then’. We shall concentrate on lossless compression, where the source data must be recovered in the decompression phase exactly into the original form. The other alternative is lossy compression, where it is sufficient to recover the original data approximately, within specified error bounds. Lossless compression is typical for alphabetic and other symbolic source data types, whereas lossy compression is most often applied to numerical data which result from digitally sampled continuous phenomena, such as sound and images. There are two main fields of coding theory, namely 1. Source coding, which tries to represent the source symbols in minimal form for storage or transmission efficiency. 2. Channel coding, the purpose of which is to enhance detection and correction of trans- mission errors, by choosing symbol representations which are far apart from each other. Data compression can be considered an extension of source coding. It can be divided into two phases: 1. Modelling of the information source means defining suitable units for coding, such as characters or words, and estimating the probability distribution of these units. 2. Source coding (called also statistical or entropy coding) is applied to the units, using their probabilities. The theory of the latter phase is nowadays quite mature, and optimal methods are known. Modelling, instead, still offers challenges, because it is most often approximate, and can be made more and more precise. On the other hand, there are also practical considerations, in addition to minimizing the data size, such as compression and decompression speed, and also the size of the model itself. In data transmission, the different steps can be thought to be performed in sequence, as follows: Model Model Errors Source Channel Channel Source Source Sink encoding encoding decoding decoding Communication channel We consider source and channel coding to be independent, and concentrate on the former. Both phases can also be performed either by hardware or software. We shall discuss only the software solutions, since they are more flexible.4 In compression, we usually assume two alphabets, one for the source and the other for the target. The former could be for example ASCII or Unicode, and the latter is most often binary. There are many ways to classify the huge number of suggested compression methods. One is based on the grouping of source and target symbols in compression: 1. Fixed-to-fixed: A fixed number of source symbols are grouped and represented by a fixed number of target symbols. This is seldom applicable; an example could be reducing the ASCII codes of numbers 0-9 to 4 bits, if we know (from our model) that only numbers occur in the source. 2. Variable-to-fixed: A variable number of source symbols are grouped and represented by fixed-length codes. These methods are quite popular and many of the commercial com- pression packages belong to this class. A manual example is the Braille code, developed for the blind, where 2x3 arrays of dots represents characters but also some combinations of characters. 3. Fixed-to-variable: The source symbols are represented by a variable number of target symbols. A well-known example of this class is the Morse code, where each character is represented by a variable number of dots and dashes. The most frequent characters are assigned the shortest codes, for compression. The target alphabet of Morse code is not strictly binary, because there appear also inter-character and inter-word spaces. 4. Variable-to-variable: A variable-size group of source symbols is represented by a variable- size code. We could, for example, make an index of all words occurring in a source text, and assign them variable-length binary codes; the shortest codes of course for the most common words. The first two categories are often called block coding methods, where the block refers to the fixed-size result units. The best (with respect to compressed size) methods today belong to category 3, where extremely precise models of the source result in very high gains. For example, English text can typically be compressed to about 2 bits per source character. Category 4 is, of course, the most general, but it seems that it does not offer notable improvement over category 3; instead, modelling of the source becomes more complicated, in respect of both space and time. Thus, in this course we shall mainly take example methods from categories 2 and 3. The former are often called also dictionary methods, and the latter statistical methods. Another classification of compression methods is based on the availability of the source during compression: 1 1. Off-line methods assume that all the source data (the whole message ) is available at the start of the compression process. Thus, the model of the data can be built before the actual encoding. This is typical of storage compression trying to reduce the consumed disk space. 2. On-line methods can start the coding without seeing the whole message at once. It is possible that the tail of the message is not even generated yet. This is the normal situation in data transmission, where the sender generates source symbols one-by-one and compresses them on the fly. 1 We shall use the word message to represent the unit which is given as input to one execution of the com- pression program. In storage compression, the message usually means a file.5 Still another classification, related to this one, is based on the emergence and status of the source model: 1. Static methods assume a fixed model, which is used by both the encoder and decoder, and which is common for all messages from the source. This is acceptable, if there is not much variation between distinct messages, for example, if they are all English texts from the same application domain. 2. Semi-adaptive methods construct a model (for example a codebook) separately for each source message. The model is based on the whole message and built before starting the actual encoding. Thus, two passes through the data are needed. The encoder must first send the model and then the encoded message to the decoder. This presupposes an off-line method. In some data communication situations this kind of delay is not acceptable. 3. Adaptive methods construct the model on the fly. When encoding a certain input symbol (or group of symbols), the model is based on the previous part of the message, already encoded and transmitted to the decoder. This means that the decoder can construct the same model intact. Both partners gradually learn the properties of the source. The process starts with some initial model, for example a uniform distribution for all possible symbols, and then changes the model to converge towards the actual distribution. Even though semi-adaptive methods can in principle create a more precise model of the source, it has turned out that adaptive methods can achieve the same compression efficiency. One reason is that they do not have to transmit the model from the encoder to the decoder. Another advantage is that they may adapt to local changes in the statistical properties within the message. Adaptive methods are quite common today in general-purpose compression software. Compression efficiency is normally measured by the compression ratio, i.e. the ratio of the source message size to its compressed size, both measured in bits. In text compression, an objective measure is bits per source character (bpc), because it is independent of the original coding. In image compression, bits per pixel is a suitable measure. Earlier, character codes were normally of fixed length, such as ASCII (7 or 8 bpc), and ISO- 8859 character sets (8 bits per character). The more recent Unicode has been adopted in many contexts (Java, XML). Some Unicode versions have a fixed number of bits per character: UCS-2 with 16 bpc, and UCS-4 with 32 bpc. The others are of variable length: UTF-8 with 8, 16, 24 or 32 bpc, and UTF-16 with 16 or 32 bpc. Even though our source text characters may have a variable number of bits, we can still talk about e.g. fixed-to-variable coding, if we encode one source character at a time because a character is logically a fixed unit.6 2. Coding-theoretic foundations Here we study the coding of messages, consisting of symbols from a given source alphabet, denoted S = s , ..., s . In practice, the actual alphabet, used in coding, can be an extension 1 q of the original, so that also groups of symbols are assigned their own codes. A codebook contains a codeword w for each (extended) symbol s. Here we assume that the codewords i i can be of variable length. In adaptive compression, the codebook can also be dynamic, so that it may change during the coding of the source message. However, for each position of the message, there is a unique, discrete sequence of bits for every possible symbol. The codebook may also be implicit if the code is a computable function of the source symbols. The function is determined from our model of the source. As mentioned, it is possible to encode a group of source symbols as a unit. Since the codebook grows exponentially with the group size, large groups can only be coded with implicit methods. An extreme is a method which treats the whole message as one group, and computes a single codeword for it. In such coding, there is no precise correspondence between individual source symbols and coded bits: For a certain bit in the coded sequence, we cannot say exactly, which source symbol has produced it. Arithmetic coding, to be discussed later, is an example of this kind of method. Let us study the desirable properties what the codebook (explicit or implicit), denoted W = w , ..., w , should have. The first necessary condition is, of course, uniqueness: For each 1 q s , s Î S: s ¹ s Þ C(s) ¹ C(s), where C is the coding function. This condition would be i j i j i j sufficient if we should transmit only a single symbol (and if end-of-code can be signalled by some means). However, this is not usually the case. For multi-symbol messages (and variable- length codewords) we have two options: (a) use some special symbols (so called commas) between the codes, (b) develop self-punctuating codes. The former would be inefficient use of the special symbols, thus we concentrate on the latter. Let X and Y be any two sequences of source symbols. If X¹ Y Þ C(X) ¹ C(Y), then code C is called uniquely decodable (decipherable). The following is an example of an ill-designed codebook a = 0 b = 01 c = 11 d = 00 Now consider the encoded sequence “00110”. It can result either from message “dca” or “aaca”. The codebook (or code, for short) is not uniquely decodable. The following codebook is, instead, such: a = 0 b = 10 c = 110 d = 1117 Here the only interpretation of “00110” is “aac”. This code is uniquely decodable and, moreover, a prefix-free code (prefix code, for short): No codeword is the prefix of any other codeword. Codebooks are often illustrated by binary trees (called also decoding trees or decision trees). The latter of the above codebooks can be visualized as follows: 0 a 0 b 1 0 c 1 1 d It is easy to see that a prefix-free code is represented by a binary tree where each source symbol corresponds to a leaf node. Most often this kind of coding trees are also complete, i.e. all internal nodes have two children. The prefix-free code is always an instantaneous code, i.e. we know the source symbol as soon as all its code bits have been received. Instantaneous decoding is extremely useful, but not absolutely necessary for unique decodability. For example, the following code is uniquely decodable, even though it is not prefix-free: a = 0 b = 01 c = 011 d = 111 When decoding the bit string 00110, we cannot say after the first bit, whether it is part of ‘a’, ‘b’, or ‘c’. After the second bit we can, since there is no code with “00”. The next source symbol can be decided only after the last ‘0’ has been seen. A more general worst case is the sequence “0111...111”, for which the first source symbol can be decided only after seeing the very last bit. The reader might notice that each codeword is the reverse of the corresponding prefix-free codeword in the earlier codebook. Unique decodability results from the property that the coded sequence can be decoded backwards as if the codebook were prefix-free (actually it is suffix-free). If the decoding tree is fully balanced, it means that all codewords are of the same length, i.e. we have a block code. If the size of the source alphabet is not a power of 2, at least the lowest level is incomplete. If the difference between the longest and shortest codeword length is 1, then the code is called shortened block code. For example, the 5-symbol alphabet a, b, c, d, e can be encoded as a = 00 b = 01 c = 10 d = 110 e = 111 A rather simple condition has been proved about the lengths l , ..., l of an instantaneous 1 q code, namely the Kraft inequality:8 Theorem 2.1. An instantaneous codebook W of q codewords exists if and only if the lengths q 1 l , ..., l of the codewords satisfy K = £ 1. 1 q å l i 2 i =1 Proof. For an instantaneous code we have the symbols in the leaves of the decoding tree, and any binary tree with q leaves can be interpreted into an instantaneous code. Thus we show that for any binary tree the depths of leaves satisfy the inequality. We apply induction: For q=1, there is one leaf with depth = 1, so the sum K is equal to ½ £ 1. (Note that the code length cannot be zero, so a root-only tree is illegal.) For q=2 we have two leaves with depth 1, so the sum is ½ + ½ £ 1. Now assume an arbitrary tree with q nodes, with sum K £ 1. (a) Extend any leaf (with depth d) with one new leaf (making the old leaf internal). The new d d+1 leaf has depth d+1. Thus the sum of the theorem becomes K-1/2 +1/2 £ K £ 1. (b) Extend any leaf (with depth d) with two new leaves (making the old leaf internal). The new leaves have depth d+1. Now the number of leaves becomes q+1, and the sum value is d d+1 d+1 K-1/2 +1/2 +1/2 = K £ 1. Since any binary tree can be constructed step by step using extensions (a) and (b), the theorem holds. ■ In general form, the Kraft inequality has radix r in the denominator, instead of 2, but we restrict ourselves to the binary target alphabet. Moreover, it can also be shown that, for binary target alphabet, strict equality should hold; otherwise the code is inefficient. Let us examine the validity of the theorem by studying the previous reduced block code, with codeword lengths 2, 2, 2, 3, 3. We obtain the sum ¼ + ¼ + ¼ + 1/8 + 1/8 = 1. Thus it is easy to check for a given set of lengths, whether a codebook can be constructed or not, and if yes, whether the code is inefficient. Consider the set of lengths 1, 3, 3, 3, which produces the sum 7/8. We can directly say that the code is inefficient. The related tree has a leaf with no brother; therefore the leaf can be omitted, and the related symbol gets one bit shorter code- word (corresponding to the parent). Now, the set 1, 2, 3, 3 produces K = 1. The Kraft inequality holds for instantaneous codes. The MacMillan inequality says that the same holds for any uniquely decodable code. The proof is not quite as straightforward as above. Important is that, when choosing a codebook, we can calmly restrict ourselves to instantaneous codes, because for any set of lengths in a uniquely decodable code, we can derive an instantaneous code with the same lengths. In some applications, the source alphabet can in principle be infinite (but yet countable). One such situation is run-length coding, where we should encode the number of similar successive symbols in a sequence. Of course, all runs are of finite length, but no upper bound can be given. In this case we need a general coding system for all integers; the traditional fixed-length representation cannot be used. A coding system is in a sense universal, if it works for all possible sources in some class. Usually, an additional requirement is set for true universality, such that the code satisfies a certain efficiency constraint (to be explained later). Since the alphabet is infinite, we cannot store an explicit codebook, but we must provide an algorithm (function), which produces the code for any integer. Two desirable properties for such a code are:9 · Effectiveness: There is an effective procedure that can decide whether a given sequence of bits belongs to the codebook or not. · Completeness: adding any new code would create a codebook that is not uniquely de- cipherable. 1 There are various ways of designing the universal code; Peter Elias has described and analysed potential coding approaches for positive integers. A simplified description of them is as follows: 1.a-coding: This is unary coding, where number n is represented by n-1 zeroes plus an end- ing one-bite: 1®1; 2®01; 3®001; 4®0001; 5®00001; ... Actually,a-coding is not universal in the strict sense, because it does not satisfy the efficiency constraint. 2.b-coding: This is the traditional (positional) representation of numbers, excluding the leading zeroes. Since its codes are not self-punctuative, we need an end symbol ‘’, and thus our code alphabet is ternary. The leading 1-bit can be omitted: 1®; 2®0; 3®1; 4®00; 5®01; 6®10; 7®11; 8®000; ... This can be transformed to the normal binary representation e.g. by further coding: 0 ®0; 1®10; ®11. The final encoding would thus be 1®11; 2®011; 3®1011; 4®0011; 5®01011; 6®10011; 7®101011; 8®00011. Notice that the lengths of the codewords are not monotonically increasing with the number to be encoded. 3.g-coding: To avoid using the end symbol, the positional representation is prefixed by an a- coded length. Again, the leading 1-bit is omitted from the number - actually we can think that the 1-bit that ends the unary code is simultaneously the leading 1-bit. The small integers are now coded as follows: 1®1; 2®010; 3®011; 4®00100; 5®00101; 6®00110; 7®00111; 8®0001000; ... 4.d-coding: similar as g-coding, but the length is not represented with unary coding, butg- coding. Examples: 1®1; 2®0100; 3®0101; 4®01100; 5®01101; 6®01110; 7®01111; 8®00100000; ... d-coding is theoretically better thang-coding for large numbers (cf. next section). There are many other useful properties of codes. One such property is robustness against transmission errors. In the worst case, inverting a single bit in a sequence of variable-length codes may scramble the whole tail of the coded sequence. The synchronization property refers to the behaviour of the decoding process in the context of errors (add/drop/complement a bit). Synchronization means, how fast the decoder will catch again the correct codeword boundaries. An example of an elegant, robust, universal coding scheme for integers is so called 2 Zeckendorf representation (called also Fibonacci representation) . Its simplest form encodes 1 Peter Elias: “Universal Codeword Sets and Representations of the Integers”, IEEE Trans. on Inf. Theory, Vol. IT-21, No. 2, 1975. 2 A. S. Fraenkel, S. T. Klein “Robust Universal Complete Codes for Transmission and Compression “, Discrete Applied Mathematics, Vol. 64, 1996, pp 31–55.10 numbers by bit-weighted sums of Fibonacci numbers …, 34, 21, 13, 8, 5, 3, 2, 1. For example, the value 17 can be represented by the sum 0·34 + 0·21 + 1·13 + 0·8 + 0·5 + 1·3 + 0·2 + 1·1. The robustness of Zeckendorf code results from the fact that every number can be represented with a code with no pairs of adjacent 1-bits (results immediately from the Fibonacci system). Thus, by extending the code with two additional 1-bits we can make the code self-delimiting. Moreover, by transmitting the least significant bits first, up to the most significant 1-bit, only a single extra 1-bit is needed to delimit the code. The codebook for small integers looks like follows. Number Weights for Code 8 5 3 2 1 1 0 0 0 0 1 11 2 0 0 0 1 0 011 3 0 0 1 0 0 0011 4 0 0 1 0 1 1011 5 0 1 0 0 0 00011 6 0 1 0 0 1 10011 7 0 1 0 1 0 01011 8 1 0 0 0 0 000011 9 1 0 0 0 1 100011 10 1 0 0 1 0 010011 11 1 0 1 0 0 001011 12 1 0 1 0 1 101011 … … … Since no two consecutive 1-bits can exist within a codeword, excluding the end, single-bit errors can only propagate to the next codeword, but not further. What is more, it has been shown that basic arithmetic operations can be performed directly with Zeckendorf-coded 1 integers . However, transmission errors are not our main concern in this course, so we shall later only mention some results and observations from this topic. 2 As an example of a parametric code, we present the so called Golomb code , which is optimal for specific types of numeric sources. As a preliminary step, we introduce a code for uniform distribution of a finite set of symbols (i.e. all q symbols are equally likely). If q is a power of k two (q = 2), then the codes for the alphabet naturally consist of all possible k-bit combinations. However, if q is not a power of 2, a fixed-length code ofélog qù bits would be 2 redundant (some code values unused). In this case we choose a so-called semi-fixed-length code, where some of the codewords have lengthëlog qû, and the rest have length élog qù . 2 2 The following tables shows semi-fixed-length codebooks for q in 2…8. x q = 2 q = 3 q = 4 q = 5 q = 6 q = 7 q = 8 0 0 0 00 00 00 00 000 1 P. Fenwick: “Zeckendorf Integer Arithmetic”, Fibonacci Quarterly, Nov 2003, pp. 405 – 413. 2 S. W. Golomb: “Run-Length Encodings”, IEEE Trans. on Inf. Theory, Vol. 12, 1966, pp. 399-401.11 1 1 10 01 01 01 010 001 2 11 10 10 100 011 010 3 11 110 101 100 011 4 111 110 101 100 5 111 110 101 6 111 110 7 111 Golomb coding is applied to numeric data (i.e. integers). It has one parameter m, which is used in encoding of an integer x as follows. Encode the quotient a = ëx/mû with unary coding, and the remainder x- ma with semi-fixed-length coding, where the alphabet size q = m. The two parts are concatenated to codewords. An example with m=3 is given in the table below. The parts of the codewords are separated by a space, for clarity. In the special case k 1 where m is a power of 2 (denote 2 ) we obtain a so called Rice code , which is structurally simpler, as shown in the table for m = 2. x Golomb Rice m = 3 m = 2 0 0 0 00 1 0 10 01 2 0 11 100 3 10 0 101 4 10 10 1100 5 10 11 1101 6 110 0 11100 7 110 10 11101 8 110 11 111100 9 1110 0 111101 Both are very efficient in practice, if the distribution of source numbers is close to geometric. Such a distribution results e.g. from repeating binary events (sequences of ‘success’/’failure’) where a success is more probable than a failure, and we encode the number of successes up to the next failure. Another interpretation of the same situation is run-length encoding of a binary sequence. If p = P(0) P(1), then m should be chosen such that the probability of m 0-bits before the next 1-bit is about 0.5. From this it follows that m » -1/log p is the best choice. 2 The next chapter studies the connection between symbol probabilities and optimal codes more deeply. 1 R. F. Rice: “Some Practical Universal Noiseless Coding Techniques”, Jet Propulsion Laboratory, JPL Publication 79-22, Pasadena, CA, 1979.12 3. Information-theoretic foundations 1 Information theory, mainly founded by Claude Shannon at Bell Labs, studies both noise protection and efficient use of the communication channel. It gives bounds for the ultimate transmission rate of the channel, and for the ultimate data compression. Of these two we concentrate on the latter. First we shall define a quantitative measure of information for both single symbols and the whole alphabet. Notice that the word ‘information’ is here a quite concrete technical concept, and bears only loose analogy with the ‘normal’ concept of information in human communication. Suppose that we have a source alphabet S = s , ... s , with probabilities P = p , ... p . The 1 q 1 q question is, how much information the receiver gets when seeing a certain symbol s. The i principle is that a less frequent character gives more information than a more frequent one. Thus, the amount of information measures the degree of surprise (uncertainty). If some p = 1 i then all symbols to be sent are = s , and there is no surprise - no information. For constructing i an information function I(p), three properties should be satisfied: 1. I(p)³ 0, 2. I(p p ) = I(p ) + I(p ) for independent events, 1 2 1 2 3. I(p) is a continuous function. n From condition 2 it follows that I(p ) = nI(p). We see that I satisfies the same properties as the logarithm function. In fact, the solution is essentially unique, and we define I(p) = k log p. From property 1 (and noticing that 0 £ p £ 1), we deduce the value k = -1. Hence we have: I(p) = -log p = log (1/p) The base of logarithm can be chosen. Normally we take base = 2, and define the unit of information to be a bit. This word has now two meanings: a digit in 2-base representation and a unit in measuring information. Of course, this is not a coincidence, as seen from the following example: Consider tossing a (fair) coin with P(heads) = P(tails) = ½. Then I(heads) = I(tails) = -log ½ = 1 (bit), which is consistent with the idea that with one binary digit we can tell the outcome. If the coin is unfair, such that P(heads) = 1/8 and P(tails) = 7/8, then we have I(heads) =-log (1/8) = 3 and I(tails) =-log (7/8)» 0.193 bits. Thus getting heads is a much bigger surprise than getting tails. Notice also that the information amount is a real- valued quantity, and thus more general than the binary digit. For studying the additive pro- perty, consider tossing a fair coin three times. The information amount of the sequence: heads-tails-heads (or any other 3-sequence), is: I(½½½) = I(1/8) =-log (1/8) = 3 bits = I(½) + I(½) + I(½), as expected. Next we define the average information when receiving a symbol from alphabet S, with probability distribution P. Since we get amount I(p) with probability p, the average (ex- i i pected) information per symbol, called entropy, is a weighted sum: q q æö 1 HP ()= pIp()= p logç÷ åå i i i èpø i=1 i=1 i 1 C. E. Shannon: “A Mathematical Theory of Communication”, Bell System Technical Journal, Vol. 27, pp. 379-423, 623-656, 1948.13 Example. Let P = 0.1, 0.2, 0.3, 0.4. Then H(P) = -0.1 log 0.1 - 0.2 log 0.2 - 0.3 log 0.3 - 0.4 log 0.4» 0.332 + 0.464 + 0.521 + 0.529 = 1.846. Compare this with uniform distribution P = 0.25, 0.25, 0.25, 0.25, which has entropy H(P) = -4 ´ 0.25 log 0.25 = 2. It is easy to prove the observation generally: a uniform distribution gives the maximum entropy. Notice the analogy with physical phenomena e.g. in thermodynamics, where the same word is used. Shannon has proved the following fundamental “noiseless source encoding theorem”: Theorem 3.1. Entropy H(S) gives a lower bound to the average code length L for any instantaneously decodable system. (Proof skipped) Even though precise symbol probabilities are seldom known, entropy is an excellent reference measure when studying the compression power of a source coding method. The difference L-H(S) between the average codeword length and entropy is called redundancy of the code. l i The equality in Theorem 3.1 holds if the probabilities p are powers of 2, i.e. p= 1/ 2 . In i i this case we reach the theoretic optimum with redundancy = 0. Even if this were not the case, L can always be limited to at most H(S) + 1 by choosing æ 1ö æ 1ö logç÷£ l logç ÷+ 1 2 i 2 p p èø è ø i i A code is called universal if L £ c H(S) + c asymptotically for some constants c and c and 1 2 1 2 for nonzero entropy H(S). The code is asymptotically optimal if c = 1. 1 Consider the encoding of natural numbers n = 1, 2, etc. Often we do not know their exact probabilities, and they are hard to find out experimentally. However, the following 1 fundamental result has been proved : Theorem 3.2. There exist codes, which satisfy the above universality condition for any distribution satisfying P(1)³ P(2) ³ P(3)³ … ³ P(n)³ … (Proof skipped). The codeword lengths for such encodings are non-decreasing with n. Of the coding methods for numbers (Chapter 2), it can be shown that a-code is not universal in the above sense, whereasg- and d-codes are. Moreover,g-code has constant c = 2, while d-code has c = 1, 1 1 the latter being asymptotically optimal. Zeckendorf code (called also Fibonacci code) is also universal, but not asymptotically optimal; its c -value is about 1.440. Nevertheless, for a large 1 subset of integers, the Zeckendorf codewords are actually shorter thand-codes. Even thougha-code (= unary code) is not universal, it is optimal if the probabilities are: n P(1) = 1/2, P(2) = 1/4, P(3) = 1/8, …, P(n) = 1/2 , … In fact, this kind of distribution is not uncommon. Note that the sum of probabilities 1/2 + 1/4 + 1/8 + … = 1, as required. Above we assumed the successive symbols of the message to be independent. In practice this is not the case, so we should extend the concept of entropy. This leads us to the modelling part of compression, which we shall study later more deeply. Here we mention the natural extension to conditional probabilities. Suppose that we know m previous symbols before s , i 1 See: Peter Elias: “Universal Codeword Sets and Representations of the Integers”, IEEE Trans. on Inf. Theory, IT-21, No 2, pp. 194-203, 197514 namely s,, L s , and the related conditional probability P(s s,, L s ). The conditional in- i i i i i 1 m 1 m formation of s in this m-memory source is naturally log (1 P(ss ,L,s )) and the conditional i 2 ii i 1 m entropy æ ö 1 ç ÷ H(Ss ,L,) s = P(ss ,L,s )log å i i ii i 2 ç ÷ 1 m 1 m P(ss ,L,) s S è ø ii i m 1 The situation can be described by an m’th order Markov process, which is a finite automaton with states for all m-grams of symbols, and transitions corresponding to successors of m- grams, with the conditional probabilities. In an ergodic Markov process the state probabilities approach the so called equilibrium. The global entropy is æ ö 1 ç ÷ HS ()= Ps ( ,L,s )P(ss ,L,s )log åå i i ii i 2 1 m 1 m ç ÷ m P(ss ,L,) s S ø S è ii i 1 m æ ö 1 ç ÷ = Ps ( ,L,ss , )log å i i i 2 ç ÷ 1 m m+1 P(ss ,L,) s è ø S ii i 1 m There is another measure for the complexity of a message which is called Kolmogorov com- plexity, and which can be considered more general than entropy. It is defined as the length of the shortest (binary) program for generating the message. If the sequence of symbols is drawn at random from a distribution that has entropy H, then the Kolmogorov complexity is approximately equal to H. However, consider a pseudo random number generator, which produces a seemingly random sequence of symbols. The generating algorithm can be regarded as the compressed representation of the message, because the sequence can be recovered by applying it. Unfortunately, Kolmogorov complexity has been shown to be non-computable, i.e. there is no algorithm to determine it (cf. Gödel’s incompleteness theorem). This makes the concept rather impractical. However, a somewhat related idea is used in fractal compression of images, where the compressor tries to find self-repeating components from the image, i.e. tries to find the ‘hidden algorithm’ that has generated the image. However, this can only be done approximately (i.e. the method is lossy), unless the image is a pure fractal.15 4. Source coding methods In this section we shall study the second part of compression, namely source coding, and leave the first, ‘harder’ part (= modelling) to the next sections. Source coding is based on the assumed probability distribution (P) of the source symbols (S). For a finite alphabet it is usually possible to give at least estimates of the probabilities. For an infinite alphabet the distribution is often implicit. For example, we could assume that the probability of a number is inversely proportional to its magnitude. At least, it must hold that p ® 0 as i ® ¥, because i Sp = 1. We shall concentrate on finite alphabets. i 4.1. Shannon-Fano coding The theory of Section 3 gives a symbol (s ) with probability p the information log (1/p ) bits. i i 2 i An optimal coding solution would give the symbol a code of this length. However, the length is an integer only if p is a negative power of 2. If this is not the case, we can take as the actual i length the next larger integer, i.e.élog (1/p)ù. It is easy to see that this choice satisfies the 2 i Kraft inequality: l l l l i i i i l³log(1/) pÞ2³1/ pÞ p³12/Þ p³ (12/)Þ³ 1 (12/) i 2 i i i å i å å Thus, there is a prefix code with lengths élog (1/p )ù for i = 1, ..., q. Since every codeword 2 i length is at most one larger than the theoretical optimum, it also holds that the average code length L satisfies: H(S)£ L£ H(S) + 1 Example. Let p = p = 0.3, and p = p = p = p = 0.1. From log (1/0.3)» 1.737 and 1 2 3 4 5 6 2 log (1/0.1)» 3.322, we get the code lengths 2, 2, 4, 4, 4, 4. The codes can be assigned in 2 lexicographic order for the list of lengths sorted into ascending order: s = 00, s = 01, s = 1000, s = 1001, s = 1010, s = 1011 1 2 3 4 5 5 which has the average code length of 2.8, while the entropy is H(S) » 2.371. If we study the decoding tree, we notice that it is not complete (not succinct): The subtree 11... is missing. Checking the Kraft inequality confirms the property (the sum is not 1): 6 1 1 1 1 1 1 1 3 = + + + + + = 1 å l 2 2 4 4 4 4 i 2 2 2 2 2 2 2 4 i =1 The situation is more obvious if we take a binary source alphabet with p = 0.99 and p = 1 2 0.01, for which the above rule gives code lengths l = 1 and l = 7, although the only sensible 1 2 possibility is to choose both lengths to be = 1 (corresponding to codes 0 and 1). Nevertheless, for lengths 1 and 7, the average code length would yet be only 0.99´1 + 0.01´7 = 1.06, while the entropy is - 0.99×log 0.99- 0.01× log 0.01» 0.08 bits 2 216 The above description is the theoretical basis for Shannon-Fano coding (presented in- dependently by Shannon & Weaver and Fano in 1949), but the actual algorithm is more constructive, and also avoids non-complete decoding trees. Its idea is to divide the ‘probability mass’ (with related symbols) repeatedly into two approximately same-sized halves, and simultaneously construct the decoding tree in a top-down fashion. With no loss of generality, we assume here that probabilities p , ..., p are in descending order. Sorting may 1 q thus precede the following codebook generation algorithm: Algorithm 4.1. Shannon-Fano codebook generation. Input: Alphabet S = s , ..., s with probability distribution P = p , ..., p , where p ³ p . 1 q 1 q i i+1 Output: Decoding tree for S. begin Create a root vertex r and associate alphabet S with it. If S has only one symbol then return a tree with r as the only node (root). j q Choose index j ( ¹ 0 and ¹ q), for which the sums p and p are the closest. å å i i i =1 ij =+1 Find decoding trees, rooted by r and r , for the sub-alphabets s , ..., s and 1 2 1 j s , ..., s recursively and add them as subtrees of r. Attach labels 0 and 1 to the j+1 q corresponding parent-child relationships rà r and r à r . 1 2 Return the tree rooted by r. end It can be proved that the same properties hold for this code as for the previous code based on direct computation of the codeword lengths, i.e. the redundancy is at most 1. To be precise, the selection of an optimal split index j is not trivial (generally, optimal partitioning is NP- hard). In practice the suggested condition (based on closest sums) is sufficient. Example. S = a, b, c, d, e, f, P = 0.3, 0.3, 0.1, 0.1, 0.1, 0.1. The codebook generation proceeds step by step, as shown in the figure on the next page, with subsets of symbols and related sums of probabilities marked for nodes. The first split is done into probability subsets 0.3, 0.3 and 0.1, 0.1, 0.1, 0.1, attached with bits 0 and 1 (to be prefixes for the codes in the subsets). The former is split again, and produces singleton subsets 0.3 and 0.3 with final codes 00 and 01 for a and b. Subset 0.1, 0.1, 0.1, 0.1 is split first into 0.1, 0.1 and 0.1, 0.1, producing prefixes (= paths from the root) 10 and 11. The final splits produce four singleton subsets, each with probability 0.1. The related codes are 100, 101, 110 and 111. For the codebook a=00, b=01, c=100, d=101, e=110, f=111 the average code length is 2.4 bits, having redundancy » 0.029 bits per symbol. In fact, the obtained code is optimal (among methods assigning distinct codewords for symbols), although generally Shannon-Fano code does not guarantee optimality. The tree structure is suitable for the decoder: by branching according to the received bits it is easy to pick up the correct original symbol from the leaf. The encoder, instead, needs some other data structure, presumably a table with an entry for each symbol and the related codeword.17 (1) (2) (3) a,b,c,d,e,f: 1.0 a,b,c,d,e,f: 1.0 a,b,c,d,e,f: 1.0 0 1 0 1 a,b: 0.6 c,d,e,f: 0.4 0 1 a,b: 0.6 c,d,e,f: 0.4 a: 0.3 b: 0.3 (4) (5) a,b,c,d,e,f: 1.0 a,b,c,d,e,f: 1.0 0 1 0 1 c,d,e,f: 0.4 c,d,e,f: 0.4 a,b: 0.6 a,b: 0.6 0 1 0 1 0 1 0 1 c,d e,f: 0.2 0.2 0 1 0 1 a: 0.3 b: 0.3 c,d: 0.2 e,f: 0.2 a:0.3 b:0.3 c:0.1 d:0.1 e:0.1 f:0.1 4.2. Huffman coding The best-known source coding method is without doubt Huffman coding. The generation of a Huffman code resembles the Shannon-Fano method, but the direction is opposite: Shannon- Fano generates the decoding tree top-down, while Huffman method does it bottom-up. The average codeword length of the Huffman code reaches the entropy if the probabilities are integer powers of ½ (cf. Chapter 3). It also holds that, even in cases where the entropy is not reached, Huffman coding is optimal among the methods that assign distinct codewords for separate, independent symbols of the source alphabet. The idea of Huffman code is developed as follows. First assume, without loss of generality, that the probabilities of symbols are sorted into decreasing order: p³ p³ ... ³ p . Obviously, 1 2 q this implies that the codeword lengths must satisfy: l £ l £ ... £ l . Since all non-root nodes 1 2 q in an efficient (complete) decoding tree have a sibling, and symbol s is on the lowest level q (leaf), also s should be a leaf, more precisely, the sibling of s and have the same codeword q-1 q, length. (The choice may not be unique, if there are equal probabilities, but then we can make an arbitrary choice.) Therefore, we can assign suffixes 0 and 1 to the last two symbols; their final codewords will be of the same length and differ only in the last bit position. The pair can be considered a single ‘meta-symbol’ (= sub-alphabet s , s ), which has probability p + q-1 q q-1 p , and becomes the parent of the last two symbols. From this we can continue by repeating q the process for the reduced alphabet of q-1 symbols, and attaching new parents to already ge- nerated subtrees. More formally, the algorithm proceeds as follows: Algorithm 4.2. Huffman codebook generation.18 Input: Alphabet S = s , ..., s with probability distribution P = p , ..., p , where p ³ p . 1 q 1 q i i+1 Output: Decoding tree for S. begin Initialize forest F to contain a distinct single-node tree T for each symbol s and set i i weight(T ) = p . i i while F 1 do begin Let X and Y be two trees in the forest which have the lowest weights. Create a binary tree Z, with X and Y as subtrees (equipped with labels 0 and 1). The root of Z is a new node representing the combined symbol set of X and Y. Set weight(Z) = weight(X) + weight(Y). Add Z to forest F and remove X and Y from it. end Return the single remaining tree of forest F. end The resulting code is a prefix code, because the original symbols will be leaves of the final tree. The proof of optimality is skipped here, but essentially we proved it already in the above derivation. As Huffman coding is optimal, and Shannon-Fano satisfies H(S) £ L £ H(S) + 1 then also Huffman code satisfies it. In fact, it has been proved that an upper bound for the redundancy is p + 0.086, where p is the highest probability among the symbols. To be 1 1 precise, the upper bound is Min(p + 0.086, 1). 1 The generated codebook is not unique, if there occur equal probabilities during the process; ties in choosing the two smallest weights can be broken arbitrarily, without affecting the average length L of codewords. However, if we always favour the smallest trees, we obtain a code, the longest codeword of which is minimal. This may have practical advantage. When we still notice that, in each combining step, 0 and 1 can be assigned to subtrees in either way, it is clear that the number of different optimal codebooks is exponential. Example. Determine the Huffman code for alphabet S = a, b, c, d, e, f where P = 0.3, 0.3, 0.1, 0.1, 0.1, 0.1. Forest F develops in the steps shown in the picture on the following page. We obtain the same codebook 00, 01, 100, 101, 110, 111 as with Shannon-Fano method, which in this case is also optimal. Technically the Huffman algorithm can be implemented e.g. in the following ways: (1) Build and maintain a heap (more precisely, a min-heap) structure for the symbols (both original and meta-symbols), ordered by weight. The heap can be built in linear time using Floyd’s algorithm. Extracting the smallest and adding a new meta-symbol need both O(log q) operations for q nodes, and thus the total complexity is O(q log q). (2) If the alphabet is already sorted by probability, we can build the tree in linear time by keeping a queue of the created metasymbols: they will be generated in increasing order of weight. The two smallest weights are thus found either among the sorted list of original symbols or from the queue of metasymbols.19 (1) (2) 0.2 0 1 0.3 0.3 0.1 0.1 0.1 0.1 0.3 0.3 0.1 0.1 0.1 0.1 a b c d e f a b c d e f (3) (4) 0 0.4 1 0.2 0.2 0.2 0.2 0 1 0 1 0 1 0 1 0.3 0.3 0.1 0.1 0.1 0.1 0.3 0.3 0.1 0.1 0.1 0.1 a b c d e f a b c d e f (5) (6) 0 1.0 1 0.4 0 0.4 0 1 1 0.6 0.2 0.2 0.6 0.2 0.2 0 1 0 1 0 1 0 1 0 1 0 1 0.3 0.3 0.1 0.1 0.1 0.1 0.3 0.3 0.1 0.1 0.1 0.1 a b c d e f a b c d e f There are some special probability distributions worth looking at. First, if all symbols are equally likely, and if the number of symbols q is a power of two, we get a perfectly balanced binary tree and a block code with each codeword having length log q. If symbols are equally 2 likely but q is not a power of two, we obtain a shortened block code, where the lengths differ by at most one. A somewhat weaker condition is sufficient for this: If the sum of two smallest probabilities (p + p ) is greater than the biggest probability (p ), then we also end up to a q-1 q 1 (shortened) block code. This is seen immediately from technique (2) above, because all actual symbols are combined before any of the metasymbols, and the metasymbols are combined in the order they were generated. i Another extreme is the (negative) exponential distribution p = 1/2 , the Huffman tree of which i is a degenerated tree where each internal node has a leaf child (two leaves on the lowest level). This kind of code is similar to unary code (11...10), except for one codeword that has all its bits = 1. A distribution that often occurs in practice, e.g. for symbols in natural language, is the Zipf distribution, where the i’th symbol in the alphabet, ordered in decreasing order of probability, has probability proportional to 1/i. Since the sum of 1/i is not limited when i approaches infinity, the Zipf distribution is valid only for finite alphabets. A typical compression result is about 5 bits per character. In static compression, the Huffman code can be determined once for all inputs. In semi- adaptive coding, however, the code is determined for each input message separately. Therefore, in transmitting the message to the decoder, one has to send both the codebook and the encoded message. This reduces the compression gain but, having a finite alphabet, the loss is negligible for long messages. Anyway, it is worthwhile to consider the compression of the20 codebook (= decoding tree) itself. The shape of the tree can be expressed with 2q-1 bits: The whole tree contains always this number of nodes, and for each node we must tell whether it is a leaf or not. In addition, we have to transmit the order of symbols in the leaves from left to right. In total, the representation amounts to 2q-1 + qélog qù bits. It is also possible to express 2 the codebook by transmitting only the lengths of codewords (in alphabetic order), or counts of different lengths plus symbols in probability order. We shall return to the latter alternative later. 4.2.1. Extended Huffman code For large alphabets, with the largest probability not too big, Huffman coding is generally efficient. However, for very small alphabets with a skew distribution it can be rather bad, compared to the entropy. For example, if we have S = a, b, c with P = 0.8, 0.18, 0.02, the optimal codebook will be 0, 10, 11, with average codeword length = 1.2 bits, while the entropy is only 0.816 bits. The redundancy (0.384 bits per character) can be reduced by (n) extending the alphabet S to S , consisting of n-symbol blocks, called n-grams. Their count is n q , and the probability of each can be computed as the product of elementary probabilities. Then we can construct the Huffman code for the extended alphabet. As before, we have the (n) (n) (n) following condition for the average codeword length: H(S ) £ L £ H(S ) + 1. Therefore, the number (L) of bits per original symbol is bounded as follows: (n) (n) H(S )/n£ L£ (H(S ) + 1)/n Þ H(S) £ L £ H(S) + 1/n In the limit, we get arbitrarily close to the entropy. In the above example, by taking pairs of symbols (aa, ab, ac, ba, bb, bc, ca, cb, cc) as the new coding alphabet (n = 2), the average codeword length will reduce to 0.8758 bits, having a redundancy of only 0.06 bits per symbol. Of course, the size of the codebook, as well as the effort of computing it, will explode when n gets larger. This restricts the applicability of the method. Moreover, many of the n-grams will not even occur in the actual message. It should somehow be possible to compute the needed codes on the fly, without building an explicit decoding tree. This idea will be used later, especially in arithmetic coding. 4.2.2. Adaptive Huffman coding Normal Huffman coding needs knowledge of the symbol probabilities before starting the encoding. This means that the compression is done off-line, in two passes. If this is not applicable, we should develop a modification where the code adapts to the probabilities symbol by symbol, while the message is being transmitted. Then the decoder can maintain the same code synchronously with the encoder, and there is no need to transmit the decoding tree. A naive solution would be to start with some default, e.g. uniform distribution, maintain symbol frequencies, and compute the whole code repeatedly; once per each transmitted symbol. Although this is in a sense optimal, it is clearly too laborious. It should be noticed that small local changes in the symbol probabilities do not usually cause global changes in the Huffman tree structure. We shall take advantage of this observation. In what follows we talk about symbol and node weights (corresponding to frequencies), instead of probabilities. This21 has no effect on the resulting code, only the scaling of numbers is different. Weights are thus integers, and a parent has always a weight which is the sum of the weights of its children. A Huffman tree has the sibling property if the following two conditions hold: (1) Each node, except the root, has a sibling (i.e. the binary tree is complete). (2) The tree nodes can be listed in non-decreasing order of weight in such a way that each node is adjacent to its sibling in the list. Theorem. A binary tree with weights associated with its nodes, as defined above, is a Huffman tree if and only if it has the sibling property. The proof is skipped. Obviously, it must be possible to swap nodes during the course of adaptation, as the order of weights (frequencies) changes. To do this, we maintain a threaded list of nodes in non-decreasing order. In this list, the adjacent nodes with equal weights are called a block. When, during encoding, a new symbol x has been transmitted, the tree must be updated: The related leaf weight is increased by one. Now it may happen that the node is not anymore in the correct place. Therefore, it is swapped with the rightmost node y of the block; the next block must have weights at least as large as the increased value, so the order is correct. If x = y then x was the only node in its block, and no swapping is required. Next assume that x¹ y and denote the sibling of x by z, and the sibling of y by w. Now we check that the sibling property holds. Since before the update weight(x) = weight(y), then after the swap the sibling property holds for y and z. For w, two cases can be distinguished: 1) weight(x) = weight(w) before the update, and the order of nodes is x...wy... After the swap and increase of weight(x) it holds: weight(w) weight(x) and the order is y...wx... Thus the sibling property holds. 2) weight(x) weight(w) before the update, and the order of nodes is x...yw... After the swap and increase of weight(x) it holds: weight(x)£ weight(w) and the order is y...xw... Again, the sibling property holds. Notice also that in swapping, the possibly attached subtrees are carried along. The procedure continues to the parent of x, the weight of which is also increased by one, and the same test plus possible swap are done. The process stops at the root. The following example shows the evolution (including swap x « y) of a sample Huffman tree, after increase of weight(a). Node a maintains its position, but the increase of the weight of its parent x from 3 to 4 makes it out of order, and relocation of x follows, whereby its children a and b move along.