How does Information theory work

how information theory handles cell signaling and uncertainty and how information foraging theory works and how information theory unify quantum mechanics
Dr.MohitBansal Profile Pic
Dr.MohitBansal,Canada,Teacher
Published Date:25-10-2017
Your Website URL(Optional)
Comment
Information Theory This section briefly introduces Shannon’s information theory, which was founded in 1948 and represents the basis for all communication systems. Although this theory is used only with respect to communication systems, it can be applied in a much broader context, for example, for the analysis of stock markets (Sloane and Wyner 1993). Furthermore, emphasis is on the channel coding theorem and source coding and cryptography are not addressed. The channel coding theorem delivers ultimate bounds on the efficiency of communi- cation systems. Hence, we can evaluate the performance of practical systems as well as encoding and decoding algorithms. However, the theorem is not constructive in the sense that it shows us how to design good codes. Nevertheless, practical codes have already been found that approach the limits predicted by Shannon (ten Brink 2000b). This chapter, starts with some definitions concerning information, entropy, and redun- dancy for scalars as well as vectors. On the basis of these definitions, Shannon’s channel coding theorem with channel capacity, Gallager exponent, and cutoff rate will be pre- sented. The meaning of these quantities is illustrated for the Additive White Gaussian Noise (AWGN) and flat fading channels. Next, the general method to calculate capacity will be extended to vector channels with multiple inputs and outputs. Finally, some information on the theoretical aspects of multiuser systems are explained. 2.1 Basic Definitions 2.1.1 Information, Redundancy, and Entropy In order to obtain a tool for evaluating communication systems, the term information must be mathematically defined and quantified. A random process X that can take on val- ues out of a finite alphabet X consisting of elements X with probabilities PrX is µ µ assumed. By intuition, the information I(X ) of a symbol X should fulfill the following µ µ conditions. 1. The information of an event is always nonnegative, that is, I(X )≥ 0. µ52 INFORMATION THEORY 2. The information of an event X depends on its probability, that is, I(X )= µ µ f(PrX ). Additionally, the information of a rare event should be larger than that µ of a frequently occurring event. 3. For statistically independent events X and X with PrX ,X= PrX PrX , µ ν µ ν µ ν the common information of both events should be the sum of the individual contents, that is, I(X ,X )= I(X )+I(X ). µ ν µ ν Combining conditions two and three leads to the relation f(PrX · PrX )= f(PrX )+f(PrX ). µ ν µ ν The only function that fulfills this condition is the logarithm. Taking care of I(X )≥ 0, µ the information of an event or a symbol X is defined by (Shannon 1948) µ 1 I(X )= log =− log PrX . (2.1) µ µ 2 2 PrX µ Since digital communication systems are based on the binary representation of symbols, the logarithm to base 2 is generally used and I(X ) is measured in bits. However, different µ definitions exist using for example, the natural logarithm (nat) or the logarithm to base 10 (Hartley). The average information of the processX is called entropy and is defined by  ¯ I(X)= E I(X )=− PrX · log PrX . (2.2) X µ µ µ 2 µ It can be shown that the entropy becomes maximum for equally probable symbols X .In µ k this case, the entropy of an alphabet consisting of 2 elements equals  −k k ¯ I (X)= 2 · log 2 = log X= k bit. (2.3) max 2 2 µ ¯ Generally, 0≤ I(X)≤ log X holds. For an alphabet consisting of only two elements with 2 probabilities PrX= P and PrX= 1− P , we obtain the binary entropy function 1 e 2 e ¯ I (P )=−P · log (P )− (1− P )· log (1− P ). (2.4) 2 e e e e e 2 2 ¯ This is depicted in Figure 2.1. Obviously, the entropy reaches its maximum I =1bitfor max the highest uncertainty at PrX= PrX= P = 0.5. It is zero for P =0and P = 1 1 2 e e e because the symbols are already a priori known and do not contain any information. More- over, entropy is a concave function with respect to P . This is a very important property e that also holds for more than two variables. A practical interpretation of the entropy can be obtained from the rate distortion theory (Cover and Thomas 1991). It states that the minimum average number of bits required for representing the events x of a processX without losing information is exactly its entropy ¯ I(X). Encoding schemes that use less bits cause distortions. Finding powerful schemes that need as few bits as possible to represent a random variable is generally nontrivial andINFORMATION THEORY 53 1 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 P → e Figure 2.1 Binary entropy function subject to source or entropy coding. The difference between the average number m¯ of bits a particular entropy encoder needs and the entropy is called redundancy ¯ m¯ −I(X) ¯ R=¯ m−I(X); r= . (2.5) ¯ I(X) In (2.5), R and r denote the absolute and the relative redundancy, respectively. Well-known examples are the Huffmann and Fanø codes, run-length codes and Lempel-Ziv codes (Bell et al. 1990; Viterbi and Omura 1979; Ziv and Lempel 1977). 2.1.2 Conditional, Joint and Mutual Information Since the scope of this work is the communication between two or more subscribers, at least two processes X and Y with symbols X ∈ X and Y ∈ Y, respectively have to be µ ν considered. The first process represents the transmitted data, the second the corresponding received symbols. For the moment, the channel is supposed to have discrete input and output symbols and it can be statistically described by the joint probabilities PrX and Y or, µ ν equivalently, by the conditional probabilities PrY X and PrX Y and the a priori ν µ µ ν probabilities PrX and PrY . Following the definitions given in the previous section, µ ν the joint information of two events X ∈ X and Y ∈Y is µ ν 1 I(X ,Y )= log =− log PrX ,Y . (2.6) µ ν µ ν 2 2 PrX ,Y µ ν Consequently, the joint entropy of both processes is given by  ¯ I(X,Y)= E I(X ,Y )=− PrX ,Y· log PrX ,Y . (2.7) X,Y µ ν µ ν µ ν 2 µ ν Figure 2.2 illustrates the relationships between different kinds of entropies. Besides the ¯ ¯ ¯ terms I(X), I(Y),and I(X,Y) already defined, three additional important entropies exist. ¯ I (P )→ 2 e54 INFORMATION THEORY ¯ I(X,Y) ¯ I(X) ¯ I(Y) ¯ ¯ I(X Y) I(X;Y) ¯ I(YX) Figure 2.2 Illustration of entropies for two processes ¯ At the receiver, y is totally known and the term I(X Y) represents the information ofX ¯ that is not part ofY. Therefore, the equivocation I(X Y) represents the information that was lost during transmission   ¯ ¯ ¯ I(X Y)= I(X,Y)−I(Y)= E − log PrX Y X,Y µ ν 2  =− PrX ,Y· log PrX Y . (2.8) µ ν µ ν 2 µ ν ¯ From Figure 2.2, we recognize that I(X Y) equals the difference between the joint entropy ¯ ¯ ¯ ¯ I(X,Y) and the sinks entropy I(Y). Equivalently, we can write I(X,Y)= I(X Y)+ ¯ I(Y), leading to the general chain rule for entropies. Chain Rule for Entropies In Appendix B.1, it has been shown that the entropy’s chain rule (Cover and Thomas 1991) n  ¯ ¯ I(X ,X , ...,X )= I(X X ···X ) (2.9) 1 2 n i i−1 1 i=1 holds for a set of random variables X , X ,upto X , belonging to a joint probability 1 2 n PrX ,X ,...,X . 1 2 n ¯ On the contrary, I(YX) represents information ofY that is not contained inX . There- fore, it cannot stem from the sourceX and is termed irrelevance.   ¯ ¯ ¯ I(YX)= I(X,Y)−I(X)= E − log PrY X Y,X ν µ 2  =− PrX ,Y· log PrY Y (2.10) µ ν ν µ 2 µ νINFORMATION THEORY 55 Naturally, the average information of a processX cannot be increased by some knowledge aboutY so that ¯ ¯ I(X Y)≤ I(X) (2.11) holds. Equality in (2.11) is obtained for statistically independent processes. ¯ The most important entropy I(X;Y) is called mutual information and describes the average information common toX andY. According to Figure 2.2, it can be determined by ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ I(X;Y)= I(X)−I(X Y)= I(Y)−I(YX)= I(X)+I(Y)− I(X,Y). (2.12) Mutual information is the term that has to be maximized in order to design a communication system with the highest possible spectral efficiency. The maximum mutual information that can be obtained is called channel capacity and will be derived for special cases in subsequent sections. Inserting (2.2) and (2.7) into (2.12) yields  PrX ,Y µ ν ¯ I(X;Y)= PrX ,Y· log µ ν 2 PrX · PrY µ ν µ ν   PrY X ν µ = PrX PrY X log . (2.13) µ ν µ 2 PrY X PrX ν l l l µ ν As can be seen, mutual information depends on the conditional probabilities PrY X ν µ determined by the channel and the a priori probabilities PrX . Hence, the only parameter µ that can be optimized for a given channel in order to maximize the mutual information is the statistics of the input alphabet. Chain Rule for Information ¯ If the mutual information depends on a signal or parameter z, (2.12) changes to I(X;Y ¯ ¯ Z)= I(X Z)− I(X Y,Z). This leads directly to the general chain rule for information (Cover and Thomas 1991) (cf. Appendix B.2) n  ¯ ¯ ¯ I(X ,...,X ;Z)= I(X;Z I(X ,...,X ). (2.14) 1 n i i−1 1 i=1 For only two random variablesX andY, (2.14) becomes ¯ ¯ ¯ ¯ ¯ I(X,Y;Z)= I(X;Z)+I(Y;ZX)= I(Y;Z)+I(X;ZY). (2.15) From (2.15), we learn that first detecting x from z and subsequently y – now for known x – leads to the same mutual information as starting with y and proceeding with the detec- tion of x. As a consequence, the detection order of x and y has no influence from the information theoretic point of view. However, this presupposes an error-free detection of the first signal that usually cannot be ensured in practical systems, resulting in error propagation. Data Processing Theorem With (2.14), the data processing theorem can now be derived. Imagine a Markovian chain X →Y→Z of three random processes X , Y,and Z,thatis, Y depends on X and Z depends on Y but X and Z are mutually independent for known y. Hence, the entire56 INFORMATION THEORY ¯ information about X contained in Z is delivered by Y and I(X;Z y)= 0 holds. With this assumption, the data processing theorem ¯ ¯ ¯ ¯ I(X;Z)≤ I(X;Y) and I(X;Z)≤ I(Y;Z) (2.16) is derived in Appendix B.3. If Z is a function of Y, (2.16) states that information about X obtained fromY cannot be increased by some processing ofY leading toZ. Equality holds if Z is a sufficient statistics of Y which means that Z contains exactly the same ¯ ¯ information aboutX asY,that is, I(X;YZ)= I(X;YY)= 0 holds. 2.1.3 Extension for Continuous Signals If the random process X consists of continuously distributed variables, the probabilities PrX defined earlier have to be replaced by probability densities p (x). Consequently, µ X all sums become integrals and the differential entropy is defined by  ∞ ¯ I (X)=− p (x)· log p (x)dx= E− log p (x). (2.17) diff X X X 2 2 −∞ Contrary to the earlier definition, the differential entropy is not restricted to be nonnegative. ¯ Hence, the aforementioned interpretation is not valid anymore. Nevertheless, I (X) can diff still be used for the calculation of mutual information and channel capacity, which will be demonstrated in Section 2.2. For a real random processX with a constant probability density p (x)= 1/(2a) in the X rangex≤ a, a being a positive real constant, the differential entropy has the value  a 1 ¯ I (X)= · log (2a)dx= log (2a). (2.18) diff 2 2 2a −a 2 With reference to a real Gaussian distributed process with mean µ and variance σ ,we X X obtain & ' 2 1 (x− µ ) X p (x)= · exp − X 2 2σ 2 X 2πσ X and 1 2 ¯ I (X)= · log (2πeσ ). (2.19a) diff 2 X 2 If the random process is circularly symmetric complex, that is, real and imaginary parts 2 2 2 are independent with powers σ = σ = σ /2, the Gaussian probability density function   X X X (PDF) has the form & ' 2 1 x− µ X   p (x)= p (x )· p (x )= · exp − . X X X 2 2 πσ σ X X In this case, the entropy is 2 ¯ I (X)= log (πeσ ). (2.19b) diff 2 X Comparing (2.19a) and (2.19b), we observe that the differential entropy of a complex Gaus- sian random variable equals the joint entropy of two independent real Gaussian variables with halved variance.INFORMATION THEORY 57 2.1.4 Extension for Vectors and Matrices When dealing with vector channels that have multiple inputs and outputs, we use vector notations as described in Section 1.2.4. Therefore, we stack n random variables x , ..., x 1 n of the process X into the vector x. With the definition of the joint entropy in (2.7), we obtain  ¯ I(X)=− Prx· log Prx (2.20) 2 n x∈X X X   =− ··· PrX ,···,X · log PrX ,···,X . ν ν ν ν n 2 n 1 1 ν =1 ν =1 1 n Applying the chain rule recursively for entropies in (2.9) leads to an upper bound n n   ¯ ¯ ¯ I(X)= I(X x ,··· ,x )≤ I(X ) (2.21) µ 1 µ−1 µ µ=1 µ=1 where equality holds exactly for statistically independent processes X . Following the µ previous subsection, the differential entropy for real random vectors becomes  ¯ I (X)=− p (x)· log p (x)dx= E− log p (x) (2.22) diff X X X 2 2 n R Under the restrictionx≤ a, a being a positive real constant, the entropy is maximized for a uniform distribution. Analogous to Section 2.1.1, we obtain n/2 n 2π a 1/V (a) forx≤ a n p (x)= with V (a)= , (2.23) X n 0else n(n/2) that is, the PDF describes the surface of a ball in the n-dimensional space. The gamma  ∞ x−1 −t function in (2.23) is defined by (x)= t e dt (Gradshteyn 2000). It becomes 0 √ 1 2n (n)= (n− 1)and (n− )= (2n) π/(n· 2 ) for n= 1, 2, 3,... . The expectation 2 in (2.22) now delivers & ' n/2 n 2π a ¯ I (X)= log . (2.24) diff 2 n(n/2) T On the contrary, for a given covariance matrix  = E xx of a real-valued process XX X X , the maximum entropy is achieved by a multivariate Gaussian density T −1 x  x 1 XX p (x)= 8 · exp − (2.25) X 2 det(2π ) XX and amounts to 1 ¯ I (X)= · log det(2πe ). (2.26) diff XX 2 2 2 For complex elements of x with the same variance σ , the Gaussian density becomes X / 0 1 H −1 p (x)= 8 · exp −x  x (2.27) X XX det(π ) XX58 INFORMATION THEORY ˆ d x y d FEC FEC channel encoder decoder R = k/n c ¯ I(X Y) ¯ I(X) ¯ ¯ I(Y) I(X;Y) ¯ I(YX) Figure 2.3 Simple model of a communication system H with  = E xx and the corresponding entropy has the form XX X ¯ I (X)= log det(πe ), (2.28) diff XX 2 if the real and imaginary parts are statistically independent. 2.2 Channel Coding Theorem for SISO Channels 2.2.1 Channel Capacity This section describes the channel capacity and the channel coding theorem defined by Shannon. Figure 2.3 depicts the simple system model. An Forward Error Correction (FEC) encoder, which is explained in more detail in Chapter 3, maps k data symbols represented by the vector d onto a vector x of lengthnk. The ratio R = k/n is termed code rate and c determines the portion of information in the whole message x. The vector x is transmitted over the channel, resulting in the output vector y of the same length n. Finally, the FEC decoder tries to recover d on the basis of the observation y and the knowledge of the code’s structure. ¯ As already mentioned in Section 2.1.2, mutual information I(X;Y) is the crucial param- eter that has to be maximized. According to (2.12), it only depends on the conditional probabilities PrY X and the a priori probabilities PrX .Since PrY X are given ν µ µ ν µ by the channel characteristics and can hardly be influenced, mutual information can only be maximized by properly adjusting PrX . Therefore, the channel capacity C describes µ the maximum mutual information  PrY X ν µ C= sup PrY X · PrX · log (2.29) ν µ µ 2 PrY X· PrX PrX ν l l l µ νINFORMATION THEORY 59 1 obtained for optimally choosing the source statistics PrX. It can be shown that mutual information is a concave function with respect to PrX. Hence, only one maximum exists, which can be determined by the sufficient conditions ∂C = 0 ∀ X ∈X. (2.30) µ ∂ PrX µ Owing to the use of the logarithm to base 2, C is measured in (bits/channel use) or (bits/s/Hz). In many practical systems, the statistics of the input alphabet is fixed or the effort for optimizing it is prohibitively high. Therefore, uniformly distributed input symbols are assumed and the expression  1 PrY X ν µ ¯ I(X;Y)= log X+ · PrY X · log . (2.31) ν µ 2 2 X PrY X ν l l µ ν is called channel capacity although the maximization with respect to PrX is missing. The ¯ ¯ first term in (2.31) represents I(X) and the second, the negative equivocation I(X Y). Channel Coding Theorem The famous channel coding theorem of Shannon states that at least one code of rate R ≤ C c exists for which an error-free transmission can be ensured. The theorem assumes perfect Maximum A Posteriori (MAP) or maximum likelihood decoding (cf. Section 1.3) and the code’s length may be arbitrarily long. However, the theorem does not show a way to find this code. For R C, it can be shown that an error-free transmission is impossible even c with tremendous effort (Cover and Thomas 1991). For continuously distributed signals, the probabilities (2.29) have to be replaced by corresponding densities and the sums by integrals. In the case of a discrete signal alphabet and a continuous channel output, we obtain the expression   p (y) YX µ C= sup p (y)· PrX · log dy. (2.32) YX µ µ 2 p (y)· PrX PrX YX l l l µ Y Examples of capacities for different channels and input alphabets are presented later in this chapter. 2.2.2 Cutoff Rate Up to this point, no expression addressing the error rate attainable for a certain code rate R and codeword length n was achieved. This drawback can be overcome with the c cutoff rate and the corresponding Bhattacharyya bound. Valid codewords by x and the code representing the set of all codewords as is denoted . Furthermore, assuming that x∈  of length n was transmitted its decision region D(x) is defined such that the decoder decides correctly for all received vectors y∈ D(x). For a discrete output alphabet of the channel, the word error probability P (x) of x can be expressed by w    P (x)= Pr Y∈ / D(x) x = Pry x. (2.33) w y∈ /D(x) 1 If the maximum capacity is really reached by a certain distribution, the supremum can be replaced by the maximum operator.60 INFORMATION THEORY Since the decision regions D(x) for different x are disjoint, we can alternatively sum   the probabilities PrY∈ D(x ) x of all competing codewords x = x and (2.33) can be rewritten as       P (x)= Pr Y∈D(x ) x = Pry x. (2.34) w      x∈\x x∈\x y∈D(x ) The right-hand side of (2.34) replaces y∈ / D(x) by the sum over all competing decision    regions D(x = x).Since Pry x is larger than Pry x for all y∈D(x ), "  Pry x  Pry x≥ Pry x⇒ ≥ 1 (2.35) Pry x holds. The multiplication of (2.34) with (2.35) and the extension of the inner sum in (2.34) n to all possible received words y∈ Y leads to an upper bound "    Pry x P (x)≤ Pry x· w Pry x    x∈\x y∈D(x )   8  = Pry x· Pry x. (2.36)   n x∈\x y∈Y The computational costs for calculating (2.36) are very high for practical systems because the number of codewords and especially the number of possible received words is very large. Moreover, we do not know a good code yet and we are not interested in the error probabilities of single codewords x. A solution would be to calculate the average error probability over all possible codes , that is, we determine the expectation E P (x) with X w respect to PrX. Since all possible codes are considered with equal probability, all words n  x∈X are possible. In order to reach this goal, it is assumed that x and x are identically   2 distributed and are independent so that Prx,x= Prx· Prx holds. The expectation of the square root in (2.36) becomes , -   8 8    E Pry x Pry x = Pry x Pry x Prx Prx n  n x∈X x∈X   8 8   = Pry x Prx Pry x Prx n  n x∈X x∈X 2  8 = Pry x· Prx . (2.37) n x∈X  Since (2.37) does not depend on x any longer, the outer sum in (2.36) becomes a constant k nR c factor 2 − 1 that can be approximated by 2 with R = k/n. We obtain (Cover and c Thomas 1991) 2 This assumption also includes codes that map different information words onto the same codeword, leading  to x= x . Since the probability of these codes is very low, their contribution to the ergodic error rate is rather small.INFORMATION THEORY 61 2   8 nR c P = E P (x) 2 · Pry x· Prx (2.38a) w X w n n y∈Y x∈X √ 2 nR +log n( n Pryx·Prx) c 2 y∈Y x∈X = 2 (2.38b) that is still a function of the input statistics PrX. In order to minimize the average error probability, the second part of the exponent in (2.38b) has to be minimized. Defining the cutoff rate to    2   8 1    R = max − · log Pry x· Prx , (2.39) 0 2 PrX n n n y∈Y x∈X that depends only on the conditional probabilities Pry x of the channel, we obtain an upper bound for the minimum average error rate −n(R −R ) −n·E (R ) 0 c B c min EP 2 = 2 . (2.40) w PrX In (2.40), E (R )= R − R denotes the Bhattacharyya error exponent. This result demon- B c 0 c strates that arbitrarily low error probabilities can be achieved for R R . If the code rate 0 c R approaches R , the length n of the code has to be infinitely increased for an error-free c 0 transmission. Furthermore, (2.40) now allows an approximation of error probabilities for finite codeword lengths. For memoryless channels, the vector probabilities can be factorized into symbol prob- abilities, simplifying the calculation of (2.39) tremendously. Applying the distributive law, we finally obtain    2  8    R = max − log Pry x· Prx . (2.41) 0 2 PrX y∈Y x∈X Owing to the applied approximations, R is always smaller than the channel capacity 0 C. For code rates with R R C, the bound in (2.40) cannot be applied. Moreover, 0 c owing to the introduction of the factor in (2.35), the bound becomes very loose for large number of codewords. Continuously Distributed Output In Benedetto and Biglieri (1999, page 633), an approximation of R is derived for the 0 AWGN channel with a discrete inputX and a continuously distributed output. The derivation starts with the calculation of the average error probability and finally the result is obtained in (2.40). Using our notation, we obtain & ' 1 R = log (X)− log 1+ r(X,N ) (2.42) 0 0 2 2 X with + X X 2  X − X µ ν 2 r(X,N )= minX · PrX PrX exp − . (2.43) 0 µ ν PrX 4N 0 µ=1 ν=162 INFORMATION THEORY However, performing the maximization is a difficult task and hence a uniform distribution of X is often assumed. In this case, the factor in front of the double sum and the a priori probabilities eliminate each other. 2.2.3 Gallager Exponent As already mentioned, the error exponent of Bhattacharyya becomes very loose for large codeword sets. In order to tighten the bound in (2.38a), Gallager introduced an optimization parameter ρ∈ 0, 1, leading to the expression (Cover and Thomas 1991) 1+ρ   1 ρnR c 1+ρ P (ρ)= E P (ρ,x) 2 · Pry x · Prx w X w n n y∈Y x∈X & ' 1+ρ 1 1+ρ ρnR +log Pryx ·Prx n n c 2 y∈Y x∈X = 2 . (2.44) Similar to the definition of the cutoff rate in (2.39), we can now define the Gallager function   1+ρ   1 1 1+ρ   E (ρ, PrX)=− · log Pry x · Prx . (2.45) 0 2 n n n y∈Y x∈X Comparing (2.37) with (2.45), it becomes obvious that the bounds of Gallager and Bhat- tacharyya are identical for ρ= 1, and R = max E (1, PrX) holds. The average 0 PrX 0 symbol error probability in (2.40) becomes −n(E (ρ,PrX)−ρ·R ) c 0 P (ρ) 2 . (2.46) w For memoryless channels, the Gallager function can be simplified to   1+ρ   1 1+ρ   E (ρ, PrX)=− log Pry x · Prx . (2.47) 0 2 y∈Y x∈X With the Gallager exponent E (R )= max max (E (ρ, PrX)− ρ· R ), (2.48) G c 0 c PrX ρ∈0,1 we finally obtain the minimum error probability −n·E (R ) G c P = min P (ρ) 2 . (2.49) w w PrX,ρ A curve sketching of E (R ) is now discussed. By partial derivation of E (ρ, PrX) G c 0 with respect to ρ, it can be shown that the Gallager function increases monotonically with ρ∈ 0, 1 from 0 to its maximum R . Furthermore, fixing ρ in (2.48), E (R ) describes 0 G c a straight line with slope−ρ and offset E (ρ, PrX). As a consequence, we have a set 0 of straight lines – one for each ρ – whose initial values at R = 0 grow with increasing ρ. cINFORMATION THEORY 63 Figure 2.4 Curve sketching of Gallager exponent E (R ) G c Each of these lines is determined by searching the optimum statistics PrX.The Gal- lager exponent is finally obtained by finding the maximum among all lines for each code rate R . c This procedure is illustrated in Figure 2.4. The critical rate . . ∂ . R = E (ρ, PrX) (2.50) crit 0 . ∂ρ ρ=1 represents the maximum code rate for which ρ= 1 is the optimal choice. It is important to mention that PrX in (2.50) already represents the optimal choice for a maximal rate. In the range 0R ≤ R , the parametrization by Gallager does not affect the result and c crit E (R ) equals the Bhattacharyya exponent E (R ) given in (2.40). Hence, the cutoff rate G c B c can be used for approximating the error probability. For R R , the Bhattacharyya c crit bound cannot be applied anymore and the tighter Gallager bound with ρ 1 will have to be used. According to (2.49), we can achieve arbitrarily low error probabilities by appropriately choosing n as long as E (R ) 0 holds. The maximum rate for which an error-free trans- G c mission can be ensured is reached at the point where E (R ) approaches zero. It can be G c shown that this point is obtained for ρ→ 0 resulting in E (ρ, PrX) 0 ¯ R = lim = maxI(X;Y)= C. (2.51) max ρ→0 PrX ρ Therefore, the maximum rate for which an error-free transmission can be ensured is exactly the channel capacity C (which was already stated in the channel coding theorem). Transmit- ting at R = C requires an infinite codeword length n→∞. For the sake of completeness, c it has to be mentioned that an expurgated exponent E (ρ, PrX) with ρ≥ 1 exists, lead- x ∂ ing to tighter results than the Gallager exponent for rates below R = E (ρ, PrX) ex x ρ=1 ∂ρ (Cover and Thomas 1991).64 INFORMATION THEORY 2.2.4 Capacity of the AWGN Channel AWGN Channel with Gaussian Distributed Input In this and the next section, the recent results for some practical channels are discussed. Starting with the equivalent baseband representation of the AWGN channel depicted in Figure 1.11. If the generally complex input and output signals are continuously distributed, differential entropies have to be used. Since the information inY for knownX can only stem from the noiseN , mutual information illustrated in Figure 2.3 has the form ¯ ¯ ¯ ¯ ¯ I(X;Y)= I (Y)− I (YX)= I (Y)− I (N). (2.52) diff diff diff diff ¯ The maximization of (2.52) with respect to p (x) only affects the term I (Y) because the X diff background noise cannot be influenced. For statistically independent processesX andN,the 2 2 2 corresponding powers can simply be added σ = σ + σ and, hence, fixing the transmit Y X N 2 power directly fixes σ . According to Section 2.1.3, the maximum mutual information for Y a fixed power is obtained for a Gaussian distributed processY. However, this can only be achieved for a Gaussian distribution ofX . Hence, we have to substitute (2.19b) into (2.52). 2 2 Inserting the results of Section 1.2.2 (σ = 2BE and σ = 2BN ), we obtain the channel s 0 X N capacity 2−dim 2 2 C = log (πeσ )− log (πeσ ) 2 Y 2 N & ' 2 2 σ + σ E X s N = log = log 1+ . (2.53) 2 2 2 N σ 0 N Obviously, the capacity grows logarithmically with the transmit power or, equivalently, with E /N . If only the real part of X is used for data transmission – such as for real-valued s 0 binary phase shift keying (BPSK) or amplitude shift keying (ASK) – the bits transmitted per channel usage is halved. However, we have to take into account that only the real part of the noise disturbs the transmission so that the effective noise power is also halved 2 1 2 2 2 (σ = σ = BN ). If the transmit power remains unchanged (σ = σ = 2BE ), (2.53)  0  s 2 X X N N becomes & ' 1 1 1 E s 1−dim 2 2 C = · log (πeσ )− · log (πeσ )= · log 1+ 2 . (2.54) 2 2 2 Y N 2 2 2 N 0 In many cases, the evaluation of systems in terms of a required E /N does not lead to a s 0 fair comparison. This is especially the case when the number of channel symbols transmitted per information bit varies. Therefore, a better comparison is obtained by evaluating systems with respect to the required energy E per information bit. For binary modulation schemes b with m= 1, it is related to E by s k k· E = n· E =⇒ E = · E = R · E (2.55) b s s b c b n because the FEC encoder should not change the energy. Substituting E in (2.53) and (2.54) s delivers & ' & ' E 1 E b b 2−dim 1−dim C = log 1+ R ,C = · log 1+ 2R . c c 2 2 N 2 N 0 0INFORMATION THEORY 65 b) capacity versus E /N a) capacity versus E /N b 0 s 0 10 10 real real complex complex 8 8 6 6 4 4 2 2 0 0 −10 0 10 20 30 0 5 10 15 20 E /N in dB→ E /N in dB→ s 0 b 0 Figure 2.5 Channel capacities for AWGN channel with Gaussian distributed input Since the highest spectral efficiency in maintaining an error-free transmission is obtained for R = C, these equations only implicitly determine C. We can resolve them with respect c to E /N and obtain the common result b 0 2C E 2 − 1 b = (2.56) N 2C 0 for real-as well as complex-valued signal alphabets. For C→ 0, the required E /N does b 0 not tend to zero but to a finite value 2C E 2 · log 2· 2 b lim = lim = log 2=− ˆ 1.59 dB. (2.57) C→0 N C→0 2 0 Hence, no error-free transmission is possible below this ultimate bound. Figure 2.5 illus- trates the channel capacity for the AWGN channel with Gaussian input. Obviously, real and complex-valued transmissions have the same capacity for E /N → 0 or, equivalently, s 0 E /N → log(2). For larger signal-to-noise ratios (SNRs), the complex system has a higher b 0 capacity because it can transmit twice as many bits per channel use compared to the real-valued system. This advantage affects the capacity linearly, whereas the drawback of a halved SNR compared to the real-valued system has only a logarithmic influence. Asymptotically, doubling the SNR (3 dB step) increases the capacity by 1 bit/s/Hz for the complex case. AWGN Channel with Discrete Input Unfortunately, no closed-form expressions exist for discrete input alphabets and (2.32) has to be evaluated numerically. Owing to the reasons discussed on page 59 we assume a uniform distribution ofX   p (y) 1 YX µ C= log (X)+ · p (y)· log dy. (2.58) YX 2 µ 2 X p (y) YX l l µ Y C→ C→66 INFORMATION THEORY b) capacity versus E /N a) capacity versus E /N b 0 s 0 6 6 complex Gaussian 32-PSK 32-PSK 5 5 16-PSK 16-PSK 4 4 8-PSK 8-PSK real Gaussian 3 3 QPSK QPSK 2 2 BPSK BPSK 1 1 0 0 −10 0 10 20 30 0 5 10 15 20 E /N in dB→ E /N in dB→ s 0 b 0 Figure 2.6 Capacity of AWGN channel for different PSK constellations An approximation of the cutoff rate was already presented in (2.42). Figure 2.6 shows the capacities for the AWGN channel and different PSK schemes. Obviously, for very low SNRs E /N → 0, no difference between discrete input alphabets s 0 and a continuously Gaussian distributed input can be observed. However, for higher SNR, the Gaussian input represents an upper bound that cannot be reached by discrete modulation schemes. Their maximum capacity is limited to the number of bits transmitted per symbol √ (log X). Since BPSK consists of real symbols± E /T , its capacity is upper bounded s s 2 by that of a continuously Gaussian distributed real input, and the highest spectral efficiency that can be obtained is 1 bit/s/Hz. The other schemes have to be compared to a complex Gaussian input. For very high SNRs, the uniform distribution is optimum again since the maximum capacity reaches exactly the number of bits per symbol. Regarding ASK and quadrature amplitude modulation QAM schemes, approximating a Gaussian distribution of the alphabet by signal shaping can improve the mutual information although it need not to be the optimum choice. The maximum gain is determined by the power ratio of uniform and Gaussian distributions if both have the same differential entropy. With (2.18) and (2.19a) for real-valued transmissions, we obtain 2 1 2a 2 2 log (2a)= log (2πeσ ) ⇒ σ = . (2.59) 2 2 X X 2 πe Since the average power for the uniformly distributed signal is .   a ∞ a 2 3 2 . x x a 2 . p (x)x dx= dx= = , X . 2a 6a 3 −∞ −a −a the power ratio between uniform and Gaussian distributions for identical entropies becomes (with (2.59)) 2 2 a /3 a /3 πe = = → 1.53 dB. (2.60) 2 2 2a /(πe) 6 σ X C→ C→INFORMATION THEORY 67 b) mutual information versus E /N a) mutual information versus E /N b 0 s 0 6 6 5 5 4 4 3 3 Gaussian Gaussian 16-QAM 16-QAM 64-QAM 64-QAM 2 2 0 10 20 30 0 5 10 15 20 E /N in dB→ E /N in dB→ s 0 b 0 Figure 2.7 Capacity of AWGN channel for different QAM constellations (solid lines: uniform distribution, dashed lines: Gaussian distribution) Theoretically, we can save 1.53 dB transmit power when changing from a uniform to a Gaussian continuous distribution without loss of entropy. The distribution of the discrete signal alphabet has the form (Fischer et al. 1998) 2 −λX µ PrX = K(λ)· e (2.61) µ where K(λ) must be chosen appropriately to fulfill the condition PrX = 1. The µ µ parameter λ≥ 0 has to be optimized for each SNR. For λ= 0, the uniform distribution −1 with K(0)=X is obtained. Figure 2.7 depicts the corresponding results. We observe that signal shaping can close the gap between the capacities for a continuous Gaussian input and a discrete uniform input over a wide range of E /N . However, the absolute gains are s 0 rather low for these small alphabet sizes and amount to 1 dB for 64-QAM. As mentioned before, for high SNRs, λ tends to zero, resulting in a uniform distribution achieving the highest possible mutual information. The last aspect in this subsection addresses the influence of quantization on the capacity. Quantizing the output of an AWGN channel leads to a model with discrete inputs and outputs that can be fully described by the conditional probabilities PrY X . They depend on ν µ the SNR of the channel and also on the quantization thresholds. We will concentrate in the following part on BPSK modulation. A hard decision at the output delivers the binary symmetric channel (BSC). Its capacity can be calculated by ¯ C= 1+ P log (P )+ (1− P ) log (1− P )= 1− I (P ) (2.62) s s s s 2 s 2 2 √ 1 q where P = erfc( E /N ) denotes the symbol error probability. Generally, we obtain 2 s s 0 2 output symbols Y for a q-bit quantization. The quantization thresholds have to be chosen ν q such that the probabilities PrY X with 1≤ µ≤X and 1≤ ν≤ 2 maximize the ν µ mutual information. Figure 2.8 shows the corresponding results. On the one hand, the loss due to a hard decision prior to decoding can be up to 2 dB, that is, the minimum E /N b 0 ¯ I(X;Y)→ ¯ I(X;Y)→68 INFORMATION THEORY b) channel capacity versus E /N a) channel capacity versus E /N b 0 s 0 1 1 0.8 0.8 0.6 0.6 0.4 0.4 q= 1 q= 1 q= 2 q= 2 q= 3 q= 3 0.2 0.2 q=∞ q=∞ Gaussian Gaussian 0 0 −10 −5 0 5 10 0 5 10 E /N in dB→ E /N in dB→ s 0 b 0 Figure 2.8 Capacity of AWGN channel for BPSK and different quantization levels for which an error-free transmission is principally possible is approximately 0.4 dB. On the other hand, a 3-bit quantization loses only slightly compared to the continuous case. For high SNRs, the influence of quantization is rather small. 2.2.5 Capacity of Fading Channel In Section 1.3.3, the error probability for frequency-nonselective fading channels was dis- cussed and it was recognized that the error rate itself is a random variable that depends on the instantaneous fading coefficient h. For the derivation of channel capacities, we encounter the same situation. Again, we can distinguish between ergodic and outage capacity. The ¯ ergodic capacity C represents the average capacity among all channel states and is mainly chosen for fast fading channels when coding is performed over many channel states. On the contrary, the outage capacity C denotes the capacity that cannot be reached with an out outage probability P . It is particularly used for slowly fading channels where the coher- out ence time of the channel is much larger than a coding block which is therefore affected by a single channel realization. For the sake of simplicity, we restrict the derivation on complex Gaussian distributed inputs because there exist no closed-form expressions for discrete signal alphabets. Starting with the result of the previous section, we obtain the instantaneous capacity & ' E s 2 C(γ)= log 1+h · = log (1+ γ) (2.63) 2 2 N 0 that depends on the squared magnitude of the instantaneous channel coefficient h and, thus, 2 on the current SNR γ=h E /N . Averaging (2.63) with respect to γ delivers the ergodic s 0 channel capacity ∞  ¯ C= EC(γ)= log (1+ξ)· p (ξ) dξ. (2.64) γ 2 0 C→ C→INFORMATION THEORY 69 In order to compare the capacities of fading channels with that of the AWGN channel, we have to apply Jensen’s inequality (Cover and Thomas 1991). Since C(γ) is a concave function, it states that E f(x)≤ f E x . (2.65) ( ) X X For convex functions, ‘≤’ has to be replaced by ‘≥’. Moving the expectation in (2.64) into the logarithm leads to    ¯ C= E log (1+γ) ≤ log 1+ Eγ . (2.66) 2 2 From (2.66), we immediately see that the capacity of a fading channel with average SNR 2 Eγ= E /N for σ = 1 is always smaller than that of an AWGN channel with the same s 0 H average SNR. Ergodic Capacity We now want to calculate the ergodic capacity for particular fading processes. If H is 2 Rayleigh distributed, we know from Section 1.5 thatH and γ are chi-squared distributed with two degrees of freedom. According to Section 1.3.3, we have to insert p (ξ)= 1/γ¯· γ 2 exp(−ξ/γ) ¯ with γ¯ = σ · E /N into (2.64). Applying the partial integration technique, s 0 H we obtain ∞  1 −ξ/γ¯ ¯ C= log (1+ξ)· · e dξ 2 γ¯ 0 1 1 = log (e)· exp · expint (2.67) 2 2 2 σ E /N σ E /N s 0 s 0 H H  ∞ −t where the exponential integral function is defined as expint(x)= e /t dt (Gradshteyn x 2000). Figure 2.9 shows a comparison between the capacities of AWGN and flat Rayleigh fading channels (bold lines). For sufficiently large SNR, the curves are parallel and we can observe a loss of roughly 2.5 dB due to fading. Compared with the bit error rate (BER) loss of approximately 17 dB in the uncoded case, this loss is rather small. It can be explained by the fact that the channel coding theorem presupposes infinite long codewords, allowing the decoder to exploit a high diversity gain. This leads to a relatively small loss in capacity compared to the AWGN channel. Astonishingly, the ultimate limit of−1.59 dB is the same for AWGN and Rayleigh fading channel. Outage Probability and Outage Capacity With the same argumentation as in Section 1.3 we now define the outage capacity C with out the corresponding outage probability P . The latter one describes the probability of the out instantaneous capacity C(γ) falling below a threshold C . out     P = Pr C(γ) C = Pr log (1+γ) C (2.68) out out out 2 Inserting the density p (ξ) with γ¯ = E /N into (2.68) leads to γ s 0 & ' C out   1− 2 C out P = Pr γ 2 − 1 = 1− exp . (2.69) out E /N s 070 INFORMATION THEORY 12 C 1 C 5 10 C 10 C 20 8 C 50 AWGN 6 ¯ C 4 2 0 −10 0 10 20 30 40 E /N in dB→ s 0 Figure 2.9 Ergodic and outage capacity of flat Rayleigh fading channels for Gaussian input versus E /N (C denotes the outage capacity for P = p) b 0 p out Resolving the last equation with respect to C yields out  C = log 1− E /N · log(1− P ) . (2.70) out s 0 out 2 Conventionally, C is written as C where p= P determines the corresponding out- out p out age probability. Figure 2.9 shows the outage capacities for different values of P .For out P = 0.5, C , which can be ensured with a probability of only 50%, is close to the out 50 ¯ ergodic capacity C. The outage capacity C decreases dramatically for smaller P .At out out a spectral efficiency of 6 bit/s/Hz, the loss in terms of E /N amounts to nearly 8 dB for b 0 P = 0.1 and roughly 18 dB for P = 0.01 compared to the AWGN channel. out out Figure 2.10 depicts the outage probability versus the outage capacity for different values of E /N . As expected for high SNRs, large capacities can be achieved with very low outage s 0 probability. However, P grows rapidly with decreasing E /N . The asterisks denote the out s 0 outage probability of the ergodic capacity C. As already observed in Figure 2.9, it is close to 0.5. 2.2.6 Channel Capacity and Diversity As explained in the previous subsection, the instantaneous channel capacity is a random variable that depends on the squared magnitude of the actual channel coefficient h.Since 2 h and, thus, also γ vary over a wide range, the capacity also has a large variance. ¯ Besides the ergodic capacity C, the outage capacity C is an appropriate measure for the out channel quality. From Section 1.5, we know that diversity reduces the SNR’s variance and, therefore, also reduces the outage probability of the channel. Assuming that a signal x is transmitted over D statistically independent channels with coefficients h ,1≤ ≤ D,the  instantaneous capacity after maximum ratio combining becomes D  E s 2 C(γ)= log 1+ h = log 1+ γ . (2.71) ( )  2 2 DN 0 =1 C→