How does Wireless Video Streaming work

wireless video streaming raspberry pi and wireless video streaming standards solutions and wireless video streaming technology and wireless video streaming using bluetooth
Dr.NaveenBansal Profile Pic
Dr.NaveenBansal,India,Teacher
Published Date:25-10-2017
Your Website URL(Optional)
Comment
11 Interactive Video Communications 11.1 Video Conferencing and Telephony Video conferencing and telephony are typical examples of interactive video communi- cations, as nowadays mobile phones and portable devices are very popular. The major difference between video conference and telephony with other multimedia applications, such as multimedia messaging (MMS) and multimedia streaming is that the video con- ferencing is delay sensitive while others are not. In the past decade,point-to-point protocols (PPP)have been in common use for TCP/IP communications (the protocol used by the Internet) over a telephone line. Using the Internet, multimedia communication can be achieved without incurring any long distance charges. On the other hand, the usage of a cable modem along with DSL technology has enabled broadband Internet access, where the cable modem is used to deliver broadband Internet access taking advantage of the unused bandwidth of the cable television network. The bandwidth of the cable connection varies from 3 Mbits/s to 30 Mbits/s and the upstream speed ranges from 384 Kbits/s to 6 Mbits/s. The DSL modem, on the other hand, takes advantage of the unused frequencies in the telephone line and varies in speed from hundreds of kbits/s to few Mbits/s. With the development of the 3G and 4G, the constraint on bandwidth for a wireless system to carry video content has been lifted, in addition, more sophisticated access protocols such as HSPA, EDGE, etc. hit the market and give a strong impetus to the mobile telephony business. 11.1.1 IP and Broadband Video Telephony Compared to the very low-bit rate (around 40 kbit/s) of video telephony over the PSTN (PublicSwitchedTelephoneNetwork),thecablemodemandDirectSubscriberLine(DSL) connections offer Internet connections at much higher bit rates. Thus, IP video telephony in conjunction with DSL and cable modems can now offer video communications at a much higher quality than before. IP Video Telephony uses the H.323 standard which can be used over any packet data network, such as those using the Internet protocol (IP). Owing to the interest in video telephony over IP, many of the existing commercial implementations use the H.323 standard. 4G Wireless Video Communications Haohong Wang, Lisimachos P. Kondi, Ajay Luthra and Song Ci © 2009 John Wiley & Sons, Ltd. ISBN: 978-0-470-77307-9348 Interactive Video Communications The Digital Subscriber Line (DSL) connection and cable Internet connections are both referred to as broadband since they use different channels to send the digital information simultaneously withtheaudiosignalorthecabletelevisionsignal.ADSLconnectionuses a dedicated line from the subscriber to the telephone company, while the cable Internet service is provided to a neighborhood by a single coaxial cable line. Hence, connection speed varies depending on how many people are using the service. Cable modems send the data signal over the cable television infrastructure. They are used to deliver broadband Internet access by taking advantage of the unused cable network bandwidth. DSL, on the other hand, uses a conventional twisted wire pair for data transmission. ADSL 1 uses two frequency bands known as upstream and downstream bands. The upstream band is used for communications from the end user to the telephone central office while the downstreambandisusedforcommunicatingfromthecentralofficetotheenduser.ADSL provides dedicated local bandwidth in contrast to the cable modem which gives shared bandwidth. Hence, the upstream and downlink speed varies depending on the distance of the end user from the telephone office. Conventional ADSL has a downstream speed of approximately 8 Mbits/s and an upstream speed around 1 Mbits/s. Thus, acceptable quality video telephony is achievable with the advances in modem technology and audio and video compression. 11.1.2 Wireless Video Telephony Video Telephony is offeredin 3G networks in both the circuit-switch mobile core network and packet network. The former provides 64 kbits/s per circuit switched path, while the latter may provide greater bandwidth but the bandwidth is not guaranteed during the call as no dedicated circuit is reserved. 3GPP uses the 3G-324M protocol to support video telephony services. In January 2004, NTT DoCoMo (a Japanese operator) announced that its FOMA (freedom of mobile multimedia access) 3G video telephony service had passedthemilestoneof2millioncustomers.Recentlyalmostallmobilephonessupporting UMTS networks can make videophone conversations with other UMTS users, and it was estimated that there were more than 130 million UMTS users in mid-2007. In the next sectionweprovidemoredetailofthe3G-324Mprotocolthatiswidelyusedinvideophone applications. 11.1.3 3G-324M Protocol 3G-324M 2 is the 3GPP umbrella protocol for video telephony in 3G mobile networks, which is based on the H.324 (a standard for low bit rate GSTN networks) specification for multimedia conferencing over circuit switched networks. 3G-324M is comprised of the following sub-protocols: • H.245 for call control. • H.223 for bitstreams to data packets multiplexer/demultiplexer. • H.223 Annex A and B for error handling of low and medium BER detection, correction and concealment. • Adaptation layers.Video Conferencing and Telephony 349 3G-324M Video Codec Video Inputs H.261/H.263/ H.264/MPEG4 Audio Codec Multiplex Audio Inputs AMR/G.723 H.223 Modem V.34/V.8/ V.250 Control System Control H.245 Figure 11.1 3G-324M basic structure The basic structure of 3G-324M is shown in Figure 11.1, which consists of a multiplexer which mixes the various media types into a single bitstream (H.223), an audio compres- sion algorithm (either a AMR or G.723 codec), a video compression algorithm (either a H.261, H.263, H.264 or MPEG4 codec) and a control protocol which performs automatic capability negotiation and logical channel control (H.245). The goal of this standard is to combine low multiplexer delay with high efficiency and the ability to handle bursty data traffic from a variable number of sources. 11.1.3.1 Multiplexing and Error Handling 3G-324M uses a multiplex standard, H.223, to mix the various streams of audio, video, data and the control channel together into a single bitstream for transmission over the modem. H.223 has a flexible mapping scheme suitable for a variety of media and for a variable frame length. In its mobile extension, it obtains greater synchronization and control of channel errors without losing its flexibility. H.223 consists of a lower multiplex layer and a set of adaptation layers. The lower multiplex layer mixes the different media streams, whereas the adaptation layers perform logical frame, sequence numbering, error detection and error correction by retransmission. Each adaptation layer is suitable for a different type of information channel. In H.223, there are 3 operation modes which are chosen according to the degree of error resiliency required in a 3G-324M system. In the first level, the multiplexing and QoS control are supported; in the second level, a 16-bit pseudorandom noise sequence is employed to improve the synchronization; in the third level, the payload length and FEC information are added in the header in order to improve error resilience capability.350 Interactive Video Communications 11.1.3.2 Adaptation Layers There are three adaptation layers in 3G-324M: AL1, AL2, and AL3. AL1 is intended primarily for data and control information transferring, in which no error detection and correction mechanism is provided. AL2 is intended primarily for digital audio transfer- ring, which includes an 8 bit cyclic redundancy code (CRC). CRC is used to identify transmission errors. AL3 is intended primarily for digital video and includes provision for retransmission and a 16 bit CRC. 11.1.3.3 The Control Channel The H.245 protocol controls the following items: • Logical channel that opens or closes for media transmissions. • Determines the master terminal at the beginning of a session. • Exchanges the capabilities between both terminals, such as the mode of multiplexing, codec support, data sharing mode, etc. • Operation mode that is sent from the receiver side to the transmitter side to convey the preference within its capability of the codec and the associated parameters. • Call control commands and indications that check the status of the terminals and com- munications. In addition, H.245 supports the numbered simple retransmission protocol (NSRP) and control channel segmentation and reassembly layer (CCSRL) sub-layer support in order to ensure reliable operation, therefore all terminals support both NSRP and SRP modes. 11.1.3.4 Audio and Video Channels The 3G-324M specifications define the AMR codec as mandatory and G.723.1 as a rec- ommended audio codec, it also declares the H.263 codec as mandatory and MPEG-4 a as recommended codec for video processing. The details of these video codecs have been discussed in Chapter 5. 11.1.3.5 Call Setup There are seven phases in the call set up procedure, designated by letters A through G. In Phase A, an ordinary telephone connection is established. In Phase B, a regular analog telephoneconversationcantakeplacebeforethe actualmultimedia communication. When either user decides to start the multimedia communication, Phase C takes place. The two modems communicate with each other and digital communication is established. Then, in Phase D, the terminals communicate with each other using the H.245 control channel. Detailed terminal capabilities are exchanged and logical channels are opened. In Phase E, actual multimedia communication takes place. Phase F is entered when either user wishes to end the call. The logical channels are closed and an H.245 message is sent to the far-end terminal to specify the new mode (disconnect, back to voice mode, or another digital mode). Finally, in Phase G, the terminals actually enter the mode specified in the previous phase.Region-of-Interest Video Communications 351 11.2 Region-of-Interest Video Communications As mentioned in Chapter 6, ROI based video coding and communication has been very popular for wireless video telephony. As shown in Figure 11.2, the ROI based video com- munications system architecture provides users with greater flexibility and interactivity in specifying their desires and enables encoders to have greater efficiency in controlling the visual quality of coded video sequences. In this section, we demonstrate a few of the latest advances in ROI based bit allocation 3 and adaptive background skipping 4, 5 techniques. 11.2.1 ROI based Bit Allocation In the literature, many ROI bit allocation algorithms 6–10 are based on a weighted version of the H.263+ TMN8 model 11, where a cost function is created and the distortion components in various regions in the function are punished differently by using a set of preset weights. As with most of the other video standards, as mentioned in Chapter 6, TMN8 uses a Q-domain rate control scheme, which models the rate and distortion with functions of quantization step size (QP). However, recent advances in rate control research and development have demonstrated that the ρ-domain rate control model 12 (ρ represents the number of non-zero AC coefficients in a macroblock in video coding) is more accurate and thus effectively reduces rate fluctuations. It is also observed that the ρ-domain rate control approach has already been used in industry trials 13–16. To the best of our knowledge, so far there is no general optimized ρ-domain bit allocation model for ROI video coding, although 17 used the ρ-domain rate control model in their efforts to get an ad-hoc bit allocation solution. In this section, we introduce a ρ-domain optimized weighted bit allocation scheme for ROI video coding. 11.2.1.1 Quality Metric for ROI Video Video quality measurement is still an open issue for ROI video coding. Most of the literature uses PSNRs on ROI and Non-ROI, respectively, as a measurement for evaluating regional visual quality, however, a quality measure for the whole image has not been addressed. In 18, a weighted mean squared error (MSE) metric was proposed in order to measure perceptual video quality. In this metric, the macroblocks are classified as activity macroblocks and static macroblocks, and different weights are video video Intelligent Encoder Network Decoder processor ROI video coding Figure 11.2 An example of ROI video coding and communications system352 Interactive Video Communications assigned to these macroblocks for calculating the weighted MSE for overall image. Although this measurement can be extended to use for ROI video, here we introduce a new quality measurement for ROI video coding which takes into account further aspects such as spatial/temporal visual quality. In general, the evaluation of ROI video quality should consider at least three aspects: users’ interest, video fidelity and perceptual quality of the reconstructed video data. The users’ interest determines directly the classification of a video frame into ROI and Non-ROI parts and their associated perceptual importance factors. In video telephony applications, the speaker’s face region is a typical ROI because a human being’s facial expressions are very complicated and small variations can convey a large quantity of information. For the video fidelity factor, PSNR is a good measurement, which indicates the total amount of distortion of the reconstructed video frame compared to the original frame. In most cases, fidelity is the most important consideration for video coding, where any improvement might cause better subjective visual quality. However, it is not always the case, and that is why perceptual quality factors should also be taken into account. Perceptual quality considers both spatial errors, for example blocking and ringing arti- facts, and temporal errors such as temporal flicker wherethe frame visual qualities change non-uniformly along the temporal axis. Let us denote by D and D the normalized per pixel distortion of the ROI and R NR Non-ROI, and α the ROI perceptual important factor. If we assume that the relationship among the aspects mentioned above can be simplified into a linear function in video quality evaluation, then we can represent the overall distortion of the video sequence as: D = αD +(1−α)D sequence R NR   M M   α ˜ ˜ ˜ ˜ = β D (f ,f )+γ D (f )+(1−β−γ)D (f ,...,f ) RF i i RS i RT 1 M M i=1 i=1  M  (1−α) ˜ + β D (f ,f ) (11.1) NF i i M i=1  M  ˜ ˜ ˜ +γ D (f )+(1−β−γ)D (f ,...,f ) NS i NT 1 M i=1 ˜ where f and f are the ith original and reconstructed frames within the M frames in i i the video sequence, β and γ are weighting factors, D and D are the total distortion R NR for ROI and Non-ROI, D , D and D are the normalized errors of ROI in fidelity, RF RS RT spatial perceptual quality and temporal perceptual quality, and D , D and D are NF NS NT their counterparts for Non-ROI. It is clear that α, β and γ should be assigned real values between 0 and 1. In low-bitrate video applications, such as wireless video telephony, blocking artifacts are the major concern of spatial perceptual quality. This kind of artifact is caused by the quantization where most of the high-frequency coefficients are removed (set to zero). The resulted effect is that the smoothed image blocks make the block boundaries quite pronounced. At the extreme low bit-rate cases, only DC coefficients will be coded whichRegion-of-Interest Video Communications 353 makes the decoded image piece-wise constant blocks. In this work, we define the D RS (similar for D ) as the normalized blockiness distortion, that is: NS boundaries with discontinuities ˜ D (f)= (11.2) RS Number of boundaries where every boundary between blocks is checked to see if perceivable discontinuities exist. The discontinuity detection approach used in 19 is adopted, which checks the sum of the mean squared difference of the intensity slope across the block boundaries. The assumption of this approach is that the slopes on both sides of a block boundary are supposed to be identical and an abrupt change in slope is probably due to quantization. In equation (11.1), the D (or D ) is defined as an assigned score in the range of RT NT 0, 1 based on the variance of D (or D ) for all the frames in the sequence. In this RS NS way, the terms on fidelity, spatial perceptual quality and temporal perceptual quality are normalized and can be bridged by weighting parametersα,β andγ to form a controllable video quality measurement. The selection of these weighting parameters is up to users based on their requirements and expectations. Again, this measurement is not a perfect metric, but it will be shown in the subsequent text that it helps the bit allocation process to favor subjective perception. 11.2.1.2 Bit Allocation Scheme for ROI Video In video coding applications, a typical problem is to minimize D with a given bit sequence budget for the video sequence. The optimal solution for this complicated problem relies on an optimal frame-level rate control algorithm and an optimal macroblock-level bit allocation scheme. However, for real-time applications, such as wireless video telephony, where very limited information about future frames is available when coding the current frame, it is not practical or feasible to pursue an optimal frame-level rate control. Typically a popular greedy algorithm is resorted to which assumes that the complexity of the video content is distributed uniformly along the frames in the video sequence, and thus allocates a fraction of the available bits to each of the rest frames. For the ˜ ˜ same reason, taking care of D (f ,...,f ) in the rate control is very difficult for NT 1 M these applications. Therefore, to find a practical solution and to simplify the problem we assume that good frame-level rate control is available and thus we narrow down the problem into a macroblock-level bit allocation problem. At the meantime, we propose a background skipping approach, which increases the chance of reducing the value of ˜ ˜ the term D (f ,...,f ) because the skipped region will present the same perceptual NT 1 M quality as that of the previous frame and thus might reduce the fluctuation of the perceptual quality between consecutive frames. For measuring the image quality of a video frame, we use equation (11.1) by setting β+γ = 1. Let us denote by R the total bit budget for a given frame f and R the bit rate for budget coding the frame, then the problem can be represented by:   ˜ ˜ Minimize α βD (f,f)+(1−β)D (f) RF RS   ˜ ˜ +(1−α) βD (f,f)+(1−β)D (f) NF NS Such that R ≤ R (11.3) budget354 Interactive Video Communications Clearly, this optimization problem can be solved by Lagrangian relaxation and dynamic programming in the same fashion as in 20. However, the computational complexity is a great deal higher than a real-time system can bear. Therefore, a low-complexity near-optimal solution is preferred. We propose a two-stage bit allocation algorithm in ρ-domain to solvethis problem.In thefirststage,wearesolving anoptimization problem: ˜ ˜ Minimize αD (f,f)+(1−α)D (f,f), such that R ≤ R (11.4) RF NF budget After the optimal coding parameters for (11.4) is obtained, in the second stage we adjust ˜ ˜ the coding parameters iteratively to reduce the term αD (f)+(1−α)D (f) until a RS NS local minimum is reached. Clearly, the result will be very close to the optimal solution when β is a relative large number. When β = 1, problems (11.3) and (11.4) are identical. In this section, we will focus on the first stage and solve problem (11.4). 11.2.1.3 Bit Allocation Models In ROI video coding, let us denote by N the number of macroblocks in the frame, ρ , i σ ,R andD the set of ρ s, standard deviation, rates and distortion (sum of squared i i i  N error) for the ith macroblocks. Thus, R = R . We define a set of weights w for i i i=1 each macroblock as:  α  ifitbelongstoROI  K w = (11.5) i 1−α   ifitbelongstoNon - ROI (N −K) where K is the number of macroblocks within the ROI. Therefore, the weighted distortion of the frame is: N  ∗ 2∗ ˜ ˜ D = w D = αD (f,f)+(1−α)D (f,f) 255 384 (11.6) i i RF NF i=1 Hence the problem (11-4) can be rewritten as: Minimize D, such that R ≤R (11.7) budget We propose to solve (11.7) by using a modeling-based bit allocation approach. As shown in 21, the distribution of the AC coefficients of a nature image can be best approximated η −ηx by a Laplacian distribution p(x)= e . Therefore in 11, the rate and distortion of 2 the ith macroblock can be modeled in (11.8) and (11.9) as functions of ρ, R = Aρ +B (11.8) i i where A and B are constant modeling parameters, and A can be thought of as the average number of bits needed to encode non-zero coefficients and B can be thought of as theRegion-of-Interest Video Communications 355 bits due to non-texture information. 2 −θρ /384 i D = 384σ e (11.9) i i where θ is an unknown constant. Here we optimize ρ instead of quantizers because that we assume that there is an i accurate enough ρ-QP table available to generate a decent quantizer from any selectedρ . i In general, (11.7) can be solved by using Lagrangian relaxation in which the constrained problem is converted into an unconstrained problem that: N N   2 −θρ /384 i Minimize J = λR+D = (λR +w D )= λ(Aρ +B)+384w σ e λ i i i i i i ρ i i=1 i=1 (11.10)  N ∗ where λ is the solution that enables R = R . By setting partial derivatives to i budget i=1 zero in (11.10), we obtain the following expression for the optimized ρ , that is: i N  2 −θρ i/384 ∂ λ(Aρ +B)+384w σ e i i i ∂J λ i=1 let = = 0 (11.11) ∂ρ ∂ρ i i which is 2 −θρ i/384 λA−θw σ e = 0 (11.12) i i so λA −θρ i/384 e = (11.13) 2 θw σ i i and 384 2 ρ = ln(θw σ )−ln(λA) (11.14) i i i θ On the other hand, since N N   384A 2 R = R = ln(θw σ )−ln(λA)+NB (11.15) budget i i i θ i=1 i=1 so, N  1 θ 2 ln(λA)= ln(θw σ )− (R −NB). (11.16) i budget i N 384NA i=1356 Interactive Video Communications From (11.14) and (11.16), we obtain the following model:   N  384 1 θ 2 2 ρ = ln(θw σ )− ln(θw σ )+ (R −NB) i i i budget i i θ N 384NA i=1   N  2 ln(θw σ ) i  i  R −NB 384 budget  i=1  2 = + ln(θw σ )−  (11.17) i i NA θ  N  As mentioned in 20, another model can be obtained if assume a uniform quantizer, then the distortion is modeled differently from equation (11.9), and thus the model can be derived as: √ w σ i i ρ = ρ . (11.18) i budget N  √ w σ i j j=1 Itisalsoindicatedthatbothmodelshaveagoodperformancewhichisclosetotheoptimal solution. 11.2.2 Content Adaptive Background Skipping The concept of content-adaptive frame/object/macroblock skipping has attracted a great deal of attentions recently. The trade off between spatial and temporal quality was first studied in 21, where a perceptual rationale is employed: that the human visual system (HVS) is more sensitive to temporal changes when the frame contains high motion activ- ities and otherwise is more sensitive to spatial details. The same logic is also used by 22–25 in determining the skip modes. In 22, a weighted function of motion and vari- ance of the residue was used to evaluate the target bits for objects in bit allocation, which assigned more bits to objects with a more complicated texture (with a higher variance) or more activity (with a higher motion). The skipping decision of objects are based on an optimization process of a cost function which considers both coded distortion owing to quantization error and skipped distortion owing to skipped objects. This approach will be difficult for applying in real-time video systems which have tight time constraints and are not able to obtain future frames in advance. In 23, an adaptive macroblock skipping approach was proposed for ROI transcoding, where thresholds for motion and MAAD (mean of accumulated absolute difference) of the residue are used to skip those inac- tive Non-ROI macroblocks. In 24, the decision for frame skipping is dependent jointly on the temporal and spatial contents of the video, and on the fullness of the buffer by using empirical rules. In 25, considering the HVS model mentioned above, the deci- sion for frame skipping is determined adaptively by motion, quantization parameter and buffer status. The motion is evaluated based on the sorted version of the most recent motion activities, and a dynamically adjusted threshold that is coupled with available resources, spatial quality, quantization parameters and motion activity. Utilizing the HVS model, by avoiding skipping frames during high-motion scenes, superior temporal qualityRegion-of-Interest Video Communications 357 is maintained. By skipping frames during low-motion scenes that are less temporally sensitive, coding bits can be saved for subsequent no-skipped frames, and spatial quality can be enhanced. Furthermore, in 25 overall temporal-spatial quality is enhanced when compared to the no-skipping and fixed-pattern solutions, given limited coding resources. In this section, a low-complexity content adaptive background skipping scheme for ROI video coding is introduced. In this context, we use background and Non-ROI as exchangeable terms because Non-ROI in video telephony applications generally refers to background region. In this framework we consider background skipping jointly with frame-level and macroblock-level bit allocation. The skip mode is determined mainly by foreground shape deformation, foreground motion, background motion and accumu- lated skipped distortion owing to skipped background. A self-learning and classification approach based on the Bayesian model is proposed in order to estimate the number of skipped background (in the future frames)based on the context of motion and background texture complexity. In addition, a weighted rate control and bit allocation algorithm is proposed in order to allocate bits for the foreground and background regions. In Figure 11.3, the system architecture of our ROI video coding system is shown, which follows a frame-by-frame processing sequence. The system adopts a ρ-domain frame-level rate control algorithm 12 and a weighted macroblock-level bit allocation algorithm. When a frame is fetched into the system, a greedy frame-level rate control module is called to assign a target ρ budget for the frame considering the remaining bits and the number of frames in the rate control window. The model is based on the assumption that the content complexity of the video frames in the rate control window is distributed uniformly andthus thebits should beallocateduniformly amongtheremaining frames. After that, the ROI of the frame is detected or tracked and the macroblocks in the frame are classified into ROI macroblocks and Non-ROI macroblocks. Then, motion estimation is conducted for all of the macroblocks in the current frame and the obtained motion information is used as a part of content cues in the following background skip modedecision.Oncethedecision,ofwhetherornotto skip thecurrentNon-ROI,is made, the ρ budget for current frame is adjusted, and then the macroblock-level bit allocation and the following DCT transformation, quantization and entropy coding are conducted in the same way as described in section 11.2.1. 11.2.2.1 Content-based Skip Mode Decision Let us first define two filters F(x , M, Th) and G(x ,M,Th), where x is a set of n n n real numbers in which x is the nth item, M an integer number and Th a threshed in the n range of 0,1, and  1 x is greater than Th 100% of items in x ,...,x n n−M n−1 F(x ,M,Th)= n 0 otherwise (11.19)  and x −x n n−M  1 if ≥ Th G(x M,Th)= x (11.20) n−M n  0 otherwise358 Interactive Video Communications Initialization Frame-level rho budget estimation Fetch a new frame ROI detection/tracking Motion estimation (all MBs) Yes First 2 frames? No Skip mode decision and rho budget adjustment Macroblock-level bit allocation Coding macroblocks No Last macroblock? No Last Frame? End Figure 11.3 System architecture of the ROI video coding Filter (11.19) detects within a local window (fixed length of M) if the current value x n is in the top position (above more than Th100% of items), and filter (11.20) detects if there is an increase from x to x by more than Th100%. These filters will be used n−M n in detecting the content status or status change, which indirectly affects the skip mode decision. In 24, 25, the value of summed and averaged motion vectors in the frame (or recent frames) is used to represent the frame motion. The higher the motion the less skipping should be activated in order to protect possible content transition information. In ROIRegion-of-Interest Video Communications 359 video coding, both foreground and background activities are considered. When a large amount of motion occurs in background regions, the frequency of background skipping should be reduced. On the other hand, when the foreground contains a large amount of activities, the skipping of background might be helpful so as to reallocate more bits to code the foreground. Let us denote by χ the amount of background activity, and ζ n n the amount of foreground activity for the frame sequences, then:  χ = (MVx +MVy ) (11.21) n i i i∈Non - ROI where MVx and MVy are x and y component of the motion vector of ith macroblock i i in the nth frame, and: ζ = µ ×κ (11.22) n n n where µ is the ROI shape deformation factor and κ is the ROI local movement n n factor, and Number of pixels in nonoverlaped regions of ROIs of the (n - 1)th and nth frames µ = n Number of pixels in ROI of the nth frame (11.23) and  κ = (MVx +MVy ) (11.24) n i i i∈ROI Clearly, ζ can characterize the degree of the foreground activities because µ repre- n n sents the degree of global activities such as object movement/rotation and shape deforma- tion andκ representslocal activities such as changeof facial expression. Two examples n of these foreground activities are shown in Figure 11.4. 2 Let us denote by σ the total energy of the background residue per frame for the B n frame sequence. Clearly it is also the distortion due to skipped background. So far, we can represent the skip mode decision: S = F(ζ ,M ,Th )G(ζ ,1,Th )+1−F(ζ ,M ,Th )G(ζ ,1,Th ) n n 2 ζ1 n ζ2 n 2 ζ1 n ζ2 (11.25) 2 1−G(σ ,p,Th )1−F(χ ,M ,Th )1−G(χ ,1,Th ) σ n 1 χ1 n χ2 B n whereTh ,M ,Th ,Th ,M andTh arethresholdsandlocalwindowsizesdefinedby σ 1 χ1 χ2 2 ζ1 users, and p−1 the number of consecutive preceding frames of the current frame skipped background (in other words, the (n−p)th frame coded background but the (n−p+1)th, (n−p+2)th,...and(n−1)thframesskippedbackground).WhenS = 1,thebackground n of the current frame is skipped, otherwise, it is coded. Clearly from (11.25) it is observed that the system chooses to skip background when there is a sharp increase of the amount of foreground activity or the foreground contains large activity, otherwise, if background contains large motion or the accumulated distortion due to skipped background is rather high, then the background will be coded.360 Interactive Video Communications (a) Global activity (face movement) (b) Local activity (change of facial expression) Figure 11.4 Examples of frames with large activity in foreground 11.2.2.2 ρ Budget Adjustment In Figure 11.3, the frame-level ρ budget estimation is based on an assumption that the whole frame is coded, however, in this system some backgrounds in the sequence will be skipped, therefore adjustment on ρ budget is necessary. Here we consider three types of strategies: 1) Greedy strategy, which simply reduces the ρ budget based on the texture complexity of ROI and Non-ROI when the skip mode is on, and does nothing if the background is coded; 2) ‘Banker’ strategy, which reduces the ρ budget when the skip mode is on, but stores the savage of these ρ’s for future frames. For a frame coding its background, it will obtain all the ρ’s saved from the previous frames with background skipping; 3) ‘Investor’ strategy, which estimates the future skipping events based on the statistics and patterns of the previous background skipping history, and then determines the ρ budget based on the estimation. budget Let us denote by ρ the ρ budget obtained from the frame-level rate controller, n adjusted ρ the adjusted ρ budget, and n the index of current frame. In the follows we n describe more details of these strategies and compare them. Greedy strategy adjusted The ρ using this strategy can be calculated by n  budget  ρ ifS = 0  n n  √ adjusted w σ i i ρ = (11.26) budget n i∈ROI    √ √ ρ otherwise n  w σ + w σ i i i i i∈ROI i∈NON-ROI whereσ representsthestandarddeviationoftheDCTcoefficientsoftheithmacroblockin i thecurrentframe,andw istheassociatedweightsforthemacroblockinmacroblock-level iRegion-of-Interest Video Communications 361 weighted bit allocation as defined in section 11.2.1. Equation (11.26) comes as an exten- sion of equation (11.18). ‘Banker’ strategy Thisstrategyisaconservativeapproachsimilartothetraditionalbankingoperation,where the customer can cash out the maximum of the total deposit of his account. In this case, thesavingofρ’sinframeswithbackgroundskipping seemstodeposit theresourceforthe nearest future frame which codes its background. The calculation for adjusted ρ budget is obtained by:  p−1   budget adjusted   pρ − ρ ifS = 0 n  n−p+1 n−i adjusted i=1  √ ρ = (11.27) w σ n i i  budget  i∈ROI    √ √ ρ otherwise n  w σ + w σ i i i i i∈ROI i∈NON-ROI where p−1 is the number of consecutive preceding frames of the current frame with skipped background and the (n−p)th frame coded its background. ‘Investor’ strategy A more aggressive approach is to predict future possible events and allocate resources based on the prediction. Here we assume that the future frames with skipped backgrounds have a similar complexity in foreground as the current frame, therefore, once we estimate that there will be q frames with skipped background following the current frame, we can calculate the adjusted ρ budget by:  p−1   budget adjusted  pρ − ρ ifS = 0 and n≤ 50  n n−p+1 n−i   i=1    √ √  w σ + w σ  i i i i   budget i∈ROI i∈NON-ROI    ρ +  √ √ 1 n  2( w σ + w σ ) i i i i q+1 adjusted i∈ROI i∈NON-ROI ρ = (11.28) n p−1  budget adjusted   pρ − ρ  n−p+1 n−i  i=1   ifS = 0 and n 50 n  2  √   w σ i i   budget i∈ROI    √ √  ρ otherwise n w σ + w σ i i i i i∈ROI i∈NON-ROI In equation (11.28), the ‘investor’ strategy acts exactly the same as the ‘banker’ strategy for the first 50 frames. In this period the statistics are collected for future q estimation. When n50 and S = 0, ρ is assigned an average value considering the previous saving n and the predicted future saving due to background skipping. We estimate q by using a Bayesian model and convert the problem into a multi-class classification problem, where the classes are represented by all possibilities of q (for example, classes 0, 1, 2, 3, 4, 5 if we limit q to be less than 6), and the feature vector used 2 in making classification decision is x = χ ,ζ ,σ ). By defining thresholds for χ ,ζ n n n n n B n 2 and σ , we can map the space of x into eight classes y (y = 0,1,..., or 7). n n n B n362 Interactive Video Communications Therefore,forcurrentframe,thebestselectionforq istheonemaximizingtheprobability: P(y q)P(q) n P(qy )= , (11.29) n P(y ) n thus it is the q that maximizes P(y q)P(q). The probabilities of P(y q) and P(q) can n n be obtained by a histogram technique based on the statistics of the previously processed frames. Let us denote by H (y) the counts of frames with coded background that follows q q frames with skipped background with feature vector y, then: H (y ) q n  P(y q)= (11.30) n H (y) q y andP(q) can be obtained by the similar approach. The diagram of the skip mode decision and ρ budget adjustment module with this strategy is shown in Figure 11.5. In Figure 11.6, three bit allocation strategies are compared in coding the Carphone sequence. As mentioned in section 11.2.1, an ROI perceptual importance factor α is defined in order to bridge the distortion of ROI and Non-ROI so as to form a weighted Yes Foreground contains large activity No No Background contains large motion Yes No Accumulated skipped distortion too high Yes Update statistical table Guess the number of future skipped frames Recalculate Rho frame Figure 11.5 Diagram of the skip mode decision and rho budget adjustment moduleRegion-of-Interest Video Communications 363 ROI video coding on the Carphone sequence (alpha = 0.9) 35 Greedy strategy 34.5 Banker’s strategy Investor’s strategy 34 33.5 33 32.5 32 31.5 31 30 35 40 45 50 55 60 65 Rate (kbps) Figure 11.6 Comparison of three bit allocation strategies distortion measurement for the frame. Therefore, the perceptual PSNR is defined as: ˜ ˜ Perceptual PSNR=−10log αD (f,f)+(1−α)D (f,f) (11.31) R NR 10 ˜ where f andf aretheoriginal andreconstructedframes,andD andD thenormalized R NR per pixel distortion of the ROI and Non-ROI. Clearly, both of the ‘banker’ and ‘investor’ strategies outperform the greedy strategy. The ‘investor’ strategy slightly outperformed the ‘banker’ strategy at higher bit rate end. Although it requires extra computational complexity for q estimation, this strategy might perform better for video sequences with repeated patterns or have self-similarity characteristics. On the other hand, the 15 fps Carphone and other QCIF sequences at bit rates from 32kbps to 64kbps are tested in the H.263 Profile 3 simulations system. Four different rate control approaches are compared: • Macroblock-level greedyalgorithm 12 wherethe bits areallocated to the macroblocks in a uniformly distributed manner. • Frame skipping algorithm that skips every other frame during encoding. • Unit-based background skipping algorithm that groups every two frames into a unit and skips the background of the second frame within each unit. • The proposed approach which content-adaptively determines the frames with skipped background, and uses the ‘investor’ strategy for bit allocation. As shown in Figure 11.7, the proposed approach outperformed all other approaches in the whole bit rate range and the gain is up to 2dB. In Figure 11.8, the frame-level detail of Perceptual PSNR (dB)364 Interactive Video Communications ROI video coding on the Carphone sequence (alpha = 0.9) 35 34 33 32 31 Frame skipping Unit-based background skipping 30 Greedy algorithm Proposed approach 29 30 35 40 45 50 55 60 65 Rate (kbps) Figure 11.7 Comparison of various approaches in coding ‘Carphone’ sequence ROI video coding on the Carphone sequence at 48Kbps (alpha = 0.9) 40 38 36 34 32 Greedy algorithm 30 Unit-based background skipping Proposed approach 28 0 50 100 150 200 250 300 Rate (kbps) Figure 11.8 Comparison of various approaches at 48kbps Perceptual PSNR (dB) Perceptual PSNR (dB)Region-of-Interest Video Communications 365 (b) PPSNR = 33.99 dB (a) PPSNR = 32.05 dB (c) PPSNR = 36.98 dB Unit-based background skipping Greedy algorithm Proposed approach Figure 11.9 Comparison of reconstructed frames by various approaches at 48kbps Artifacts (a) Original frame (b) Reconstructed frame Figure 11.10 Visual artifacts due to background skipping these algorithms at the 48kbps is demonstrated. Figure 11.9 shows the reconstructed 15th frame for the compared algorithms and the advantage of the proposed approach is almost 5dB compared to the greedy algorithm and 3dB compared to the unit-based background skipping approach. We have to point out that background skipping sometimes might cause visual artifacts if more than enough number of backgrounds are skipped, for example, as shown in Figure11.10(b),thecodedforegroundandthebackgroundcopiedfromthepreviousframe366 Interactive Video Communications are not aligned well thus causing artifacts at the collar. Clearly, this kind of artifact is very difficult to detect and be concealed because a certain degree of semantic information might be required in the processing. Further study on better background substitution or interpolation algorithms might be helpful in reducing such artifacts. References 1. K. Maxwell, “Asymmetric digital subscriber line: Interim technology for the next forty years”, IEEE Commun. Mag., October, pp. 100–106, 1996. 2. Eli Orr, “Understanding the 3G-324M Spec”, can be downloaded from the weblink: http://www.commsdesign.com/designcorner/OEG20030121S0009. 3. H. Wang, K. El-Maleh, “Joint adaptive background skipping and weighted bit allocation for wireless video telephony”, in Proc. International Conference on Wireless Networks, Communications, and Mobile Computing, Maui, Hawaii, USA, June 2005. 4. H. Wang, K. El-Maleh, and Y. J. Liang, “Real-time region-of-interest video coding using content-adaptive background skipping with dynamic bit reallocation”, in Proc.IEEEInternational Conference onAcoustics, Speech, and Signal Processing, Toulouse, France, May 2006. 5. Y. J. Liang, H. Wang, and K. El-Maleh, “Design and implementation of ROI video coding using content-adaptive background skipping”, in Proc. IEEE International Symposium on Circuits and Systems, Kos, Greece, May 2006. 6. M. Chen, M. Chi, C. Hsu and J. Chen, “ROI video coding based on H.263+ with robust skin-color detection technique”, IEEE Trans. Consumer Electronics, Vol. 49, No. 3, Aug. 2003. pp. 724–730. 7. C. Lin, Y. Chang and Y. Chen, “A low-complexity face-assisted coding scheme for low bit-rate video telephony”, IEICE Trans. Inf. & Syst., Vol. E86-D, No. 1, Jan. 2003. pp. 101–108. 8. S. Sengupta, S. K. Gupta, and J. M. Hannah, “Perceptually motivated bit-allocation for H.264 encoded video sequences”, ICIP’03, Vol. III, pp. 797–800. 9. X. K. Yang, W. S. Lin, Z. K. Lu, X. Lin, S. Rahardja, E. P. Ong, and S. S. Yao, “Local visual perceptual clues and its use in videophone rate control”, ISCAS’2004, Vol. III, pp. 805–808. 10. D. Tancharoen, H. Kortrakulkij, S. Khemachai, S. Aramvith, and S. Jitapunkul, “Automatic face color segmentation based rate control for low bit-rate video coding”, in Proc. 2003 International Symposium on Circuits and Systems (ISCAS’03), Vol. II, pp. 384–387. 11. J. Ribas-Corbera and S. Lei, “Rate control in DCT video coding for low-delay communications”, IEEE Trans. Circuits Systems for Video Technology, Vol. 9, No. 1, pp. 172–185, Feb. 1999. 12. Z. He and S. K. Mitra,“A linear source model and a unified rate control algorithm for DCT video coding”, IEEE Trans. Circuits and System for Video Technology, Vol. 12, No. 11, Nov. 2002. pp. 970–982. 13. H. Wang and N. Malayath, “Macroblock level bit allocation”, US patent pending, May 2005. 14. H.WangandK.El-Maleh,“Region-of-InterestcodinginvideotelephonyusingRhodomainbitallocation”, US patent pending, March 2005. 15. H. Wang and N. Malayath, “Two pass rate control techniques for video coding using a MINMAX approach”, US patent pending, Sept. 2005. 16. H. Wang and N. Malayath, “Two pass rate control techniques for video coding using a rate-distortion characteristics”, US patent pending, Sept. 2005. 17. T. Adiono, T. Isshiki, K. Ito, T. Ohtsuka, D. Li, C. Honsawek and H. Kunieda, “Face focus coding under H.263+ video coding standard”, in Proc. IEEE Asia-Pacific Conf. Circuits and Systems, Dec. 2000, Tianjin, China, pp. 461–464. 18. C. Wong, O. Au, B. Meng, and H. Lam, “Perceptual rate control for low-delay video communications”, ICME’2003. Vol. III, pp. 361–364. 19. S. Minami and A. Zakhor, “An optimization approach for removing blocking effects in transform coding”, IEEE Trans. Circuits Systems for Video Technology, Vol. 5, No. 2, pp. 74–82, April 1995. 20. H. Wang, G. M. Schuster, A. K. Katsaggelos, ”Rate-distortion optimal bit allocation scheme for object-basedvideocoding”, IEEETrans.CircuitsandSystemforVideoTechnology, July-September,2005.