Packet video applications over the current Internet have to be adaptive to packet losses, occurring frequently due to network congestion. One adaptation approach that is especially suited for applications with real-time constraints, limited available bandwidth, or multicast distribution, is to carefully conceal the information loss at the receiver. In this paper we propose a hybrid loss concealment mechanism that exploits both the spatial and the temporal redundancy that is present in video streams. Our spatial concealment algorithm is a smoothing operation that expresses each pixel of the missing block as a distance-dependent linear combination of all the surrounding pixels. The decision between spatial and temporal concealment is taken individually for each quadrant of the missing block, and it is based on the presence of motion in the closest adjacent block. The basic idea in this approach is that temporal concealment, based on the previous frame, gives best results in areas where there is no significant motion, while spatial concealment, based on the current frame, gives better (smoother) results in areas where there is motion. We have evaluated the effectiveness of this hybrid concealment mechanism with different video sequences and the results are quite satisfactory.
Packet video applications over the Internet, such as RealPlayer, or Windows Media Player, become increasingly popular. The current Internet, however, as a best-effort network that does not offer quality-of-service mechanisms, poses some hard challenges on such applications. One of them is that packets are frequently dropped in the network routers due to congestion. A possible way to deal with losses is through packet retransmissions. This approach however increases significantly the transfer delay of lost packets, which may be unacceptable for applications with real-time constraints [1]. Additionally, retransmissions are not well suited for multicast-based applications [2]. A different approach is to shield the transmitted data with forward-error-correction (FEC) codes, so that up to a certain number of lost packets can be fully recovered at the receiver [3]. The FEC-based schemes, however, require a considerable increase in the transmission bandwidth [4], which may be prohibitive for applications that run over low-bandwidth links (such as dial-up modems).
A third approach, that we pursue in this paper, is to perform loss concealment at the receiver. The goal in this approach is to mask the missing visual information based on the temporal and spatial redundancy that is present in video sequences. Of course, it may not be possible to completely reconstruct the lost image blocks. The basic idea, however, is that if the losses are not very frequent and the quality requirements are not very strict, the lost image blocks can be effectively approximated using the adjacent blocks in the time and/or space domains. The loss concealment approach can be combined with retransmission or FEC mechanisms. For example, a video-conference application can use FEC-based loss recovery for the audio stream and loss concealment for the video stream, since it is generally accepted that in many applications the audio quality is more important than the video quality [5].
The loss concealment mechanism that we propose in this paper is hybrid, in the sense that it masks a lost block using information from both the previous video frame (temporal concealment), and from the surrounding blocks of the current frame (spatial concealment), depending on the presence of motion in that area of the frame. Although this is probably not a new idea, the algorithm that we use to select when and where to perform each of the two types of concealment is, to the extent of our knowledge, original. In addition, we propose a new algorithm for spatial concealment. The algorithm produces a very smooth estimate of the lost block, because it expresses each lost pixel as a distant-dependent linear combination of all the surrounding pixels. This operation, which is the major computative part of the algorithm, can be efficiently performed for the entire lost block with a single matrix multiplication. We experimented with this hybrid mechanism, as well as with other schemes of similar complexity, on several video sequences. Our conclusion is that the proposed loss concealment mechanism is quite effective on parts of the video sequence which are either spatially smooth or relatively still. On the other hand, areas with significant motion context tend to be slightly blurred, without the annoying artifacts, however, that would be produced doing temporal concealment only.
The rest of this paper is structured as follows. In Section 2 we discuss some issues that are related with the problem of loss concealment, and review the existing work on this field. Section 3 presents the proposed mechanism in detail, and compares it with another technique of similar complexity. Section 4, finally, shows a few video frames before and after the loss concealment.
Some earlier works focus on loss concealment for specific coders, such as predicting lost DCT coefficients [6,7], or lost motion vectors in MPEG streams [7, 8], or they work only with layered video codecs [8, 9]. We assume, on the contrary, that the loss concealment process is performed after decoding, and so it is codec-independent (the only requirement is that the coder is block-based). The particular coding scheme used, however, has a strong impact on the relation between the packet loss rate and the magnitude of the actual information loss. For example, suppose that two video sequences, an MPEG and an MJPEG, encounter a packet loss in the network. In the MJPEG sequence only one frame is affected by the loss, while in the MPEG sequence, if the loss occurred in an I frame, an entire `group of pictures' (e.g., nine frames) is affected. This example illustrates the interesting trade-off between a high compression rate, and the ability to effectively deal with packet losses. The question of an "optimal" middle choice in this trade-off is largely an open question, especially in the context of the high loss rates of the current Internet. It is in general agreed, though, that the coders which achieve a very high compression ratio are inappropriate for lossy network paths, since they are not resilient to packet losses.
Another important issue is the mapping between packets and encoded blocks. For example, in the H.261 encoder a packet includes one macroblock, i.e., 4 adjacent blocks for grey-scale images, or 4 luminance and 2 chrominance adjacent blocks for YUV-color images. Consequently, when a packet is lost several adjacent blocks are lost and the spatial concealment techniques perform poorly. The solution that is widely accepted nowadays, and that we assume as deployed here, is that some form of transmission interleaving is used, i.e., a packet does not contain adjacent blocks but disperse ones [6]. In our implementation each packet maps to four blocks of the same column which are four rows apart. The transmission interleaving technique reduces the probability of adjacent lost blocks but it does not eliminate it. For example, if 20 out of the 396 blocks of an intracoded CIF image are randomly lost, there is a probability of about 22% that at least one of the adjacent blocks will also be lost. This shows that the concealment mechanisms have to be able to deal effectively with the case of adjacent lost blocks.
Several previous works have considered only spatial approaches [7, 10], or only temporal approaches [8,9]. We believe that it is beneficial to exploit both the spatial and the temporal redundancy that is present in the decoded video sequence, using a certain criterion on which approach will lead to best results. Another critical issue is that the loss concealment computations should not be prohibitively long. Specifically, real-time performance with software-only implementations is a major requirement [11]. There are some sophisticated schemes, such as techniques for performing projections onto convex sets, or attempting edge-continuity using directional filters [10], that can probably lead to higher-quality concealment than the mechanism of this paper. Such approaches, however, are computationally very expensive because they are iterative, and so they are probably inappropriate for real-time execution in high frame rates and with high loss rates. Finally, we describe the proposed scheme for grey-scale video sequences. For YUV-color sequences, the same algorithm can be applied independently on the luminance and on the chrominance blocks. It has to be noted, however, that in color video sequences additional concealment techniques can be developed, exploiting the relative sensitivity of the human visual system to the different chroma components. We did not pursue this idea further.
Let F be the video frame on which we perform loss concealment, and Fp be the previous video frame, in which losses have been already concealed. Let X be a missing block in F, and E the estimate of X that results from the loss concealment. P is the block of Fp which resides in the same location as X. The adjacent blocks with X are U for up, D for down, L for left, and R for right. Some of the blocks U,D,L,R may be also lost, without having been concealed yet. In that case, we temporarily replace the missing adjacent block with the corresponding block from Fp. Additionally, we split the lost block X to its quadrants, as shown in Figure 1. The four quadrants are denoted as XU,XD,XL,XR, based on their adjacent block.
The proposed spatial concealment technique uses the pixels that are directly adjacent to X and that make up the four 8-by-1 surrounding vectors u,d,l,r (see Figure 1). Specifically, u is the bottom row of block U, d the top row of block D, l the right-most column of block L, and r the left-most column of block R. If block X resides at the boundary of the image, one or two of the vectors u,d,l,r are not defined. In that case, we predict those vectors based on the closest known pixels. For example, if block X is at the top of the frame (but not at the corners), u is predicted as u(i) = (l[1]+r[1]) / 2 for i=1..8. If it is at the top-left corner, u is predicted as u(i) = r[1] and l as l(i)=d[1] for i=1..8.
The high-level description of the hybrid loss concealment mechanism is shown in Figure 2 (note that this is just a description of the algorithm and that the implementation can be more efficient in several ways). If we detect motion in one of the adjacent blocks U,D,L,R, we mask the corresponding quadrant of X using a smooth estimate XS of X computed based on the adjacent vectors u,d,l,r (spatial concealment). Otherwise, if there is no significant motion in that area of X, we replace that quadrant of X with the corresponding quadrant of the block P (temporal concealment). The intuition in this algorithm is that if there is no significant motion context close to a certain area of the lost block, it is probable that that area will be more or less as it was in the previous frame. On the contrary, if there is motion in that area of the lost block, using information from the previous frame would create artifacts, and so we mask the lost bits with a smooth estimate computed from the nearby pixels of the surrounding blocks of the current frame. The details of spatial concealment, temporal concealment, and motion detection follow in the next sections.
The smoothness constraint that we use in the spatial concealment is described next (see also Figure 3). Let x be a pixel of the spatially concealed block XS, and xu,xd,xl,xr be the up, down, left, and right pixels, respectively. We require that for all pixels x of XS:
For an 8-by-8 block we can write 64 such averaging equations, one for each pixel x. The resulting system of linear equations has 64 unknowns, the pixels of X, while the constant terms depend on the surrounding vectors u,d,l,r. It is straightforward to see that we can rewrite these 64 equations as a linear system:
where the matrix A is the 64-by-64 block matrix:
I8 is the 8-by-8 identity matrix, while L is the Laplacian-like 8-by-8 matrix:
The vector of unknowns v is the pixels of XS in anti-lexicographic order (the anti-lexicographic order of a matrix is to first read all the elements of the first column, then of the second column, and so on). The constant term c is the 64-by-1 vector that results from the anti-lexicographic order of the matrix:
The matrix A is invertible, and so we can directly compute the pixels of XS, as:
It is interesting to compare this spatial concealment approach with that presented in [12]. In that work, each missing pixel x is estimated from a distance-dependent linear combination of the four closest pixels of the blocks U,D,L,R. A similar algorithm was discussed in [8]. The major difference between those schemes and ours is that, here, each missing pixel is calculated from a distance-dependent linear combination of all the surrounding pixels, and not only of the four closest ones. This leads to much smoother concealment, as will be shown in Section 4. On the other hand, the algorithm of [12] is more effective in preserving vertical or horizontal edges in the area of the lost blocks. We use the term Periphery Spatial Concealment for the proposed smoothing algorithm of this paper, and the term 4-point Spatial Concealment for the smoothing algorithm of [12].
We detect if there is motion in each of the surrounding blocks U,D,L,R using the Pixel Difference Classification (PDC) method. The PDC operator takes two blocks as arguments and returns the number of pixels in these blocks that differ in absolute magnitude more than a threshold T. Specifically,
The value of T has to be chosen based on the noise level of the image. If T is too small, motion will be detected even under minor light variations or insignificant camera movements, while if it is too large important motion will not be detected at all. For the video sequences that we experimented with, a value around 10 was leading to accurate motion detection most of the times.
Given two same-location blocks B1 and B2 from frames F and Fp respectively, we detect that there is motion present in the area of these blocks, and so we switch to spatial concealment, if
where MT is the motion threshold. The value of MT can be determined through experimentation, and it certainly depends on the specific video streams. We found that values around 16 to 24 (for 8-by-8 blocks) leads to the best results.
Figure 5 shows a frame from the commonly used video-conference sequence `Mother and Daughter'. In this particular frame the mother moves her left hand down, while some slight movements also occur in the area of the two faces. We show the original frame, the frame after random block losses which are shown as black blocks (assuming the transmission interleaving scheme mentioned in Section 2), the frame that results from 4-point spatial concealment, the frame that results from periphery spatial concealment, the frame that results from temporal concealment, and finally, the frame that results from the proposed hybrid concealment mechanism. In this particular experiments we assume an MJPEG type of coder, in which all the frame blocks are transmitted and are susceptible to losses. The fraction of lost blocks is about 30\%. A coder, on the contrary, that only transmits the blocks that have been modified since the last frame (partial replenishment) would result in the same fraction of lost blocks, but only in the areas of the moving objects. In that case, the hybrid loss concealment mechanism would perform more spatial concealments than temporal concealments.
First, note that the periphery spatial concealment leads to smoother loss masking than the 4-point spatial concealment, which is positive because smooth areas are in general less noticeable to the human eye. This can be mainly observed in the areas of the two faces and bodies. In the areas of vertical or horizontal edges, however, such as at the wall-picture frame, the 4-point approach preserves the edges in the lost blocks, while the periphery approach does not. The temporal concealment approach leads, as expected, to excellent concealment in areas where there are no moving objects, or where the motion is not clearly visible. In the area of the mother's moving hand, however, performing only temporal concealment leads to annoying artifacts. It is exactly those areas that the hybrid mechanism attempts to smoothly mask using the periphery spatial concealment mechanism. Comparing the original frame with the frame that results from hybrid concealment shows that the proposed mechanism can lead to satisfactory results, that are quite close to the original frame.






In order to demonstrate cases where the hybrid concealment mechanism (as well as all the other approaches that we consider in this work) do not perform quite satisfactorily, we also show a frame from the `Mobile' sequence. In this video sequence the train at the bottom of the image moves to the left, while the frame of numbers at the upper part of the image moves to the right. Additionally, the motion is not horizontal, but rather cyclic. What makes the loss concealment a very hard task in this sequence is the moving numbers, which consist of relatively narrow and highly curved edges. Note that performing only temporal concealment leads to several artifacts, like lines and parts of the digits that were not present in the original frame. Such artifacts are easily noticeable by the human observer, even in a high frame-rate video sequence. The hybrid concealment approach on the other hand, replaces those artifacts with blurred areas, which although are not as annoying, they are still easily noticeable. Such cases show the intrinsic limitations of any loss concealment approach. Our conclusion is that a hybrid loss concealment mechanism, like the scheme proposed in this paper, can be a satisfactory solution for real-time applications that do not have strict quality requirements, such as video-calls or video-conferences over the Internet. Applications with higher quality requirements should probably rely on retransmissions or FEC-based mechanisms, which are of course more expensive in terms of transfer delay and network bandwidth.