《HEVC Extension for Multiview Video Coding and Multiview Video ....doc》由会员分享,可在线阅读,更多相关《HEVC Extension for Multiview Video Coding and Multiview Video ....doc(43页珍藏版)》请在三一办公上搜索。
1、ITU - Telecommunications Standardization SectorSTUDY GROUP 16 Question 6Video Coding Experts Group (VCEG)44nd Meeting: San Jose, CA, USA, 03-10 February 2012Document VCEG-AR13Filename: VCEG-AR13.docQuestion:Q.6/SG16 (VCEG)Source:Christian Bartnik, Sebastian Bosse, Heribert Brust, Tobias Hinz, Harich
2、aran Lakshman, Detlev Marpe, Philipp Merkle, Karsten Mller, Hunn Rhee, Heiko Schwarz, Gerhard Tech, Thomas Wiegand, Martin WinkenFraunhofer HHIEinsteinufer 3710587 Berlin, GermanyEmail:(firstName).(lastName)hhi.fraunhofer.deTitle:HEVC Extension for Multiview Video Coding and Multiview Video plus Dep
3、th CodingPurpose:Proposal_In this document, an HEVC extension for Multiview Video Coding and Multiview Video plus Depth Coding is proposed. Beside the known concept of disparity-compensated prediction, the proposed HEVC extension includes additional inter-view prediction techniques and depth coding
4、tools. The extensions for multiview video coding and multiview video plus depth coding were integrated in the HM-3.0 software. Experimental results for four stereo test sequences show an average overall bit rate reduction of about 28% relative to HEVC simulcast for multiview video coding, and 33% re
5、lative to HEVC simulcast for multiview video plus depth coding.For the standardization of Multiview Video Coding and Multiview Video plus Depth Coding, four 1080p25/30 stereo test sequences with automatically generated depth maps are proposed.1 IntroductionThis document describes a proposal for a da
6、ta format suitable for delivering 3D video in future applications and a coding scheme for representing the data format. The 3D video is transmitted in the Multiview plus Depth (MVD) format, which contains two or more captured views as well as associated depth maps. Based on the coded videos and dept
7、h maps, additional views suitable for displaying the 3D video on autostereoscopic displays can be generated using depth-image-based rendering (DIBR) techniques. The video sequences as well as the sequences of depth maps are coded using an extension of HEVC. This proposed coding format is backwards c
8、ompatible to HEVC in a way that a sub-bitstream representing a single-view can be extracted from the 3D video bitstream and independently decoded with an HEVC conforming decoder. The data format also provides view scalability. Optionally, the data format provides independent decodability of the vide
9、o sequences, so that, for example, a sub-bitstream representing conventional stereo video can be extracted from a 3D bitstream and decoded. The proposed coding format can also be used for conventional multiview video coding (without the coding of depth data).2 Data Format and System DescriptionIn th
10、e proposed HEVC extension, 3D video is in general represented using the Multiview Video plus Depth (MVD) format, in which a small number of captured views as well as associated depth maps are coded and the resulting bitstream packets are multiplexed into a 3D video bitstream. After decoding the vide
11、o and depth data, additional intermediate views suitable for displaying the 3D content on an auto-stereoscopic display can be synthesized using depth-image-based rendering (DIBR) techniques. For the purpose of view synthesis, camera parameters, or more accurately, parameters specifying a conversion
12、of the depth data into disparity vectors, are additionally included in the bitstream. The bitstream packets include header information, which signal, in connection with transmitted parameter sets, a view identifier and an indication whether the packet contains video or depth data. Sub-bitstreams con
13、taining only some of the coded components can be easily extracted by discarding bitstream packets that contain non-required data. One of the views, which is also referred to as the base view or the independent view, is coded independently of the other views and the depth data using a conventional 2D
14、 video coder. HEVC is used as 2D video codec. The sub-bitstream containing the independent view can be decoded by an unmodified 2D HEVC decoder and displayed on a conventional 2D display. Optionally, the encoder can be configured in a way that a sub-bitstream representing two views without depth dat
15、a can be extracted and independently decoded for displaying the 3D video on a conventional stereo display. The codec can also be used for coding multiview video signals without depth data. And, when using depth data, it can be configured in a way that the video pictures can be decoded independently
16、of the depth data.Figure 1: Overview of the system structure and the data format for the transmission of 3D video.The basic concept of the proposed system and data format is illustrated in Figure 1. In general, the input signal for the encoder consists of multiple views, associated depth maps, and c
17、orresponding camera parameters. However, as described above, the codec can also be operated without depth data. The input component signals are coded using a 3D video encoder, which represents an extension of HEVC. At this, the base view is coded using an unmodified HEVC encoder. The 3D video encode
18、r generates a bitstream, which represents the input videos and depth data in a coded format. If the bitstream is decoded using a 3D video decoder, the input videos, the associated depth data, and camera parameters are reconstructed with the given fidelity. For displaying the 3D video on an autostere
19、oscopic display, additional intermediate views are generated by a DIBR algorithm using the reconstructed views and depth data. If the 3D video decoder is connected to a conventional stereo display instead of to an autostereoscopic display, the view synthesizer can also generate a pair of stereo view
20、s, in case such a pair is not actually present in the bitstream. At this, it is possible to adjust the rendered stereo views to the stereo geometry of the viewing conditions. One of the decoded views or an intermediate view at an arbitrary virtual camera position can also be used for displaying a si
21、ngle view on a conventional 2D display.The 3D video bitstream is constructed in a way that the sub-bitstream representing the coded representation of the base view can be extracted by simple means. The bitstream packets representing the base view can be identified by inspecting transmitted parameter
22、 sets and the packet headers. The sub-bitstream for the base view can be extracted by discarding all packets that contain depth data or data for the dependent views and, then, the extracted sub-bitstream can be directly decoded with an unmodified HEVC decoder and displayed on a conventional 2D video
23、 display.Beside the option that a stereo pair can be rendered based on the output of a 3D video decoder, the encoder can also be configured in a way that the sub-bitstream containing only two stereo views can be extracted and directly decoded using a stereo decoder. The encoder can also be configure
24、d in a way that the views can be generally decoded independently of the depth data.3 Coding AlgorithmIn the following, we describe the coding algorithm based on the MVD format, in which each video picture is associated with a depth map. But as mentioned in sec.2, the coding algorithm can also be use
25、d for a multiview format without depth maps. The video pictures and, when present, the depth maps are coded access unit by access unit, as it is illustrated in Figure 2. An access unit includes all video pictures and depth maps that correspond to the same time instant. It should be noted that the co
26、ding order of access units doesnt need to be identical to the capture or display order. In general, the reconstructed data of already coded access units can be used for an efficient coding of the current access unit. Random access is enabled by so-called random access units or instantaneous decoding
27、 refresh (IDR) access units, in which the video pictures and depth maps are coded without referring to previously coded access units. Furthermore, an access unit doesnt reference any access unit that precedes the previous random access unit in coding order.Figure 2: Access unit structure and coding
28、order of view components.The video pictures and depth maps corresponding to a particular camera position are indicated by a view identifier (viewId). All video pictures and depth maps that belong to the same camera position are associated with the same value of viewId. The view identifiers are used
29、for specifying the coding order inside the access units and detecting missing views in error-prone environments. Inside an access unit, the video picture and, when present, the associated depth map with viewId equal to0 are coded first, followed by the video picture and depth map with viewId equal t
30、o1, etc. A video picture and depth map with a particular value of viewId are transmitted after all video pictures and depth maps with smaller values of viewId. The video picture is always coded before the associated depth map (i.e., the depth map with the same value of viewId). It should be noted th
31、at the value of viewId doesnt necessarily represent the arrangement of the cameras in the camera array. For ordering the reconstructed video pictures and depth map after decoding, each value of viewId is associated with another identifier called view order index (VOI). The view order index is a sign
32、ed integer values, which specifies the ordering of the coded views from left to right. If a viewA has a smaller value of VOI than a viewB, the camera for viewA is located left to the camera of viewB. In addition, camera parameters required for converting depth values into disparity vectors are inclu
33、ded in the bitstream. For a linear camera setup, these conversion parameters consist of a scale factor and an offset. The vertical component of a disparity vector is always equal to0. The horizontal component is derived according todv=(s*v+o)n,where v is the depth sample value, s is the transmitted
34、scale factor, o is the transmitted offset, and n is a shift parameter that depends on the required accuracy of the disparity vectors.Each video sequence and depth sequence is associated with a separate sequence parameter set and a separate picture parameter set. The picture parameter set syntax, the
35、 NAL unit header syntax, and the slice header syntax for the coded slices havent been modified for including a mechanism by which the content of a coded slice NAL units can be associated with a component signal. Instead, the sequence parameter set syntax for all component sequences except for the ba
36、se view has been extended. Theses sequences parameter sets contain the following additional parameters: the view identifier (indicates the coding order of a view); the depth flag (indicates whether video data or depth data are present); the view order index (indicates the location of the view relati
37、ve to other coded views); an indicator specifying whether camera parameters are present in the sequence parameter set or in the slice headers; when camera parameters are present in an sequence parameter set, for each viewId value smaller than the current view identifier, a scale and an offset specif
38、ying the conversion of a depth sample of the current view to a horizontal disparity between the current view and the view with viewId; when camera parameters are present in an sequence parameter set, for each viewId value smaller than the current view identifier, a scale and an offset specifying the
39、 conversion of a depth sample of the view with viewId to a horizontal disparity between the current view and the view with viewId;The sequence parameter set for the base view doesnt contain the additional parameters. Here, the view identifier is inferred to be equal to0, the depth flag is inferred t
40、o be equal to0, and the view order index is inferred to be equal to0.The sequence parameter sets for dependent views include a flag, which specifies whether the camera parameters are constant for a coded video sequence or whether they can change on a picture by picture basis. If this flag indicates
41、that the camera parameters are constant for a coded video sequence, the camera parameters (i.e., the scale and offset values described above) are present in the sequence parameter set. Otherwise, the camera parameters are not present in the sequence parameter set, but instead the camera parameters a
42、re coded in the slice headers that reference the corresponding sequence parameter set.Figure 3: Basic codec structure with inter-component prediction (red arrows).The basic structure of the 3D video codec is shown in the block diagram of Figure 3. In principle, each component signal is coded using a
43、n HEVC-based codec. The resulting bitstream packets, or more accurately, the resulting Network Abstraction Layer (NAL) units, are multiplexed to form the 3D video bitstream. The base or independent view is coded using an unmodified HEVC codec. Given the 3D video bitstream, the NAL units containing d
44、ata for the base layer can be identified by parsing the parameter sets and NAL unit header of coded slice NAL units (up to the picture parameter set identifier). Based on these data, the sub-bitstream for the base view can be extracted and directly coded using a conventional HEVC decoder.For coding
45、the dependent views and the depth data, modified HEVC codecs are used, which are extended by including additional coding tools and inter-component prediction techniques that employ already coded data inside the same access unit as indicated by the red arrows in Figure 3. For enabling an optional dis
46、carding of depth data from the bitstream, e.g., for supporting the decoding of a stereo video suitable for conventional stereo displays, the inter-component prediction can be configured in a way that video pictures can be decoded independently of the depth data. For improving the coding efficiency f
47、or dependent views and depth data, the following modifications have been integrated: disparity-compensatedprediction: A technique for using already coded and reconstructed pictures (or depth maps) inside an access unit as additional reference pictures for inter prediction. The same concept is found
48、in MVC. inter-viewmotionprediction: A technique for employing the motion parameters of already coded video pictures of other views (inside an access unit) for predicting the motion parameters of a current video picture. inter-viewresidualprediction: A technique for employing the coded residuals of a
49、lready coded video pictures of other views (inside an access unit) for predicting the residuals of a current video picture. reducedmotionvectoraccuracyfordepthdata: A technique for increasing the coding efficiency of depth data (and decreasing the decoding complexity) by reducing the motion vector accuracy. disablingofin-loopfiltersfordepthdata: A encoding technique for i