Algorithms for Processing RGBD Images and Videos for Depth-Based 3D Video Systems
MetadataShow full item record
In recent times, immersive visual media such as Virtual Reality (VR), Augmented Reality (AR), 3DTV and Free Viewpoint Television (FTV) have garnered tremendous interest. Immersive visual media content typically provides interactivity and a more realistic viewing experience, thereby reducing the divide between the real and the virtual world. Such media have found applications in various fields such as gaming, education and training, entertainment etc. With increased availability of depth sensing cameras, the demand for depth-based 3D video systems is on the rise. Depth sensing cameras acquire 3D information of the scene and store it in the form of two-dimensional array known as depth map. Depth-based 3D video systems are a natural choice for immersive media. This is because the depth information not only enables the viewer to have a perception of depth but also plays an important role in enabling the viewers to view the scene by switching the viewpoints as if they are around the scene. The display systems such as 3DTV and FTV play a major role in creating the effect of immersion to the viewers. However, the efficient functioning of these display systems are tightly coupled with various aspects such as acquisition of 3D content, representation of the content in a manner suitable for processing, compression, transmission etc. All these functions together comprise the end-to-end 3D video system. In this thesis, we address few problems that are encountered in different functional blocks of the depth-based 3D video system. The problems addressed in this thesis are relevant at the acquisition, representation and display stages. In the first two contributing chapters (Chapters 2 and 3), we address the problem of depth map upsampling using a guidance image. Depth map upsampling is performed to obtain depth information corresponding to every pixel in the color image. This is necessary since the depth sensors such as Time-of-Flight cameras provide depth maps whose resolutions can be signi - cantly lower than the color images. In Chapter-2, we perform upsampling of a low-resolution depth map using a high-resolution color image as guidance without making use of any learning based technique. We first upscale the depth map i.e., increase the resolution of the depth map to that of the color image using bicubic interpolation. We then perform refinement of the upscaled depth map to retain the abrupt transition at the object boundaries. We pose the refinement as a segmentation problem and solve on patchwise basis using Normalized Cuts based segmentation algorithm. In Chapter-3, we use a supervised learning based approach to perform depth map upsampling. Here, instead of using the high-resolution color image, we make use of its edge map for guidance. Similar to the previous chapter, we first upscale the low-resolution depth map to the resolution of guidance image and then perform refinement on a patchwise basis. We train a convolutional neural network to learn the mapping between the patches of upscaled depth map and the corresponding patches of the ground-truth depth map. We extend this network to propose an iterative refinement network loosely based on the concept of recurrent neural networks to perform the re finement of depth in an iterative manner. We generate training samples and train the network from scratch to perform patchwise refinement of the upscaled depth map. In Chapter-4, we address the problem of RGBD image segmentation. We propose an unsupervised method that performs segmentation of the given RGBD image in a multi-stage manner. We rst divide the RGBD image into superpixels and then iteratively cluster the superpixels. This is performed in multiple stages where different features are used for clustering in every stage. Information extracted from the color image, the albedo image, the depth map, the surface normals, the plane information obtained from the surface normals and the edge maps obtained from both the color and depth images are utilized to construct similarity matrices which are then used in the process of clustering of superpixels. We then address the problem of performing salient object segmentation in a given RGBD image in Chapter-5. We fi rst segment the RGBD image using the algorithm proposed in Chapter-4. For each segment, we calculate scores using the features such as center-bias, color and depth contrasts and frequency information. We then use the superpixels belonging to the segment having the highest score as a query for graph-based manifold ranking. This process is performed on full-resolution and half-resolution RGBD images, whose results are then combined together to obtain the final saliency map. Finally, in Chapter-6, we propose a fast yet effective algorithm to synthesize the virtual video from input videos acquired from multiple synchronized RGBD videos. This is necessary to provide the viewer a realistic and immersive viewing experience. We fi rst consider frames from the two input videos closest to the speci fied virtual viewpoint, perform 3D warping on each and blend them to obtain an initial version of the virtual video frame. We then propose a modifi ed non-local means fi ltering based technique to fill the disocclusion holes present in the initial version. While only spatial information is used to fi ll holes in the first frame, information from the previously synthesized frames is used to fi ll the subsequent frames.