dc.description.abstract | In recent times, immersive visual media such as Virtual Reality (VR), Augmented Reality (AR),
3DTV and Free Viewpoint Television (FTV) have garnered tremendous interest. Immersive
visual media content typically provides interactivity and a more realistic viewing experience,
thereby reducing the divide between the real and the virtual world. Such media have found
applications in various fields such as gaming, education and training, entertainment etc.
With increased availability of depth sensing cameras, the demand for depth-based 3D video
systems is on the rise. Depth sensing cameras acquire 3D information of the scene and store it
in the form of two-dimensional array known as depth map. Depth-based 3D video systems are a
natural choice for immersive media. This is because the depth information not only enables the
viewer to have a perception of depth but also plays an important role in enabling the viewers
to view the scene by switching the viewpoints as if they are around the scene.
The display systems such as 3DTV and FTV play a major role in creating the effect of
immersion to the viewers. However, the efficient functioning of these display systems are tightly
coupled with various aspects such as acquisition of 3D content, representation of the content in
a manner suitable for processing, compression, transmission etc. All these functions together
comprise the end-to-end 3D video system. In this thesis, we address few problems that are
encountered in different functional blocks of the depth-based 3D video system. The problems
addressed in this thesis are relevant at the acquisition, representation and display stages.
In the first two contributing chapters (Chapters 2 and 3), we address the problem of depth
map upsampling using a guidance image. Depth map upsampling is performed to obtain depth
information corresponding to every pixel in the color image. This is necessary since the depth
sensors such as Time-of-Flight cameras provide depth maps whose resolutions can be signi -
cantly lower than the color images. In Chapter-2, we perform upsampling of a low-resolution
depth map using a high-resolution color image as guidance without making use of any learning
based technique. We first upscale the depth map i.e., increase the resolution of the depth
map to that of the color image using bicubic interpolation. We then perform refinement of
the upscaled depth map to retain the abrupt transition at the object boundaries. We pose the refinement as a segmentation problem and solve on patchwise basis using Normalized Cuts
based segmentation algorithm. In Chapter-3, we use a supervised learning based approach to
perform depth map upsampling. Here, instead of using the high-resolution color image, we
make use of its edge map for guidance. Similar to the previous chapter, we first upscale the
low-resolution depth map to the resolution of guidance image and then perform refinement on
a patchwise basis. We train a convolutional neural network to learn the mapping between the
patches of upscaled depth map and the corresponding patches of the ground-truth depth map.
We extend this network to propose an iterative refinement network loosely based on the concept
of recurrent neural networks to perform the re finement of depth in an iterative manner. We
generate training samples and train the network from scratch to perform patchwise refinement
of the upscaled depth map.
In Chapter-4, we address the problem of RGBD image segmentation. We propose an unsupervised
method that performs segmentation of the given RGBD image in a multi-stage manner.
We rst divide the RGBD image into superpixels and then iteratively cluster the superpixels.
This is performed in multiple stages where different features are used for clustering in every
stage. Information extracted from the color image, the albedo image, the depth map, the
surface normals, the plane information obtained from the surface normals and the edge maps
obtained from both the color and depth images are utilized to construct similarity matrices
which are then used in the process of clustering of superpixels.
We then address the problem of performing salient object segmentation in a given RGBD
image in Chapter-5. We fi rst segment the RGBD image using the algorithm proposed in
Chapter-4. For each segment, we calculate scores using the features such as center-bias, color
and depth contrasts and frequency information. We then use the superpixels belonging to the
segment having the highest score as a query for graph-based manifold ranking. This process is
performed on full-resolution and half-resolution RGBD images, whose results are then combined
together to obtain the final saliency map.
Finally, in Chapter-6, we propose a fast yet effective algorithm to synthesize the virtual
video from input videos acquired from multiple synchronized RGBD videos. This is necessary
to provide the viewer a realistic and immersive viewing experience. We fi rst consider frames
from the two input videos closest to the speci fied virtual viewpoint, perform 3D warping on
each and blend them to obtain an initial version of the virtual video frame. We then propose a
modifi ed non-local means fi ltering based technique to fill the disocclusion holes present in the
initial version. While only spatial information is used to fi ll holes in the first frame, information
from the previously synthesized frames is used to fi ll the subsequent frames. | en_US |