Global Methods for Camera Motion Estimation
Abstract
This thesis is a contribution towards the problem of Structure-from-Motion (SfM), which has been of interest to the 3D computer vision community for many decades. Structure-from-Motion falls under the class of 3D reconstruction problems in computer vision, whose objective is to recover the 3D structure of a scene, given a collection of images of the scene. The advancements in camera sensor technologies, along with rapid growth in digital communication systems and the internet, have led to easy capturing and sharing of photo collections. Additionally, the continuous evolution of computing hardware has resulted in complex tasks like estimating 3D reconstructions being performed with affordable computing resources. These developments position SfM as one of the core problems in 3D vision with a wide range of scientific, industrial and domestic applications that use SfM either directly or are influenced by the progress in SfM.
Given a 3D scene, cameras capture different parts of the scene, resulting in a collection of images that form a graph. Every node in the graph represents a camera, and every edge represents an overlap in the scene between a pair of cameras. Such a graph is called a viewgraph in 3D reconstruction problems.
The output of SfM is a 3D reconstruction of a scene corresponding to the input images provided, which is represented as a point cloud. The output also consists of rotation and translation for each individual camera, together referred to as the camera motion, which is estimated during the reconstruction process and is generally not available apriori. This makes estimating camera motions a crucial component of SfM. There are two main paradigms for solving SfM: incremental and global. Incremental pipelines grow the reconstruction of a scene by repeatedly adding new cameras, estimating their motions and solving for new 3D points. Global pipelines utilize the fact that the number of cameras is much smaller compared to the number of 3D points. These methods simultaneously estimate the motions of all cameras from pairwise relationships between them and then solve for 3D points.
In this thesis, we focus on obtaining accurate, reliable and efficient estimation of camera motions in SfM. We examine the issues concerning the estimation of camera motions under three different aspects, namely input quality, choice of cost function and the underlying graph representation, i.e. the viewgraph, and develop global methods to address these issues. In the following, we briefly summarize the contributions of this thesis.
Input Quality: The first aspect of camera motion estimation considered in this thesis is the quality of relative camera motions, which is the input to global SfM pipelines. The quality of relative motions is dependent on many factors associated with their estimation. It is well known that relative translation directions are more prone to errors than relative rotations, both of which are obtained from epipolar geometry. We look into the problem of translation averaging, where the aim is to estimate camera translations from relative translation directions between the cameras. Existing translation averaging methods exploit the consistency relationships in a viewgraph to solve for camera translations, but the solution is limited by the quality of input relative directions. Therefore, instead of working only with relative translation directions, we also take recourse to using the keypoint correspondences between image pairs from which the relative motions are estimated through epipolar geometry. We propose a modular framework to iteratively reweight point correspondences based on their global consistency via translation averaging methods. The modularity of our framework allows us to incorporate any translation averaging method. Our proposed framework improves relative translation directions on benchmark datasets, thus improving camera translation estimates in comparison to the translation averaging scheme used.
Choice of Cost Function: The second aspect of camera motion estimation considered in this thesis is the choice of different optimization costs to solve for translation averaging. The mismatch in the input and output spaces results in two types of cost functions, either comparing directions or displacements. These cost functions are often relaxed to make the optimization steps simpler. We observe that the distinctly different nature of the cost functions leads to varied behaviour under different baseline conditions between camera pairs and noise levels in input directions, and neither of them performs the best in all scenarios. We argue that translation averaging can benefit from a fusion of the two approaches. We propose a principled approach to recursively fuse the estimates obtained from both the relaxed costs based on uncertainty computed from the Hessian of the cost functions. Our proposed method improves camera translation estimates compared to individual costs, showing the usefulness of our approach. Moreover, our method further improves translation estimates when incorporated within the framework of improving relative directions.
Graph Representation: The third aspect of camera motion estimation considered in this thesis is the viewgraph representation that arises due to relationships between cameras. We show two applications where viewgraphs provide useful information to improve camera motion estimates. In the first application, we introduce the idea of sensitivity in translation averaging, which is equivalent to perturbation analysis of the problem. The existing literature focuses on robustness to outliers and the uniqueness of the solution, whereas sensitivity examines a distinctly different question. We propose two different formulations to theoretically analyze the sensitivity/conditioning of the problem: one based on edges and another based on nodes in viewgraphs, called the edge and node-based formulations, respectively. We first analyze the smallest solvable problem in each of the two formulations, which identifies the configuration of ill-conditioned directions. We extend the analysis to full graphs and quantify the conditioning of translation averaging solely based on input directions. We propose two efficient algorithms, one for each formulation, to improve the conditioning of translation averaging. Analysis of real data discloses the presence of ill-conditioned directions in abundance and also shows that sensitivity is not correlated with the presence of outliers. Applying our methods significantly improves the translation estimates, revealing the benefits of such an analysis. Moreover, we also compare the two proposed formulations, which suggests that the node-based formulation uses a stronger criterion for quantifying the sensitivity of the problem compared to the edge-based formulation.
The second application that leverages the viewgraph representation solves the tasks of graph sparsification and disambiguation of repeated structures in a unified manner. A well-sparsified viewgraph aids in faster estimation of camera motions, while disambiguation avoids gross camera motion errors for scenes containing repeated structures. We first provide a unified understanding of the two problems. Using this understanding, we arrive at a scoring mechanism to identify redundant and false edges in a viewgraph based on the 3D point connectivity of cameras. Our edge selection procedure is formulated as an optimization problem wherein the optimum can be achieved by a simple thresholding scheme. We propose two efficient algorithms, one for viewgraphs consisting only of camera triplets and another for general viewgraphs. Each algorithm simultaneously solves both the tasks under consideration. Unlike existing methods in the literature, both our algorithms need only a single hyperparameter and do not require access to intermediate reconstructions. They can be applied as a preprocessing step to any SfM pipeline, making them modular. We demonstrate the effectiveness of our methods on generic and ambiguous datasets that cover a range of small, medium and large-scale datasets. Analysis of real data reveals that the distribution of edge scores with our two methods is useful for both graph sparsification and removal of false edges. Applying our methods significantly improves reconstruction time due to graph sparsification and avoids superimposed reconstructions because of repeated structures being disambiguated. We also compare our two proposed algorithms, where we observe that the algorithm designed for general viewgraphs reconstructs more cameras and 3D points than the one designed for viewgraphs consisting only of triplets, with a shorter compute time since no preprocessing is required for the former method.
Finally, we conclude the thesis by pointing out some interesting future directions related to the work done in this thesis.