Techniques for estimating the direction of pointing gestures using depth images in the presence of orientation and distance variations from the depth sensor

Das, Shome Subhra

dc.contributor.advisor	Ramakrishnan, A G
dc.contributor.author	Das, Shome Subhra
dc.date.accessioned	2022-01-27T10:57:56Z
dc.date.available	2022-01-27T10:57:56Z
dc.date.submitted	2021
dc.identifier.uri	https://etd.iisc.ac.in/handle/2005/5604
dc.description.abstract	A significant part of our daily life is spent interacting with various computing devices, mobile phones, etc. Soon, robots and drones may become common in daily life, as might be virtual reality based interfaces. Currently, the interaction with such devices is through the use of mouse, touch-pad, joystick, virtual reality wand, etc. Most of these interaction devices (such as mouse, keyboard and joystick) are immobile and restricted to table-top usage. Some interaction devices such as drone controllers are mobile, but need significant training before usage. Devices such as virtual reality wands are mobile, but inhibit immersive experience due to their handheld mode of operation and size of the device. Also, these interaction devices need to be touched with the hand during operation, thus making them prone to transmission of virus, bacteria etc. due to their use by multiple subjects. To overcome the limitations of the existing interaction devices, the recent trend is to move towards gesture-based user interfaces, which enable more natural interaction. Gestural inter-faces facilitate better interaction in 3D, since the user is not constrained to operate on planar surfaces (like with the use of mouse and touch-pad) or to operate at a fixed location (like joy-stick). Also, gesture based interaction requires minimum training, as human beings use gestures in day to day life. Gesture based interfaces are immersive as they mainly use the hands and not cumbersome external devices. Also, gesture based interfaces are non-contact type and minimize the probability of transmission of virus, bacteria etc. All interaction modes (gestural or device based) mentioned above are mostly used for selection tasks, pick-and-place tasks and direction indication tasks. These tasks primarily consist of pointing tasks. Existing gesture-based pointing interfaces suffer from one or more of the following limitations. Techniques using RGB cameras need multiple cameras for operation, thereby imposing constraints on camera placement and the operational area of the setup. Also, RGB camera based techniques are not tolerant to variation in skin color and illumination. Majority of the existing techniques based on depth sensor rely on the usage of multiple joint locations (such as head, shoulder, elbow, and wrist) to find the pointing direction. This requires that the entire upper human body be visible to the depth sensor. This may lead to occlusion related problem and constrains the operational area of the setup. There are a few techniques which use only the hand region data from a depth sensor. However, these techniques either work in a very constrained setting or have very poor accuracy. This thesis addresses two problems, namely pointing direction estimation (hereafter referred to as PDE) and detection of pointing gestures (a prerequisite before any PDE technique). The proposed techniques use depth images from a single depth sensor thus avoiding the pitfalls of RGB based techniques, and use only the hand region thus avoiding the pitfalls of using multiple body parts. To our knowledge, this is the maiden attempt at creating depth and orientation tolerant, accurate methods for estimating the pointing direction using only depth images of the hand region. The proposed methods achieve accuracies comparable to or better than those of existing methods while avoiding their limitations. The summary of the key contributions of this thesis follows. • Proposing an accurate, real-time technique for estimating the pointing direction using a nine-axis inertial motion unit (IMU) and depth data from an RGB-D sensor. It is the first method to fuse information from the IMU and depth sensor to obtain the pointing direction by finding the axis vector of the index finger. Further, this is the first method to obtain ground-truth pointing direction of index finger-based pointing gestures using only the data from the index finger region. • Creation of a large data-set of 107K samples with accurate ground-truth for pointing direction estimation from depth images. Each sample consists of the segmented depth image of a hand, the fingertip location (2D + 3D), the pointing vector (as a unit vector and in terms of the yaw and pitch values), and the mean depth of the hand. To the best of our knowledge, this is the first data-set for depth image and hand region based PDE that has accurate ground-truth and a large number of samples. The data-set has been made publicly available. • Proposing a new 3D convolutional neural network based method to estimate pointing direction. To the best of our knowledge, this is the first deep learning-based method for PDE that uses only the depth image of the hand region for estimation of pointing direction. It is tolerant to variation in orientation and depth of the hand with respect to the camera and is suitable for real-time applications. • Proposing a technique that uses depth image of the hand region for estimating the pointing direction with the aid of a global registration technique. Pointing direction is estimated by aligning a pointing hand model (captured using kinect fusion-based method) with the point-cloud from the test depth data. Unlike the other methods proposed by us, it does not need an accurate segmentation of the hand region as a prerequisite. It is tolerant to variation in the orientation and depth of the hand w.r.t. the RGB-D sensor. It achieves less net angular, yaw and pitch errors than most hand region based PDE techniques in the literature. • Creation of a large data-set of approximately 100K (46,918 positive and 53,477 negative) samples for detection of pointing gestures from depth images of the hand region. A deep learning based technique is proposed using the created data-set to distinguish pointing gestures from other hand gestures. The proposed method achieves significantly better performance over various metrics (accuracy, precision, recall, true negative rate, false positive rate, false negative rate) w.r.t. the only other existing technique for pointing gesture detection using hand region and depth image. • Proposing an accurate, inexpensive setup for calibrating a nine-axis IMU. This calibration is used in some works reported in the thesis. • Proposing a technique to find the absolute orientation of an RGB-D sensor in the North-East-Down (NED) coordinate frame. The absolute orientation of the RGB-D sensor is used in some works reported in this thesis. In chapter 1, we state the motivation to create a pointing gesture based interface for natural human interaction with computers, robots, drones and in virtual reality setups. We elucidate the limitations of the existing techniques and devices to show the necessity of designing pointing gesture-based interfaces that use a single depth sensor and the image of the hand region only. In chapter 2, we describe the experimental setups created and used for the work reported in chapters 3 and 4. First, we propose an inexpensive and accurate setup to calibrate a nine-axis IMU. Then we propose a method to find the orientation of an RGB-D sensor in the NED frame. In chapter 3, we propose a technique for estimating the pointing direction using a nine-axis IMU and depth data from an RGB-D sensor. Sensor fusion is applied to the data from the magnetometer, accelerometer and the gyroscope to find the pointing direction in the NED frame. Coordinate transformation is used to find the pointing direction in the frame of the RGB-D sensor. The computationally expensive parts of the algorithm are executed in the GPU, due to which it takes only eight milliseconds to process a frame. Thus, the proposed method is suitable for real-time operations, while achieving a mean accuracy of 90.5% over the depth range of the RGB-D sensor. Chapter 4 reports on the creation of a data-set for index finger-based pointing direction estimation with accurate ground truth and a high number of samples. The data-set has been collected using the IMU-based technique proposed in chapter 3. Chapter 4 also proposes a 3D convolutional neural network-based method to find the pointing direction from depth images of the hand region. This method achieves a mean accuracy of 94.49% over the depth range of the RGB-D sensor and real-time performance. In chapter 5, we propose a global registration-based method to find the pointing direction. A pointing hand model is captured using Kinect fusion technique, while the pointing direction is found by aligning the pointing hand model to the point-cloud from the test depth data. We show that our method is tolerant to small changes between the model and the actual hand data. This method achieves a mean accuracy of 86.33% over the depth range of the RGB-D sensor. In chapter 6, we address the challenge of detecting pointing gestures from depth images of the hand region. Identifying pointing gestures from generic gestures is a prerequisite for any pointing gesture-based interface. We have created a data-set of nearly 100K samples consisting of comparable number of pointing and non-pointing gestures. A 3D convolutional neural network-based method is proposed using the created data-set to distinguish pointing gestures from other gestures using only the depth images of the hand region. The proposed method achieves significantly better performance over various metrics (accuracy, precision, recall, true negative rate, false positive rate, false negative rate) w.r.t. the only other technique for pointing gesture detection that uses only hand region and depth image. The proposed method is suitable for real-time operation. Chapter 7 concludes the thesis by summarizing the contributions from all the chapters and proposing directions for possible future work.	en_US
dc.language.iso	en_US	en_US
dc.rights	I grant Indian Institute of Science the right to archive and to make available my thesis or dissertation in whole or in part in all forms of media, now hereafter known. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation	en_US
dc.subject	Computer vision	en_US
dc.subject	gesture recognition	en_US
dc.subject	pointing gesture	en_US
dc.subject	direction estimation	en_US
dc.subject.classification	Research Subject Categories::TECHNOLOGY::Information technology	en_US
dc.title	Techniques for estimating the direction of pointing gestures using depth images in the presence of orientation and distance variations from the depth sensor	en_US
dc.type	Thesis	en_US
dc.degree.name	PhD	en_US
dc.degree.level	Doctoral	en_US
dc.degree.grantor	Indian Institute of Science	en_US
dc.degree.discipline	Engineering	en_US

Files in this item

Name:: ShomeThesisPhD_etd.pdf
Size:: 28.78Mb
Format:: PDF
Description:: Thesis full text

View/Open

This item appears in the following Collection(s)

Electrical Engineering (EE) [357]

Show simple item record