Trajectory-based Descriptors for Action Recognition in Real-world Videos
Abstract
This thesis explores motion trajectory-based approaches to recognize human actions in
real-world, unconstrained videos. Recognizing actions is an important task in applications
such as video retrieval, surveillance, human-robot interactions, analysis of sports videos, summarization of videos, behaviour monitoring, etc. There has been a considerable amount of research done in this regard. Earlier work used to be on videos captured by static cameras where it was relatively easy to recognise the actions. With more videos being captured by moving cameras, recognition of actions in such videos with irregular camera motion is still a challenge in unconstrained settings with variations in scale, view, illumination, occlusion and unrelated motions in the background. With the increase in videos being captured from wearable or head-mounted cameras, recognizing actions in egocentric videos is also explored in this thesis.
At first, an effective motion segmentation method to identify the camera motion
in videos captured by moving cameras is explored. Next, action recognition in videos
captured in normal third-person view (perspective) is discussed. Further, the action recognition approaches for first-person (egocentric) views are investigated. First-person videos are often associated with frequent unintended camera motion. This is due to the motion of the head resulting in the motion of the head-mounted cameras (wearable cameras). This is followed by recognition of actions in egocentric videos in a multicamera setting. And lastly, novel feature encoding and subvolume sampling (for “deep” approaches) techniques are explored in the context of action recognition in videos.
The first part of the thesis explores two effective segmentation approaches to identify
the motion due to camera. The first approach is based on curve fitting of the motion
trajectories and finding the model which best fits the camera motion model. The curve
fitting approach works when the trajectories generated are smooth enough. To overcome
this drawback and segment trajectories under non-smooth conditions, a second approach
based on trajectory scoring and grouping is proposed. By identifying the instantaneous
dominant background motion and accordingly aggregating the scores (denoting the
“foregroundness”) along the trajectory, the motion that is associated with the camera can
be separated from the motion due to foreground objects. Additionally, the segmentation result has been used to align videos from moving cameras, resulting in videos that seem to be captured by nearly-static cameras.
In the second part of the thesis, recognising actions in normal videos captured from
third-person cameras is investigated. To this end, two kinds of descriptors are explored.
The first descriptor is the covariance descriptor adapted for the motion trajectories. The covariance descriptor for a trajectory encodes the co-variations of different features along the trajectory’s length. Covariance, being a second-order encoding, encodes information of the trajectory that is different from that of the first-order encoding. The second
descriptor is based on Granger causality. The novel causality descriptor encodes the
“cause and effect” relationships between the motion trajectories of the actions. This
type of interaction descriptors captures the causal inter-dependencies among the motion
trajectories and encodes complimentary information different from those descriptors
based on the occurrence of features. The causal dependencies are traditionally computed on time-varying signals. We extend it further to capture dependencies between spatiotemporal signals and compute generalised causality descriptors which perform better than their traditional counterparts.
An egocentric or first-person video is captured from the perspective of the personof-interest (POI). The POI wears a camera and moves around doing his/her activities.
This camera records the events and activities as seen by him/her. The POI who is performing actions or activities is not seen by the camera worn by him/her. Activities
performed by the POI are called first-person actions and third-person actions are those
done by others and observed by the POI. The third part of the thesis explores action
recognition in egocentric videos. Differentiating first-person and third-person actions is important when summarising/analysing the behaviour of the POI. Thus, the goal is to
recognise the action and the perspective from which it is being observed. Trajectory
descriptors are adapted to recognise actions along with the motion trajectory ranking
method of segmentation as pre-processing step to identify the camera motion. The motion
segmentation step is necessary to remove unintended head motion (camera motion) during
video capture. To recognise actions and corresponding perspectives in a multi-camera
setup, a novel inter-view causality descriptor based on the causal dependencies between trajectories in different views is explored. Since this is a new problem being addressed, two first-person datasets are created with eight actions in third-person and first-person perspectives. The first dataset is a single camera dataset with action instances from first-person and third-person views. The second dataset is a multi-camera dataset with each action instance having multiple first-person and third-person views.
In the final part of the thesis, a feature encoding scheme and a subvolume sampling
scheme for recognising actions in videos is proposed. The proposed Hyper-Fisher Vector
feature encoding is based on embedding the Bag-of-Words encoding into the Fisher Vector
encoding. The resulting encoding is simple, effective and improves the classification
performance over the state-of-the-art techniques. This encoding can be used in place of the traditional Fisher Vector encoding in other recognition approaches. The proposed subvolume sampling scheme, used to generate second layer features in “deep” approaches for action recognition in videos, is based on iteratively increasing the size of the valid subvolumes in the temporal direction to generate newer subvolumes. The proposed sampling requires lesser number of subvolumes to be generated to “better represent” the actions and thus, is less computationally intensive compared to the original sampling scheme. The techniques are evaluated on large-scale, challenging, publicly available datasets. The Hyper-Fisher Vector combined with the proposed sampling scheme perform better than the state-of-the-art techniques for action classification in videos.