dc.description.abstract | Video data has been historically known for its unstructured nature, rich semantic content and
scalability issues in terms of storage. With advances in computer vision and Deep Neural Net works (DNNs) it is now possible to automatically extract rich semantic information from video
data. This has resulted in increased interest for development and exploration of applications
where stored video data could be used to observe and study the world retrospectively. But recent research has highlighted the compute intensive nature of such deep models (e.g., accurate
object detection models) leading to high cost associated with use of these models limiting their
applicability to analyze video data naively for retrospective analysis. Also development and
efficient implementation of above applications often require to co-analyze video data along with
associated geospatial and temporal metadata, which has been acknowledged by the research
community to be a difficult task due to the associated cognitive load.
There is a growing use of drone cameras for capturing video data due to their mobility
and ease of deployment. These videos are complemented with temporally varying location
and orientation metadata, which ease their exploration. Video query systems are required to
allow intuitive querying over such geospatial, temporal and semantic information associated
with the videos. In this thesis, we develop a geospatial-temporal video query system that supports semantic queries over drone videos by extending an existing spatial-temporal database
and leveraging DNN models. Specifically, we include query abstractions over the level of detail
of visual information captured in the videos, and propose simple heuristics for better reuse
of semantic object detection results from different object detection model configurations. Op timizations needed for such retrospective analysis motivate the proposed novel DDownscale
method, along with an Ingest Pipeline, to efficiently acquire, store and query drone videos in a
Video Repository.
A key requirement for this video repository is the need to conserve storage space and compute
time for semantic video queries. Reducing the resolution of videos will reduce the video size
and the inferencing time during querying. Existing methods to reduce the resolution of video
data for such optimizations often leverage the stationary spatial and temporal characteristics
of videos in static cameras, which are absent in videos from mobile drone fleets. Drones fly at
different altitudes and record videos from different viewpoints and capture videos with varying
level of detail of visual information. Another factor we need to take care of, is that, drone
videos are typically of short duration unlike those captured by static cameras. We propose the
novel DDownscale method to dynamically select the downscale factor for a video such that the
level of detail in the video required for effective object detection is not compromised. We model
the relative recall drop caused by downscaling as a function of the object size in the downscaled
video and the downscaling factor used. We observe that for a given object detection DNN
model and class of interest, our method generalizes well to the evaluated test datasets. Using
the above modeling, we derive the DDownscale inequality as a relation between the relative
recall drop of the video and the hyperparameters to DDownscale. This relation is satisfied
by ≈ 98% of the dynamically downscaled videos across different datasets. For user specified
target reduction in recall values ranging from 1% – 30%, the proposed DDownscale algorithm
help achieve > 25% reduction in total object detection time and > 31% reduction in storage
on average compared to baseline of storing and evaluating the videos uniformly at the original
resolution, with ≈ 96% of dynamically downscaled videos having the relative recall drop within
user specified target.
Additionally, we explore a simpler specification of target level of detail ; derive a relation
between this specification and a statistic of relative drop in recall of smallest object of in terest when detected by the selected model; and propose a drone video ingest pipeline that
preprocesses video on arrival from the drones, including dynamically downscaling them, before
insertion into the video repository residing on a central server. The pipeline uses scheduling
strategies over a cluster of heterogeneous edge accelerators to reduce time to ingest the drone
videos to make them quickly available for analysis. The pipeline reduces the average turn
around time for ingest by ≈ 66% despite the downscaling overhead, compared to uploading
original resolution video without downscaling, for the evaluated workload and experimental
setup. | en_US |