Abstract:
|
Human action understanding from videos is one of the foremost challenges in computer
vision. It is the cornerstone of many applications like human-computer interaction and
automatic surveillance. The current state of the art methods for action recognition and
localization mostly rely on Deep Learning. In spite of their strong performance, Deep
Learning approaches require a huge amount of labeled training data. Furthermore, standard
action recognition pipelines rely on independent optical flow estimators which increase
their computational cost. We propose two approaches to improve these aspects. First, we
develop a novel method for efficient, real-time action localization in videos that achieves
performance on par or better than other more computationally expensive methods. Second,
we present a self-supervised learning approach for spatiotemporal feature learning that does
not require any annotations. We demonstrate that features learned by our method provide
a very strong prior for the downstream task of action recognition. |