We show that the proposed approach outperforms unsupervised baselines and obtains competitive performance compared to those trained with reinforcement learning. We demonstrate that the proposed approach can generalize to different tasks and environments in a streaming fashion, without explicit rewards or training. We perform extensive experiments in both simulated and real-world environments on two tasks - active object tracking and active action localization.
![cheetah3d frame active object cheetah3d frame active object](https://venturebeat.com/wp-content/uploads/2018/06/iPad-YouTube-Sharing.jpg)
We formulate an energy-based mechanism that combines predictive learning and reactive control to perform active action localization without rewards, which can be sparse or non-existent in real-world environments.
![cheetah3d frame active object cheetah3d frame active object](https://i.stack.imgur.com/7yBqp.jpg)
In this work, we tackle the problem of active action localization where the goal is to localize an action while controlling the geometric and physical parameters of an active camera to keep the action in the field of view without training data. They are often restricted by the quality, quantity, and diversity of \textit training data and do not often generalize to out-of-domain samples. Visual event perception tasks such as action localization have primarily focused on supervised learning settings under a static observer, i.e., the camera is static and cannot be controlled by an algorithm. The tracker can even restore tracking once it occasionally loses the target. Moreover, our approach shows a good generalization ability when performing tracking in case of unseen object moving path, object appearance, background and distracting object.
Cheetah3d frame active object manual#
The yielded tracker can automatically pay attention to the most likely object in the initial frame and perform tracking subsequently, not requiring a manual bounding box as initialization. We carry out experiments on the AI research platform ViZDoom. The tracker, regarded as an agent, is trained with the A3C algorithm, where we harness an environment augmentation technique and a customized rewarding function to encourage robust object tracking. Specifically, a ConvNet-LSTM function approximator is adopted, which takes as input only visual observations (i.e., frame sequences) and directly outputs camera motions (e.g., move forward, turn left, etc.). Crucially, these two tasks are tackled in an end-to-end manner via reinforcement learning.
![cheetah3d frame active object cheetah3d frame active object](https://cdn.iphonephotographyschool.com/wp-content/uploads/iPhone-Photo-Composition-21.jpg)
In this paper we propose an active object tracking approach, which provides a tracking solution simultaneously addressing tracking and camera control.