Attention and Fusion of Deep Representations for Computer Vision
In this thesis by articles we make contributions related to attention and fusion at the intersection of the deep learning and computer vision fields. In our first article, we investigate the design of neural network operators that fuse features extracted from different modalities to make predictions on a single task. We propose a generalized class of multimodal fusion operators for visual question answering (VQA). We identify generalizations of existing multimodal fusion operators, and show that specific non-trivial instantiations of our operator exhibit superior performance in terms of open-ended accuracy on the VQA task. In our second article, we introduce Transformers to video object segmentation (VOS). We propose a scalable, end-to-end method for VOS called "Sparse Spatiotemporal Transformers"(SST) to address runtime, scalability and temporal dependency issues of prior work. We show that our method achieves competitive results on YouTube-VOS 2019 with increased scalability compared with the state of the art.