Adaptive Fusion Techniques for Multimodal Data

Gaurav Sahu, Olga Vechtomova

Machine Learning for NLP Long paper Paper

Gather-3D: Apr 23, Gather-3D: Apr 23 (13:00-15:00 UTC) [Join Gather Meeting]

You can open the pre-recorded video in separate windows.

Abstract: Effective fusion of data from multiple modalities, such as video, speech, and text, is challenging due to the heterogeneous nature of multimodal data. In this paper, we propose adaptive fusion techniques that aim to model context from different modalities effectively. Instead of defining a deterministic fusion operation, such as concatenation, for the network, we let the network decide "how" to combine a given set of multimodal features more effectively. We propose two networks: 1) Auto-Fusion, which learns to compress information from different modalities while preserving the context, and 2) GAN-Fusion, which regularizes the learned latent space given context from complementing modalities. A quantitative evaluation on the tasks of multimodal machine translation and emotion recognition suggests that our lightweight, adaptive networks can better model context from other modalities than existing methods, many of which employ massive transformer-based networks.
NOTE: Video may display a random order of authors. Correct author list is at the top of this page.

Connected Papers in EACL2021

Similar Papers

One-class Text Classification with Multi-modal Deep Support Vector Data Description
Chenlong Hu, Yukun Feng, Hidetaka Kamigaito, Hiroya Takamura, Manabu Okumura,
Few-shot learning through contextual data augmentation
Farid Arthaud, Rachel Bawden, Alexandra Birch,
Augmenting Transformers with KNN-Based Composite Memory for Dialog
Angela Fan, Claire Gardent, Chloe Braud, Antoine Bordes,