EgoAVU: : Egocentric Audio-Visual Understanding

Ashish Seth1,2,*, Xinhao Mei1, Changsheng Zhao1, Varun Nagaraja1, Ernie Chang1, Gregory P. Meyer1, Gael Le Lan1, Yunyang Xiong1, Vikas Chandra1, Yangyang Shi1, Dinesh Manocha2, Zhipeng Cai1,†

1Meta, 2University Of Maryland, College Park

*Work done at Meta, Project Lead

Learning from model weights


We introduce EgoAVU, a scalable and automated data engine to enable egocentric audio–visual understanding. EgoAVU enriches existing egocentric narrations by integrating human actions with environmental context, explicitly linking visible objects and the sounds produced during interactions or surroundings. Leveraging this pipeline, we construct EgoAVU-Instruct (3M QAs) and EgoAVU-Bench (3K verified QAs), enabling systematic training and evaluation of MLLMs. Models finetuned with EgoAVU-Instruct exhibit high audio-visual grounding in egocentric settings.

Abstract

Understanding egocentric videos plays a vital role for embodied intelligence. Recent multi-modal large language models (MLLMs) can accept both visual and audio inputs. However, due to the challenge of obtaining text labels with coherent joint-modality information, whether MLLMs can jointly understand both modalities in egocentric videos remains under-explored. To address this problem, we introduce EgoAVU, a scalable data engine to automatically generate egocentric audio-visual narrations, questions and answers. EgoAVU enriches human narrations with multimodal context and generates audio-visual narrations through cross-modal correlation modeling. Token-based video filtering and modular, graph based curation ensure both the data diversity and quality. Leveraging EgoAVU, we construct EgoAVU-Instruct --- a large scale training dataset of 3M samples, and EgoAVU-Bench --- a manually verified evaluation split covering diverse tasks. EgoAVU-Bench clearly reveals the limitation of existing MLLMs: they bias heavily towards visual signals, often neglecting audio cues or failing to correspond audio with the visual source. Finetuning MLLMs on EgoAVU-Instruct effectively solves this issue, enabling up to 113% performance improvement on EgoAVU-Bench. Such benefit can also transfer to other benchmarks such as EgoTempo and EgoIllusion, achieving up to 28% relative performance gain.

Methodology

EgoAVU methodology overview

(1) Multisensory narration enrichment. For each egocentric video clip, EgoAVU enhances the raw human narration with detailed multisensory context using open-source MLLMs.

(2) Audio–visual diversity filtering. The enriched narrations are then used to select video clips that exhibit diverse and informative audio–visual dynamics.

(3) Multimodal Context Graph (MCG) construction. EgoAVU constructs a MCG using open-source large language models to explicitly capture complex cross-modal relationships among actions, objects, and audio cues. The MCG is parsed together with the enriched narrations to generate coherent and temporally aligned audio–visual narrations.

(4) Audio–visual QA generation. Finally, the generated audio–visual narrations are leveraged to create high-quality audio–visual question–answer pairs, forming both the large-scale instruction-tuning dataset EgoAVU-Instruct and the evaluation benchmark EgoAVU-Bench.

Result: Performance on EgoAVU-Bench

EgoAVU main result

We compare seven MLLMs with joint audio–visual understanding capabilities against our fine-tuned models across a diverse set of tasks in EgoAVU-Bench, including open-ended QAs: Source-Sound Association (SSA), Audio-Visual Dense Narration (AVDN), and Audio-Visual Segment Narration (AVSN), as well as closed-ended QAs: Temporal Reasoning (TR) and Audio-Visual Hallucination (AVH). For the open-ended tasks, we report LLM-as-Judge (S), METEOR (M), and ROUGE-L (R). For the closed-ended tasks, we report Accuracy (Acc.).

Result: Performance on other VQA and AVQA Benchmarks

EgoAVU vqa result

Finetuning on EgoAVU-Instruct benefits other egocentric benchmarks such as EgoTempo and EgoIllusion, achieving up to 28.1% accuracy gain. Our model also maintains strong performance on exocentric video QA benchmarks such as VideoMME and AVQA.

Qualitative Example

EgoAVU vqa result

Our model fine-tuned on EgoAVU-Instruct captures significantly more dense visual details than Qwen2.5 Omni and VideoLLaMA2, while also identifying auditory cues related to human actions and background sounds in the video.

BibTeX

@article{YourPaperKey2024,
  title={Your Paper Title Here},
  author={First Author and Second Author and Third Author},
  journal={Conference/Journal Name},
  year={2024},
  url={https://your-domain.com/your-project-page}
}