Tags: Video Question Answering - Paper Library - SwiftScholar

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Published:4/9/2024

Long-Term Video UnderstandingMultimodal Long-Sequencing ModelVideo Question AnsweringVideo Information Storage MechanismMemory-Augmented Multimodal Learning

The MALMM model is introduced for longterm video understanding, utilizing an online approach and a memory bank to store historical video information, overcoming frame limitations. Extensive experiments demonstrate stateoftheart performance in tasks like video question answer

03

Self-Chained Image-Language Model for Video Localization and Question Answering

Published:5/12/2023

Self-Recurrent Video Localization and Question AnsweringBLIP-2 Based Vision-Language ModelVideo Question AnsweringTemporal Keyframe LocalizationUnlabeled Video Localization Optimization

The SeViLA framework introduces a solution for video question answering, addressing issues from uniform frame sampling. Utilizing the BLIP2 model, it efficiently combines temporal keyframe localization and QA, significantly improving performance while reducing the need for expen

01

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

Published:12/19/2024

Multimodal Large Language ModelVisual-Spatial Intelligence BenchmarkSpatial ReasoningVideo Question AnsweringCognitive Map Generation

This work introduces VSIBench to evaluate multimodal large language models' spatial reasoning from videos, revealing emerging spatial awareness and local world models, with cognitive map generation enhancing spatial distance understanding beyond standard linguistic reasoning tec

010

Free Reads