Free Reads

Sign in to view your remaining parses.
Tag Filter
Video Question Answering
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
Published:4/9/2024
Long-Term Video UnderstandingMultimodal Long-Sequencing ModelVideo Question AnsweringVideo Information Storage MechanismMemory-Augmented Multimodal Learning
The MALMM model is introduced for longterm video understanding, utilizing an online approach and a memory bank to store historical video information, overcoming frame limitations. Extensive experiments demonstrate stateoftheart performance in tasks like video question answer
03
Self-Chained Image-Language Model for Video Localization and Question Answering
Published:5/12/2023
Self-Recurrent Video Localization and Question AnsweringBLIP-2 Based Vision-Language ModelVideo Question AnsweringTemporal Keyframe LocalizationUnlabeled Video Localization Optimization
The SeViLA framework introduces a solution for video question answering, addressing issues from uniform frame sampling. Utilizing the BLIP2 model, it efficiently combines temporal keyframe localization and QA, significantly improving performance while reducing the need for expen
01
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
Published:12/19/2024
Multimodal Large Language ModelVisual-Spatial Intelligence BenchmarkSpatial ReasoningVideo Question AnsweringCognitive Map Generation
This work introduces VSIBench to evaluate multimodal large language models' spatial reasoning from videos, revealing emerging spatial awareness and local world models, with cognitive map generation enhancing spatial distance understanding beyond standard linguistic reasoning tec
010