3DLLM-Mem: Long-Term Spatial-Temporal Memory for Embodied 3D Large Language Model
TL;DR Summary
This study introduces 3DLLM-Mem to enhance long-term spatial-temporal memory in Large Language Models for dynamic 3D environments. It presents 3DMem-Bench for evaluating reasoning capabilities, with experimental results showing significant performance improvements in embodied tas
Abstract
Humans excel at performing complex tasks by leveraging long-term memory across temporal and spatial experiences. In contrast, current Large Language Models (LLMs) struggle to effectively plan and act in dynamic, multi-room 3D environments. We posit that part of this limitation is due to the lack of proper 3D spatial-temporal memory modeling in LLMs. To address this, we first introduce 3DMem-Bench, a comprehensive benchmark comprising over 26,000 trajectories and 2,892 embodied tasks, question-answering and captioning, designed to evaluate an agent's ability to reason over long-term memory in 3D environments. Second, we propose 3DLLM-Mem, a novel dynamic memory management and fusion model for embodied spatial-temporal reasoning and actions in LLMs. Our model uses working memory tokens, which represents current observations, as queries to selectively attend to and fuse the most useful spatial and temporal features from episodic memory, which stores past observations and interactions. Our approach allows the agent to focus on task-relevant information while maintaining memory efficiency in complex, long-horizon environments. Experimental results demonstrate that 3DLLM-Mem achieves state-of-the-art performance across various tasks, outperforming the strongest baselines by 16.5% in success rate on 3DMem-Bench's most challenging in-the-wild embodied tasks.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is "3DLLM-Mem: Long-Term Spatial-Temporal Memory for Embodied 3D Large Language Model". This title clearly indicates the paper's focus on enhancing 3D Large Language Models (LLMs) with long-term memory capabilities for embodied agents operating in 3D environments, specifically addressing both spatial and temporal aspects of memory.
1.2. Authors
The authors of the paper are:
-
Wenbo Hu, Yining Hong, Yanjun Wang, Leison Gao, Zibu Wei, Xingcheng Yao, Nanyun Peng, Kai-Wei Chang (affiliated with University of California, Los Angeles)
-
Yonatan Bitton, Idan Szpektor (affiliated with Google Research)
Their research backgrounds are primarily in areas related to Large Language Models, 3D vision, embodied AI, and machine learning, given their affiliations with a prominent university AI research group and a major technology company's research division.
1.3. Journal/Conference
The paper is published at (UTC): 2025-05-28T17:59:13.000Z. While a specific journal or conference is not explicitly mentioned in the provided abstract or first page, the presence of a "NeurIPS Paper Checklist" in the appendices strongly suggests that it is intended for publication at the Conference on Neural Information Processing Systems (NeurIPS). NeurIPS is one of the most prestigious and influential conferences in the field of artificial intelligence and machine learning, known for publishing cutting-edge research.
1.4. Publication Year
The publication timestamp indicates the paper was published on May 28, 2025.
1.5. Abstract
The paper addresses the limitation of current Large Language Models (LLMs) in handling long-term memory for embodied agents in dynamic, multi-room 3D environments. This limitation is attributed to the lack of proper 3D spatial-temporal memory modeling. To tackle this, the authors introduce two main contributions:
-
3DMEM-BENCH: A comprehensive benchmark comprising over 26,000 trajectories and 2,892 embodied tasks, including question-answering and captioning, designed to evaluate an agent's ability to reason over long-term memory in 3D environments. -
3DLLM-MEM: A novel dynamic memory management and fusion model. This model utilizesworking memory tokens(representing current observations) as queries to selectively attend to and fuse the most useful spatial and temporal features fromepisodic memory(storing past observations and interactions). This approach aims for task-relevant information focus and memory efficiency in complex, long-horizon environments.Experimental results demonstrate that
3DLLM-MEMachieves state-of-the-art performance across various tasks on3DMEM-BENCH, outperforming strong baselines by 16.5% in success rate on challenging embodied tasks.
1.6. Original Source Link
The official source link is: https://arxiv.org/abs/2505.22657
The PDF link is: https://arxiv.org/pdf/2505.22657v2.pdf
This indicates the paper is available as a preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the inability of current Large Language Models (LLMs) to effectively plan and act in complex, dynamic, multi-room 3D environments, especially when tasks require recalling information over extended periods and vast spaces. Humans excel at such tasks by naturally employing long-term memory across both temporal (what happened when) and spatial (where things are) experiences.
This problem is important because embodied AI systems, which are AI agents designed to interact with the physical world, need robust memory capabilities to perform real-world tasks. Current 3D LLMs and 3D Vision-Language-Action models have made strides in perceiving, reasoning, and acting in 3D spaces, but they suffer from several critical limitations:
-
Struggle with long-term memory chains: Models fail to maintain coherent memory over multiple visual scenarios (e.g., different rooms in a house) and extended time frames.
-
Information density of 3D scenes: Real-world 3D environments are vast and information-dense. Storing dense 3D representations (which capture intricate geometric relationships) is computationally challenging. Selective retrieval (common in LLMs) risks omitting critical information.
-
Entanglement of spatial and temporal memory: Agents need to track not only object locations but also how environments and objects change over time due to exploration and interaction. Maintaining coherent representations of previously seen spaces while integrating new information is a significant hurdle.
The paper's entry point and innovative idea is to address this
3D spatial-temporal memory modelinggap directly. It proposes a dual-memory system inspired by human cognition (working memory for current observations, episodic memory for past experiences) and a novelmemory fusion moduleto selectively and efficiently integrate relevant information.
2.2. Main Contributions / Findings
The paper makes two primary contributions:
-
3DMEM-BENCH: A Novel Benchmark for Long-Term Spatial-Temporal Memory Evaluation.- This benchmark is specifically designed to evaluate embodied agents' reasoning, planning, and acting capabilities that require
long-term spatial-temporal memoryin multi-room 3D environments. - It comprises over 26,000 trajectories and 2,892 embodied tasks, including
question-answering (EQA)andcaptioning. - Tasks are categorized into simple, medium, and hard difficulty levels, and include "in-the-wild" challenges to test generalization.
- This addresses a gap in prior benchmarks which often focus on single-step or short-horizon reasoning, or lack embodied interaction support for long-term exploration.
- This benchmark is specifically designed to evaluate embodied agents' reasoning, planning, and acting capabilities that require
-
3DLLM-MEM: A Dynamic Memory Management and Fusion Model for Embodied 3D LLMs.3DLLM-MEMintroduces a novel architecture that integrates adual-memory system: a limited-capacityworking memoryfor current observations and an expandableepisodic memoryfor storing dense 3D representations of past observations and interactions.- Its key innovation is a
memory fusion modulethat usesworking memory tokens(representing current observations) as queries to selectively attend to and fuse the most useful spatial and temporal features fromepisodic memory. - This dynamic and selective fusion mechanism allows the model to focus on task-relevant information while maintaining memory efficiency, crucial for complex,
long-horizon environments.
Key Findings:
-
3DLLM-MEMachieved state-of-the-art performance across all evaluation categories (embodied tasks, EQA, and captioning) on the3DMEM-BENCHbenchmark. -
It significantly outperformed existing baselines, including common memory management strategies (e.g.,
Most Recent Memory,Retrieval-Augmented Memory) and other3D LLMs, by 16.5% in success rate on the most challenging "in-the-wild" embodied tasks. -
The model demonstrated strong generalization capabilities, with its performance remaining robust even in "in-the-wild" settings where other methods saw sharp drops.
-
3DLLM-MEMmaintained strong performance (27.8%) on hard "in-the-wild" tasks, while baselines degraded significantly to ~5% success rate, showcasing its scalability and effectiveness in managing longer-term memory representations. -
Ablation studies confirmed that initializing the
memory fusion module's query withworking memory tokensis the most effective design choice.These findings solve the problem of limited long-term spatial-temporal memory in embodied 3D LLMs, enabling agents to perform more complex, multi-room, and long-horizon tasks that require recalling and integrating information across space and time.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully understand this paper, a beginner needs to grasp several core concepts from artificial intelligence, natural language processing, and computer vision.
-
Large Language Models (LLMs):
LLMsare advancedartificial intelligencemodels, typically based on thetransformer architecture, that are trained on vast amounts of text data. They are designed to understand, generate, and process human language, excelling at tasks like text generation, translation, summarization, and question answering. Their "largeness" refers to their immense number of parameters (billions or even trillions) and the scale of their training data. In this paper,LLMsare extended to perceive and reason about 3D environments. -
Embodied AI:
Embodied AIrefers to intelligent agents that are situated within and can interact with physical or simulated environments. Unlike disembodied AI (like a chatbot), an embodied agent has a "body" (e.g., a robot, a virtual avatar) through which it perceives its surroundings (e.g., via cameras) and performs actions (e.g., moving, grasping objects). The goal is to develop AI that can learn and operate in complex, dynamic real-world settings, similar to humans. -
3D Vision-Language Models (3D VLMs) / 3D LLMs: These are
LLMsaugmented with the ability to process and reason about 3D visual information. TraditionalLLMsprimarily handle text, while3D VLMsintegrate inputs from 3D data (like point clouds, multi-view images with depth information) and link them to natural language. This allows them to understand queries about 3D scenes (e.g., "Where is the red chair?") and potentially generate actions in those scenes. The paper builds onLLaVA-3Das a base3D LLM. -
Working Memory: In cognitive psychology,
working memoryis a temporary, limited-capacity cognitive system that holds information readily available for processing. It's like a mental scratchpad where we keep track of immediate observations, thoughts, and plans. In the context of3DLLM-MEM, it represents the agent's current observations and immediate context. -
Episodic Memory:
Episodic memoryis a type oflong-term memorythat stores specific personal experiences, including their temporal (when) and spatial (where) context. It's memory of events that happened. For anembodied AI, this would include records of past observations, interactions, and locations visited over a longer period. -
Point Clouds: A
point cloudis a set of data points in a 3D coordinate system. These points represent the external surface of an object or environment. Each point typically has X, Y, Z coordinates and can also include other attributes like color (RGB), intensity, ornormal vectors.Point cloudsare a common way to represent 3D visual information and are used in this paper fordense 3D representations. -
Axis-Aligned Bounding Box (AABB): An
AABBis the smallest rectangular cuboid (a 3D box) that completely encloses an object and whose faces are aligned with the coordinate axes. It's a simple and common way to represent the spatial extent of objects or rooms in 3D environments, often used for collision detection and spatial queries. The paper usesAABBsfor rooms and objects inHM3D. -
Attention Mechanism: A core component of
transformer models, theattention mechanismallows a neural network to dynamically weigh the importance of different parts of its input when processing information. It calculatesattention scoresbetween aquery(what you're looking for) andkeys(what's available), and then uses these scores to combinevalues(the information itself). This allows the model to "focus" on relevant parts of the input. In this paper, it's central to thememory fusion modulewhere current observationsquerypastepisodic memory. The generic formula forself-attention(a common form of attention) is: Where:- is the
querymatrix. - is the
keymatrix. - is the
valuematrix. - is the dimension of the
keyvectors, used as a scaling factor to prevent large dot product values from pushing thesoftmaxinto regions with tiny gradients. - is the dot product of the query and key matrices, calculating similarity scores.
- normalizes these scores into probability distributions.
- The result is a weighted sum of the
valuevectors.
- is the
3.2. Previous Works
The paper contextualizes its contributions by referencing several prior benchmarks and models, highlighting their limitations in addressing long-term spatial-temporal memory in embodied 3D environments.
-
3D Large Language Models (3D-LLMs):
3D-LLM(Hong et al., 2023b): A popular3D LLMthat injects 3D world information intoLLMs. The paper usesLLaVA-3D(Zhu et al., 2024), which builds on2D-LLMwith multi-view images and 3D position embeddings, as its base model. These models typically excel at perceiving and reasoning about 3D spaces.- Limitation: The paper notes that current
3D LLMsstruggle with long-term memory chains and managing the vast information density of 3D scenes, especially for long-horizon tasks across multiple rooms.3D-LLMitself (as a baseline) shows low performance due to a lack of an explicit memory module and limited context length.
-
Embodied AI Simulators and Benchmarks:
ALFWorld(Shridhar et al., 2021),Behavior-1K(Li et al., 2024a),VisualAgentBench(Liu et al., 2024),EmbodiedBench(Yang et al., 2025a): These are existing benchmarks for embodied agents.- Limitation: The paper argues these benchmarks often focus on high-level planning tasks with short trajectories, typically in single-room settings, thus requiring minimal
spatial-temporal memory.EmbodiedBenchalso lacks comprehensive memory evaluation. - Navigation-focused benchmarks like
RoboTHOR(Deitke et al., 2020),Habitat-Matterport 3D (HM3D)(Ramakrishnan et al., 2021), and others (Krantz et al., 2022; Khanna et al., 2024) emphasize long-term scene exploration but often lack embodied interaction support.
-
Embodied Question Answering (EQA) Benchmarks:
EQAbenchmarks (Das et al., 2018; Wijmans et al., 2019; Yu et al., 2019) aim to develop goal-driven agents that can perceive their environment and answer questions. Some include memoryQA(e.g.,OpenEQAby Majumdar et al., 2024, and Yang et al., 2024).- Limitation: Existing
EQAbenchmarks might evaluate episodic memory but often don't jointly target bothspatialandepisodic memory, especially their changes over time, nor do they fully support complex embodied action tasks.
-
Memory Systems in AI:
- Early work with
LLMagents (Shinn et al., 2023; Zhang et al., 2023; Packer et al., 2023; Zhang et al., 2024) uses memory for decision-making in web-based or sandbox environments. Most focus on improving retrieval from anexperience poolormemory bank(Zhao et al., 2024; Gao et al., 2024; Xu et al., 2025b). 3D-Mem(Yang et al., 2025b): This framework investigates3D scene memoryfor exploration and reasoning by promptingvision-language modelsusingsnapshot-based memory.- Limitation: These approaches often do not support embodied interaction or action execution (e.g.,
3D-Mem), or they don't explicitly focus ondense 3D representationsand the intricate challenge of fusingspatial-temporal memoryfor task execution in dynamicembodied scenarios.
- Early work with
3.3. Technological Evolution
The field has evolved from 2D-based LLMs to 3D-aware LLMs (3D-LLMs, LLaVA-3D) that can perceive and reason about 3D spaces. Concurrently, embodied AI has advanced from basic navigation tasks in simulated environments to complex interaction and planning. Initially, LLMs lacked explicit mechanisms to handle long-term memory beyond their limited context windows. Solutions like Retrieval-Augmented Generation (RAG) emerged to address this by retrieving relevant information from external knowledge bases. However, integrating this with dynamic, dense 3D spatial-temporal information for embodied agents remained a significant challenge.
This paper's work fits within the technological timeline as a crucial step towards developing truly intelligent embodied agents by bridging the gap between 3D perception, LLM reasoning, and long-term memory management. It moves beyond simply retrieving information to actively fusing it, recognizing the unique demands of spatial-temporal memory in dynamic 3D environments.
3.4. Differentiation Analysis
Compared to related work, the core differences and innovations of this paper's approach are:
- Comprehensive
3D Spatial-Temporal MemoryFocus: Unlike previousembodied benchmarksthat primarily target short-horizon tasks or navigation,3DMEM-BENCHexplicitly focuses onlong-term spatial-temporal memorythroughfine-grained embodied tasks,EQA, andcaptioningthat span multiple rooms and require reasoning about memory changes over time and space. - Dense 3D Memory Representation: The paper is among the first to explore
dense 3D representations(likepoint clouds) as memory forembodied 3D LLMs. This is a significant departure from approaches that might rely on sparse or object-centric representations, providing richer geometric and environmental detail crucial for complex tasks. - Novel
Dual-Memory SystemwithDynamic Memory Fusion:3DLLM-MEMintroduces a uniquedual-memory system(working memory + episodic memory) combined with amemory fusion module. This module dynamically usesworking memory tokensas queries toselectively attendto andfuserelevant features fromepisodic memory. This is more sophisticated than simple context window expansion orretrieval-augmentedmethods, as it actively integrates information based on task relevance and spatial-temporal relationships, while also managing memory efficiency. - Support for Embodied Interaction and Actions: While
3D-Mem(Yang et al., 2025b) investigates3D scene memory, it does not support embodied interaction or action execution.3DLLM-MEMexplicitly supports these, allowing agents to manipulate objects and navigate based on their fusedspatial-temporal memory. - Robustness in "In-the-Wild" Scenarios: The
3DMEM-BENCHincludes "in-the-wild" challenges (unseen objects, unseen memory contexts, novel challenges) which expose the fragility of existing methods.3DLLM-MEMdemonstrates significant robustness and generalization capabilities in these challenging scenarios, suggesting a more scalable solution.
4. Methodology
4.1. Principles
The core idea behind 3DLLM-MEM is to equip 3D Large Language Models (3D-LLMs) with a long-term spatial-temporal memory system that mimics human cognition. Humans use a working memory for immediate observations and an episodic memory for storing past experiences. 3DLLM-MEM adopts a similar dual-memory system: a limited-capacity working memory for current observations and an expandable episodic memory that stores past spatial-temporal information as dense 3D representations.
The key theoretical basis or intuition is that for an embodied agent to perform complex, long-horizon tasks in dynamic 3D environments, it needs to:
-
Maintain Awareness of Current Context: The
working memoryensures the agent is grounded in its immediate surroundings. -
Recall Relevant Past Experiences: The
episodic memorystores a rich history of observations and interactions across different rooms and time steps. -
Selectively Integrate Information: Crucially, the agent shouldn't retrieve all past information (which would be computationally overwhelming) but rather
selectively fuseonly the most task-relevant features fromepisodic memorywith itsworking memory. Thismemory fusionprocess is dynamic and guided by the current task and observations.By integrating
dense 3D representations(which preserve intricate geometric and environmental details) and adynamic memory fusion module,3DLLM-MEMaims to overcome the limitations of prior models that struggle with information density, context limits, and the entanglement ofspatialandtemporal memory.
4.2. Core Methodology In-depth (Layer by Layer)
3DLLM-MEM is built upon LLaVA-3D (Zhu et al., 2024), which provides the foundational 3D vision-language capabilities. The novel contribution of 3DLLM-MEM lies in its memory module that manages and fuses working and episodic memory.
4.2.1. Base Model: LLaVA-3D's 3D Perception
LLaVA-3D extends 2D-LLMs by incorporating 3D awareness. Here's how it processes 3D scenes:
- Multi-view Image Input: For a given 3D scene,
LLaVA-3Dtakes multiple 2D images (views) as input. Each image is denoted as , where3is for RGB channels, is width, and is height. - 2D Patch Encoding: A
CLIP encoder(avision transformerpre-trained oncontrastive language-image pre-training) splits each 2D image into patches. These patches are then encoded into visual tokens. - 3D Position Embedding: To bring these 2D patches into a 3D spatial context,
3D position embeddingsare utilized. For each image, with a knowndepth image(providing distance information for each pixel),camera intrinsic parameters(describing the camera's optical properties), andcamera extrinsic parameters(describing the camera's position and orientation in the 3D world),3D positionsin the 3D world are obtained for each patch. These3D positionsare directly added to the2D patch visual tokens. - Formation of 3D Patches: The encoded 2D patch features, augmented with
3D position embeddings, formpixel-aligned 3D patches, where is the number of views, is the feature dimension, , and (assuming is the patch size). - Redundancy Reduction: To reduce redundancy in these
3D patchesand manage computational load,Farthest Point Sampling (FPS)is applied.FPSis a technique used to select a subset of points from apoint cloudsuch that the selected points are as far apart from each other as possible, ensuring good coverage and a diverse representation of the original data. This downsamples the3D featuresinto a set of tokens , where is the reduced number of tokens and is the feature dimension. These tokens represent the current observations used by3DLLM-MEM.
4.2.2. 3DLLM-MEM Memory Module
The 3DLLM-MEM model builds on this LLaVA-3D backbone by introducing a sophisticated memory module inspired by human cognitive processes.
4.2.2.1. Dual-Memory System: Working Memory and Episodic Memory
- Working Memory: This holds the agent's current observations. At any time step , the current observation, after being processed by the
LLaVA-3D's perception module (resulting in the downsampled3D features), is denoted as . This is considered the agent'sworking memory, which is actively within its immediate context for processing. - Episodic Memory: As the agent explores and interacts, past observations and experiences are stored in an
episodic memory. This memory stores data from previous time steps, from up to the current time step , represented conceptually as .
4.2.2.2. Episodic Memory Management
To efficiently manage the episodic memory, a memory feature bank is employed.
- Feature Projection: For each observation at time step (where ), it is first passed through a
multi-layer perceptron (MLP)layer. AnMLPis a type offeedforward artificial neural networkthat maps sets of input data onto a set of appropriate outputs. ThisMLPprojects the observation into amemory-specific feature space. These projected features are then stored in thememory bank. - Temporal Encoding: To enhance the temporal understanding within the
episodic memory,sinusoidal positional embeddingsare incorporated. These embeddings encode the time step and are directly added to the correspondingmemory feature representations.Sinusoidal positional embeddingsare typically sine and cosine functions of varying frequencies, allowing the model to distinguish positions (or time steps) and understand their relative order without adding learnable parameters.
4.2.2.3. Memory Fusion Module
The core innovation is the memory fusion module, which enables the agent to dynamically integrate relevant information from episodic memory into its current working memory. This module operates as follows:
-
Query Generation: The
3D featuresfrom theworking memory(the current observation ) are encoded into ashared memory space. This representation then serves as thequery feature, denoted as . Here, is the number of tokens (downsampled3D featuresfromLLaVA-3D), and is the dimension of the feature space. -
Key and Value Features from Episodic Memory: The
episodic memory bankstores correspondingkeyandvalue featuresfrom all past observations. These are denoted as and , respectively, where is the total number of time steps (past observations) and is the number of memory tokens per time step. -
Memory-Query Attention: The
query featureattends to thekey featuresfrom theepisodic memoryusing ascaled dot-product attentionmechanism. This calculatesattention scoresindicating how relevant each part of theepisodic memoryis to the current observation and task. The formula for the fused query features is given as: Where:- is the
query featurederived from the currentworking memory(current observation). - represents the
key featuresfromepisodic memory. denotes its transpose. - is a scaling factor, likely related to the dimension of the key vectors, similar to in standard
attentionmechanisms, preventing thesoftmaxfunction from having extremely small gradients when inputs are large. - computes the scaled dot products between the
working memory queryand allepisodic memory keys. - normalizes these scores into a probability distribution over the
episodic memoryentries. - represents the
value featuresfromepisodic memory. - The result, , is a weighted sum of the
episodic memory value features, effectively fusing the most relevant past information with the current context.
- is the
-
Final Memory-Enhanced Representation: The
fused memory featureis thenconcatenatedwith the originalworking memory query feature.Concatenationmeans joining two tensors along a particular dimension. This creates the finalmemory-enhanced representationfor the agent: This then serves as the input to theLLM decoderfor reasoning, planning, and action generation.
4.2.2.4. Memory Update
The memory system is dynamic and updated continuously:
-
Working Memory: The
working memory(current observation) isdynamicand updated online. As the agent interacts with the environment (e.g., picking up an object, moving to a new room), changes are immediately reflected in the3D representationsthat form theworking memory. -
Episodic Memory Transfer: When the agent moves to a new environment (e.g., a new room), the content of the
previous working memoryis transferred to theepisodic memory bank. -
Memory Entry Updates: If the agent modifies an environment that already exists in the
episodic memory bank(e.g., puts an object down in a previously visited room), the correspondingmemory entryin theepisodic memory bankis updated to reflect the latest state. This ensures that thememory bankremains consistent and reflects the most recent state of explored environments. The paper notes that environment changes and observations are pre-collected and stored locally to facilitate efficient data loading during training and inference.Figure 3 from the original paper (reproduced below) visually illustrates this dual-memory system and fusion mechanism.
该图像是图示,展示了3DLLM-Mem模型的架构及其在长时记忆形成中的作用。左侧(a)部分展示了如何在执行“准备简单早餐”的任务时,使用不同时间步的记忆。同时,右侧(b)部分概述了记忆融合机制的流程。
Figure 3: (a) We propose 3DLLM-MEM, a memory-enhanced 3D embodied agent that gradually form its long-term memory while executing tasks. Multiple timesteps are shown together but in different colors, with each timestep's memory including the prior one. The task is "prepare a simple breakfast" as shown in Figure 2. (b) Overview of our memory fusion mechanism.
5. Experimental Setup
5.1. Datasets
The experiments primarily use a novel benchmark introduced in the paper: 3DMEM-BENCH.
5.1.1. Base Environment Construction
- Source:
3DMEM-BENCHis built on top of theHabitat-Matterport 3D (HM3D) semantics dataset(Ramakrishnan et al., 2021).HM3Dcontains 10003D spacesand 10,600 rooms. - Preprocessing:
HM3Dscenes undergo preprocessing to defineaxis-aligned bounding boxes (AABB)for rooms and use validsemantic label annotations. This filtering process yields 182 3D spaces and 2,602 rooms. - Interactive Objects: Original
HM3Dobjects are not interactive. To enable manipulation tasks, interactive objects fromObjaverse(Deitke et al., 2023) are added.Objaverseconsists of 800K 3D objects across rich categories. - Room and Object Location Derivation:
- The scenes use a
semantic surface meshwhere each triangle has a unique hexadecimal color linked to a surface label (e.g., floor, ceiling) and a room identifier. AABBsfor rooms are derived by querying thesemantic tablefor floor and ceiling colors. Global candidate floor elevations are aggregated. Vertical bounds are determined by floor's lowest point and ceiling's highest point. If one is missing, defaults are used. Horizontal limits are min/max coordinates of floor/ceiling points.AABBsfor objects within rooms are calculated by gathering corresponding vertices and computing min/max coordinates.- Room-level and object-level
AABBsare merged by their shared room index.
- The scenes use a
5.1.2. Generating Task Trajectories
- Instruction Generation:
Gemini(Team et al., 2023), aLarge Language Model, is prompted to generate diverse tasks. The prompts use theAABBsof rooms and objects. - Interactive Object Integration:
Geminiincorporates interactive objects based on task requirements and their appropriateness in indoor environments. - Trajectory Simulation Pipeline: A pipeline verifies each trajectory step-by-step:
- Correctness of agent's location.
- Existence and validity of referenced objects.
- Correctness of
pick-upandput-downactions. - Ensuring high-level actions are executable in the simulator (Szot et al., 2024; Yang et al., 2025a).
- Validation Rate: Approximately 24% of generated trajectories pass this filtering, ensuring correctness and feasibility.
5.1.3. Embodied Data Collection
- Exploration Phase: An
embodied agentfirst performsrandom explorationto collectRGB-D observations(color and depth images) and correspondingcamera poses(position and orientation). - Task Execution Phase: The agent then follows the
task trajectory, incrementally exploring new environments, executinginteraction actions, and receiving feedback with newRGB-D observation data. - Data Storage: All interaction results are recorded, and
reconstructed point cloud datais precomputed and stored locally for faster loading during training and inference.
5.1.4. Data Curation
- Task Complexity Categories: Tasks are divided into three subcategories based on the number of multi-room scene settings:
Simple: 3 multi-room scene settings.Medium: 5 multi-room scene settings.Hard: 10 multi-room scene settings.
- Scale: A total of 51K trajectories are collected (31K simple, 10K medium, 10K hard). The final benchmark contains over 26,000 trajectories and 1,860 fine-grained long-term memory embodied tasks.
- In-domain Evaluation Sets: Created by removing training tasks and filtering instances never shown in the agent's
working memory. - In-the-wild Evaluation Sets: For generalization, these sets include:
- Instances involving unseen objects.
- Entirely unseen memory contexts.
- Novel challenges differing from training.
- EQA Data Curation: After agent exploration,
Geminigenerates question-answer pairs from complete trajectories. Questions are categorized into:Spatial reasoningLong-term object navigationComparative reasoningMulti-room layout understandingSemantic object countingThese evaluate models onspatial-temporal changesin memory during task execution.
- Long-term Memory Captioning: Data collected across multiple rooms before and after trajectory execution, enabling comparison and summarization of memory-relevant experiences (targeting
semantic episodic memory).
5.1.5. Quality Control
Two procedures ensure quality:
-
Automatic Validation: Re-running the
trajectory simulation validation pipeline(as described in Section 5.1.2), especially for "in-the-wild" tasks. -
Manual Review: Four student experts manually inspect each benchmark instance. They render
multi-view imagesof the entire scene using the simulator to verify if benchmark annotations accurately correspond to the simulated environment.The following are examples of how
Geminiis prompted to generate tasks for3DMEM-BENCHand how human annotators review the data quality (Figure 5 from the original paper).
The following are the results from Table 4, Table 5, Table 6 and Figure 5 of the original paper:
System message You are an AI assistant and task generator for a 3D embodied agent operating in a multi-room environment. The environment provides detailed object instance information, including bounding boxes and IDs. Your goal is to generate a complex task that requires the agent to explore multiple rooms, navigate, and crucially use long-term memory to recall details observed earlier. Prompt 1. Environment and Object Information Object Representation: Each object is given with a bounding box in the format: "<object_name>(num)": [x_min, y_min, z_min], [x_max, y_max, z_max] Here, (num) indicates the ID, with (0) being the closest to the origin [0,0,0]. IDs reset for each room (e.g., sofa(0) in room 2 and sofa(0) in room 4 if each room has one sofa). Actions Available: ${ \mathit { \Omega } } { < } \mathrm { G O }$ TO ROOM(id)>: Navigate to a room that has already been visited. <GO TO NEW ROOM>: Navigate to a new, unexplored room (and unlock its objects). Do not use this for rooms that have been visited before. <PICK UP object_name(id) from room(id) in room $\mathrm { ( id ) } >$ :Pick up an object that originally belongs to a specific room while in that same room. <PUT DOWN object_name(id) from room(id) on object_name(id) in room(id)>: Place an object (that originally belongs to a room) onto another object (such as a table or floor) in a room. New Objects: You can add extra objects to diversify the task. Important: Use only object names from the provided new_objects_name_list. If a room already has an object with the same name, the new object should have a new ID (e.g., if lamp(0) exists, the added one should be lamp(1)). These extra objects are only for task design; the agent's trajectory should not mention adding them. 2. Task Design Requirements Multi-Room Exploration: Design a task that spans several rooms. The room order (given in a Room Order list) should be chosen so that necessary items are distributed across rooms. The agent should explore every room in the specified order. Long-Term Memory and Implicit Cues: Do not simply list all items as a checklist at the start. Insted rovide a vagueoverall goal .g "repare meal" Laterin the trajectory, havehe agent recall these earlier observations when the need arises. Ensure the agent must remember something seen long ago rather than simply following an explicit list. Update Memory and make new decision based on your current observations: The agent originall planned to use one object for completing its task, but couldn't find it after exploration of rooms. It has to change to a another similar object to complete its task. Inventory and Action Constraints: The agent can only hold one item at a time. Never perform consecutive PICK UP or PUT DOWN actions. If the agent holds an item, it must put it down before picking up another. When temporarily storing an object (e.g., on a table), include a "thought" explaining why the object is being set down and later recalled. 3.Reasoning and Object Comparisons: If your task requires choosing a specific object instance (e.g., selecting table(1) because it is bigger than table(0)), compare their bounding boxes and explain your choice in the trajectory. For clarity, consider these examples: {In-context examples} Here is the scene information: {Input scene information} Table 4: Prompt template for generating task trajectories for embodied tasks.
Prompt
You are an AI assistant / task generator in the room. All object instances in this 3D scene are given, along with their bounding boxes and ids." Each object's bounding boxes are represented by a 3D coordinate ' <obj_name>(num)': [x min, y min, z min],[x max, y max, Z max]' with units of meters, and each represents left-bottom corner and the right-top corner coordinate.
You will also receive a trajectory composed of the following tokens and reasoning chains. ${ \mathit { \Omega } } { < } \mathrm { G O }$ TO ROOM(id)>: which navigates back to a specific room (id). This can only be done if the agent already go to this room. <PICK UP object_name(id) from room(id) in room(id)>: Pick up an object that originally belongs to a specific room while in that same room. <PUT DOWN object_name(id) from room(id) on object_name(id) in room(id)>: Place an object (that originally belongs to a room) onto another object (such as a table or floor) in a room. <GO TO NEW ROOM>: which navigates to a new room you haven't explored and unlocks objects there.
This trajectory is what the agent have executed over the past. You need to propose several questions and answers that focused on the reasoning abilities of the long-term memory of the agent. These reasoning questions should focus on what have changed temporally or spatially in this agent's memory. It's important that this change challenged the agent's memory. For example the questions should contain object counting, spatial relation, comparison between objects across rooms, long-term multi-room room layout, long-term multi-room object navigation. Remember spatial memory is important, you should design questions that asked about the 3D object spatial relation and layout in the room that need the agent to perform a hard reasoning for the final answer.
For clarity, consider these examples: {In-context examples} Here is the scene information: {Input scene information} Here is the agent's trajectory: {Input agent's trajectory}
Table 5: Prompt template for generate QA data. {In-context examples} are in-context examples.
{Input scene information} are scene, room and object semantics along with their bounding boxes.
{Input agent's trajectory} is the 3D agent's explored trajectories and action chains.
Prompt
You are provided with a scene description containing multiple rooms. Each room includes a list of objects along with their positions in the room, represented by bounding boxes. Each object's bounding box is defined by a 3D coordinate in the format: <object_name>(num): [x min, y min, z min],[x max, y max, z max] with units in meters (defining the left-bottom and right-top corners). Your task is to generate an object caption for each room in the form of a coherent, descriptive paragraph that conveys the 3D spatial arrangement and relative positions of all objects within that room.
Then, you will receive the object descriptions and caption for the current 3D room you are in. You will also be provided with the previous rooms' captions as well. Your task is to generate new captions covering the summarization of the common features across all rooms based on your current room and important difference based on your current room. The reasons of generating the new caption is to help the agent to remind of what are in previous rooms memories can help the agent in this current room. The past objects and observations should be related to current room by examining the summarization of common things and differences. For clarity, consider these examples: {In-context examples}
Here is the scene information: {Input scene information} Here is current room you are in and previous rooms you went: {Input agent's locaton}
Table 6: Prompt template for generate QA data. {In-context examples} are in-context examples.
{Input scene information} are scene, room and object semantics along with their bounding boxes.
{Input agent's location} is the location for current room in the scene and the past explored rooms.

该图像是示意图,展示了两个房间(房间8和房间9)的多视角渲染图像,用于评估数据质量。每个房间展示了不同视角的室内环境,以帮助人工注释者进行QA和描述任务。
Figure 5: Example of human annotators manually check the data quality on QA and captioning tasks through multiple rendered multi-view images from each room.
5.2. Evaluation Metrics
The paper uses different evaluation metrics for each of the three task categories: embodied tasks, Embodied Question Answering (EQA), and captioning.
5.2.1. Embodied Tasks
-
Success Rate (SR):
- Conceptual Definition:
Success Ratemeasures the percentage of tasks where the agent successfully completes all sub-goals and reaches the desired final state as defined by the task instructions. It's a binary metric (task either succeeded or failed). - Mathematical Formula: $ \mathrm{SR} = \frac{\text{Number of successfully completed tasks}}{\text{Total number of tasks}} \times 100% $
- Symbol Explanation:
- : Success Rate.
- : Count of tasks where the agent performed all required actions and achieved the final goal.
- : Total count of tasks attempted by the agent.
- Conceptual Definition:
-
Sub-Success Rate (Sub-SR):
- Conceptual Definition:
Sub-Success Rateis a finer-grained metric that measures the proportion of individual sub-goals or steps within a task that the agent successfully completes, even if the overall task is not fully successful. This provides insight into partial task completion. - Mathematical Formula: $ \mathrm{Sub\text{-}SR} = \frac{\text{Number of successfully completed sub-goals}}{\text{Total number of sub-goals}} \times 100% $
- Symbol Explanation:
- : Sub-Success Rate.
- : Count of individual steps or intermediate objectives achieved by the agent across all tasks.
- : Total count of all individual steps or intermediate objectives defined across all tasks.
- Conceptual Definition:
5.2.2. Embodied Question Answering (EQA)
- Accuracy:
- Conceptual Definition: For
open-ended EQAtasks, accuracy measures the correctness of the agent's answer to a question. Since answers are free-form text, the paper follows the "LLM-as-judge" evaluation protocol. This means anotherLarge Language Model(in this case,Gemini) is prompted to evaluate if the agent's generated answer is correct, typically by comparing it to a ground-truth answer and assessing its factual correctness and relevance. - Mathematical Formula: (Implicitly defined by LLM-as-judge protocol for open-ended questions, but for classification-like accuracy it would be): $ \mathrm{Accuracy} = \frac{\text{Number of correctly answered questions}}{\text{Total number of questions}} \times 100% $
- Symbol Explanation:
-
: The proportion of questions for which the agent's answer is judged as correct.
-
: Count of
EQAquestions where the agent's response was deemed correct by theLLM-as-judge. -
: Total count of
EQAquestions.The following is the prompt template used for the
LLM-as-judgeprotocol forEQAevaluation (Table 7 from the original paper):
该图像是一个示意图,展示了对AI助手提供的回答质量的评估流程。图中包括系统消息、参考答案和助手的回答,旨在指导对第二个问题的评价,包括比较助手的回答和参考答案等步骤。
-
- Conceptual Definition: For
Table 7: Prompt template for open-ended QA evaluation following standard LLM-as-judge protocol.
5.2.3. Captioning
For captioning tasks, standard natural language generation metrics are used to compare the generated caption with a human-written reference caption.
-
BLEU (Bilingual Evaluation Understudy):
- Conceptual Definition:
BLEUis an algorithm for evaluating the quality of text which has been machine-translated or machine-generated. It works by comparing N-grams of the candidate text to the N-grams of one or more reference texts. A higherBLEUscore indicates a closer match to the reference text.BLEU1considers unigrams (single words), andBLEU4considers N-grams up to four words. - Mathematical Formula: (The full
BLEUformula is complex, but its core components are): $ \mathrm{BLEU} = \mathrm{BP} \cdot \exp\left( \sum_{n=1}^{N} w_n \log p_n \right) $ Where:- is the
brevity penalty(a penalty for translations that are too short compared to the reference). - is the maximum
N-gramorder (e.g., 4 forBLEU4). - are positive weights for each
N-gramorder (usually ). - is the
precisionforN-gramsof order , calculated as the count of matchingN-gramsin the candidate that also appear in the reference, divided by the total number ofN-gramsin the candidate.
- is the
- Symbol Explanation:
- : Bilingual Evaluation Understudy score.
- : Brevity Penalty, to penalize short generated captions.
- : Maximum N-gram length considered.
- : Weight for the n-gram precision.
- : Modified n-gram precision.
- Conceptual Definition:
-
METEOR (Metric for Evaluation of Translation with Explicit Ordering):
- Conceptual Definition:
METEORis another metric for evaluating the quality of text generation. It addresses some shortcomings ofBLEUby including explicit word-to-word matching based onunigram precisionandrecall, consideringsynonymyandstemming, and factoring inword order. It computes a score based on the harmonic mean ofprecisionandrecall, withrecallgiven more weight, and then applies afragmentation penalty. - Mathematical Formula: (Simplified core idea):
$
\mathrm{METEOR} = \mathrm{Fmean} \cdot (1 - \mathrm{Penalty})
$
Where:
- (a weighted harmonic mean of
precisionandrecall). - is a
fragmentation penaltythat penalizes non-contiguous matches.
- (a weighted harmonic mean of
- Symbol Explanation:
- : Metric for Evaluation of Translation with Explicit Ordering.
- : F-mean, a weighted harmonic mean of precision and recall.
- : Unigram precision.
- : Unigram recall.
- : Fragmentation penalty.
- Conceptual Definition:
5.3. Baselines
The 3DLLM-MEM model was compared against a broad range of memory management approaches and existing 3D LLMs:
-
Everything in Context:
- Description: This baseline attempts to fit all observations (visual features and textual information) directly into the model's
context window. - Purpose: Serves as an upper bound for performance when memory capacity is not an issue, applicable only for smaller scenes or simpler tasks where all relevant information can be accommodated.
- Limitation: Not scalable to
long-horizontasks orinformation-dense 3D environmentsdue toLLM context lengthlimitations.
- Description: This baseline attempts to fit all observations (visual features and textual information) directly into the model's
-
Most Recent Memory:
- Description: This baseline retains only the most recent observations within the
context window, assuming that the most immediate past is the most relevant. - Purpose: A common heuristic for managing
context limitsin sequential processing. - Limitation: Fails when tasks require recalling information from further back in time or from previously visited, non-contiguous spatial locations.
- Description: This baseline retains only the most recent observations within the
-
Retrieval-Augmented Memory (RAG):
- Description: Inspired by
retrieval-based techniquesused inLLMs. This approach uses amemory bankto store past observations. During inference, the most relevant memory entries areretrieved(e.g., using a similarity search) and thenappendedto theworking memory(current observations) before being fed into theLLMfor reasoning. - Purpose: To overcome
context lengthlimitations by selectively providing relevant historical context. - Limitation: The effectiveness heavily depends on the
retrieval mechanism's ability to accurately identify and extract truly relevant information, which can be challenging inspatial-temporal reasoningtasks. The paper's results suggest it often misses critical information.
- Description: Inspired by
-
3D-LLM (Finetuned):
- Description: A popular
3D LLM(Hong et al., 2023b) recognized by the community. The authors fine-tune this model on their training data. For evaluation, it uses an "everything in context" strategy with the longest context window it supports. The paper notes modifications to fit multi-scene input and align with specific interaction tokens. - Purpose: Represents the state-of-the-art in
3D LLMswithout explicitlong-term memorymodules tailored for dynamicspatial-temporal fusion. - Limitation: Primarily struggles with retaining
long-term memoryofsemantic observationsand is limited by itscontext length(512 tokens for its T5-flanxl backbone), performing poorly on tasks requiringlong-term spatial-temporal reasoning.
- Description: A popular
-
3D-Mem (Yang et al., 2025b):
- Description: A framework designed for
3D scene memoryinembodied explorationandreasoning. It uses a snapshot-based architecture with two stores:memory snapshots(compact multi-view RGB-D frames with bounding boxes summarizing inspected areas) andfrontier snapshots(boundary views for new information). For evaluation in this paper, the frontier component is disabled, focusing only onmemory snapshotscaptured from room centers. - Purpose: To provide a dedicated
3D scene memoryforembodied agents. - Limitation: This method does not support
embodied interactionoraction executionand shows limitations inspatial relation,navigation, andobject counting tasks, indicating a reliance on aggregatedimage-centric memoriesrather than truedense 3D spatial-temporal understanding.
- Description: A framework designed for
5.4. Implementation Details
- Base Model: The
3DLLM-MEMmodel is implemented based onLLaVA-3D(Zhu et al., 2024). - Hardware & Framework: Modifications were made for compatibility with
Google TPUsusingPyTorch/XLA frameworks(Paszke et al., 2019; XLA team, 2017-2025). - Context Window: The model's
context windowis expanded to 8192 tokens to accommodatelong-term memory inputs. - Fine-tuning:
- The proposed
memory modulealong with theLLM decoderare fine-tuned. - Initialization is done from
LLaVA-3D's pretrained weights. - Training is conducted on 8
Google Cloud TPU v5p cores. - Batch size: 256.
- Training duration: Approximately 1 day for 1000 steps.
- Optimizer:
Adam optimizer. - Learning Rate: (no weight decay).
- Learning Rate Schedule:
Linear warmupfor the initial 3% of steps (from to ), followed by acosine decay scheduler.
- The proposed
- Training Data Split: Fine-tuning uses the training split of
3DMEM-BENCH, which contains about 26K trajectories. - Task-Specific Epochs (Due to compute limitations):
- Captioning task: 15 epochs.
- Question-Answering task: 20 epochs.
- Embodied task: 75 epochs (allocated most compute time).
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate that 3DLLM-MEM significantly outperforms all existing baselines across various tasks in the 3DMEM-BENCH benchmark, particularly excelling in long-horizon embodied tasks and showing strong generalization capabilities in "in-the-wild" settings.
6.1.1. Results on Embodied Tasks
The following are the results from Table 2a of the original paper:
| Model | Simple | Medium | Hard | Average | ||||||||||||
| In-domain | In-the-wild | In-domain | In-the-wild | In-domain | In-the-wild | In-domain | In-the-wild | |||||||||
| SR | Sub-SR | SR | Sub-SR | SR | Sub-SR | SR | Sub-SR | SR | Sub-SR | SR | Sub-SR | SR | Sub-SR | SR | Sub-SR | |
| 3D-LLM (Finetuned) | 10.4 | 20.3 | 9.1 | 18.5 | - | - | - | - | - | - | - | - | - | - | - | - |
| Everything in Context | 35.5 | 63.9 | 32.4 | 45.2 | - | - | - | - | - | - | - | - | - | - | - | - |
| Most Recent Memory | 32.8 | 62.3 | 23.4 | 38.6 | 20.1 | 34.8 | 12.4 | 25.3 | 10.4 | 20.7 | 5.4 | 12.1 | 21.1 | 39.3 | 13.7 | 25.3 |
| Retrieval-Augmented Memory | 34.2 | 63.0 | 28.3 | 46.2 | 21.8 | 40.2 | 13.7 | 28.0 | 10.8 | 21.6 | 4.8 | 10.6 | 22.3 | 41.6 | 15.6 | 28.3 |
| 3DLLM-MEM (Ours) | 45.5 | 73.4 | 37.0 | 65.4 | 36.8 | 67.8 | 31.6 | 57.4 | 30.5 | 46.2 | 27.8 | 42.1 | 37.6 | 62.5 | 32.1 | 55.0 |
Table 2a: Results on 3DMEM-BENCH embodied tasks. SR stands for success rate. Sub-SR stands for sub-success rate. Our model outperforms existing approaches by a large margin.
- Overall Superiority:
3DLLM-MEMconsistently achieves the highestSuccess Rate (SR)andSub-Success Rate (Sub-SR)across all task complexities (simple, medium, hard) and settings (in-domain, in-the-wild). - Performance Gap: On average,
3DLLM-MEMachieves anSRof 37.6% in-domain and 32.1% in-the-wild, significantly higher than the next bestRetrieval-Augmented Memory(22.3% in-domain, 15.6% in-the-wild). This represents a remarkable 16.5% improvement in in-the-wild tasks. - Robustness to "In-the-Wild" Challenges: While other methods experience a sharp performance drop in "in-the-wild" scenarios (e.g.,
Most Recent Memorydrops from 32.8% to 23.4%SRfor simple tasks),3DLLM-MEMmaintains robustness. Its average in-the-wildSRof 32.1% is a strong indicator of its generalization capabilities. - Scalability with Complexity: As task complexity increases from simple to hard, all existing baselines degrade significantly, achieving only around 5%
SRon hard "in-the-wild" tasks. In contrast,3DLLM-MEMsustains a strongSRof 27.8% for hard "in-the-wild" tasks, demonstrating its effectiveness in managinglonger-term memory representationsrequired for complex,long-horizontasks. - Baseline Insights:
3D-LLM (Finetuned)shows the lowest performance, even in simple tasks, underscoring the necessity of an explicitmemory module.Most Recent MemoryandRetrieval-Augmented Memoryperform poorly, indicating that simply keeping recent observations or retrieving relevant ones is insufficient for complexspatial-temporal reasoning.Everything in Contextperforms better thanMost Recent MemoryandRAGin simple tasks (where it fits all information), suggesting that access to all information is beneficial. However,3DLLM-MEMstill outperforms it, highlighting the value of selectively fusing task-relevant memory features.
6.1.2. Results on Long-Term EQA and Captioning
The following are the results from Table 2b of the original paper:
| Model | Embodied Task | Embodied Question Answering (EQA) | Captioning | |||||||
| In-domain | In-the-wild | Spatial | Nav. | Comparative | Layout | Count | BLEU1 | BLEU4 | METEOR | |
| 3D-LLM (Finetuned) | - | - | 2.9 | 5.8 | 0.0 | 7.7 | 0.0 | 42.3 | 12.0 | 30.6 |
| 3D-Mem (GPT4-0) | - | - | 39.9 | 11.0 | 25.8 | 19.1 | 7.8 | 41.7 | 4.7 | 31.8 |
| 3D-Mem (Gemini-2.5-Flash) | - | - | 41.6 | 18.2 | 37.6 | 30.2 | 12.7 | 42.8 | 4.8 | 29.6 |
| 3D-Mem (Gemini-2.5-Pro) | - | - | 39.7 | 27.7 | 36.0 | 35.2 | 16.4 | 41.5 | 3.0 | 28.6 |
| Most Recent Memory | 21.1 | 13.7 | 27.5 | 30.2 | 24.3 | 20.1 | 10.5 | 32.4 | 10.1 | 25.6 |
| Retrieval-Augmented Memory | 22.3 | 15.6 | 38.0 | 33.4 | 31.8 | 29.7 | 15.6 | 40.8 | 11.5 | 29.3 |
| 3DLLM-MEM (Ours) | 37.6 | 32.1 | 62.8 | 40.6 | 41.4 | 39.9 | 26.3 | 58.2 | 18.8 | 37.3 |
Table 2b: Results on all tasks in 3DMEM-BENCH. Average success rate is reported for embodied tasks. Nav. stands for long-term object navigation. We report accuracy score for open-ended EQA evaluation and follow the standard LLM-as-judge evaluation protocol by prompting Gemini. Evaluation details are provided in Appendix E.
- Superiority Across All Tasks:
3DLLM-MEMconsistently outperforms all existing approaches across all subcategories ofEQA(Spatial, Nav., Comparative, Layout, Count) and captioning metrics (BLEU1,BLEU4,METEOR). - EQA Performance:
3DLLM-MEMachieves significantly higher accuracy scores inEQAtasks, such as 62.8% forSpatial reasoning, 40.6% forNavigation, and 41.4% forComparative reasoning. This demonstrates its strong capability forlong-term spatial-temporal reasoning, which is crucial for answering complex questions about dynamic environments. - Captioning Performance:
3DLLM-MEMalso leads in captioning, withBLEU1of 58.2%,BLEU4of 18.8%, andMETEORof 37.3%. - Baseline Insights for EQA/Captioning:
3D-LLM (Finetuned)shows relatively strong performance in captioning (second bestBLEU1) due to its ability to summarize object-centric semantic memory. However, its performance onEQAis very poor (e.g., 0.0% for Comparative and Count), highlighting its limitation inlong-term spatial-temporal reasoningdue to a limited context length.3D-Memshows improvedEQAperformance over other baselines, but it falls short in specific categories likeSpatial relation,Navigation, andObject counting, suggesting limitations of itsimage-centric aggregated memoriescompared to3DLLM-MEM'sdense 3D fusion.Most Recent MemoryandRetrieval-Augmented Memoryalso lag significantly behind3DLLM-MEMinEQAandcaptioning, further validating the effectiveness of the proposedmemory fusion technique.
6.1.3. Qualitative Results
The paper provides qualitative examples (Figure 4 and Figure 6) to visually illustrate 3DLLM-MEM's ability to utilize long-term memory and execute complex tasks. Figure 6, in particular, shows a multi-step task ("Prepare a cozy reading nook") where the agent explores, recalls memories of objects in different rooms, navigates back and forth, and performs interactions based on that memory. This vividly supports the quantitative results, showing that the model can integrate observations over time and space to complete long-horizon embodied tasks.
The following are the results from Figure 6 of the original paper:

该图像是一个示意图,展示了3DLLM-MEM的任务执行过程。在步骤(1)和(2)中,智能体随机探索环境以形成初步记忆。在接收到任务指令后,它回忆起之前的观察并前往卧室取书,随后返回客厅完成任务,最后从厨房取来茶杯,成功准备出舒适的读书角落。
Figure 6: Qualitative example of 3DLLM-MEM. The task instruction is: Prepare a cozy reading nook in the living room with two books and a teacup. In images (1) and (2), the agent explores the environment randomly, forming an initial memory of the scene. After receiving the task instruction, it recalls its memory and navigates to the bedroom to pick up a book from the cabinet, as shown in images (3) and (4). The agent then returns to the living room and places the book on the table in front of the sofa (image 5). Unable to recall any additional books, the agent resumes exploration and finds a second book on the bed, which it picks up (image 6) and stacks on top of the first book (image 7). Finally, the agent recalls seeing a teacup in the kitchen, navigates to retrieve it (image 8), and places it on the table in the living room (image 9). The task is successfully completed.
6.2. Ablation Studies / Parameter Analysis
The paper conducts an ablation study on the design choices for initializing the query in its memory fusion module, which is a critical component of 3DLLM-MEM. The goal is to determine the most effective way to derive the query that guides the fusion of episodic memory with working memory.
The following are the results from Table 3 of the original paper:
| Model | Simple | Medium | Hard | Average | ||||||||||||
| In-domain | In-the-wild | In-domain | In-the-wild | In-domain | In-the-wild | In-domain | In-the-wild | |||||||||
| SR | Sub-SR | SR | Sub-SR | SR | Sub-SR | SR | Sub-SR | SR | Sub-SR | SR | Sub-SR | SR | Sub-SR | SR | Sub-SR | |
| 3DLLM-MEM | 45.5 | 73.4 | 37.0 | 65.4 | 36.8 | 67.8 | 31.6 | 57.4 | 30.5 | 46.2 | 27.8 | 42.1 | 37.6 | 62.5 | 32.1 | 55.0 |
| Init with Most Recent Episodic Memory | 42.3 | 69.4 | 28.6 | 50.7 | 32.4 | 58.6 | 23.7 | 45.1 | 22.6 | 37.8 | 15.3 | 31.4 | 32.4 | 55.3 | 22.5 | 42.4 |
| Init with Learnable Zero Parameters | 41.4 | 67.2 | 27.9 | 50.0 | 33.0 | 59.2 | 23.4 | 45.8 | 24.2 | 40.4 | 18.6 | 35.6 | 32.9 | 55.6 | 23.3 | 43.8 |
Table 3: Ablation study of query initialization designs in our memory fusion module.
The ablation study compares three different ways to initialize the fusion query (the in the memory fusion formula):
3DLLM-MEM(Ours): Initializes thefusion querywithworking memory features(i.e., the current observation tokens). This is the proposed method.- Init with Most Recent Episodic Memory: Uses the features from the most recent entry in the
episodic memoryto initialize thefusion query. - Init with Learnable Zero Parameters: Initializes the
fusion querywithlearnable zero parameters, meaning the model learns from scratch what constitutes a good query to retrieve relevant memories.
Analysis of Results:
- Superiority of
Working Memory Initialization:3DLLM-MEM(initializing withworking memory tokens) consistently achieves the highestSRandSub-SRacross all task complexities and settings. This confirms that using current observations to queryepisodic memoryis the most effective strategy. - Performance Degradation of Alternatives:
- Initializing with
Most Recent Episodic Memoryleads to a significant drop in performance compared to3DLLM-MEM. For instance, average in-the-wildSRdrops from 32.1% to 22.5%. - Initializing with
Learnable Zero Parametersalso shows a similar performance drop, with an average in-the-wildSRof 23.3%.
- Initializing with
- Interaction with Task Complexity:
Init with Most Recent Episodic Memorysurprisingly outperformsInit with Learnable Zero Parametersinsimple settings(e.g., Simple in-domain SR: 42.3% vs. 41.4%). This might be because for simple tasks, relevant information is often recent, and using recent memory provides a good starting point for the query, leading to faster convergence or easier learning.- However, in
hard settings,Init with Learnable Zero Parametersperforms slightly better thanInit with Most Recent Episodic Memory(e.g., Hard in-domain SR: 24.2% vs. 22.6%). This suggests that for more complex tasks, where relevant information might be distributed far back inepisodic memory, a learned, unconstrained query (from zero parameters) might eventually become more effective than one biased towards only the most recent past, which could be misleading forlong-horizon reasoning.
- Conclusion from Ablation: The results strongly support the design choice of initializing
fusion querieswithworking memory tokens. This approach provides thememory fusion modulewith a strong, task-relevant signal (the current observation) to effectively and robustly retrieve and integrate useful features fromepisodic memory, leading to superior performance across diverse and challenging scenarios.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper addresses a critical limitation in current Large Language Models (LLMs) for embodied AI: the struggle to manage long-term spatial-temporal memory in dynamic, multi-room 3D environments. It makes two significant contributions:
-
3DMEM-BENCH: A novel, comprehensive benchmark that provides over 26,000 trajectories and 2,892 embodied tasks, includingquestion-answeringandcaptioning. This benchmark is specifically designed to evaluate an agent's ability to reason overlong-term memoryacross varying complexities (simple to hard) and "in-the-wild" scenarios, filling a crucial gap in existing evaluation methods. -
3DLLM-MEM: A newembodied 3D LLMequipped with adual-memory system(working memory for current observations and episodic memory for past experiences) and a novelmemory fusion module. This module usesworking memory tokensas queries toselectively attendto andfusetask-relevantspatialandtemporal featuresfromepisodic memory, thereby achieving both memory efficiency and effectivespatial-temporal reasoning.Experimental results on
3DMEM-BENCHdemonstrate that3DLLM-MEMachieves state-of-the-art performance across all tasks, significantly outperforming strong baselines. It shows remarkable robustness and scalability, especially in challenging "in-the-wild" and hard-complexity tasks, where it vastly surpasses other methods. The ablation study further validates the effectiveness of itsmemory fusion mechanismand the choice of initializing queries withworking memory tokens.
7.2. Limitations & Future Work
The authors explicitly state one limitation of 3DLLM-MEM:
-
High-Level Policies, Not Low-Level Control: Currently,
3DLLM-MEMdoes not involvelow-level navigationandcontrol policy. Instead, it utilizeshigh-level predefined policieswithin the simulator to carry out actions.The authors suggest that this aspect (integrating low-level control) is "orthogonal to our study" and "could be explored and seamlessly integrated into our framework in the future." This implies a future research direction focusing on end-to-end
embodied agentswhere3DLLM-MEM's high-level reasoning and memory capabilities could guide more granularlocomotionandmanipulationactions.
7.3. Personal Insights & Critique
This paper presents a significant advancement in the field of embodied AI and 3D LLMs. The problem of long-term spatial-temporal memory is indeed central to developing truly intelligent and capable agents that can operate in complex real-world environments.
- Innovation of the Benchmark: The
3DMEM-BENCHis a highly valuable contribution. The inclusion offine-grained complexity levelsand "in-the-wild" scenarios forces models to demonstrate true generalization androbust long-term memory, going beyond the simpler, short-horizon tasks often found in existing benchmarks. The diverse task types (embodied actions,EQA, captioning) provide a holistic evaluation ofspatial-temporal understanding. - Elegant Memory Architecture: The
dual-memory systemwith adynamic memory fusion moduleis an elegant and biologically inspired solution. The use ofworking memory tokensas queries forepisodic memoryis a clever way to ensure task-relevance and efficiency, mitigating the combinatorial explosion of information indense 3D representations. This mechanism could inspire similar designs in otherLLM-based agentstackling sequential tasks. - Dense 3D Representation: The emphasis on
dense 3D representationsin memory is crucial. While computationally intensive, it retains more information than sparse or object-centric methods, which is often necessary for nuancedspatial reasoningand detailed interaction. The paper effectively shows how to manage this complexity through selective fusion. - Potential Broader Applications: The principles of
3DLLM-MEMextend beyondembodied AI. AnyLLM-based systemthat needs to maintainlong-term, dynamically evolving contextover time and across different "spaces" (e.g., documents, codebases, web environments) could benefit from such adual-memoryandfusion mechanism. For example,LLM agentsfor complex software development, scientific discovery, or multi-session human-computer interaction could adapt this approach to maintain and leverageepisodic knowledge. - Unverified Assumptions/Areas for Improvement:
-
Low-Level Control: The stated limitation about
high-level policiesis a significant one for practicalembodied AI. While the authors claim it's orthogonal, the true test oflong-term memoryin a physical robot would involve seamlessly integrating this high-level reasoning withlow-level motor controlandperceptual feedbackloops. Future work should focus on this integration to see if the memory system remains robust when facing the uncertainties and continuous learning demands of real-world interaction. -
Computational Cost: While the memory fusion aims for efficiency, managing
dense 3D representationsoverlong horizonscould still be computationally demanding, especially asepisodic memorygrows. A detailed analysis of memory footprint and inference speed scaling with (number of timesteps in episodic memory) would be beneficial. -
Explainability: How does the
memory fusion moduledecide what is relevant? While it uses anattention mechanism, understanding which specific3D featuresfrom which pastepisodic memoryentries are being fused could offer valuable insights into the model's reasoning process and help diagnose failures. -
Catastrophic Forgetting: As the
episodic memoryis updated, especially if entries are modified, there's a potential forcatastrophic forgettingif not handled carefully. The paper mentions updating entries, but the specifics of how this prevents loss of older, potentially still relevant, information could be explored further. -
Memory Eviction Strategy: For truly unbounded
long-term memory, aneviction strategy(deciding when to discard old memories) would be necessary to prevent thememory bankfrom growing indefinitely. This paper doesn't explicitly discuss such a mechanism, which would be crucial for indefinite operation.Overall,
3DLLM-MEMrepresents a compelling step forward in endowingembodied LLMswith sophisticatedspatial-temporal memory, paving the way for more intelligent and autonomous agents.
-
Similar papers
Recommended via semantic vector search.