Paper status: completed

3DLLM-Mem: Long-Term Spatial-Temporal Memory for Embodied 3D Large Language Model

Published:05/29/2025

Spatial-Temporal Memory for Large Language Models (1)3DMem-Bench Benchmark (1)Dynamic Memory Management and Fusion Model (1)Embodied Tasks in Multi-Room 3D Environments (1)Long-Term Memory Reasoning (1)

Original Link PDF

Price: 0.100000

1 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This study introduces 3DLLM-Mem to enhance long-term spatial-temporal memory in Large Language Models for dynamic 3D environments. It presents 3DMem-Bench for evaluating reasoning capabilities, with experimental results showing significant performance improvements in embodied tas

Abstract

Humans excel at performing complex tasks by leveraging long-term memory across temporal and spatial experiences. In contrast, current Large Language Models (LLMs) struggle to effectively plan and act in dynamic, multi-room 3D environments. We posit that part of this limitation is due to the lack of proper 3D spatial-temporal memory modeling in LLMs. To address this, we first introduce 3DMem-Bench, a comprehensive benchmark comprising over 26,000 trajectories and 2,892 embodied tasks, question-answering and captioning, designed to evaluate an agent's ability to reason over long-term memory in 3D environments. Second, we propose 3DLLM-Mem, a novel dynamic memory management and fusion model for embodied spatial-temporal reasoning and actions in LLMs. Our model uses working memory tokens, which represents current observations, as queries to selectively attend to and fuse the most useful spatial and temporal features from episodic memory, which stores past observations and interactions. Our approach allows the agent to focus on task-relevant information while maintaining memory efficiency in complex, long-horizon environments. Experimental results demonstrate that 3DLLM-Mem achieves state-of-the-art performance across various tasks, outperforming the strongest baselines by 16.5% in success rate on 3DMem-Bench's most challenging in-the-wild embodied tasks.

Mind Map

In-depth Reading

English Analysis~36 min read · 48,256 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is "3DLLM-Mem: Long-Term Spatial-Temporal Memory for Embodied 3D Large Language Model". This title clearly indicates the paper's focus on enhancing 3D Large Language Models (LLMs) with long-term memory capabilities for embodied agents operating in 3D environments, specifically addressing both spatial and temporal aspects of memory.

1.2. Authors

The authors of the paper are:

Wenbo Hu, Yining Hong, Yanjun Wang, Leison Gao, Zibu Wei, Xingcheng Yao, Nanyun Peng, Kai-Wei Chang (affiliated with University of California, Los Angeles)
Yonatan Bitton, Idan Szpektor (affiliated with Google Research)

Their research backgrounds are primarily in areas related to Large Language Models, 3D vision, embodied AI, and machine learning, given their affiliations with a prominent university AI research group and a major technology company's research division.

1.3. Journal/Conference

The paper is published at (UTC): 2025-05-28T17:59:13.000Z. While a specific journal or conference is not explicitly mentioned in the provided abstract or first page, the presence of a "NeurIPS Paper Checklist" in the appendices strongly suggests that it is intended for publication at the Conference on Neural Information Processing Systems (NeurIPS). NeurIPS is one of the most prestigious and influential conferences in the field of artificial intelligence and machine learning, known for publishing cutting-edge research.

1.4. Publication Year

The publication timestamp indicates the paper was published on May 28, 2025.

1.5. Abstract

The paper addresses the limitation of current Large Language Models (LLMs) in handling long-term memory for embodied agents in dynamic, multi-room 3D environments. This limitation is attributed to the lack of proper 3D spatial-temporal memory modeling. To tackle this, the authors introduce two main contributions:

3DMEM-BENCH: A comprehensive benchmark comprising over 26,000 trajectories and 2,892 embodied tasks, including question-answering and captioning, designed to evaluate an agent's ability to reason over long-term memory in 3D environments.
3DLLM-MEM: A novel dynamic memory management and fusion model. This model utilizes working memory tokens (representing current observations) as queries to selectively attend to and fuse the most useful spatial and temporal features from episodic memory (storing past observations and interactions). This approach aims for task-relevant information focus and memory efficiency in complex, long-horizon environments.

Experimental results demonstrate that 3DLLM-MEM achieves state-of-the-art performance across various tasks on 3DMEM-BENCH, outperforming strong baselines by 16.5% in success rate on challenging embodied tasks.

1.6. Original Source Link

The official source link is: https://arxiv.org/abs/2505.22657 The PDF link is: https://arxiv.org/pdf/2505.22657v2.pdf This indicates the paper is available as a preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the inability of current Large Language Models (LLMs) to effectively plan and act in complex, dynamic, multi-room 3D environments, especially when tasks require recalling information over extended periods and vast spaces. Humans excel at such tasks by naturally employing long-term memory across both temporal (what happened when) and spatial (where things are) experiences.

This problem is important because embodied AI systems, which are AI agents designed to interact with the physical world, need robust memory capabilities to perform real-world tasks. Current 3D LLMs and 3D Vision-Language-Action models have made strides in perceiving, reasoning, and acting in 3D spaces, but they suffer from several critical limitations:

Struggle with long-term memory chains: Models fail to maintain coherent memory over multiple visual scenarios (e.g., different rooms in a house) and extended time frames.
Information density of 3D scenes: Real-world 3D environments are vast and information-dense. Storing dense 3D representations (which capture intricate geometric relationships) is computationally challenging. Selective retrieval (common in LLMs) risks omitting critical information.
Entanglement of spatial and temporal memory: Agents need to track not only object locations but also how environments and objects change over time due to exploration and interaction. Maintaining coherent representations of previously seen spaces while integrating new information is a significant hurdle.

The paper's entry point and innovative idea is to address this 3D spatial-temporal memory modeling gap directly. It proposes a dual-memory system inspired by human cognition (working memory for current observations, episodic memory for past experiences) and a novel memory fusion module to selectively and efficiently integrate relevant information.

2.2. Main Contributions / Findings

The paper makes two primary contributions:

3DMEM-BENCH: A Novel Benchmark for Long-Term Spatial-Temporal Memory Evaluation.
- This benchmark is specifically designed to evaluate embodied agents' reasoning, planning, and acting capabilities that require long-term spatial-temporal memory in multi-room 3D environments.
- It comprises over 26,000 trajectories and 2,892 embodied tasks, including question-answering (EQA) and captioning.
- Tasks are categorized into simple, medium, and hard difficulty levels, and include "in-the-wild" challenges to test generalization.
- This addresses a gap in prior benchmarks which often focus on single-step or short-horizon reasoning, or lack embodied interaction support for long-term exploration.
3DLLM-MEM: A Dynamic Memory Management and Fusion Model for Embodied 3D LLMs.
- 3DLLM-MEM introduces a novel architecture that integrates a dual-memory system: a limited-capacity working memory for current observations and an expandable episodic memory for storing dense 3D representations of past observations and interactions.
- Its key innovation is a memory fusion module that uses working memory tokens (representing current observations) as queries to selectively attend to and fuse the most useful spatial and temporal features from episodic memory.
- This dynamic and selective fusion mechanism allows the model to focus on task-relevant information while maintaining memory efficiency, crucial for complex, long-horizon environments.

Key Findings:

3DLLM-MEM achieved state-of-the-art performance across all evaluation categories (embodied tasks, EQA, and captioning) on the 3DMEM-BENCH benchmark.
It significantly outperformed existing baselines, including common memory management strategies (e.g., Most Recent Memory, Retrieval-Augmented Memory) and other 3D LLMs, by 16.5% in success rate on the most challenging "in-the-wild" embodied tasks.
The model demonstrated strong generalization capabilities, with its performance remaining robust even in "in-the-wild" settings where other methods saw sharp drops.
3DLLM-MEM maintained strong performance (27.8%) on hard "in-the-wild" tasks, while baselines degraded significantly to ~5% success rate, showcasing its scalability and effectiveness in managing longer-term memory representations.
Ablation studies confirmed that initializing the memory fusion module's query with working memory tokens is the most effective design choice.

These findings solve the problem of limited long-term spatial-temporal memory in embodied 3D LLMs, enabling agents to perform more complex, multi-room, and long-horizon tasks that require recalling and integrating information across space and time.

3.1. Foundational Concepts

To fully understand this paper, a beginner needs to grasp several core concepts from artificial intelligence, natural language processing, and computer vision.

Large Language Models (LLMs): LLMs are advanced artificial intelligence models, typically based on the transformer architecture, that are trained on vast amounts of text data. They are designed to understand, generate, and process human language, excelling at tasks like text generation, translation, summarization, and question answering. Their "largeness" refers to their immense number of parameters (billions or even trillions) and the scale of their training data. In this paper, LLMs are extended to perceive and reason about 3D environments.
Embodied AI: Embodied AI refers to intelligent agents that are situated within and can interact with physical or simulated environments. Unlike disembodied AI (like a chatbot), an embodied agent has a "body" (e.g., a robot, a virtual avatar) through which it perceives its surroundings (e.g., via cameras) and performs actions (e.g., moving, grasping objects). The goal is to develop AI that can learn and operate in complex, dynamic real-world settings, similar to humans.
3D Vision-Language Models (3D VLMs) / 3D LLMs: These are LLMs augmented with the ability to process and reason about 3D visual information. Traditional LLMs primarily handle text, while 3D VLMs integrate inputs from 3D data (like point clouds, multi-view images with depth information) and link them to natural language. This allows them to understand queries about 3D scenes (e.g., "Where is the red chair?") and potentially generate actions in those scenes. The paper builds on LLaVA-3D as a base 3D LLM.
Working Memory: In cognitive psychology, working memory is a temporary, limited-capacity cognitive system that holds information readily available for processing. It's like a mental scratchpad where we keep track of immediate observations, thoughts, and plans. In the context of 3DLLM-MEM, it represents the agent's current observations and immediate context.
Episodic Memory: Episodic memory is a type of long-term memory that stores specific personal experiences, including their temporal (when) and spatial (where) context. It's memory of events that happened. For an embodied AI, this would include records of past observations, interactions, and locations visited over a longer period.
Point Clouds: A point cloud is a set of data points in a 3D coordinate system. These points represent the external surface of an object or environment. Each point typically has X, Y, Z coordinates and can also include other attributes like color (RGB), intensity, or normal vectors. Point clouds are a common way to represent 3D visual information and are used in this paper for dense 3D representations.
Axis-Aligned Bounding Box (AABB): An AABB is the smallest rectangular cuboid (a 3D box) that completely encloses an object and whose faces are aligned with the coordinate axes. It's a simple and common way to represent the spatial extent of objects or rooms in 3D environments, often used for collision detection and spatial queries. The paper uses AABBs for rooms and objects in HM3D.
Attention Mechanism: A core component of transformer models, the attention mechanism allows a neural network to dynamically weigh the importance of different parts of its input when processing information. It calculates attention scores between a query (what you're looking for) and keys (what's available), and then uses these scores to combine values (the information itself). This allows the model to "focus" on relevant parts of the input. In this paper, it's central to the memory fusion module where current observations query past episodic memory. The generic formula for self-attention (a common form of attention) is: $\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$ Where:
- $Q$ is the query matrix.
- $K$ is the key matrix.
- $V$ is the value matrix.
- $d_k$ is the dimension of the key vectors, used as a scaling factor to prevent large dot product values from pushing the softmax into regions with tiny gradients.
- $QK^T$ is the dot product of the query and key matrices, calculating similarity scores.
- $\mathrm{softmax}$ normalizes these scores into probability distributions.
- The result is a weighted sum of the value vectors.

3.2. Previous Works

The paper contextualizes its contributions by referencing several prior benchmarks and models, highlighting their limitations in addressing long-term spatial-temporal memory in embodied 3D environments.

3D Large Language Models (3D-LLMs):
- 3D-LLM (Hong et al., 2023b): A popular 3D LLM that injects 3D world information into LLMs. The paper uses LLaVA-3D (Zhu et al., 2024), which builds on 2D-LLM with multi-view images and 3D position embeddings, as its base model. These models typically excel at perceiving and reasoning about 3D spaces.
- Limitation: The paper notes that current 3D LLMs struggle with long-term memory chains and managing the vast information density of 3D scenes, especially for long-horizon tasks across multiple rooms. 3D-LLM itself (as a baseline) shows low performance due to a lack of an explicit memory module and limited context length.
Embodied AI Simulators and Benchmarks:
- ALFWorld (Shridhar et al., 2021), Behavior-1K (Li et al., 2024a), VisualAgentBench (Liu et al., 2024), EmbodiedBench (Yang et al., 2025a): These are existing benchmarks for embodied agents.
- Limitation: The paper argues these benchmarks often focus on high-level planning tasks with short trajectories, typically in single-room settings, thus requiring minimal spatial-temporal memory. EmbodiedBench also lacks comprehensive memory evaluation.
- Navigation-focused benchmarks like RoboTHOR (Deitke et al., 2020), Habitat-Matterport 3D (HM3D) (Ramakrishnan et al., 2021), and others (Krantz et al., 2022; Khanna et al., 2024) emphasize long-term scene exploration but often lack embodied interaction support.
Embodied Question Answering (EQA) Benchmarks:
- EQA benchmarks (Das et al., 2018; Wijmans et al., 2019; Yu et al., 2019) aim to develop goal-driven agents that can perceive their environment and answer questions. Some include memory QA (e.g., OpenEQA by Majumdar et al., 2024, and Yang et al., 2024).
- Limitation: Existing EQA benchmarks might evaluate episodic memory but often don't jointly target both spatial and episodic memory, especially their changes over time, nor do they fully support complex embodied action tasks.
Memory Systems in AI:
- Early work with LLM agents (Shinn et al., 2023; Zhang et al., 2023; Packer et al., 2023; Zhang et al., 2024) uses memory for decision-making in web-based or sandbox environments. Most focus on improving retrieval from an experience pool or memory bank (Zhao et al., 2024; Gao et al., 2024; Xu et al., 2025b).
- 3D-Mem (Yang et al., 2025b): This framework investigates 3D scene memory for exploration and reasoning by prompting vision-language models using snapshot-based memory.
- Limitation: These approaches often do not support embodied interaction or action execution (e.g., 3D-Mem), or they don't explicitly focus on dense 3D representations and the intricate challenge of fusing spatial-temporal memory for task execution in dynamic embodied scenarios.

3.3. Technological Evolution

The field has evolved from 2D-based LLMs to 3D-aware LLMs (3D-LLMs, LLaVA-3D) that can perceive and reason about 3D spaces. Concurrently, embodied AI has advanced from basic navigation tasks in simulated environments to complex interaction and planning. Initially, LLMs lacked explicit mechanisms to handle long-term memory beyond their limited context windows. Solutions like Retrieval-Augmented Generation (RAG) emerged to address this by retrieving relevant information from external knowledge bases. However, integrating this with dynamic, dense 3D spatial-temporal information for embodied agents remained a significant challenge.

This paper's work fits within the technological timeline as a crucial step towards developing truly intelligent embodied agents by bridging the gap between 3D perception, LLM reasoning, and long-term memory management. It moves beyond simply retrieving information to actively fusing it, recognizing the unique demands of spatial-temporal memory in dynamic 3D environments.

3.4. Differentiation Analysis

Compared to related work, the core differences and innovations of this paper's approach are:

Comprehensive 3D Spatial-Temporal Memory Focus: Unlike previous embodied benchmarks that primarily target short-horizon tasks or navigation, 3DMEM-BENCH explicitly focuses on long-term spatial-temporal memory through fine-grained embodied tasks, EQA, and captioning that span multiple rooms and require reasoning about memory changes over time and space.
Dense 3D Memory Representation: The paper is among the first to explore dense 3D representations (like point clouds) as memory for embodied 3D LLMs. This is a significant departure from approaches that might rely on sparse or object-centric representations, providing richer geometric and environmental detail crucial for complex tasks.
Novel Dual-Memory System with Dynamic Memory Fusion: 3DLLM-MEM introduces a unique dual-memory system (working memory + episodic memory) combined with a memory fusion module. This module dynamically uses working memory tokens as queries to selectively attend to and fuse relevant features from episodic memory. This is more sophisticated than simple context window expansion or retrieval-augmented methods, as it actively integrates information based on task relevance and spatial-temporal relationships, while also managing memory efficiency.
Support for Embodied Interaction and Actions: While 3D-Mem (Yang et al., 2025b) investigates 3D scene memory, it does not support embodied interaction or action execution. 3DLLM-MEM explicitly supports these, allowing agents to manipulate objects and navigate based on their fused spatial-temporal memory.
Robustness in "In-the-Wild" Scenarios: The 3DMEM-BENCH includes "in-the-wild" challenges (unseen objects, unseen memory contexts, novel challenges) which expose the fragility of existing methods. 3DLLM-MEM demonstrates significant robustness and generalization capabilities in these challenging scenarios, suggesting a more scalable solution.

4. Methodology

4.1. Principles

The core idea behind 3DLLM-MEM is to equip 3D Large Language Models (3D-LLMs) with a long-term spatial-temporal memory system that mimics human cognition. Humans use a working memory for immediate observations and an episodic memory for storing past experiences. 3DLLM-MEM adopts a similar dual-memory system: a limited-capacity working memory for current observations and an expandable episodic memory that stores past spatial-temporal information as dense 3D representations.

The key theoretical basis or intuition is that for an embodied agent to perform complex, long-horizon tasks in dynamic 3D environments, it needs to:

Maintain Awareness of Current Context: The working memory ensures the agent is grounded in its immediate surroundings.
Recall Relevant Past Experiences: The episodic memory stores a rich history of observations and interactions across different rooms and time steps.
Selectively Integrate Information: Crucially, the agent shouldn't retrieve all past information (which would be computationally overwhelming) but rather selectively fuse only the most task-relevant features from episodic memory with its working memory. This memory fusion process is dynamic and guided by the current task and observations.

By integrating dense 3D representations (which preserve intricate geometric and environmental details) and a dynamic memory fusion module, 3DLLM-MEM aims to overcome the limitations of prior models that struggle with information density, context limits, and the entanglement of spatial and temporal memory.

4.2. Core Methodology In-depth (Layer by Layer)

3DLLM-MEM is built upon LLaVA-3D (Zhu et al., 2024), which provides the foundational 3D vision-language capabilities. The novel contribution of 3DLLM-MEM lies in its memory module that manages and fuses working and episodic memory.

4.2.1. Base Model: LLaVA-3D's 3D Perception

LLaVA-3D extends 2D-LLMs by incorporating 3D awareness. Here's how it processes 3D scenes:

Multi-view Image Input: For a given 3D scene, LLaVA-3D takes multiple 2D images (views) as input. Each image is denoted as $X \in \mathbb{R}^{3 \times W \times H}$ , where 3 is for RGB channels, $W$ is width, and $H$ is height.
2D Patch Encoding: A CLIP encoder (a vision transformer pre-trained on contrastive language-image pre-training) splits each 2D image into patches. These patches are then encoded into visual tokens.
3D Position Embedding: To bring these 2D patches into a 3D spatial context, 3D position embeddings are utilized. For each image, with a known depth image (providing distance information for each pixel), camera intrinsic parameters (describing the camera's optical properties), and camera extrinsic parameters (describing the camera's position and orientation in the 3D world), 3D positions in the 3D world are obtained for each patch. These 3D positions are directly added to the 2D patch visual tokens.
Formation of 3D Patches: The encoded 2D patch features, augmented with 3D position embeddings, form pixel-aligned 3D patches $X_{3D} \in \mathbb{R}^{V \times d \times w \times h}$ , where $V$ is the number of views, $d$ is the feature dimension, $w = \lfloor \frac{W}{P} \rfloor$ , and $h = \lceil \frac{H}{P} \rceil$ (assuming $P$ is the patch size).
Redundancy Reduction: To reduce redundancy in these 3D patches and manage computational load, Farthest Point Sampling (FPS) is applied. FPS is a technique used to select a subset of points from a point cloud such that the selected points are as far apart from each other as possible, ensuring good coverage and a diverse representation of the original data. This downsamples the 3D features into a set of tokens $X_{3DFeat} \in \mathbb{R}^{N \times d}$ , where $N$ is the reduced number of tokens and $d$ is the feature dimension. These tokens represent the current observations used by 3DLLM-MEM.

4.2.2. 3DLLM-MEM Memory Module

The 3DLLM-MEM model builds on this LLaVA-3D backbone by introducing a sophisticated memory module inspired by human cognitive processes.

4.2.2.1. Dual-Memory System: Working Memory and Episodic Memory

Working Memory: This holds the agent's current observations. At any time step $t=i$ , the current observation, after being processed by the LLaVA-3D's perception module (resulting in the downsampled 3D features), is denoted as $X^{[t=i]} \in \mathbb{R}^{N \times d}$ . This is considered the agent's working memory, which is actively within its immediate context for processing.
Episodic Memory: As the agent explores and interacts, past observations and experiences are stored in an episodic memory. This memory stores data from previous time steps, from $t=1$ up to the current time step $T$ , represented conceptually as $X^{[t=1:T]} \in \mathbb{R}^{T \times N \times d}$ .

4.2.2.2. Episodic Memory Management

To efficiently manage the episodic memory, a memory feature bank is employed.

Feature Projection: For each observation $X^{[t=j]}$ at time step $j$ (where $1 \leq j \leq T$ ), it is first passed through a multi-layer perceptron (MLP) layer. An MLP is a type of feedforward artificial neural network that maps sets of input data onto a set of appropriate outputs. This MLP projects the observation into a memory-specific feature space. These projected features are then stored in the memory bank.
Temporal Encoding: To enhance the temporal understanding within the episodic memory, sinusoidal positional embeddings are incorporated. These embeddings encode the time step $t=j$ and are directly added to the corresponding memory feature representations. Sinusoidal positional embeddings are typically sine and cosine functions of varying frequencies, allowing the model to distinguish positions (or time steps) and understand their relative order without adding learnable parameters.

4.2.2.3. Memory Fusion Module

The core innovation is the memory fusion module, which enables the agent to dynamically integrate relevant information from episodic memory into its current working memory. This module operates as follows:

Query Generation: The 3D features from the working memory (the current observation $X^{[t=i]}$ ) are encoded into a shared memory space. This representation then serves as the query feature, denoted as $f_t^Q \in \mathbb{R}^{N \times M}$ . Here, $N$ is the number of tokens (downsampled 3D features from LLaVA-3D), and $M$ is the dimension of the feature space.
Key and Value Features from Episodic Memory: The episodic memory bank stores corresponding key and value features from all past observations. These are denoted as $f^K \in \mathbb{R}^{T \times N \times M}$ and $f^V \in \mathbb{R}^{T \times N \times M}$ , respectively, where $T$ is the total number of time steps (past observations) and $N$ is the number of memory tokens per time step.
Memory-Query Attention: The query feature $f_t^Q$ attends to the key features $f^K$ from the episodic memory using a scaled dot-product attention mechanism. This calculates attention scores indicating how relevant each part of the episodic memory is to the current observation and task. The formula for the fused query features $f_{\mathrm{fuse}}^Q$ is given as: $f_{\mathrm{fuse}}^Q = \mathrm{Softmax}\left(\frac{f_t^Q (f^K)^T}{\sqrt{C}}\right) f^V$ Where:
- $f_t^Q \in \mathbb{R}^{N \times M}$ is the query feature derived from the current working memory (current observation).
- $f^K \in \mathbb{R}^{T \times N \times M}$ represents the key features from episodic memory. $(f^K)^T$ denotes its transpose.
- $C$ is a scaling factor, likely related to the dimension of the key vectors, similar to $d_k$ in standard attention mechanisms, preventing the softmax function from having extremely small gradients when inputs are large.
- $\frac{f_t^Q (f^K)^T}{\sqrt{C}}$ computes the scaled dot products between the working memory query and all episodic memory keys.
- $\mathrm{Softmax}(\cdot)$ normalizes these scores into a probability distribution over the episodic memory entries.
- $f^V \in \mathbb{R}^{T \times N \times M}$ represents the value features from episodic memory.
- The result, $f_{\mathrm{fuse}}^Q$ , is a weighted sum of the episodic memory value features, effectively fusing the most relevant past information with the current context.
Final Memory-Enhanced Representation: The fused memory feature $f_{\mathrm{fuse}}^Q$ is then concatenated with the original working memory query feature $f_t^Q$ . Concatenation means joining two tensors along a particular dimension. This creates the final memory-enhanced representation $f^M$ for the agent: $f^M = \mathrm{Concat}\left[ f_{\mathrm{fuse}}^Q ; f_t^Q \right]$ This $f^M$ then serves as the input to the LLM decoder for reasoning, planning, and action generation.

4.2.2.4. Memory Update

The memory system is dynamic and updated continuously:

Working Memory: The working memory (current observation) is dynamic and updated online. As the agent interacts with the environment (e.g., picking up an object, moving to a new room), changes are immediately reflected in the 3D representations that form the working memory.
Episodic Memory Transfer: When the agent moves to a new environment (e.g., a new room), the content of the previous working memory is transferred to the episodic memory bank.
Memory Entry Updates: If the agent modifies an environment that already exists in the episodic memory bank (e.g., puts an object down in a previously visited room), the corresponding memory entry in the episodic memory bank is updated to reflect the latest state. This ensures that the memory bank remains consistent and reflects the most recent state of explored environments. The paper notes that environment changes and observations are pre-collected and stored locally to facilitate efficient data loading during training and inference.

Figure 3 from the original paper (reproduced below) visually illustrates this dual-memory system and fusion mechanism.

该图像是图示，展示了3DLLM-Mem模型的架构及其在长时记忆形成中的作用。左侧(a)部分展示了如何在执行“准备简单早餐”的任务时，使用不同时间步的记忆。同时，右侧(b)部分概述了记忆融合机制的流程。

Figure 3: (a) We propose 3DLLM-MEM, a memory-enhanced 3D embodied agent that gradually form its long-term memory while executing tasks. Multiple timesteps are shown together but in different colors, with each timestep's memory including the prior one. The task is "prepare a simple breakfast" as shown in Figure 2. (b) Overview of our memory fusion mechanism.

5. Experimental Setup

5.1. Datasets

The experiments primarily use a novel benchmark introduced in the paper: 3DMEM-BENCH.

5.1.1. Base Environment Construction

Source: 3DMEM-BENCH is built on top of the Habitat-Matterport 3D (HM3D) semantics dataset (Ramakrishnan et al., 2021). HM3D contains 1000 3D spaces and 10,600 rooms.
Preprocessing: HM3D scenes undergo preprocessing to define axis-aligned bounding boxes (AABB) for rooms and use valid semantic label annotations. This filtering process yields 182 3D spaces and 2,602 rooms.
Interactive Objects: Original HM3D objects are not interactive. To enable manipulation tasks, interactive objects from Objaverse (Deitke et al., 2023) are added. Objaverse consists of 800K 3D objects across rich categories.
Room and Object Location Derivation:
- The scenes use a semantic surface mesh where each triangle has a unique hexadecimal color linked to a surface label (e.g., floor, ceiling) and a room identifier.
- AABBs for rooms are derived by querying the semantic table for floor and ceiling colors. Global candidate floor elevations are aggregated. Vertical bounds are determined by floor's lowest point and ceiling's highest point. If one is missing, defaults are used. Horizontal limits are min/max coordinates of floor/ceiling points.
- AABBs for objects within rooms are calculated by gathering corresponding vertices and computing min/max coordinates.
- Room-level and object-level AABBs are merged by their shared room index.

5.1.2. Generating Task Trajectories

Instruction Generation: Gemini (Team et al., 2023), a Large Language Model, is prompted to generate diverse tasks. The prompts use the AABBs of rooms and objects.
Interactive Object Integration: Gemini incorporates interactive objects based on task requirements and their appropriateness in indoor environments.
Trajectory Simulation Pipeline: A pipeline verifies each trajectory step-by-step:
1. Correctness of agent's location.
2. Existence and validity of referenced objects.
3. Correctness of pick-up and put-down actions.
4. Ensuring high-level actions are executable in the simulator (Szot et al., 2024; Yang et al., 2025a).
Validation Rate: Approximately 24% of generated trajectories pass this filtering, ensuring correctness and feasibility.

5.1.3. Embodied Data Collection

Exploration Phase: An embodied agent first performs random exploration to collect RGB-D observations (color and depth images) and corresponding camera poses (position and orientation).
Task Execution Phase: The agent then follows the task trajectory, incrementally exploring new environments, executing interaction actions, and receiving feedback with new RGB-D observation data.
Data Storage: All interaction results are recorded, and reconstructed point cloud data is precomputed and stored locally for faster loading during training and inference.

5.1.4. Data Curation

Task Complexity Categories: Tasks are divided into three subcategories based on the number of multi-room scene settings:
- Simple: 3 multi-room scene settings.
- Medium: 5 multi-room scene settings.
- Hard: 10 multi-room scene settings.
Scale: A total of 51K trajectories are collected (31K simple, 10K medium, 10K hard). The final benchmark contains over 26,000 trajectories and 1,860 fine-grained long-term memory embodied tasks.
In-domain Evaluation Sets: Created by removing training tasks and filtering instances never shown in the agent's working memory.
In-the-wild Evaluation Sets: For generalization, these sets include:
- Instances involving unseen objects.
- Entirely unseen memory contexts.
- Novel challenges differing from training.
EQA Data Curation: After agent exploration, Gemini generates question-answer pairs from complete trajectories. Questions are categorized into:
- Spatial reasoning
- Long-term object navigation
- Comparative reasoning
- Multi-room layout understanding
- Semantic object counting These evaluate models on spatial-temporal changes in memory during task execution.
Long-term Memory Captioning: Data collected across multiple rooms before and after trajectory execution, enabling comparison and summarization of memory-relevant experiences (targeting semantic episodic memory).

5.1.5. Quality Control

Two procedures ensure quality:

Automatic Validation: Re-running the trajectory simulation validation pipeline (as described in Section 5.1.2), especially for "in-the-wild" tasks.
Manual Review: Four student experts manually inspect each benchmark instance. They render multi-view images of the entire scene using the simulator to verify if benchmark annotations accurately correspond to the simulated environment.

The following are examples of how Gemini is prompted to generate tasks for 3DMEM-BENCH and how human annotators review the data quality (Figure 5 from the original paper).

The following are the results from Table 4, Table 5, Table 6 and Figure 5 of the original paper:

$System message You are an AI assistant and task generator for a 3D embodied agent operating in a multi-room environment. The environment provides detailed object instance information, including bounding boxes and IDs. Your goal is to generate a complex task that requires the agent to explore multiple rooms, navigate, and crucially use long-term memory to recall details observed earlier. Prompt 1. Environment and Object Information Object Representation: Each object is given with a bounding box in the format: "<object_name>(num)": [x_min, y_min, z_min], [x_max, y_max, z_max] Here, (num) indicates the ID, with (0) being the closest to the origin [0,0,0]. IDs reset for each room (e.g., sofa(0) in room 2 and sofa(0) in room 4 if each room has one sofa). Actions Available: ${ \mathit { \Omega } } { < } \mathrm { G O }$ TO ROOM(id)>: Navigate to a room that has already been visited. <GO TO NEW ROOM>: Navigate to a new, unexplored room (and unlock its objects). Do not use this for rooms that have been visited before. <PICK UP object_name(id) from room(id) in room $\mathrm { ( id ) } >$ :Pick up an object that originally belongs to a specific room while in that same room. <PUT DOWN object_name(id) from room(id) on object_name(id) in room(id)>: Place an object (that originally belongs to a room) onto another object (such as a table or floor) in a room. New Objects: You can add extra objects to diversify the task. Important: Use only object names from the provided new_objects_name_list. If a room already has an object with the same name, the new object should have a new ID (e.g., if lamp(0) exists, the added one should be lamp(1)). These extra objects are only for task design; the agent's trajectory should not mention adding them. 2. Task Design Requirements Multi-Room Exploration: Design a task that spans several rooms. The room order (given in a Room Order list) should be chosen so that necessary items are distributed across rooms. The agent should explore every room in the specified order. Long-Term Memory and Implicit Cues: Do not simply list all items as a checklist at the start. Insted rovide a vagueoverall goal .g "repare meal" Laterin the trajectory, havehe agent recall these earlier observations when the need arises. Ensure the agent must remember something seen long ago rather than simply following an explicit list. Update Memory and make new decision based on your current observations: The agent originall planned to use one object for completing its task, but couldn't find it after exploration of rooms. It has to change to a another similar object to complete its task. Inventory and Action Constraints: The agent can only hold one item at a time. Never perform consecutive PICK UP or PUT DOWN actions. If the agent holds an item, it must put it down before picking up another. When temporarily storing an object (e.g., on a table), include a "thought" explaining why the object is being set down and later recalled. 3.Reasoning and Object Comparisons: If your task requires choosing a specific object instance (e.g., selecting table(1) because it is bigger than table(0)), compare their bounding boxes and explain your choice in the trajectory. For clarity, consider these examples: {In-context examples} Here is the scene information: {Input scene information}$ Table 4: Prompt template for generating task trajectories for embodied tasks.

$Prompt You are an AI assistant / task generator in the room. All object instances in this 3D scene are given, along with their bounding boxes and ids." Each object's bounding boxes are represented by a 3D coordinate ' <obj_name>(num)': [x min, y min, z min],[x max, y max, Z max]' with units of meters, and each represents left-bottom corner and the right-top corner coordinate. You will also receive a trajectory composed of the following tokens and reasoning chains. ${ \mathit { \Omega } } { < } \mathrm { G O }$ TO ROOM(id)>: which navigates back to a specific room (id). This can only be done if the agent already go to this room. <PICK UP object_name(id) from room(id) in room(id)>: Pick up an object that originally belongs to a specific room while in that same room. <PUT DOWN object_name(id) from room(id) on object_name(id) in room(id)>: Place an object (that originally belongs to a room) onto another object (such as a table or floor) in a room. <GO TO NEW ROOM>: which navigates to a new room you haven't explored and unlocks objects there. This trajectory is what the agent have executed over the past. You need to propose several questions and answers that focused on the reasoning abilities of the long-term memory of the agent. These reasoning questions should focus on what have changed temporally or spatially in this agent's memory. It's important that this change challenged the agent's memory. For example the questions should contain object counting, spatial relation, comparison between objects across rooms, long-term multi-room room layout, long-term multi-room object navigation. Remember spatial memory is important, you should design questions that asked about the 3D object spatial relation and layout in the room that need the agent to perform a hard reasoning for the final answer. For clarity, consider these examples: {In-context examples} Here is the scene information: {Input scene information} Here is the agent's trajectory: {Input agent's trajectory}$ Table 5: Prompt template for generate QA data. {In-context examples} are in-context examples.
{Input scene information} are scene, room and object semantics along with their bounding boxes.
{Input agent's trajectory} is the 3D agent's explored trajectories and action chains.

Prompt

You are provided with a scene description containing multiple rooms. Each room includes a list of objects along with their positions in the room, represented by bounding boxes. Each object's bounding box is defined by a 3D coordinate in the format: <object_name>(num): [x min, y min, z min],[x max, y max, z max] with units in meters (defining the left-bottom and right-top corners). Your task is to generate an object caption for each room in the form of a coherent, descriptive paragraph that conveys the 3D spatial arrangement and relative positions of all objects within that room.

Then, you will receive the object descriptions and caption for the current 3D room you are in. You will also be provided with the previous rooms' captions as well. Your task is to generate new captions covering the summarization of the common features across all rooms based on your current room and important difference based on your current room. The reasons of generating the new caption is to help the agent to remind of what are in previous rooms memories can help the agent in this current room. The past objects and observations should be related to current room by examining the summarization of common things and differences. For clarity, consider these examples: {In-context examples}

Here is the scene information: {Input scene information} Here is current room you are in and previous rooms you went: {Input agent's locaton}

Table 6: Prompt template for generate QA data. {In-context examples} are in-context examples.
{Input scene information} are scene, room and object semantics along with their bounding boxes.
{Input agent's location} is the location for current room in the scene and the past explored rooms.

Figure 5: Example of human annotators manually check the data quality on QA and captioning tasks through multiple rendered multi-view images from each room.
该图像是示意图，展示了两个房间（房间8和房间9）的多视角渲染图像，用于评估数据质量。每个房间展示了不同视角的室内环境，以帮助人工注释者进行QA和描述任务。

Figure 5: Example of human annotators manually check the data quality on QA and captioning tasks through multiple rendered multi-view images from each room.

5.2. Evaluation Metrics

The paper uses different evaluation metrics for each of the three task categories: embodied tasks, Embodied Question Answering (EQA), and captioning.

5.2.1. Embodied Tasks

Success Rate (SR):
- Conceptual Definition: Success Rate measures the percentage of tasks where the agent successfully completes all sub-goals and reaches the desired final state as defined by the task instructions. It's a binary metric (task either succeeded or failed).
- Mathematical Formula: $ \mathrm{SR} = \frac{\text{Number of successfully completed tasks}}{\text{Total number of tasks}} \times 100% $
- Symbol Explanation:
  - $\mathrm{SR}$ : Success Rate.
  - $\text{Number of successfully completed tasks}$ : Count of tasks where the agent performed all required actions and achieved the final goal.
  - $\text{Total number of tasks}$ : Total count of tasks attempted by the agent.
Sub-Success Rate (Sub-SR):
- Conceptual Definition: Sub-Success Rate is a finer-grained metric that measures the proportion of individual sub-goals or steps within a task that the agent successfully completes, even if the overall task is not fully successful. This provides insight into partial task completion.
- Mathematical Formula: $ \mathrm{Sub\text{-}SR} = \frac{\text{Number of successfully completed sub-goals}}{\text{Total number of sub-goals}} \times 100% $
- Symbol Explanation:
  - $\mathrm{Sub\text{-}SR}$ : Sub-Success Rate.
  - $\text{Number of successfully completed sub-goals}$ : Count of individual steps or intermediate objectives achieved by the agent across all tasks.
  - $\text{Total number of sub-goals}$ : Total count of all individual steps or intermediate objectives defined across all tasks.

5.2.2. Embodied Question Answering (EQA)

Accuracy:
- Conceptual Definition: For open-ended EQA tasks, accuracy measures the correctness of the agent's answer to a question. Since answers are free-form text, the paper follows the "LLM-as-judge" evaluation protocol. This means another Large Language Model (in this case, Gemini) is prompted to evaluate if the agent's generated answer is correct, typically by comparing it to a ground-truth answer and assessing its factual correctness and relevance.
- Mathematical Formula: (Implicitly defined by LLM-as-judge protocol for open-ended questions, but for classification-like accuracy it would be): $ \mathrm{Accuracy} = \frac{\text{Number of correctly answered questions}}{\text{Total number of questions}} \times 100% $
- Symbol Explanation:
  - $\mathrm{Accuracy}$ : The proportion of questions for which the agent's answer is judged as correct.
  - $\text{Number of correctly answered questions}$ : Count of EQA questions where the agent's response was deemed correct by the LLM-as-judge.
  - $\text{Total number of questions}$ : Total count of EQA questions.
    
    The following is the prompt template used for the LLM-as-judge protocol for EQA evaluation (Table 7 from the original paper):
    
    该图像是一个示意图，展示了对AI助手提供的回答质量的评估流程。图中包括系统消息、参考答案和助手的回答，旨在指导对第二个问题的评价，包括比较助手的回答和参考答案等步骤。

Table 7: Prompt template for open-ended QA evaluation following standard LLM-as-judge protocol.

5.2.3. Captioning

For captioning tasks, standard natural language generation metrics are used to compare the generated caption with a human-written reference caption.

BLEU (Bilingual Evaluation Understudy):
- Conceptual Definition: BLEU is an algorithm for evaluating the quality of text which has been machine-translated or machine-generated. It works by comparing N-grams of the candidate text to the N-grams of one or more reference texts. A higher BLEU score indicates a closer match to the reference text. BLEU1 considers unigrams (single words), and BLEU4 considers N-grams up to four words.
- Mathematical Formula: (The full BLEU formula is complex, but its core components are): $ \mathrm{BLEU} = \mathrm{BP} \cdot \exp\left( \sum_{n=1}^{N} w_n \log p_n \right) $ Where:
  - $\mathrm{BP}$ is the brevity penalty (a penalty for translations that are too short compared to the reference).
  - $N$ is the maximum N-gram order (e.g., 4 for BLEU4).
  - $w_n$ are positive weights for each N-gram order (usually $1/N$ ).
  - $p_n$ is the precision for N-grams of order $n$ , calculated as the count of matching N-grams in the candidate that also appear in the reference, divided by the total number of N-grams in the candidate.
- Symbol Explanation:
  - $\mathrm{BLEU}$ : Bilingual Evaluation Understudy score.
  - $\mathrm{BP}$ : Brevity Penalty, to penalize short generated captions.
  - $N$ : Maximum N-gram length considered.
  - $w_n$ : Weight for the n-gram precision.
  - $p_n$ : Modified n-gram precision.
METEOR (Metric for Evaluation of Translation with Explicit Ordering):
- Conceptual Definition: METEOR is another metric for evaluating the quality of text generation. It addresses some shortcomings of BLEU by including explicit word-to-word matching based on unigram precision and recall, considering synonymy and stemming, and factoring in word order. It computes a score based on the harmonic mean of precision and recall, with recall given more weight, and then applies a fragmentation penalty.
- Mathematical Formula: (Simplified core idea): $ \mathrm{METEOR} = \mathrm{Fmean} \cdot (1 - \mathrm{Penalty}) $ Where:
  - $\mathrm{Fmean} = \frac{10 \cdot P \cdot R}{R + 9P}$ (a weighted harmonic mean of precision $P$ and recall $R$ ).
  - $\mathrm{Penalty}$ is a fragmentation penalty that penalizes non-contiguous matches.
- Symbol Explanation:
  - $\mathrm{METEOR}$ : Metric for Evaluation of Translation with Explicit Ordering.
  - $\mathrm{Fmean}$ : F-mean, a weighted harmonic mean of precision and recall.
  - $P$ : Unigram precision.
  - $R$ : Unigram recall.
  - $\mathrm{Penalty}$ : Fragmentation penalty.

5.3. Baselines

The 3DLLM-MEM model was compared against a broad range of memory management approaches and existing 3D LLMs:

Everything in Context:
- Description: This baseline attempts to fit all observations (visual features and textual information) directly into the model's context window.
- Purpose: Serves as an upper bound for performance when memory capacity is not an issue, applicable only for smaller scenes or simpler tasks where all relevant information can be accommodated.
- Limitation: Not scalable to long-horizon tasks or information-dense 3D environments due to LLM context length limitations.
Most Recent Memory:
- Description: This baseline retains only the most recent observations within the context window, assuming that the most immediate past is the most relevant.
- Purpose: A common heuristic for managing context limits in sequential processing.
- Limitation: Fails when tasks require recalling information from further back in time or from previously visited, non-contiguous spatial locations.
Retrieval-Augmented Memory (RAG):
- Description: Inspired by retrieval-based techniques used in LLMs. This approach uses a memory bank to store past observations. During inference, the most relevant memory entries are retrieved (e.g., using a similarity search) and then appended to the working memory (current observations) before being fed into the LLM for reasoning.
- Purpose: To overcome context length limitations by selectively providing relevant historical context.
- Limitation: The effectiveness heavily depends on the retrieval mechanism's ability to accurately identify and extract truly relevant information, which can be challenging in spatial-temporal reasoning tasks. The paper's results suggest it often misses critical information.
3D-LLM (Finetuned):
- Description: A popular 3D LLM (Hong et al., 2023b) recognized by the community. The authors fine-tune this model on their training data. For evaluation, it uses an "everything in context" strategy with the longest context window it supports. The paper notes modifications to fit multi-scene input and align with specific interaction tokens.
- Purpose: Represents the state-of-the-art in 3D LLMs without explicit long-term memory modules tailored for dynamic spatial-temporal fusion.
- Limitation: Primarily struggles with retaining long-term memory of semantic observations and is limited by its context length (512 tokens for its T5-flanxl backbone), performing poorly on tasks requiring long-term spatial-temporal reasoning.
3D-Mem (Yang et al., 2025b):
- Description: A framework designed for 3D scene memory in embodied exploration and reasoning. It uses a snapshot-based architecture with two stores: memory snapshots (compact multi-view RGB-D frames with bounding boxes summarizing inspected areas) and frontier snapshots (boundary views for new information). For evaluation in this paper, the frontier component is disabled, focusing only on memory snapshots captured from room centers.
- Purpose: To provide a dedicated 3D scene memory for embodied agents.
- Limitation: This method does not support embodied interaction or action execution and shows limitations in spatial relation, navigation, and object counting tasks, indicating a reliance on aggregated image-centric memories rather than true dense 3D spatial-temporal understanding.

5.4. Implementation Details

Base Model: The 3DLLM-MEM model is implemented based on LLaVA-3D (Zhu et al., 2024).
Hardware & Framework: Modifications were made for compatibility with Google TPUs using PyTorch/XLA frameworks (Paszke et al., 2019; XLA team, 2017-2025).
Context Window: The model's context window is expanded to 8192 tokens to accommodate long-term memory inputs.
Fine-tuning:
- The proposed memory module along with the LLM decoder are fine-tuned.
- Initialization is done from LLaVA-3D's pretrained weights.
- Training is conducted on 8 Google Cloud TPU v5p cores.
- Batch size: 256.
- Training duration: Approximately 1 day for 1000 steps.
- Optimizer: Adam optimizer.
- Learning Rate: $2 \times 10^{-5}$ (no weight decay).
- Learning Rate Schedule: Linear warmup for the initial 3% of steps (from $10^{-8}$ to $10^{-5}$ ), followed by a cosine decay scheduler.
Training Data Split: Fine-tuning uses the training split of 3DMEM-BENCH, which contains about 26K trajectories.
Task-Specific Epochs (Due to compute limitations):
- Captioning task: 15 epochs.
- Question-Answering task: 20 epochs.
- Embodied task: 75 epochs (allocated most compute time).

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate that 3DLLM-MEM significantly outperforms all existing baselines across various tasks in the 3DMEM-BENCH benchmark, particularly excelling in long-horizon embodied tasks and showing strong generalization capabilities in "in-the-wild" settings.

6.1.1. Results on Embodied Tasks

The following are the results from Table 2a of the original paper:

Model	Simple				Medium				Hard				Average
	In-domain		In-the-wild		In-domain		In-the-wild		In-domain		In-the-wild		In-domain		In-the-wild
	SR	Sub-SR	SR	Sub-SR	SR	Sub-SR	SR	Sub-SR	SR	Sub-SR	SR	Sub-SR	SR	Sub-SR	SR	Sub-SR
3D-LLM (Finetuned)	10.4	20.3	9.1	18.5	-	-	-	-	-	-	-	-	-	-	-	-
Everything in Context	35.5	63.9	32.4	45.2	-	-	-	-	-	-	-	-	-	-	-	-
Most Recent Memory	32.8	62.3	23.4	38.6	20.1	34.8	12.4	25.3	10.4	20.7	5.4	12.1	21.1	39.3	13.7	25.3
Retrieval-Augmented Memory	34.2	63.0	28.3	46.2	21.8	40.2	13.7	28.0	10.8	21.6	4.8	10.6	22.3	41.6	15.6	28.3
3DLLM-MEM (Ours)	45.5	73.4	37.0	65.4	36.8	67.8	31.6	57.4	30.5	46.2	27.8	42.1	37.6	62.5	32.1	55.0

Table 2a: Results on 3DMEM-BENCH embodied tasks. SR stands for success rate. Sub-SR stands for sub-success rate. Our model outperforms existing approaches by a large margin.

Overall Superiority: 3DLLM-MEM consistently achieves the highest Success Rate (SR) and Sub-Success Rate (Sub-SR) across all task complexities (simple, medium, hard) and settings (in-domain, in-the-wild).
Performance Gap: On average, 3DLLM-MEM achieves an SR of 37.6% in-domain and 32.1% in-the-wild, significantly higher than the next best Retrieval-Augmented Memory (22.3% in-domain, 15.6% in-the-wild). This represents a remarkable 16.5% improvement in in-the-wild tasks.
Robustness to "In-the-Wild" Challenges: While other methods experience a sharp performance drop in "in-the-wild" scenarios (e.g., Most Recent Memory drops from 32.8% to 23.4% SR for simple tasks), 3DLLM-MEM maintains robustness. Its average in-the-wild SR of 32.1% is a strong indicator of its generalization capabilities.
Scalability with Complexity: As task complexity increases from simple to hard, all existing baselines degrade significantly, achieving only around 5% SR on hard "in-the-wild" tasks. In contrast, 3DLLM-MEM sustains a strong SR of 27.8% for hard "in-the-wild" tasks, demonstrating its effectiveness in managing longer-term memory representations required for complex, long-horizon tasks.
Baseline Insights:
- 3D-LLM (Finetuned) shows the lowest performance, even in simple tasks, underscoring the necessity of an explicit memory module.
- Most Recent Memory and Retrieval-Augmented Memory perform poorly, indicating that simply keeping recent observations or retrieving relevant ones is insufficient for complex spatial-temporal reasoning.
- Everything in Context performs better than Most Recent Memory and RAG in simple tasks (where it fits all information), suggesting that access to all information is beneficial. However, 3DLLM-MEM still outperforms it, highlighting the value of selectively fusing task-relevant memory features.

6.1.2. Results on Long-Term EQA and Captioning

The following are the results from Table 2b of the original paper:

Model	Embodied Task		Embodied Question Answering (EQA)					Captioning
Model	In-domain	In-the-wild	Spatial	Nav.	Comparative	Layout	Count	BLEU1	BLEU4	METEOR
3D-LLM (Finetuned)	-	-	2.9	5.8	0.0	7.7	0.0	42.3	12.0	30.6
3D-Mem (GPT4-0)	-	-	39.9	11.0	25.8	19.1	7.8	41.7	4.7	31.8
3D-Mem (Gemini-2.5-Flash)	-	-	41.6	18.2	37.6	30.2	12.7	42.8	4.8	29.6
3D-Mem (Gemini-2.5-Pro)	-	-	39.7	27.7	36.0	35.2	16.4	41.5	3.0	28.6
Most Recent Memory	21.1	13.7	27.5	30.2	24.3	20.1	10.5	32.4	10.1	25.6
Retrieval-Augmented Memory	22.3	15.6	38.0	33.4	31.8	29.7	15.6	40.8	11.5	29.3
3DLLM-MEM (Ours)	37.6	32.1	62.8	40.6	41.4	39.9	26.3	58.2	18.8	37.3

Table 2b: Results on all tasks in 3DMEM-BENCH. Average success rate is reported for embodied tasks. Nav. stands for long-term object navigation. We report accuracy score for open-ended EQA evaluation and follow the standard LLM-as-judge evaluation protocol by prompting Gemini. Evaluation details are provided in Appendix E.

Superiority Across All Tasks: 3DLLM-MEM consistently outperforms all existing approaches across all subcategories of EQA (Spatial, Nav., Comparative, Layout, Count) and captioning metrics (BLEU1, BLEU4, METEOR).
EQA Performance: 3DLLM-MEM achieves significantly higher accuracy scores in EQA tasks, such as 62.8% for Spatial reasoning, 40.6% for Navigation, and 41.4% for Comparative reasoning. This demonstrates its strong capability for long-term spatial-temporal reasoning, which is crucial for answering complex questions about dynamic environments.
Captioning Performance: 3DLLM-MEM also leads in captioning, with BLEU1 of 58.2%, BLEU4 of 18.8%, and METEOR of 37.3%.
Baseline Insights for EQA/Captioning:
- 3D-LLM (Finetuned) shows relatively strong performance in captioning (second best BLEU1) due to its ability to summarize object-centric semantic memory. However, its performance on EQA is very poor (e.g., 0.0% for Comparative and Count), highlighting its limitation in long-term spatial-temporal reasoning due to a limited context length.
- 3D-Mem shows improved EQA performance over other baselines, but it falls short in specific categories like Spatial relation, Navigation, and Object counting, suggesting limitations of its image-centric aggregated memories compared to 3DLLM-MEM's dense 3D fusion.
- Most Recent Memory and Retrieval-Augmented Memory also lag significantly behind 3DLLM-MEM in EQA and captioning, further validating the effectiveness of the proposed memory fusion technique.

6.1.3. Qualitative Results

The paper provides qualitative examples (Figure 4 and Figure 6) to visually illustrate 3DLLM-MEM's ability to utilize long-term memory and execute complex tasks. Figure 6, in particular, shows a multi-step task ("Prepare a cozy reading nook") where the agent explores, recalls memories of objects in different rooms, navigates back and forth, and performs interactions based on that memory. This vividly supports the quantitative results, showing that the model can integrate observations over time and space to complete long-horizon embodied tasks.

The following are the results from Figure 6 of the original paper:

该图像是一个示意图，展示了3DLLM-MEM的任务执行过程。在步骤(1)和(2)中，智能体随机探索环境以形成初步记忆。在接收到任务指令后，它回忆起之前的观察并前往卧室取书，随后返回客厅完成任务，最后从厨房取来茶杯，成功准备出舒适的读书角落。

Figure 6: Qualitative example of 3DLLM-MEM. The task instruction is: Prepare a cozy reading nook in the living room with two books and a teacup. In images (1) and (2), the agent explores the environment randomly, forming an initial memory of the scene. After receiving the task instruction, it recalls its memory and navigates to the bedroom to pick up a book from the cabinet, as shown in images (3) and (4). The agent then returns to the living room and places the book on the table in front of the sofa (image 5). Unable to recall any additional books, the agent resumes exploration and finds a second book on the bed, which it picks up (image 6) and stacks on top of the first book (image 7). Finally, the agent recalls seeing a teacup in the kitchen, navigates to retrieve it (image 8), and places it on the table in the living room (image 9). The task is successfully completed.

6.2. Ablation Studies / Parameter Analysis

The paper conducts an ablation study on the design choices for initializing the query in its memory fusion module, which is a critical component of 3DLLM-MEM. The goal is to determine the most effective way to derive the query that guides the fusion of episodic memory with working memory.

The following are the results from Table 3 of the original paper:

Model	Simple				Medium				Hard				Average
	In-domain		In-the-wild		In-domain		In-the-wild		In-domain		In-the-wild		In-domain		In-the-wild
	SR	Sub-SR	SR	Sub-SR	SR	Sub-SR	SR	Sub-SR	SR	Sub-SR	SR	Sub-SR	SR	Sub-SR	SR	Sub-SR
3DLLM-MEM	45.5	73.4	37.0	65.4	36.8	67.8	31.6	57.4	30.5	46.2	27.8	42.1	37.6	62.5	32.1	55.0
Init with Most Recent Episodic Memory	42.3	69.4	28.6	50.7	32.4	58.6	23.7	45.1	22.6	37.8	15.3	31.4	32.4	55.3	22.5	42.4
Init with Learnable Zero Parameters	41.4	67.2	27.9	50.0	33.0	59.2	23.4	45.8	24.2	40.4	18.6	35.6	32.9	55.6	23.3	43.8

Table 3: Ablation study of query initialization designs in our memory fusion module.

The ablation study compares three different ways to initialize the fusion query (the $f_t^Q$ in the memory fusion formula):

3DLLM-MEM (Ours): Initializes the fusion query with working memory features (i.e., the current observation tokens). This is the proposed method.
Init with Most Recent Episodic Memory: Uses the features from the most recent entry in the episodic memory to initialize the fusion query.
Init with Learnable Zero Parameters: Initializes the fusion query with learnable zero parameters, meaning the model learns from scratch what constitutes a good query to retrieve relevant memories.

Analysis of Results:

Superiority of Working Memory Initialization: 3DLLM-MEM (initializing with working memory tokens) consistently achieves the highest SR and Sub-SR across all task complexities and settings. This confirms that using current observations to query episodic memory is the most effective strategy.
Performance Degradation of Alternatives:
- Initializing with Most Recent Episodic Memory leads to a significant drop in performance compared to 3DLLM-MEM. For instance, average in-the-wild SR drops from 32.1% to 22.5%.
- Initializing with Learnable Zero Parameters also shows a similar performance drop, with an average in-the-wild SR of 23.3%.
Interaction with Task Complexity:
- Init with Most Recent Episodic Memory surprisingly outperforms Init with Learnable Zero Parameters in simple settings (e.g., Simple in-domain SR: 42.3% vs. 41.4%). This might be because for simple tasks, relevant information is often recent, and using recent memory provides a good starting point for the query, leading to faster convergence or easier learning.
- However, in hard settings, Init with Learnable Zero Parameters performs slightly better than Init with Most Recent Episodic Memory (e.g., Hard in-domain SR: 24.2% vs. 22.6%). This suggests that for more complex tasks, where relevant information might be distributed far back in episodic memory, a learned, unconstrained query (from zero parameters) might eventually become more effective than one biased towards only the most recent past, which could be misleading for long-horizon reasoning.
Conclusion from Ablation: The results strongly support the design choice of initializing fusion queries with working memory tokens. This approach provides the memory fusion module with a strong, task-relevant signal (the current observation) to effectively and robustly retrieve and integrate useful features from episodic memory, leading to superior performance across diverse and challenging scenarios.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper addresses a critical limitation in current Large Language Models (LLMs) for embodied AI: the struggle to manage long-term spatial-temporal memory in dynamic, multi-room 3D environments. It makes two significant contributions:

3DMEM-BENCH: A novel, comprehensive benchmark that provides over 26,000 trajectories and 2,892 embodied tasks, including question-answering and captioning. This benchmark is specifically designed to evaluate an agent's ability to reason over long-term memory across varying complexities (simple to hard) and "in-the-wild" scenarios, filling a crucial gap in existing evaluation methods.
3DLLM-MEM: A new embodied 3D LLM equipped with a dual-memory system (working memory for current observations and episodic memory for past experiences) and a novel memory fusion module. This module uses working memory tokens as queries to selectively attend to and fuse task-relevant spatial and temporal features from episodic memory, thereby achieving both memory efficiency and effective spatial-temporal reasoning.

Experimental results on 3DMEM-BENCH demonstrate that 3DLLM-MEM achieves state-of-the-art performance across all tasks, significantly outperforming strong baselines. It shows remarkable robustness and scalability, especially in challenging "in-the-wild" and hard-complexity tasks, where it vastly surpasses other methods. The ablation study further validates the effectiveness of its memory fusion mechanism and the choice of initializing queries with working memory tokens.

7.2. Limitations & Future Work

The authors explicitly state one limitation of 3DLLM-MEM:

High-Level Policies, Not Low-Level Control: Currently, 3DLLM-MEM does not involve low-level navigation and control policy. Instead, it utilizes high-level predefined policies within the simulator to carry out actions.

The authors suggest that this aspect (integrating low-level control) is "orthogonal to our study" and "could be explored and seamlessly integrated into our framework in the future." This implies a future research direction focusing on end-to-end embodied agents where 3DLLM-MEM's high-level reasoning and memory capabilities could guide more granular locomotion and manipulation actions.

7.3. Personal Insights & Critique

This paper presents a significant advancement in the field of embodied AI and 3D LLMs. The problem of long-term spatial-temporal memory is indeed central to developing truly intelligent and capable agents that can operate in complex real-world environments.

Innovation of the Benchmark: The 3DMEM-BENCH is a highly valuable contribution. The inclusion of fine-grained complexity levels and "in-the-wild" scenarios forces models to demonstrate true generalization and robust long-term memory, going beyond the simpler, short-horizon tasks often found in existing benchmarks. The diverse task types (embodied actions, EQA, captioning) provide a holistic evaluation of spatial-temporal understanding.
Elegant Memory Architecture: The dual-memory system with a dynamic memory fusion module is an elegant and biologically inspired solution. The use of working memory tokens as queries for episodic memory is a clever way to ensure task-relevance and efficiency, mitigating the combinatorial explosion of information in dense 3D representations. This mechanism could inspire similar designs in other LLM-based agents tackling sequential tasks.
Dense 3D Representation: The emphasis on dense 3D representations in memory is crucial. While computationally intensive, it retains more information than sparse or object-centric methods, which is often necessary for nuanced spatial reasoning and detailed interaction. The paper effectively shows how to manage this complexity through selective fusion.
Potential Broader Applications: The principles of 3DLLM-MEM extend beyond embodied AI. Any LLM-based system that needs to maintain long-term, dynamically evolving context over time and across different "spaces" (e.g., documents, codebases, web environments) could benefit from such a dual-memory and fusion mechanism. For example, LLM agents for complex software development, scientific discovery, or multi-session human-computer interaction could adapt this approach to maintain and leverage episodic knowledge.
Unverified Assumptions/Areas for Improvement:
- Low-Level Control: The stated limitation about high-level policies is a significant one for practical embodied AI. While the authors claim it's orthogonal, the true test of long-term memory in a physical robot would involve seamlessly integrating this high-level reasoning with low-level motor control and perceptual feedback loops. Future work should focus on this integration to see if the memory system remains robust when facing the uncertainties and continuous learning demands of real-world interaction.
- Computational Cost: While the memory fusion aims for efficiency, managing dense 3D representations over long horizons could still be computationally demanding, especially as episodic memory grows. A detailed analysis of memory footprint and inference speed scaling with $T$ (number of timesteps in episodic memory) would be beneficial.
- Explainability: How does the memory fusion module decide what is relevant? While it uses an attention mechanism, understanding which specific 3D features from which past episodic memory entries are being fused could offer valuable insights into the model's reasoning process and help diagnose failures.
- Catastrophic Forgetting: As the episodic memory is updated, especially if entries are modified, there's a potential for catastrophic forgetting if not handled carefully. The paper mentions updating entries, but the specifics of how this prevents loss of older, potentially still relevant, information could be explored further.
- Memory Eviction Strategy: For truly unbounded long-term memory, an eviction strategy (deciding when to discard old memories) would be necessary to prevent the memory bank from growing indefinitely. This paper doesn't explicitly discuss such a mechanism, which would be crucial for indefinite operation.
  
  Overall, 3DLLM-MEM represents a compelling step forward in endowing embodied LLMs with sophisticated spatial-temporal memory, paving the way for more intelligent and autonomous agents.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.