Paper Library - AiPaper · SwiftScholar

This paper presents MedPaLM 2, a significant advancement in medical question answering, achieving an 86.5% score on the MedQA dataset, improving by over 19% while employing base LLM enhancements, domain finetuning, and innovative prompting strategies.

Memory Forcing: Spatio-Temporal Memory for Consistent Scene Generation on Minecraft

Published:10/4/2025

Autoregressive Video Diffusion ModelsMinecraft Scene GenerationSpatio-Temporal Memory FrameworkGeometry-Indexed Spatial MemoryIncremental 3D Reconstruction

The paper introduces the 'Memory Forcing' framework, which combines spatiotemporal memory for consistent scene generation in Minecraft. It features hybrid training and chained forward training to guide the model in utilizing temporal memory during exploration and spatial memory

WorldPack: Compressed Memory Improves Spatial Consistency in Video World Modeling

Published:12/2/2025

Video World ModelsLong-Term Sequential ModelingCompressed Memory MechanismMinecraft LoopNav BenchmarkSpatial Consistency Improvement

WorldPack is a video world model that uses compressed memory to enhance spatial consistency and fidelity in longterm generation, outperforming stateoftheart models in the LoopNav benchmark within Minecraft.

TeleWorld: Towards Dynamic Multimodal Synthesis with a 4D World Model

Published:1/1/2026

4D World ModelDynamic Multimodal SynthesisVideo Generation and ReconstructionLong-Horizon Consistency ModelingAutoregressive Diffusion Video Model

The paper presents , a realtime multimodal 4D world model that addresses limitations in video generation by integrating video synthesis and dynamic scene reconstruction within a closedloop framework, utilizing a novel generationreconstructionguidance paradigm for c

SANet: Multi-Scale Dynamic Aggregation for Chinese Handwriting Recognition

Published:9/15/2025

Chinese Handwriting RecognitionMulti-Scale Dynamic AggregationStar Attention-based NetworkFeature Extraction and GeneralizationCASIA-HWDB Dataset

This paper introduces SANet, a Star Attentionbased Network utilizing MultiScale Dynamic Aggregation for Chinese handwriting recognition, achieving 98.12% characterlevel accuracy on CASIAHWDB with improved feature extraction and robustness through a lightweight design and synt

MultiRAG: A Knowledge-guided Framework for Mitigating Hallucination in Multi-source Retrieval Augmented Generation

Published:8/5/2025

Multi-Source Retrieval-Augmented GenerationKnowledge-Guided ApproachHallucination MitigationLogical Relationship Graph ConstructionMulti-Level Confidence Calculation Mechanism

MultiRAG is a knowledgeguided framework designed to mitigate hallucination in multisource retrievalaugmented generation. By constructing logical relationships with multisource line graphs and a multilevel confidence mechanism, it effectively reduces challenges related to inf

ClosureX: Compiler Support for Correct Persistent Fuzzing

Published:2/6/2025

Persistent FuzzingCompiler SupportSoftware Testing Techniques

ClosureX introduces a novel fuzz testing mechanism that addresses semantic inconsistencies in persistent fuzzing. It achieves nearpersistent performance with finegrained state restoration, increasing testcase execution rates by over 3.5 times while enhancing bug discovery capab

Finite Scalar Quantization: VQ-VAE Made Simple

Published:9/27/2023

Finite Scalar QuantizationSimplified VQ-VAE MethodAutoregressive Image GenerationMultimodal GenerationDepth Estimation and Image Classification

This paper introduces Finite Scalar Quantization (FSQ) as a simpler alternative to VQ in VQVAEs, enabling implicit codebook creation. FSQ achieves competitive performance across tasks while avoiding codebook collapse and reducing complexity.

Rock Classification Based on Residual Networks

Published:2/19/2024

Residual NetworksRock ClassificationData Augmentation MethodsMulti-Head Self AttentionBottleneck Transformer Block Influence

This study proposes two methods for rock classification using residual networks, achieving 70.1% and 73.7% accuracy with modifications to ResNet34 and multihead selfattention. It also explores the impact of bottleneck transformer blocks on performance.

ASAP: Aligning Simulation and Real-World Physics for Learning Agile Humanoid Whole-Body Skills

Published:2/3/2025

Alignment of Simulation and Real-World PhysicsHumanoid Whole-Body Skill LearningDelta Action Compensation ModelRetargeted Human Motion DataDynamic Transfer Evaluation

The ASAP framework addresses the dynamics mismatch for humanoid robots by utilizing a twostage approach: pretraining motion tracking policies in simulation and finetuning them with realworld data to achieve agile wholebody skills.

Animation Engine for Believable Interactive User-Interface Robots

Published:4/1/2005

Animation Engine for User-Interface RobotsInteractive Robot Behavior SynthesisFamily Companion RobotSmooth Transition FilterAnimation and Expression Synthesis Techniques

This paper presents an animation engine for interactive userinterface robots, integrating believable behaviors with animations through three software components. A case study showcases its use in the family companion robot iCat, enhancing user interaction.

Rock Classification through Knowledge-Enhanced Deep Learning: A Hybrid Mineral-Based Approach

Published:10/16/2025

Knowledge-Enhanced Rock ClassificationMineral Composition Analysis1D Convolutional Neural Network ApplicationDeep Learning in GeologyRock Type Identification

This study introduces a knowledgeenhanced deep learning approach for rock classification, integrating geological expertise with spectral analysis. Using 1DCNN, accuracy rates reached 98.37% and 97.75%. Results highlighted optimal limestone classification, revealing challenges f

ReHyAt: Recurrent Hybrid Attention for Video Diffusion Transformers

Published:1/8/2026

video diffusion modelsTransformer architectureHybrid Attention MechanismEfficient Attention MechanismVideo Generation

ReHyAt introduces a Recurrent Hybrid Attention mechanism for video diffusion transformers that reduces attention complexity to linear, enhancing scalability for long sequences. It achieves efficient distillation from existing models at significantly lower training costs, while ma

CreativeVR: Diffusion-Prior-Guided Approach for Structure and Motion Restoration in Generative and Real Videos

Published:12/13/2025

Diffusion Model Video RestorationVideo Super-Resolution and RestorationStructure and Motion RestorationAIGC Video ProcessingTemporal Coherence Module

CreativeVR is a diffusionpriorguided video restoration framework addressing structural and temporal artifacts in both generative and real videos. Utilizing a deepadapter approach, it offers flexible precision control, balancing restoration quality and corrective behavior, sign

MoMa: Skinned motion retargeting using masked pose modeling

Published:9/14/2024

Shape-aware Motion RetargetingSkeleton-aware Motion RetargetingTransformer-based Auto-EncoderMotion TransferMixamo Dataset

MoMa introduces a novel skinned motion retargeting method that integrates skeletonaware and shapeaware capabilities, effectively transferring animations across characters with different structures using a transformerbased autoencoder and a facebased optimizer.

Self-Adapting Improvement Loops for Robotic Learning

Published:6/7/2025

Self-Adapting Improvement LoopOnline Video LearningRobotic Task PlanningMetaWorld TasksSelf-collected Behavior Enhancement

This paper introduces the SelfAdapting Improvement Loop (SAIL) that enhances robotic agents' performance on new tasks through selfcollected online experiences. It leverages indomain and internetscale pretrained video models, showing continuous performance improvements over it

3DLLM-Mem: Long-Term Spatial-Temporal Memory for Embodied 3D Large Language Model

Published:5/29/2025

Spatial-Temporal Memory for Large Language Models3DMem-Bench BenchmarkDynamic Memory Management and Fusion ModelEmbodied Tasks in Multi-Room 3D EnvironmentsLong-Term Memory Reasoning

This study introduces 3DLLMMem to enhance longterm spatialtemporal memory in Large Language Models for dynamic 3D environments. It presents 3DMemBench for evaluating reasoning capabilities, with experimental results showing significant performance improvements in embodied tas

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Published:6/2/2001

Vision-Language-Action ModelEfficient Robotics ModelCompact Model DesignConsumer-Grade Hardware Deployment

SmolVLA is a compact and efficient visionlanguageaction model that achieves competitive performance at reduced computational costs, enabling deployment on consumergrade hardware and promoting broader participation in robotics research through communitydriven dataset pretraini

Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models

Published:1/11/2023

Knowledge Editing in Language ModelsRepresentation Denoising and Causal TracingIn-Parameter Editing of ModelsFact Storage and Parameter LocalizationUnderstanding Mechanisms in Language Models

This study examines the relationship between knowledge localization and model editing in language models. It reveals that optimal editing locations may differ from suggested knowledge storage points, challenging prior causal tracing assumptions. Ultimately, the choice of editing

ALPAGASUS: TRAINING A BETTER ALPACA WITH FEWER DATA

Published:7/17/2023

Data Selection Strategy Based on Large Language ModelsLarge Language Model Fine-TuningHigh-Quality Data FilteringEnhanced Instruction Fine-Tuning CapabilityALPAGASUS Model

The ALPAGASUS model improves performance by using a data selection strategy that filters 9,000 highquality samples from the original 52,000. It outperforms the original Alpaca and achieves 5.7 times faster training, highlighting the importance of data quality over quantity.

1 - 20 / 1131

Papers