Configurable Clinical Information Extraction with Agentic RAG: What Works, What Breaks, and Why

Published:06/18/2026

Analysis

~17 min read · 24,911 charsThis analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

1. Bibliographic Information

1.1. Title

The central topic of the paper is the development, deployment, and evaluation of an Agentic Clinical Information Extraction (ACIE) system. The title, "Configurable Clinical Information Extraction with Agentic RAG: What Works, What Breaks, and Why," highlights the focus on using agentic retrieval-augmented generation (RAG) to extract structured information from complex, real-world clinical data, specifically analyzing the architectural decisions necessitated by the poor quality of clinical metadata.

1.2. Authors

The authors are a multidisciplinary team of researchers and clinicians primarily affiliated with the Institute for Artificial Intelligence in Medicine (IKIM) at University Medicine Essen, Germany.

  • Osman Alperen Çinar-Koraş, Marie Bauer, Sameh Khattab, Merlin Engelke, Moon Kim: Researchers at IKIM and the Faculty of Computer Science, University of Duisburg-Essen.
  • Stephan Settelmeier: Department of Cardiology and Vascular Medicine, University Hospital Essen.
  • Shigeyasu Sugawara: Advanced Clinical Research Center, Fukushima Medical University, Japan.
  • Fabian Freisleben, Felix Nensa, Jens Kleesiek: Core faculty at IKIM, with affiliations to TU Dortmund University and the Lamarr Institute.

1.3. Journal/Conference

The paper is currently a preprint available on arXiv (arXiv:2606.19602). It was published on June 17, 2026. As a preprint, it has not yet undergone peer review by a specific journal or conference, but arXiv is a reputable repository for scientific preprints in the fields of computer science, artificial intelligence, and medicine.

1.4. Publication Year

1.5. Abstract

The paper addresses the challenge of extracting structured clinical data from patient records that span hundreds of heterogeneous documents and thousands of structured data points. The core problem is that metadata required for standard AI retrieval is often absent or incomplete. The authors deploy ACIE, an on-premise agentic RAG pipeline, which reasons over complete patient contexts and grounds answers in source passages for verification. The study quantifies the "metadata gap," traces the architectural decisions made to address it, and evaluates the system in a retrospective lymphoma registry study. In this evaluation, nuclear-medicine physicians verified 7,326 extracted values, accepting 96.5% of them.

The paper is available as a preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

Clinical workflows, such as enrolling patients in studies, require compiling structured data from vast, unstructured patient records. This is currently a manual, error-prone process. While Large Language Models (LLMs) offer a solution for automated extraction, two major barriers exist:

  1. Privacy/Regulatory: Patient data cannot be sent to external cloud servers, necessitating on-premise deployment.

  2. Data Reality: Real clinical records are messy. Standard Retrieval-Augmented Generation (RAG) relies on metadata (like dates, document types, or encounter IDs) to filter and retrieve relevant information. However, in real-world hospital systems (specifically FHIR repositories), this metadata is often missing, duplicated, or incorrect. Consequently, standard RAG fails because it cannot reliably locate the correct documents or reason across them.

    The paper's entry point is the recognition that "the data dictates the architecture." Instead of trying to fix the data, the authors propose an agentic architecture that can reason over the raw content of documents to compensate for the lack of reliable metadata.

2.2. Main Contributions / Findings

The paper makes three primary contributions:

  1. Clinician-Verified Evaluation: It presents a rigorous evaluation of the ACIE system alongside an independent retrospective lymphoma registry study. This involved 74 clinician-configured fields across 99 patients, resulting in 7,326 individual judgments verified by nuclear-medicine physicians.
  2. Quantification of the Metadata Gap: The authors provide a detailed analysis of data quality in a large-scale FHIR repository (nearly 2 billion resources), quantifying the sparsity of metadata (e.g., document relationships, authorship, timestamps) that AI systems typically rely on.
  3. Architectural Insights: It traces specific design decisions—such as using agentic retrieval over static filtering, generating query-relevant document summaries, and using Markdown serialization—to the specific failures of clinical data quality.

Key Findings:

  • ACIE achieved a 96.5% clinician acceptance rate.
  • No hallucinations (fabricated content without source support) were observed.
  • The remaining errors were primarily driven by complex temporal reasoning tasks (dates and tabular timelines), which are difficult even for agentic systems.
  • The study demonstrates that shifting the clinician's role from compiling data to verifying AI-generated citations can significantly speed up workflows (reportedly roughly three times faster).

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand this paper, one must grasp several key technologies and standards:

  • FHIR (Fast Healthcare Interoperability Resources): This is a standard for exchanging healthcare information electronically. It describes data formats and elements (called "resources") and an application programming interface (API) for exchanging them. In this paper, FHIR serves as the central repository for all patient data, including documents (like PDFs of letters) and structured data (like lab values).
  • RAG (Retrieval-Augmented Generation): A technique used in Large Language Models where the model retrieves relevant documents from an external database and uses them as context to generate an answer. This helps the model answer questions based on specific, up-to-date data without retraining the model.
  • Agentic AI / Agentic RAG: This goes beyond standard RAG. An "agent" is an LLM empowered to use "tools" (like a search engine or a database query tool) and iterate on its thought process. Instead of a one-shot retrieve-then-read pipeline, an agent can reason: "I need info on X, let me search. The results mention Y, let me inspect that specific document. Now I have enough info to answer."
  • OCR (Optical Character Recognition): Technology used to convert different types of documents, such as scanned paper documents, PDF files, or images captured by a digital camera, into editable and searchable data. The paper uses PaddleOCR to process scanned clinical documents.
  • Grounding / Attribution: The practice of linking specific parts of an AI's generated output back to the specific source document or passage that supports that claim. This is crucial for safety in clinical settings, allowing doctors to verify the AI's work.

3.2. Previous Works

The paper situates itself within the evolution of Clinical Information Extraction (IE):

  1. Rule-Based Systems (e.g., cTAKES): Early systems relied on hand-crafted rules defined by developers. They were precise but required immense effort to adapt to new tasks.
  2. Domain-Specific Pretraining (e.g., BioBERT, GatorTron): Models pre-trained on vast amounts of medical text to improve understanding. While accurate, they still required fine-tuning for specific tasks.
  3. LLMs for Extraction: Recent work showed that general-purpose LLMs (like GPT-4 or Llama) can perform extraction without specific training (few-shot learning). However, most of this work remained in research settings and did not address the deployment challenges of real-world, messy hospital data.
  4. Agentic RAG in Medicine: The authors mention i-MedRAG (iterative retrieval for medical QA) and CLINES (structuring clinical concepts). However, the authors note that prior work typically used researcher-defined targets on clean benchmark datasets, whereas this paper deploys a system with clinician-configured targets on raw, messy hospital data.

3.3. Technological Evolution

The field has evolved from rigid, developer-centric rules to flexible, user-centric LLMs. The current frontier is Agentic RAG, which moves beyond simple retrieval to iterative reasoning. This paper represents the next step: applying these advanced agentic architectures to the "wild" of production hospital systems, revealing that the theoretical elegance of RAG often clashes with the messy reality of clinical metadata.

3.4. Differentiation Analysis

The core differentiator of this work is its focus on the "metadata gap." Most RAG systems assume they can filter documents by date, type, or encounter ID. This paper proves that in a real hospital (even one with a massive FHIR repository), those filters are broken. Consequently, the authors' solution (ACIE) differs from standard RAG by:

  • Content-Based Triage: Instead of filtering by metadata, the agent reads summaries of documents to decide if they are relevant.
  • On-Premise Deployment: Unlike many cloud-based LLM solutions, ACIE runs entirely within the hospital firewall.
  • Clinician-in-the-Loop Configuration: The extraction schema is defined by doctors, not programmers.

4. Methodology

4.1. Principles

The core principle of ACIE is that retrieval must be robust to metadata failure. Since standard filters (e.g., "give me documents from 2023") are unreliable, the system must employ an agent that actively inspects the content of the patient's context to find the information it needs. The architecture is designed to maximize verifiability—every extracted value must be grounded in a specific text passage so a human clinician can check it.

4.2. Core Methodology In-depth (Layer by Layer)

Patient Context Ingestion

The first stage involves gathering all data for a specific patient from the hospital's FHIR server. This includes structured data (lab results, medications) and unstructured documents (PDFs).

  • OCR Processing: Non-machine-readable documents (scanned images) are processed using PaddleOCR-VL. Documents that fail a quality threshold are excluded.
  • Chunking: Documents are split into "chunks" to be stored and retrieved. The authors use two granularities:
    1. Coarse Retrieval Chunks: Larger segments that preserve context for the initial search.
    2. Fine-Grained Citation Passages: Smaller segments used as the atomic unit for citing sources in the final answer.

Length-Penalized Retrieval

A critical technical challenge identified is that dense retrievers (which use vector embeddings) often favor short, concise text chunks because they are semantically "dense" and specific. However, in clinical notes, short fragments often lack the necessary context to be meaningful. To counter this bias, the authors apply a length-penalized retrieval score.

When calculating the similarity score ss between a query qq and a chunk cc, the system modifies the standard cosine similarity score using the following formula:

s=sim(q,c)p(),p()=min(τ,1)23+13 s = \operatorname {sim} (q, c) \cdot p (\ell), \quad p (\ell) = \min \left(\frac {\ell}{\tau}, 1\right) \cdot \frac {2}{3} + \frac {1}{3}

Where:

  • sim(q,c)\operatorname{sim}(q, c) is the cosine similarity between the query vector and the chunk vector. This measures how semantically similar the text is.
  • \ell is the length of the chunk in characters.
  • τ\tau is a threshold parameter, fixed at 40 characters in this study.
  • p()p(\ell) is the penalty function.

Explanation of the Formula: The function p()p(\ell) is designed to boost the scores of longer chunks (up to a point) and dampen the scores of very short fragments.

  1. The term τ\frac{\ell}{\tau} calculates the ratio of the chunk length to the threshold (40).
  2. min(,1)\min(\dots, 1) caps this ratio at 1. This means for chunks longer than 40 characters, this part becomes 1. For chunks shorter than 40, it is a fraction (e.g., a 20-char chunk would be 0.5).
  3. The result is multiplied by 23\frac{2}{3} and then 13\frac{1}{3} is added.
    • If the chunk is very short (0\ell \to 0), the score is multiplied by roughly 13\frac{1}{3} (severe penalty).
    • If the chunk is long (40\ell \ge 40), the score is multiplied by 123+13=11 \cdot \frac{2}{3} + \frac{1}{3} = 1 (no penalty). This ensures that short fragments are not unfairly ranked above longer, more informative passages.

Agentic Extraction Pipeline

Once the data is ingested and indexed, the extraction process begins. This is driven by a tool-calling agent based on the ReAct paradigm (Reasoning + Acting).

  1. Schema Configuration: Clinicians define "extraction targets" (e.g., "Date of Diagnosis," "Chemotherapy regimen") using a typed schema. No code is written by developers.
  2. Agent Tools: The agent is equipped with a set of tools to interact with the patient context:
    • Semantic Search: Search across all documents using the length-penalized scoring described above.
    • List Documents: Get a list of documents relevant to the query.
    • Query-Relevant Summaries: Since reading hundreds of full documents is too expensive and slow, the system generates summaries on the fly. It assembles the highest-scoring citation chunks for a document (in document order) until it accumulates at least 200 words. This provides a "preview" of the document's content relevant to the specific query.
    • Inspect Document: Read the full text of a specific document.
    • Query Structured Data: Directly query the FHIR structured data (e.g., lab values) using code.
  3. Iterative Reasoning: The agent loops through these tools. For example, it might search for "lymphoma diagnosis," list the results, read the summaries of a few promising letters, inspect the full text of the best one, and then query the structured lab data to confirm a specific marker.
  4. Grounded Output: Finally, the agent returns the extracted value. Crucially, every value must include a citation pointing to the specific source passage (chunk) that supports it.

Deployment Infrastructure

The system is designed for strict privacy and compliance.

  • On-Premise: Runs entirely on the hospital's Kubernetes cluster. Data never leaves the network.
  • Models:
    • Extraction Model: Qwen 3.6 35B-A3B (a Mixture-of-Experts model).
    • OCR Model: PaddleOCR-VL 1.5.
  • Hardware: The extraction model runs on 4×4 \times H100 GPUs; OCR runs on a single H100.

Architectural Adaptation to Data Quality

The authors highlight a specific failure mode and its fix:

  • The Failure: When patient metadata was serialized in JSON format for the LLM, specific patients consistently triggered malformed tool calls (the agent failed to use tools correctly).

  • The Fix: Switching the serialization format to Markdown eliminated these failures entirely. This suggests that the complex, nested structure of JSON metadata confused the model's parsing, whereas Markdown's linear structure was more robust.

    The following figure (Figure 1 from the original paper) illustrates the workflow of the ACIE system described above:

    img-0.jpeg 该图像是示意图,展示了ACIE系统的工作流程。患者上下文通过FHIR服务器整合来自多个医院系统的数据,提取代理根据提取架构搜索记录,返回以引用文献验证的答案,供临床医生决策接受或拒绝。

5. Experimental Setup

5.1. Datasets

The evaluation relies on two distinct data contexts: the general hospital corpus and a specific clinical study cohort.

  1. General FHIR Corpus (University Medicine Essen):

    • Scale: One of the largest clinical FHIR repositories in Europe.
    • Volume: ~1.97 billion resources (1.7 million patients).
    • Composition: Includes 84M clinical documents, 675M lab values, 77M medication orders, etc.
    • Purpose: Used to analyze the "metadata gap" and data quality challenges (Section 4.1 of the paper).
  2. Lymphoma Registry Study Cohort:

    • Context: An independent retrospective registry study of lymphoma patients undergoing molecular imaging.
    • Design: The study's electronic Case Report Form (eCRF) was designed independently by physicians before ACIE existed, ensuring the extraction targets were not biased toward the AI's capabilities.
    • Schema: 74 fields covering classification, histology, cytogenetics, treatment timelines, and outcomes.
    • Sample Size: 99 patients.
    • Judgments: 7,326 individual data points (99 patients ×\times 74 fields).

5.2. Evaluation Metrics

The primary evaluation metric is Clinician Acceptance Rate. This is a human-in-the-loop metric where subject matter experts (nuclear-medicine physicians) act as the ground truth.

  1. Conceptual Definition: The acceptance rate measures the percentage of AI-extracted values that a clinician approves for entry into the clinical record. It captures not just factual correctness, but also clinical usefulness and safety. If the AI returns a value, the clinician checks it against the source text. If the AI returns nothing (abstains), the clinician checks if the information is truly absent.

  2. Mathematical Formula: While the paper does not provide a single algebraic formula for "Acceptance," it can be derived from the counts provided. Let NN be the total number of judgments (fields ×\times patients). Let AA be the number of accepted judgments. Acceptance Rate=AN \text{Acceptance Rate} = \frac{A}{N} The paper further breaks this down into Precision (acceptance of non-empty values) and Abstention Reliability (acceptance of empty values). Precision=Accepted Non-Empty ValuesTotal Non-Empty Values Generated \text{Precision} = \frac{\text{Accepted Non-Empty Values}}{\text{Total Non-Empty Values Generated}} Abstention Reliability=Accepted Empty ValuesTotal Empty Values Generated \text{Abstention Reliability} = \frac{\text{Accepted Empty Values}}{\text{Total Empty Values Generated}}

  3. Symbol Explanation:

    • AA: The count of fields marked as "Accepted" by the reviewer.
    • NN: The total count of fields evaluated (7,326).
    • Non-Empty Values: Fields where the AI extracted a specific text or number.
    • Empty Values: Fields where the AI returned "null" or abstained from answering.

5.3. Baselines

The paper explicitly states that it did not compare against a non-agentic or commercial baseline in the primary evaluation. The focus was on validating the deployed system against the clinical standard of care (manual verification) rather than benchmarking against other algorithms. The "baseline" is effectively the status quo of manual data extraction by clinicians.

6. Results & Analysis

6.1. Core Results Analysis

The primary result is a 96.5% overall acceptance rate across 7,326 judgments. This high validation rate confirms that the agentic RAG approach is viable for complex clinical data extraction.

  • Precision vs. Abstention: The system is balanced. It achieved 96.4% precision on values it did extract, and 96.8% reliability on values it didn't extract (correctly identifying when information is missing).
  • Safety: There were zero hallucinations (0.0% in Table 4). The system never fabricated a value without a source citation. The dominant error type was "Incorrect value" (87.4% of rejections), which is a safe failure mode (a human reviewer catches the wrong number and corrects it).
  • Failure Modes: Performance varied significantly by data type.
    • Strong Types: Categorical (98.6%), Numerical (98.3%), Boolean (98.6%). These are facts easily found in text.
    • Weak Types: Dates (84.3%) and Tabular fields (79.8%).
    • Analysis: Dates and tables require temporal reasoning—assembling a timeline from multiple conflicting documents. This is the hardest task for the system. For example, "Date of death or last follow-up" was the single most rejected field.

6.2. Data Presentation (Tables)

Metadata Gap and Data Quality

The following are the results from Table 1 of the original paper, showing the scale of the FHIR repository:

Category Count
Patient records 5,598,272
Unique individuals 1,747,135
Orders and requests 852M
Lab values and observations 675M
Clinical documents 84M
Medication orders 77M
Diagnoses and conditions 40M
Medication administrations 33M
Clinical encounters 29M
Other 182M
Total FHIR resources 1.97B

The following are the results from Table 2 of the original paper, showing per-patient statistics:

Med. (IQR) P99 Max
Docs 52 (14–140) 937 2,542
Dedup. (%) 33.5 (20.0–43.0) - 54.6
Struct. resources 406 (38–1,922) 37,074 119,191
Encounters 18 (6–46) 323 1,207
OCR rej. (%) 10.3 (6.8–17.8) - 52.0

Clinical Study Evaluation Results

The following are the results from Table 3 of the original paper, showing clinician acceptance by field type:

Field type n Acc.% Empty%
Categorical 4,455 98.6 35.1
Numerical 891 98.3 76.1
Boolean 792 98.6 21.8
Free text 297 96.0 59.9
Date 594 84.3 34.0
Tabular 297 79.8 31.0
Total 7,326 96.5 39.4

The following are the results from Table 4 of the original paper, categorizing the 253 rejected fields:

Rejection category Count (%)
Extraction errors
Incorrect value (needs correction) 221 (87.4)
Fully incorrect (replaced) 7 (2.8)
Missed extraction (false negative) 9 (3.6)
Missing reference 3 (1.2)
Extraneous extraction (false positive) 1 (0.4)
Hallucinated 0 (0.0)
Editorial adjustments
Missing information 2 (0.8)
Excess information 1 (0.4)
Form configuration
Configuration error 9 (3.6)

Detailed Error Analysis

The following are the results from Table 5 of the original paper, listing the 74 AI-extracted fields:

Type (n) Fields
Categorical (45)
WHO classification & subtyping (9) Primary lymphoma category; entity-specific subtype for LBCL, DLBCL cell-of-origin, follicular, Hodgkin, mantle-cell, marginal-zone, Burkitt, and peripheral T-/NK-cell lymphoma
Histology & qualitative IHC (9) Biopsy sites; necrosis; positive/negative immunohistochemistry for p53, PD-L1 (tumour cells), MYC, BCL2, CD20, CD30, and EBV (EBER ISH)
Cytogenetics & FISH (9) Cytogenetics performed; karyotype available; complex karyotype; FISH status; MYC / BCL2 / BCL6 rearrangement; del(17p)/TP53; 9p24.1 amplification
NGS mutation status (17) Overall NGS status; TP53, MYD88 (L265P), NOTCH1, NOTCH2, EZH2, CD79B, ARID1A, MEF2B, EP300, FOXO1, CREBBP, CARD11, RHOA (G17V), TET2, DNMT3A, IDH2 (R172)
Outcome (1) Overall survival event
Numerical (9) Height; weight at diagnosis; months from ASCT to relapse or progression; Ki-67 index; IHC expression percentage for p53, PD-L1 (tumour cells), PD-L1 (immune cells), MYC, and BCL2
Boolean (8) Significant comorbidities (CIRS); B-symptoms; progression event; relapse event; prior ASCT before relapse/progression; histological transformation; bone-marrow involvement; large-cell component
Date (6) Diagnosis; most recent biopsy; transformation; progression or last follow-up; relapse or last follow-up; death or last follow-up
Free text (3) Original pathology-report diagnosis; extranodal site specification; biopsied body-fluid specification
Tabular (3) Treatment timeline (each therapy line, transplant, surgery, and radiation with dates and cycles); PET examinations (dates, indication, tracer); MRD assessments (method, result)

The following are the results from Table 6 of the original paper, splitting acceptance by returned status:

| Field type | Returned a value | | Returned empty | | --- | --- | --- | --- | --- | | n | Acc.% | n | Acc.% | Categorical | 2,892 | 98.3 | 1,563 | 99.1 | Numerical | 213 | 98.1 | 678 | 98.4 | Boolean | 619 | 98.4 | 173 | 99.4 | Free text | 119 | 93.3 | 178 | 97.8 | Date | 392 | 91.8 | 202 | 69.8 | Tabular | 205 | 71.2 | 92 | 98.9 | Total | 4,440 | 96.4 | 2,886 | 96.8

The following are the results from Table 7 of the original paper, showing rejection categories by field type:

Field type Rej. Incorrect Fully incorr. Miss.info Excess Missed Extran. Miss.ref. Config.
Categorical 62 48 1 1 - 3 1 1 7
Numerical 15 12 - - - 2 - 1 -
Boolean 11 6 2 1 - - - - 2
Date 93 89 2 - - 1 - 1 -
Free text 12 11 - - - 1 - - -
Tabular 60 55 2 - 1 2 - - -
Total 253 221 7 2 1 9 1 3 9

The following are the results from Table 8 of the original paper, listing the ten most-rejected fields:

Field Type Rej.
Date of death / last follow-up Date 44
Treatment timeline Tabular 40
Date of progression / last PET Date 22
Overall survival event Categorical 16
MRD assessments Tabular 15
Weight at diagnosis Numerical 12
Date of relapse / last PET Date 11
Biopsy sites Categorical 8
Diagnosis date Date 7
Extranodal site specification Free text 7

Statistical Distributions

The following are the results from Table 9 of the original paper, showing full per-patient context statistics:

Min P1 Q1 Med. Q3 P99 Max Mean
Raw docs 1 2 22 77 193 1,269 3,029 164.6
Deduped 1 1 14 52 140 937 2,542 120.4
Encounters 0 3 6( 18 46 323 1,207 42.0
Structured 0 1 38 406 1,922 37,074 119,191 2,804.1
History (days) 0 15 330 1,305 4,145 9,775 739,726 2,699.7
Doc. length (chars)† 24 - 1,387 2,915 11,035 - 907,179 10,788

The following are the results from Table 10 of the original paper, showing encounter coverage:

Case enc. n Min P1 Q1 Median Q3 P99 Max Mean
1 1,930 9.5 42.9 100.0 100.0 100.0 100.0 100.0 96.6
2-5 3,334 2.1 23.8 46.7 59.0 75.0 100.0 100.0 61.3
6-20 3,099 6.9 12.0 24.3 32.6 44.0 80.2 100.0 35.5
20+ 1,637 2.5 3.9 10.0 14.7 21.1 53.5 85.5 17.1

The following are the results from Table 11 of the original paper, showing concentration index:

Min P1 Q1 Med. Q3 P99 Max Mean
Top-encounter coverage (%) 2.1 6.3 26.7 47.5 80.0 100.0 100.0 52.9
Concentration index 0.04 0.73 1.27 2.24 3.75 14.83 105.1 3.07

The following are the results from Table 12 of the original paper, showing deduplicated documents per patient by history length:

History n Min P1 Q1 Med. Q3 P99 Max Mean
< 1 yr 2,709 1 1 4 10 26 190 497 23.7
1–3 yr 1,995 1 2 18 49 102 572 1,476 86.0
3–5 yr 934 2 6 34 82 162 777 1,360 132.6
5–10 yr 1,602 1 6 50 107 235 1,217 2,542 187.9
10+ yr 2,760 1 3 43 114 252 1,174 2,439 196.8

The following are the results from Table 13 of the original paper, showing total FHIR resources per patient by history length:

History n Min P1 Q1 Med. Q3 P99 Max Mean
< 1 yr 2,709 1 5 17 52 291 13,040 52,462 711.4
1–3 yr 1,995 6 8 105 490 1,645 29,889 102,412 2,241.7
3–5 yr 934 12 20 225 931 2,820 30,848 83,760 2,996.0
5–10 yr 1,602 5 19 352 1,177 4,061 47,994 100,673 4,288.7
10+ yr 2,760 6 18 365 1,303 4,939 48,929 121,542 5,086.6

The following are the results from Table 14 of the original paper, showing document metadata population rates:

Resource Metadata Field Pop. %
Identification and relationships
DocRef Unique identifier 27.8
DocRef Related documents 0.52
Authorship and provenance
DocRef Author 1.9
DocRef Authenticator 16.2
DocRef Custodian 100.0
DiagRep Performer 27.1
DiagRep Results interpreter 97.5
Content description
DocRef Description 99.7
DocRef Attachment title 0.0
DocRef Content format 98.1
DiagRep Title 60.6
DiagRep Structured conclusion 0.45

The following are the results from Table 15 of the original paper, showing timestamp field population rates:

Resource Timestamp Field Pop. %
DocRef Report date 99.99
DocRef File creation date 100.0
DocRef Encounter period 0.0
DocRef Release date 76.5
DocRef Print date 24.7
DocRef Record last updated 100.0
DiagRep Effective date 95.2
DiagRep Issued date 78.0
DiagRep File creation date 97.7
DiagRep Record last updated 100.0

The following are the results from Table 16 of the original paper, showing alignment between FHIR metadata timestamps and clinical dates:

FHIR Field n Same day ±1 day >1 day
All fields 15,142 58.8% 4.8% 36.5%
Report date 8,037 58.6% 5.2% 36.2%
Effective date 6,240 59.0% 3.7% 37.3%
Issued date 859 59.1% 7.9% 32.9%
File creation 6 50.0% 0.0% 50.0%

6.3. Ablation Studies / Parameter Analysis

The paper does not present formal ablation studies in the traditional sense (e.g., "with vs. without agent"). However, the "Lessons from Deployment" section serves as a qualitative analysis of architectural choices:

  • Agentic vs. Static Filtering: The authors initially tried a standard RAG pipeline with metadata filters. It failed. The switch to an agentic approach was the "ablation" that made the system work.
  • Markdown vs. JSON: The switch from JSON to Markdown for metadata serialization was a specific fix that resolved deterministic tool-calling failures.
  • Length Penalty (τ\tau): The authors fixed the threshold τ=40\tau = 40 and the blend weights (23,13\frac{2}{3}, \frac{1}{3}) on a development subset. This parameter controls the bias against short text fragments.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper demonstrates that Agentic Clinical Information Extraction (ACIE) is a viable solution for extracting structured data from messy, real-world electronic health records. By deploying an on-premise agentic RAG system, the authors achieved a 96.5% clinician acceptance rate with zero hallucinations. The key insight is that because clinical metadata is fundamentally unreliable (the "metadata gap"), AI architectures must shift from metadata-based filtering to content-based reasoning. The system successfully shifts the clinician's burden from manual compilation to verification of grounded citations.

7.2. Limitations & Future Work

The authors acknowledge several limitations:

  • Scope: The evaluation is a single-site, retrospective study in one language (German), focusing on one disease area (lymphoma). Generalization to other settings is untested.
  • Comparison: There was no comparison against a non-agentic baseline or commercial system, so the specific contribution of the "agentic" design is isolated by logical argument rather than direct benchmarking.
  • Reviewer Bias: Each field was graded by a single expert; inter-rater reliability was not measured.
  • Model Constraints: Performance is bounded by the capabilities of the on-premise model (`Qwen 3.6 35B-A3B).
  • Temporal Reasoning: The system struggles with complex temporal reasoning (dates and tables), which remains a residual challenge.

7.3. Personal Insights & Critique

This paper provides a crucial reality check for the AI-in-medicine field. Much research focuses on benchmark performance on clean, curated datasets. This paper reveals that in the "wild," the fundamental assumptions of standard RAG (reliable metadata) are often false.

  • Strengths: The rigorous quantification of the metadata gap (e.g., showing that 36.5% of timestamps are off by more than a day) is a significant contribution that benefits the entire community. The architectural decision to use "query-relevant summaries" to triage documents is a clever, practical solution to the context-length problem.
  • Transferability: The lessons learned here are likely universal. Any hospital deploying RAG will face similar metadata sparsity. The "Agentic" approach—treating retrieval as a reasoning problem rather than a search problem—is likely the correct path forward for complex domains beyond medicine, such as legal or financial discovery.
  • Potential Issues: The reliance on a large 35B model (even if MoE) requires substantial hardware (4 H100s), which may be a barrier to adoption for smaller hospitals. Additionally, while 96.5% acceptance is high, the 3.5% error rate in a clinical setting (even with human review) requires careful workflow design to ensure the "human in the loop" doesn't become complacent (automation bias).