Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents

Published:06/18/2026

Analysis

~22 min read · 28,858 charsThis analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

1. Bibliographic Information

1.1. Title

The title of the paper is "Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents". This title signals the paper's central concern: moving away from traditional, static ranking systems for Large Language Model (LLM) agents towards a more robust evaluation metric based on how well rankings predict performance in new, unseen situations.

1.2. Authors

The paper has a large list of authors, led by Dhaval C. Patel. The full list includes Kaoutar El Maghraoui, Shuxin Lin, Yusheng Li, Tianjun Feng, Chun-Yi Tsai, Yihan Sun, Wei Alexander Xin, Akshat Bhandari, Tanisha Rathod, Aaron Fan, Sanskruti Vijay Shejwal, Tomas Pasiecznik, Sagar Chethan Kumar, Tanmay Agarwal, Rohith Kanathur, Sam Colman, Amaan Sheikh, Dev Bahl, Ann Li, Krish Veera, Alimurtaza Mustafa Merchant, Shambhawi Baswaraj Bhure, Sajal Kumar Goyla, Chengrui Li, Kirthana Natarajan, Rui Li, Thomas Ajai, Rujing Li, Vivek G. Iyer, Sanjaii Vijayakumar, Yitong Bai, Ayal Yakobe, Darief Maes, Yassine Jebbouri, Tianyang Xu, Thai Quoc On, Vera Mazeeva, Winston Li, Yuval Shemla, Yeshitha Bhuvanesh, Rushin Bhatt, Siddharth Chethan Gowda, Alisha Vinod, Caroline Cahill, Shriya Aishani Rachakonda, Yunfeng Chen, Aryaman Agrawal, Aman Upganlawar, Mao Le, Jonathan Ang, Yubin Sally Go, Madhav Rajkondawar, Yang-Jung Chen, Trisha Maturi, Ananya Kapoor, Andrew Li, Shrey Arora, Mana Abbaszadeh, Shen Li, Charles Xu, and Byeolah Kwon.

The extensive author list suggests a collaborative effort, likely involving researchers from multiple institutions or a large-scale project. The diversity of names implies a mix of senior researchers and students or junior researchers contributing to various implementation studies. Based on the references and affiliations implied by the context (e.g., "Columbia University" in references), it appears to be a multi-institutional effort, possibly involving industry research labs and academic institutions.

1.3. Journal/Conference

The paper is listed as being "Published at (UTC): 2026-06-18T02:02:46.000Z" and is available on arXiv (arXiv:2606.19704). The abstract mentions that the benchmark used (AssetOpsBench) has been accepted at EMNLP 2025, NeurIPS 2025, and AAAI 2026. This suggests the paper itself is likely a preprint destined for a top-tier AI conference in 2026, such as NeurIPS, ICLR, or ACL. The mention of EMNLP and AAAI indicates the work spans both NLP and AI application domains. The high citation count of the benchmark (stars/forks) and its acceptance at multiple top venues indicate the work is building upon a recognized and impactful foundation in the field.

1.4. Publication Year

The paper was published in 2026.

1.5. Abstract

The paper's research objective is to address the inadequacy of current aggregate-score leaderboards for evaluating LLM agents. The authors argue that these leaderboards systematically underspecify the evaluation dimensions relevant to real-world deployment. The core methodology involves aggregating findings from fourteen parallel implementation studies extending one specific industrial-agent benchmark (AssetOpsBench) and consolidating these with seven prior benchmarks. The main results show that rankings derived from aggregate scores do not transfer well to out-of-distribution (OOD) settings, evidenced by rank instability in public-to-hidden competitions. The key conclusion is that the field should rank configurations by "predictive validity"—the correlation between in-sample and out-of-sample rank—rather than just in-sample mean performance. The paper proposes a twelve-tier measurement apparatus to expose deployment-relevant dimensions that current frameworks like HELM collapse, and provides a pre-registered pilot design for future validation.

The paper is available as a preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the "rank instability" of LLM agents when evaluated using traditional, static leaderboards. Current benchmarks often rely on a single aggregate score (like Pass@1) to rank agents. However, LLM agents are complex systems that plan, call tools, reuse artifacts, and coordinate. A single score hides the trade-offs between different dimensions of performance, such as reasoning quality vs. cost, or retrieval accuracy vs. latency.

This problem is critically important because real-world deployment of agents involves high stakes (operator time, capital, safety). If a leaderboard ranking does not predict how an agent will perform in a new, slightly different deployment scenario (an out-of-distribution setting), then the leaderboard is misleading for decision-makers. The paper cites a specific competition where the correlation between public and hidden rankings was statistically indistinguishable from zero (ρ=0.13\rho=-0.13) for the execution track, demonstrating that optimizing for a public leaderboard does not guarantee good performance on unseen data.

The paper's entry point is the observation that existing benchmarks (like HELM) and new agent benchmarks only touch a small subset of the dimensions that matter in deployment. The innovative idea is to shift the evaluation paradigm from "scoring high on a known test set" to "demonstrating that your ranking is stable and predictive across different, unseen test sets."

2.2. Main Contributions / Findings

The paper's primary contributions are:

  1. A Synthesis of Evaluation Dimensions: It consolidates metrics from seven source benchmarks and fourteen implementation studies into a "twelve-tier measurement apparatus." This apparatus exposes dimensions (like deployment infrastructure, multi-turn dialog, reasoning mode adaptivity) that are currently collapsed or ignored by standard leaderboards.

  2. A Position Argument: It formally argues that aggregate-score leaderboards systematically underspecify deployed-agent evaluation. It claims that the field would benefit from a multi-tier measurement apparatus ranked by predictive validity rather than in-sample mean.

  3. Operationalization of Predictive Validity: It proposes three concrete, falsifiable criteria for testing out-of-distribution (OOD) behavior: Held-Out Scenarios, Cross-Subset Transfer, and Adversarial Perturbation. It also proposes a formula for a Predictive-Validity (PV) score.

  4. A Research Agenda: It provides a pre-registered pilot design and a field-level vision for the next generation of agentic benchmarks, including specific proposals for leaderboard structure (declared configuration columns, layered presentation).

    The key findings are that different architectural choices (e.g., reasoning-on vs. off, RAG vs. Knowledge Plugin) lead to qualitatively different behaviors that aggregate scores hide. Furthermore, LLM-as-a-judge evaluation is reflexive (the judge's biases affect the ranking) and lacks an external anchor, making it unstable. The paper concludes that ranking by predictive validity is a more useful criterion for deployment decisions than ranking by in-sample mean.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand this paper, one must be familiar with several key concepts:

  • LLM Agents: Unlike a standard Large Language Model (LLM) that simply generates text based on a prompt, an LLM agent is a system that uses an LLM as a reasoning engine to interact with its environment. This interaction often involves using tools (e.g., a calculator, a database query function, a web search API) to perform actions. The agent can plan a sequence of steps, execute a tool, observe the result, and then decide on the next step, often over multiple turns of conversation.
  • MCP (Model Context Protocol): This is a specific standard mentioned frequently in the paper. It is an open protocol that standardizes how LLM applications connect to external data sources and tools. Think of it as a universal "plug-and-play" standard for AI assistants. If an agent follows MCP, it can easily connect to any MCP-compliant server (like a file system, a database, or a company's internal API) without needing custom code for each one. The paper focuses on "MCP-based" agents, meaning agents that use this protocol to access industrial tools and data.
  • Benchmark: A benchmark is a standardized set of tasks and data used to evaluate and compare the performance of different AI models or systems. For agents, a benchmark typically consists of scenarios (e.g., "Diagnose why this HVAC chiller is failing") along with the necessary tools and a correct answer or evaluation criteria.
  • Leaderboard: A leaderboard is a ranked list that displays the performance scores of different models or systems on a benchmark. It's a common way to track progress in AI research (e.g., the "ImageNet leaderboard" or "MMLU leaderboard").
  • Out-of-Distribution (OOD) Generalization: This is a fundamental concept in machine learning. It refers to a model's ability to perform well on data that is different from the data it was trained or evaluated on. For example, a model trained on pictures of dogs might fail (perform poorly) on an "out-of-distribution" test set consisting of cartoon drawings of dogs. The paper argues that agent evaluation must focus on OOD performance because real-world deployment is always an OOD scenario relative to the fixed benchmark set.
  • Spearman Correlation (ρ\rho): This is a statistical measure used to assess the strength and direction of the monotonic relationship between two ranked variables. Unlike Pearson correlation, which measures linear relationships on raw values, Spearman correlation works on the ranks of the data. If you rank 10 agents on a public leaderboard and then rank the same 10 agents on a hidden test set, Spearman correlation tells you how similar those two rankings are. A value of 1 means the rankings are identical, -1 means they are perfectly reversed, and 0 means there is no relationship. The paper uses this to quantify "rank instability."
  • LLM-as-a-Judge: This is an evaluation methodology where one LLM is used to grade the output of another LLM. For example, to check if an agent's answer is correct, you might use a powerful model like GPT-4 to read the question, the agent's answer, and the ground truth, and then ask GPT-4 to provide a score or a pass/fail judgment. This is scalable but introduces potential biases from the judge model itself.
  • Pass@k: This is a common metric for code generation and task completion. It measures the probability that at least one of the top kk generated solutions (e.g., the agent's first kk attempts or plans) is correct. Pass@1 means only the very first attempt is considered.

3.2. Previous Works

The paper synthesizes and builds upon several streams of prior work:

  1. Agent Benchmarks: The paper cites a wave of recent benchmarks designed for agents, such as SWE-bench (for resolving GitHub issues), TaskBench (for task automation), and MCP-Bench (for tool-using agents). Each of these benchmarks surfaces a different facet of evaluation (e.g., trajectory-level evaluation, tool-use accuracy) but still relies on aggregate scores for ranking. The paper positions itself as a critique and synthesis of this fragmented landscape.
  2. Critiques of Single-Score Ranking: The paper draws on a longer history of NLP research critiquing the over-reliance on single metrics. It cites works like "Utility is in the eye of the user" (Ethayarajh and Jurafsky, 2020), which argues that leaderboard rankings reflect a specific user's utility function only by accident. It also mentions the "benchmark lottery" (Dehghani et al., 2021), which showed that the choice of data subsample can reorder rankings. These works provide the theoretical foundation for the paper's argument against aggregate scores.
  3. Multi-Dimensional Frameworks: The paper references HELM (Holistic Evaluation of Language Models), a framework that broadens the dimensions on which single-shot models are measured (e.g., accuracy, fairness, robustness). The paper argues that while HELM is a step forward for models, agents introduce new, orthogonal axes (like orchestration, tool-call hygiene) that HELM does not score. It positions its twelve-tier apparatus as the successor to HELM for the agent era.
  4. Behavioral Testing: The paper mentions behavioral testing (e.g., CheckList), which tests models on specific behavioral capabilities (e.g., robustness to negation). This is related to the paper's "adversarial perturbation" criterion for OOD testing.
  5. Specific Benchmarks Cited:
    • AssetOpsBench: The central benchmark used in the paper's analysis. It's an industrial-agent benchmark for asset operations and maintenance. The paper chose it for its community traction and extensibility.
    • ARE/Gaia2: A benchmark that checks executed trajectories against human-annotated oracle DAGs (Directed Acyclic Graphs) with hard, causality, and timing checks. It provides a method for judge-independent verification, which the paper champions.
    • Condition Insight: An industrial framework that compares LLM-assigned classifications to a parallel rule-based pipeline, providing another example of judge-independent measurement.

3.3. Technological Evolution

The field of AI evaluation has evolved from simple accuracy metrics on static datasets (e.g., ImageNet classification accuracy) to more holistic evaluations for language models (e.g., HELM, which considers multiple metrics and scenarios). With the rise of LLM agents, the complexity has increased further. Evaluation now needs to account for multi-step reasoning, tool use, and interaction with dynamic environments.

Current agent benchmarks represent a first step in this new era, creating scenarios that require tool use. However, they have largely inherited the "single-score ranking" paradigm from the previous era. The paper's work fits into this timeline as the next critical step: moving from "can we create a benchmark for agents?" to "how can we create a robust and predictive evaluation framework for complex, deployed agents?" It represents a shift from capability evaluation (can it do the task?) to reliability and generalizability evaluation (will it do the task consistently in the real world?).

3.4. Differentiation Analysis

The core difference between this paper's approach and prior methods is its ranking criterion.

  • Prior Methods: Most prior benchmarks and leaderboards rank systems based on an in-sample mean score (e.g., average Pass@1 on the benchmark's test set). They assume that a higher score on the benchmark means a better system for deployment. They often report multiple metrics but still use a single aggregate score for the final ranking.

  • This Paper's Approach: The paper proposes ranking systems based on predictive validity. This is a fundamentally different criterion. Instead of asking "what is the average score?", it asks "how stable is the ranking if we change the test set?". A system with a slightly lower in-sample mean but a very stable ranking (high predictive validity) would be ranked higher than a system with a high in-sample mean but an unstable ranking. This directly addresses the deployment question of "can I trust this ranking to hold in my specific environment?"

    Another key differentiator is the multi-tier measurement apparatus. While frameworks like HELM are multi-dimensional for models, this paper introduces dimensions specific to agents (like orchestration, tool hygiene, multi-turn artifact reuse) that are entirely absent from previous frameworks. It argues that these dimensions are orthogonal and non-redundant, providing a more complete picture of an agent's deployment profile.

4. Methodology

4.1. Principles

The core principle of the proposed methodology is that predictive validity is a superior ranking criterion for deployment decisions than in-sample mean performance.

The theoretical basis is that deployed systems operate in an out-of-distribution (OOD) environment. They will never encounter the exact same scenarios as those in the training set or the fixed benchmark set. Therefore, an evaluation metric that only measures performance on the benchmark (in-sample) is a poor proxy for real-world utility. A more useful metric is one that measures how well the system's ranking relative to other systems holds up when the evaluation data is changed. If System A is ranked #1 on the public leaderboard, we want to be confident it will also be ranked #1 (or close to it) on a private, held-out set of scenarios that more closely resembles a specific deployment environment.

The intuition is that a system which achieves a high score by overfitting to the specific quirks of the benchmark data will have a high in-sample mean but a low predictive validity (its ranking will collapse on new data). Conversely, a robust system might have a slightly lower in-sample mean but a much higher predictive validity, making it a safer choice for deployment.

4.2. Core Methodology In-depth (Layer by Layer)

The paper's methodology is a comprehensive framework that defines what to measure (the twelve-tier apparatus) and how to rank systems based on those measurements (the predictive validity score).

Step 1: Defining the Measurement Space (The Twelve-Tier Apparatus)

The first step is to establish a comprehensive set of dimensions that characterize an LLM agent's performance in deployment. The paper does not invent these dimensions but synthesizes them from seven prior benchmarks and fourteen new implementation studies. The result is a twelve-tier apparatus, divided into two groups.

The Seven Core Tiers (T1–T7): These consolidate metrics from established benchmarks.

  • T1 Success: The most basic metric, like a pass/fail rate (e.g., Pass@1). It's a floor metric, useful but not very informative on its own.

  • T2 Tool-Call Hygiene: This measures the correctness of the agent's interaction with tools. It checks if tool names are valid, if the parameters match the schema, and if the execution was successful. A high success rate (T1) can be achieved with poor tool hygiene (e.g., getting the right answer but making many invalid calls along the way), so this tier is non-redundant.

  • T3 Planning-Process Quality: This evaluates the quality of the agent's plan before execution. It might use LLM-judged metrics to assess if the plan is clear, logical, and efficient. Two agents can achieve the same final success (T1) via very different planning processes (one might be concise, the other verbose and convoluted).

  • T4 Capability Axes: This breaks down performance into specific capabilities like execution, search, adaptability to time constraints, handling ambiguity, and noise. It provides a capability profile rather than a single score.

  • T5 Cost-Efficiency Pareto: This measures the trade-off between performance and resource cost. It includes metrics like cost per scenario, latency, and step count. A deployment decision often depends on finding the best performance for a given budget, making this a critical axis.

  • T6 Failure-Mode Taxonomies: Instead of just reporting a failure, this tier categorizes how the agent failed (e.g., "hallucinated a tool name," "misunderstood the query," "retrieved wrong data"). This detailed failure analysis is crucial for debugging and improving the system.

  • T7 Integrity & Reproducibility: This measures the reliability of the evaluation itself. It includes metrics like multi-run variance (does the score change if we run it again?), inter-rater agreement between judges, and the distinction between validation and test sets.

    The Five Deployment-Extension Tiers (T8–T12): These tiers are surfaced by the paper's analysis of the fourteen implementation studies and address dimensions specific to deployed agents.

  • T8 Deployment Infrastructure: This measures performance overhead related to the deployment stack, such as latency decomposition (how much time is spent on the model vs. network vs. tool execution), protocol overhead (e.g., MCP-stdio overhead), and the impact of infrastructure optimizations.

  • T9 Multi-Turn Dialog: This evaluates the agent's behavior over a conversation, not just a single turn. It includes metrics for cross-turn artifact reuse (does the agent remember and reuse results from previous turns?), per-turn cost dynamics, and dialog-level success rates.

  • T10 Reasoning-Mode Adaptivity: This assesses the agent's use of different reasoning modes (e.g., "extended thinking" or chain-of-thought). It measures the cost of reasoning (latency, tokens) and its benefit (quality improvement) on a per-rubric basis. It helps answer "when is it worth it to use expensive reasoning?"

  • T11 Knowledge Augmentation: This evaluates the agent's ability to use external knowledge. It includes metrics for retrieval recall, multi-hop reasoning depth, and the quality of the embedding index used for retrieval.

  • T12 Evidence Grounding & Verification: This is a critical tier for trustworthiness. It proposes judge-independent verification methods. Instead of relying solely on an LLM judge, it uses rule-based pipelines (e.g., checking if a condition classification matches a rule-based system) or DAG oracles (checking if the executed trajectory matches a pre-defined correct sequence of steps). This provides an anchor to detect drift in the LLM judge's performance.

    The following figure (Figure 2 from the original paper) illustrates the coverage of various benchmarks across these twelve tiers, highlighting the gap the paper aims to fill.

    img-1.jpeg 该图像是图表,展示了不同基准在核心能力层级(T1-T7)和部署扩展层级(T8-T12)上的覆盖情况。各基准在不同层级中的表现用深浅不一的色块标示,显示出几乎所有已有基准在部署扩展层级的缺失。

Step 2: Operationalizing Predictive Validity (The Three OOD Criteria)

Once the measurement space is defined, the next step is to define how to calculate the "predictive validity" score. The paper proposes three operationalizations of Out-of-Distribution (OOD) shift, which serve as tests for rank stability.

Criterion A: Held-Out Scenarios (Mild Shift) This is the simplest test. The benchmark's data is split into a "public" subset and a "held-out" subset using a stratified random split.

  • Procedure: Rank all agent configurations on the public subset. Then, evaluate the same configurations on the held-out subset and rank them again. Calculate the Spearman rank correlation (ρ\rho) between the two rankings.
  • Interpretation: A high correlation (e.g., ρ>0.85\rho > 0.85) suggests the ranking is stable even on unseen data from the same distribution. A low correlation indicates rank instability. A pass is uninformative (it just means the benchmark is easy), but a failure is damning evidence that the public ranking is not to be trusted.

Criterion B: Cross-Subset Transfer (Moderate Shift) This is a more realistic test for deployment, where the target domain might be different from the source domain.

  • Procedure: If the benchmark has distinct subsets (e.g., AssetOpsBench has subsets for HVAC chillers, compressors, hydraulic pumps, etc.), you perform a "leave-one-out" cross-validation. For each subset kk, you rank the agents on all other subsets (k-1) and then test the ranking on the held-out subset kk. This produces a matrix of rank correlations (e.g., a 6×66 \times 6 matrix for six subsets).
  • Interpretation: This tests the question: "If I rank agents on chillers, will that ranking transfer to hydraulic pumps?" It measures the generalizability of the ranking across different but related task distributions.

Criterion C: Adversarial Perturbation (Strongest Shift) This is the most rigorous test, designed to probe the agent's robustness to superficial changes in the input that should not affect the correct answer.

  • Procedure: Take the base scenarios and create semantically equivalent but syntactically different versions. The paper suggests four classes of perturbation:
    1. Paraphrase: Rewrite the query using different words while preserving the intent.
    2. Identifier Renaming: Change the names of entities (e.g., "Chiller 6" becomes "Unit-CHX-06").
    3. Time-Window Shifting: Change temporal references (e.g., "last week" becomes "two weeks ago to one week ago").
    4. Distractor Injection: Append irrelevant operational context to the query.
  • Interpretation: A robust agent should perform equivalently on the base and perturbed versions. A large drop in score or a change in ranking indicates that the agent is overfitting to the specific phrasing of the benchmark and lacks true understanding or robustness.

Step 3: Calculating the Predictive-Validity (PV) Score

The paper proposes a composite score, PV(c)\mathrm{PV}(c), for a configuration cc, which combines its in-sample performance with a penalty for OOD instability.

The formula for the Predictive-Validity score is:

PV(c)=αYˉcβσYc,OODγIQR(Yc) \mathrm{PV}(c) = \alpha \bar{Y}_c - \beta \sigma_{Y_c, \mathrm{OOD}} - \gamma \mathrm{IQR}(Y_c)

Where:

  • Yˉc\bar{Y}_c is the in-sample mean score for configuration cc (e.g., its average Pass@1 on the public set).

  • σYc,OOD\sigma_{Y_c, \mathrm{OOD}} is the cross-OOD-criterion standard deviation of the rank position for configuration cc. This measures how unstable the configuration's ranking is across the three OOD criteria (A, B, and C). A high standard deviation means the ranking fluctuates wildly depending on the test set, which is bad.

  • IQR(Yc)\mathrm{IQR}(Y_c) is the interquartile range of the per-scenario scores for configuration cc. This measures the consistency of the configuration's performance across individual scenarios. A high IQR means the agent is very inconsistent (aces some scenarios, fails others), which is undesirable for deployment.

  • α,β,γ\alpha, \beta, \gamma are weights that determine the relative importance of each component.

    The logic of the formula is to reward high in-sample performance (αYˉc\alpha \bar{Y}_c) but penalize the configuration if its ranking is unstable across OOD tests (βσYc,OOD\beta \sigma_{Y_c, \mathrm{OOD}}) or if its performance is inconsistent across scenarios (γIQR(Yc)\gamma \mathrm{IQR}(Y_c)).

The paper does not finalize the weights (α,β,γ\alpha, \beta, \gamma) in this position paper. Instead, it proposes that they be fit on Criterion-A holdouts to maximize the Spearman correlation between the PV rank and the ranks obtained from the more stringent Criterion-B and Criterion-C tests. This is a meta-optimization step: find the weights that make the PV score the best predictor of OOD performance.

Step 4: Falsification of the Position

The paper makes its position falsifiable by defining explicit conditions that must hold for the position to be considered well-supported. These conditions serve as a research agenda for future empirical validation.

The position is considered refuted if the following conditions fail under a controlled experiment:

  1. The Spearman rank correlation between in-sample and OOD rankings, ρ\rho, is not less than 0.85 across at least two of the three OOD criteria. If generalization is consistently high (ρ0.85\rho \geq 0.85), then the concern about rank instability is moot.

  2. The top-3 in-sample configurations do not leave the top-5 on out-of-sample in 10%\geq 10\% of holdout splits. If the top performers are stable, then the current leaderboard is adequate.

  3. The correlation between mean performance and OOD variance, ρPearson\rho_{\text{Pearson}}, is not greater than 0.2. This condition checks if high-performing configurations are disproportionately unstable. If there's no link, then simply picking the highest mean score is not a performance-risk trade-off.

  4. The top-10 configurations ranked by the proposed PV score do not differ significantly from the top-10 ranked by the in-sample mean. Specifically, the Jaccard overlap between the two top-10 lists must be <0.85< 0.85. If the lists are almost identical, then the new methodology offers no different deployment guidance.

    The paper notes that the first condition is already partially supported by existing evidence from a prior competition (CODS-2025), where the execution track public-to-hidden Spearman correlation was ρ=0.13\rho=-0.13, well below the 0.85 threshold. The remaining conditions require the controlled study outlined in the paper's future work section.

5. Experimental Setup

5.1. Datasets

The experiments and analysis in the paper are not based on a single new dataset but rather on a synthesis of findings from multiple sources, all anchored around the AssetOpsBench benchmark.

  • AssetOpsBench: This is the primary benchmark used. It is an industrial-agent benchmark for asset operations and maintenance. It contains scenarios involving tasks like diagnosing HVAC chillers, compressors, and hydraulic pumps. The benchmark includes the necessary MCP server tooling and Docker artifacts. The original version has 141 scenarios, but the fourteen implementation studies extended it with new asset classes (batteries, transformers, visual inspection) and new scenarios.
  • Extended Scenarios: The fourteen implementation studies created new data to test specific dimensions.
    • Multi-modal Visual Extension: One study added 22 scenarios involving visual inspection of assets (pumps, transformers, motors, wind-turbine blades) using public image datasets.

    • New Asset Classes: Other studies added scenarios for battery analytics (using NASA PCoE cycling data) and smart grid transformers.

    • Adversarial Perturbations: The paper proposes creating perturbed versions of the scenarios (paraphrases, renamed identifiers, shifted time windows, injected distractors) for Criterion C testing, though this is part of the proposed future work.

      The choice of AssetOpsBench is strategic. It is a serious, deployment-oriented benchmark with high community traction (over 1,600 GitHub stars), cross-venue uptake (accepted at top conferences), and demonstrated extensibility. Its industrial domain is chosen because the stakes of mis-evaluation are concrete (operator time, capital, safety), making the problem of rank instability particularly relevant.

5.2. Evaluation Metrics

The paper proposes and analyzes a wide range of metrics, organized into the twelve-tier apparatus. Here are the key metrics:

  1. Conceptual Definition: Pass@1 (or Pass Rate) is the most fundamental metric. It measures whether the agent's first attempt or final output is correct according to a ground truth or judge. It provides a binary success/fail signal for each scenario. Mathematical Formula: For a set of NN scenarios, let yiy_i be 1 if the agent passes scenario ii and 0 otherwise. The Pass@1 rate is: Pass@1=1Ni=1Nyi \text{Pass@1} = \frac{1}{N} \sum_{i=1}^{N} y_i Symbol Explanation: NN is the total number of scenarios. yiy_i is the binary outcome for the ii-th scenario.

  2. Conceptual Definition: Spearman's Rank Correlation Coefficient (ρ\rho) is a non-parametric measure of the strength and direction of the association between two ranked variables. In this paper, it is used to measure the stability of rankings between different test sets (e.g., public vs. hidden). Mathematical Formula: Given two rankings of nn items, convert the raw scores to ranks RXiR_{X_i} and RYiR_{Y_i}. The Spearman correlation is the Pearson correlation between the rank variables: ρ=16di2n(n21) \rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)} where di=RXiRYid_i = R_{X_i} - R_{Y_i} is the difference between the two ranks for the ii-th item. Symbol Explanation: nn is the number of items being ranked. did_i is the difference in ranks for the ii-th item. ρ\rho ranges from -1 (perfectly reversed) to 1 (identical rankings).

  3. Conceptual Definition: Condition Agreement Rate (CAR) is a judge-independent metric used in the "Condition Insight" framework. It measures the agreement between an LLM's classification and a parallel rule-based pipeline. Mathematical Formula: For a set of classifications, CAR is the number of agreements divided by the total number of classifications. CAR=Number of AgreementsTotal Classifications \text{CAR} = \frac{\text{Number of Agreements}}{\text{Total Classifications}} Symbol Explanation: This is a simple accuracy metric, but its key feature is that one of the "judges" is a deterministic rule-based system, not an LLM. This provides an anchor against judge drift.

  4. Conceptual Definition: Krippendorff's Alpha (α\alpha) is a statistical measure of inter-rater agreement. It is used in the paper to measure the reliability of the LLM-as-a-judge. Mathematical Formula: The formula is complex, but generally, it is defined as: α=1DoDe\alpha = 1 - \frac{D_o}{D_e} where DoD_o is the observed disagreement and DeD_e is the expected disagreement by chance. Symbol Explanation: α\alpha ranges from 0 (no agreement beyond chance) to 1 (perfect agreement). The paper reports a low α=0.61\alpha=0.61 for LLM-as-judge, compared to α[0.74,0.82]\alpha \in [0.74, 0.82] for human-human agreement, indicating the LLM judge is less reliable.

  5. Conceptual Definition: Predictive-Validity Score (PV) is the composite score proposed by the paper to rank configurations. Mathematical Formula: PV(c)=αYˉcβσYc,OODγIQR(Yc) \mathrm{PV}(c) = \alpha \bar{Y}_c - \beta \sigma_{Y_c, \mathrm{OOD}} - \gamma \mathrm{IQR}(Y_c) Symbol Explanation: As detailed in the methodology section, this combines in-sample mean (Yˉc\bar{Y}_c) with penalties for OOD rank instability (σYc,OOD\sigma_{Y_c, \mathrm{OOD}}) and score inconsistency (IQR(Yc)\mathrm{IQR}(Y_c)).

5.3. Baselines

The paper's "baselines" are not other specific models but rather the current evaluation practices that the paper critiques. The paper compares its proposed "predictive validity" approach against the status quo of aggregate-score leaderboards.

The paper uses the following as evidence for the limitations of the baseline approach:

  • Prior Competition Results: It cites the CODS-2025 AssetOpsBench challenge results as a baseline. The public leaderboard rankings (based on aggregate scores) showed a Spearman correlation of ρ=0.13\rho=-0.13 with the hidden evaluation rankings, demonstrating the failure of the baseline method to predict deployment performance.

  • Existing Benchmarks: It compares the coverage of its twelve-tier apparatus against seven prior benchmarks (AssetOpsBench, MCP-Bench, MCP-Universe, ARE/Gaia2, TaskBench, Exgentic, Condition Insight). The analysis shows that no single prior benchmark reports more than four or five of the twelve tiers, highlighting the underspecification of the baseline methods.

    The paper also uses the fourteen implementation studies as a form of "convergent architectural sensitivity" analysis. Each study modifies one variable of a base agent (e.g., turning reasoning on/off, changing retrieval strategy) and shows that while the aggregate score might not change much, the performance on specific tiers (like latency, tool hygiene, or per-rubric clarity) changes dramatically. This serves as a series of micro-baselines demonstrating that the aggregate score hides critical information.

6. Results & Analysis

6.1. Core Results Analysis

The paper's results are not from a single new experiment but from a synthesis of existing evidence and a convergent analysis of fourteen implementation studies. The core findings strongly validate the paper's position.

  1. Validation of Rank Instability: The paper provides strong evidence that aggregate-score rankings are unstable. The most direct evidence comes from the CODS-2025 competition retrospective, where the execution track showed a public-to-hidden Spearman correlation of ρ=0.13\rho=-0.13, which is statistically indistinguishable from zero. This means the public leaderboard was effectively random with respect to the hidden test set. Even the planning track, which had a positive correlation (ρ=0.69\rho=0.69), saw mean private scores fall 11.3 points below public scores and showed saturation (only 8 unique values across 20 teams). This confirms that optimizing for a public leaderboard does not guarantee robust performance.

  2. Validation of the Twelve-Tier Apparatus: The fourteen implementation studies provide convergent evidence that the twelve tiers are non-redundant and deployment-relevant. Each study modified a single architectural variable and observed changes on specific tiers that were invisible in the aggregate score.

    • Reasoning Mode: Toggling "extended thinking" improved clarity by 31 percentage points and reduced hallucinations but increased planning latency by 41.9%. The aggregate score hid this trade-off.
    • Retrieval Strategy: Comparing single-pass RAG vs. a Knowledge Plugin showed a trade-off: RAG was faster (8.9-20s) but less accurate (50-68%), while the Plugin was slower (114-146s) but more accurate (~90%). An aggregate score of accuracy would favor the Plugin, but a deployment with latency constraints would prefer RAG. The twelve-tier apparatus (T5 Cost-Efficiency) exposes this trade-off.
    • Evaluation Methodology: Studies showed that LLM-as-a-judge reliability is low (α=0.61\alpha=0.61) compared to humans. Judge-independent methods like CAR (Condition Agreement Rate) and DAG oracles provided stable, high-precision verification (0.99 precision), validating the need for Tier T12.
  3. Validation of Predictive Validity: The paper's proposed PV score directly addresses the issues found in the baselines. By penalizing OOD variance (σYc,OOD\sigma_{Y_c, \mathrm{OOD}}), it would down-weight the configurations that overfit to the public leaderboard (the ones responsible for the rank instability). By including the IQR term, it penalizes inconsistent agents. The paper argues that ranking by this PV score would provide more deployment-relevant guidance than ranking by in-sample mean.

    The following figure (Figure 3 from the original paper) summarizes the headline improvement reported by each of the fourteen teams against their own baseline. The logarithmic scale highlights the wide range of improvements possible when optimizing for specific tiers (like latency or cost) rather than just a single aggregate score.

    img-2.jpeg 该图像是图表,展示了不同团队的headline改进情况与各自基线的对比,采用对数刻度,颜色区分不同扩展轴。G9、G19和G30数据被省略,具体非量级结果详见表1。

The following figure (Figure 4 from the original paper) provides an evidence heatmap, mapping the fourteen extension studies to four high-level evaluation dimensions (robustness, trustworthiness, generalization, and longitudinal/drift behavior). It visually demonstrates that all four dimensions receive convergent evidence from multiple independent teams, reinforcing the paper's claim that these are critical, orthogonal dimensions.

img-3.jpeg 该图像是一个表格,展示了十四项扩展研究在四个评估维度下的性能指标。表格中列出了不同方法的名称及其对应的鲁棒性(Rob.)、信任度(Trust.)、通用性(Gen.)和长期表现(Long.)得分,支持对多项技术的比较分析。

6.2. Data Presentation (Tables)

The following are the results from Table 1 of the original paper, which details the setup and magnitude of improvement for each of the fourteen implementation studies.

The following are the results from Table 2 of the original paper, which summarizes the conclusion and forward-looking suggestion for each study.

6.3. Ablation Studies / Parameter Analysis

The fourteen implementation studies effectively serve as a massive, coordinated ablation study. Each study "ablates" or modifies one specific component of a base agent architecture and measures the impact.

  • Reasoning Mode Ablation: One study ablated the "extended thinking" mode. The result was a clear trade-off: quality improved (clarity +31pp, hallucination -7pp) but cost increased (latency +21.5%, planning latency +41.9%). This ablation shows that "reasoning" is not a free lunch and its value depends on the deployment context.

  • Retrieval Strategy Ablation: Another study ablated the retrieval strategy, swapping single-pass RAG for a multi-hop Knowledge Plugin. The ablation revealed a Pareto frontier: neither dominated the other. The Plugin was more accurate but much slower and more expensive. This ablation proves that "retrieval strategy" must be a declared configuration property, as the right choice is deployment-dependent.

  • Infrastructure Ablation: Several studies ablated infrastructure components. One found that replacing MCP tool execution with text-RAG over telemetry evidence caused a catastrophic failure (pass rate 100% -> 20%). Another found that using a persistent server vs. stdio transport removed a massive latency floor. These ablations highlight that infrastructure choices (Tier T8) are not just implementation details but critical factors affecting success and cost.

  • Quantization Ablation: In the visual inspection extension, an ablation of quantization strategies found that a specific configuration (AWQ W4A16 with domain-specific calibration) improved both speed and accuracy, while a more aggressive "bundle" (FP8 KV-cache) caused a complete response collapse. This shows that optimization choices can have non-linear, even catastrophic, effects on performance, reinforcing the need for rigorous, multi-tier evaluation.

    The following figure (Figure 5 from the original paper) aggregates the published rank-correlation evidence, plotting point estimates with 95% confidence intervals against the falsification threshold of ρ=0.85\rho=0.85. It visually summarizes the empirical support for the paper's position: the execution track correlation is near zero, the planning track is positive but its upper bound hits the threshold, and cross-benchmark correlations are variable. This chart serves as a meta-ablation of the "ranking stability" property across different domains and benchmarks.

    img-4.jpeg 该图像是图表,展示了公共与隐藏的斯皮尔曼等级相关性。图中包括了不同实验组的相关系数及相应的置信区间,虚线红线表示 ho=0.85 ho = 0.85 的虚假阈值。数据来源于CODS规划、执行及拟议的试点研究。

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper concludes that the current practice of ranking LLM agents by aggregate scores on static leaderboards is fundamentally flawed for deployment decisions. The core issue is rank instability: a high score on a public benchmark does not guarantee a high rank on a private, out-of-distribution test set. The paper supports this with empirical evidence from a major competition and a synthesis of fourteen implementation studies.

The main contribution is a proposal for a new evaluation paradigm:

  1. Measure More Dimensions: Adopt a twelve-tier measurement apparatus that exposes deployment-relevant dimensions (tool hygiene, planning quality, cost, failure modes, etc.) that are currently hidden.

  2. Rank by Predictive Validity: Instead of ranking by in-sample mean, rank by a Predictive-Validity (PV) score. This score rewards high performance but penalizes configurations whose rankings are unstable across out-of-distribution tests (held-out, cross-subset, adversarial).

  3. Require Judge-Independent Verification: Every leaderboard should include a deterministic, rule-based verifier to provide an anchor against LLM-judge drift.

    The paper provides a concrete, falsifiable research agenda to validate this position and proposes specific changes to leaderboard design (declared configuration columns, layered presentation) to operationalize its vision.

7.2. Limitations & Future Work

The authors are explicit about the limitations of their work:

  • Empirical Validation is Future Work: The paper is a position paper and synthesis. The predictive-validity claim has not been tested at scale by the authors. They have specified the experiment but not run it. The evidence is convergent architectural sensitivity, not a controlled randomized trial.

  • Domain Specificity: All evidence is drawn from the industrial asset operations domain (AssetOpsB). Whether the twelve-tier apparatus generalizes to other domains (scientific assistants, customer service) is an open question.

  • Tier Independence is Asserted, Not Tested: The paper claims the twelve tiers are roughly orthogonal but has not empirically tested this.

  • Industrial Deployment Validity Gap: The proposed criteria are still internal to the benchmark. There is no data yet linking framework rankings to real-world outcomes (incident reduction, false-alarm rate).

    The primary future work is the controlled empirical validation of the predictive-validity position using the three OOD criteria and the proposed PV score formula. The authors also call for community artifacts: a shared rule pipeline for judge-independent verification and a reference adversarial-perturbation suite.

7.3. Personal Insights & Critique

This paper makes a profound and timely contribution to the field of AI evaluation. Its greatest strength is its rigorous grounding in real-world deployment concerns. By moving the discussion from "can we get a higher score?" to "can we trust this score?", it addresses a critical gap as AI systems move from research labs to production environments.

The synthesis of fourteen implementation studies is particularly powerful. It provides a level of evidence—convergent architectural sensitivity—that is often more convincing than a single, isolated experiment. The fact that independent teams, modifying different parts of the system, all converged on the same set of missing measurement dimensions is a strong argument for the twelve-tier apparatus.

The proposal of "predictive validity" as a ranking criterion is innovative and directly addresses the deployment question. It shifts the focus from absolute performance to relative stability, which is often what matters most when choosing a system for a specific, constrained environment. The formula for the PV score is a pragmatic first step, balancing performance with risk.

A potential critique is that the proposed methodology is significantly more complex and expensive than current leaderboards. Requiring OOD tests, judge-independent verification, and reporting on twelve tiers is a heavy lift for benchmark maintainers and participants. However, the authors rightly flag this as a field-organization concern and suggest community artifacts to mitigate the cost. The high stakes of industrial deployment, as highlighted by the paper, justify this increased cost.

Another area for improvement could be the exploration of the weights (α,β,γ\alpha, \beta, \gamma) in the PV score. The paper leaves this for future work, but different deployment contexts might require different weights (e.g., a safety-critical system might prioritize low IQR/consistency over raw mean performance). A "configurable PV score" could be a valuable extension.

Overall, this paper provides a clear, actionable, and well-argued roadmap for the next generation of agent evaluation. Its insights are highly transferable to any domain where robust, reliable AI performance is more important than achieving a state-of-the-art score on a static benchmark.