Abstract

Resources for simulation-based evaluation of conversational recommender systems (CRSs) are scarce. The UserSimCRS toolkit was introduced to address this gap. In this work, we present UserSimCRS v2, a significant upgrade aligning the toolkit with state-of-the-art research. Key extensions include an enhanced agenda-based user simulator, introduction of large language model-based simulators, integration for a wider range of CRSs and datasets, and new LLM-as-a-judge evaluation utilities. We demonstrate these extensions in a case study.

1. Bibliographic Information

1.1. Title

The central topic of the paper is "UserSimCRS v2: Simulation-Based Evaluation for Conversational Recommender Systems". It focuses on presenting an upgraded toolkit for evaluating Conversational Recommender Systems (CRSs) using various simulation techniques.

1.2. Authors

The authors of this paper are:

Nolwenn Bernard, affiliated with TH Köln, Köln, Germany.
Krisztian Balog, affiliated with the University of Stavanger, Stavanger, Norway.

Their research backgrounds are likely in information retrieval, recommender systems, conversational AI, and natural language processing, given the focus of the paper on CRSs, user simulation, and evaluation.

1.3. Journal/Conference

The paper is published at (UTC): 2025-12-04T09:07:35.000Z. Based on the provided link (https://arxiv.org/abs/2512.04588), it is a preprint published on arXiv, which is an open-access repository for electronic preprints, primarily in the fields of mathematics, physics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. As a preprint, it has not yet undergone formal peer review in a journal or conference. However, arXiv is a highly influential platform for disseminating early research findings in the academic community.

1.4. Publication Year

The publication year is 2025.

1.5. Abstract

Resources for simulation-based evaluation of Conversational Recommender Systems (CRSs) are scarce. The UserSimCRS toolkit was initially introduced to address this gap. This paper presents UserSimCRS v2, a significant upgrade that aligns the toolkit with state-of-the-art research. Key extensions include an enhanced agenda-based user simulator, the introduction of large language model (LLM)-based simulators, integration for a wider range of CRSs and datasets, and new LLM-as-a-Judge evaluation utilities. These extensions are demonstrated in a practical case study.

1.6. Original Source Link

Original Source Link: https://arxiv.org/abs/2512.04588
PDF Link: https://arxiv.org/pdf/2512.04588v1.pdf
Publication Status: This paper is currently a preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the scarcity of robust, open-source resources for simulation-based evaluation of Conversational Recommender Systems (CRSs). CRSs are designed to assist users in discovering items that match their needs through interactive, multi-turn dialogues. Despite their growing interest, evaluating CRSs remains a significant challenge.

Why this problem is important:

Limitations of Traditional Evaluation: Offline test collections, while common, fail to capture the dynamic, interactive nature of CRSs. Human evaluations are often employed but suffer from critical drawbacks such as poor reproducibility, high cost, and limited scalability.
Benefits of User Simulation: User simulation offers a valuable alternative or preliminary step. It allows researchers to refine and pre-select high-performing systems before costly human evaluations, thereby optimizing resource usage.
Gap in Existing Toolkits: While comprehensive toolkits exist for building CRSs (e.g., CRSLab, RecWizard), they typically lack robust and reusable resources for simulation. Toolkits for task-oriented dialogue systems include user simulators, but these are ill-suited for the recommendation domain as they don't model crucial recommendation-specific constructs (e.g., assessing items based on historical preferences) and often lack support for standard recommendation benchmark datasets (e.g., ReDial, INSPIRED).
Limitations of UserSimCRS v1: The original UserSimCRS toolkit (v1) provided an agenda-based user simulator and basic metrics, but it was limited. It lacked the generative flexibility of state-of-the-art LLM-based simulators, had basic components for its agenda-based simulator, didn't explicitly define information needs, lacked support for popular benchmark datasets, and offered a limited selection of evaluation measures.
Interoperability Challenge: A major practical hurdle for adoption is the interoperability between a simulator and various CRSs, which are often implemented in different frameworks and trained on diverse datasets.

The paper's entry point or innovative idea is to significantly upgrade the existing UserSimCRS toolkit to address these limitations. It aims to create a more holistic and modernized framework that aligns with current state-of-the-art research in user simulation and CRS evaluation, particularly by incorporating Large Language Models (LLMs).

2.2. Main Contributions / Findings

The primary contributions of UserSimCRS v2 are:

Enhanced Agenda-based Simulator: The classic agenda-based user simulator is significantly upgraded by incorporating LLM-based components for dialogue act extraction (Natural Language Understanding - NLU) and natural language generation (Natural Language Generation - NLG), alongside a revised dialogue policy that is guided by the user's information need.
Introduction of LLM-based Simulators: The toolkit now includes two end-to-end LLM-based user simulators: a single-prompt simulator and a dual-prompt simulator, reflecting the current state-of-the-art in user simulation.
Wider CRS and Dataset Integration: A new communication interface is introduced to integrate commonly used CRS models available in CRS Arena. Furthermore, a unified data format is provided with conversion and LLM-powered augmentation tools for widely-used benchmark datasets like ReDial, INSPIRED, and IARD.
Advanced Conversational Quality Evaluation: The toolkit expands its evaluation capabilities by introducing a new LLM-as-a-Judge utility. This utility assesses conversational quality across five aspects: recommendation relevance, communication style, fluency, conversational flow, and overall satisfaction.
Demonstration through Case Study: The paper demonstrates these extensions in a case study focusing on movie recommendations. This case study evaluates a selection of CRSs using the different types of user simulators and datasets now supported in UserSimCRS v2.

The key findings from the case study, though illustrative of the toolkit's capabilities rather than definitive CRS performance benchmarks, highlight several important points:
Simulator Disagreement: Different simulators (e.g., ABUS vs. LLM-DP) can significantly disagree on the ranking of CRS performance, and even on the magnitude of performance when they agree on the top system.
Distinct Simulator Characteristics: ABUS tends to be more "opinionated" with a wider range of scores, while LLM simulators provide scores within a more compressed range. LLM-DP often yields slightly higher average scores than LLM-SP.
Dataset Dependence: CRS performance is highly dependent on the dataset used for evaluation, with some systems showing extreme variability and others high consistency across datasets.

These findings underscore the importance of selecting appropriate simulators and datasets for CRS evaluation and open up new research directions enabled by UserSimCRS v2.

3.1. Foundational Concepts

To understand this paper, a novice reader should be familiar with the following fundamental concepts:

Conversational Recommender Systems (CRSs): At its core, a CRS is a system that helps users find items (e.g., movies, products) through natural language dialogue. Unlike traditional recommender systems that might just show a list of items, CRSs engage in a multi-turn conversation to understand user preferences, clarify needs, and provide recommendations. They integrate elements of natural language processing (NLP), dialogue management, and recommendation algorithms. The interaction is dynamic; the CRS might ask follow-up questions, accept user feedback, or explain recommendations.
User Simulation: This refers to the process of using automated software agents (simulators) to mimic the behavior of human users. In the context of CRSs, a user simulator interacts with the CRS as if it were a real person, providing inputs and responding to the system's utterances. The purpose is to evaluate the CRS without needing actual human testers, which saves time and money, and allows for more reproducible and scalable evaluations. While simulations may not perfectly replicate human behavior, they serve as a valuable preliminary step before human evaluation.
Agenda-based User Simulator (ABUS): This is a traditional, modular approach to user simulation, often rule-based. It operates with a predefined "agenda" of goals or tasks that the simulated user needs to achieve (e.g., finding a movie of a specific genre). As illustrated in Figure 1a (from the original paper), an agenda-based user simulator typically consists of three main components:
1. Natural Language Understanding (NLU): This component processes the incoming CRS utterance (text) and converts it into a structured, machine-understandable representation, often in the form of dialogue acts.
2. Dialogue Policy: Based on the current dialogue state (the history of the conversation and the user's progress towards their agenda) and the output of the NLU, this component decides the next action or dialogue act the user simulator should perform.
3. Natural Language Generation (NLG): This component takes the structured dialogue act chosen by the dialogue policy and converts it into a natural language text utterance that is then sent back to the CRS. Optional components, like user modeling (to store preferences) and external resources (e.g., web, datasets), can also be involved.
Large Language Models (LLMs): These are advanced artificial intelligence models, typically based on deep neural networks (like Transformers), that have been trained on vast amounts of text data. They are capable of understanding, generating, and processing human language with remarkable fluency and coherence. In this paper, LLMs are used in several ways:
- As components within an agenda-based simulator (e.g., for NLU and NLG) to provide more diverse and human-like text processing.
- As end-to-end user simulators that directly generate user responses given the conversation history and a prompt.
- As LLM-as-a-Judge evaluators to assess the quality of conversations, mimicking human evaluators.
Dialogue Acts: A dialogue act is a linguistic unit that describes the communicative function or intention of an utterance in a dialogue. Instead of just raw text, a dialogue act provides a structured representation. For example, a user's utterance like "I'm looking for a sci-fi movie" could be represented as Request(genre="sci-fi", item_type="movie"). It often consists of an intent (e.g., Request, Confirm, Inform) and a set of slot-value pairs (e.g., $genre="sci-fi"$ ).
Information Need: In the context of recommendation, a user's information need represents their underlying goal, preferences, and constraints for what they are looking for. It guides the user simulator's behavior throughout the conversation. The paper defines it as a set of constraints ( $C$ ) and requests ( $R$ ), in addition to target items.
- Constraints are conditions the desired item must meet (e.g., $genre = pop$ , $artist = Pharrell Williams$ ).
- Requests are pieces of information the user wants to know about the item (e.g., release_year = ?).
- Target items are the specific items the user is looking for.
LLM-as-a-Judge: This is an evaluation paradigm where an LLM is used to assess the quality of system outputs, often in a conversational context. Instead of human evaluators, the LLM is given criteria or rubrics and scores the conversation or specific aspects of it. While efficient, it's important to acknowledge potential biases and limitations compared to human judgment.

The following figure (Figure 1 from the original paper) shows the two architectures for user simulators discussed above:

该图像是示意图，展示了用户模拟器的两种架构：议程基础架构（上）和端到端架构（下）。议程基础架构包括自然语言理解和生成模块、对话管理器及额外模块（如用户建模），而端到端架构则简化为内部过程，两者均涉及外部资源（如网络和数据集）。

3.2. Previous Works

The paper contextualizes UserSimCRS v2 by referencing several prior works, primarily highlighting gaps they aim to fill or building upon existing foundations:

UserSimCRS v1 [1]: The direct predecessor to this work. UserSimCRS was initially introduced by Afzali et al. [1] as a toolkit specifically for simulation-based evaluation of CRSs. It provided an agenda-based user simulator and basic evaluation metrics. However, as noted in the paper, its limitations included a restricted selection of simulators (only agenda-based), reliance on supervised NLU and template-based NLG requiring training data, lack of explicit information need definition, limited support for benchmark datasets, and a narrow set of evaluation measures. UserSimCRS v2 directly addresses these limitations.
CRS Building Toolkits (e.g., CRSLab [43], RecWizard [39], FORCE [25]): These toolkits are designed to facilitate the development of CRSs, often supporting various architectures like deep neural networks (CRSLab), rule-based systems (FORCE), or LLM-based CRSs (RecWizard). While they offer interfaces for direct interaction and some offline component-level evaluation (e.g., BLEU for NLG), the paper points out that they "generally lack robust and reusable resources for simulation." This gap motivates the existence of UserSimCRS.
Task-Oriented Dialogue System Toolkits (e.g., ConvLab [21, 45, 46], PyDial [31]): These comprehensive toolkits for dialogue systems (a generalization of CRSs to other tasks) do include user simulators for training and evaluation. However, the paper argues they are "ill-suited for the recommendation domain" because they are not designed to model recommendation-specific constructs (like assessing item preferences based on historical interactions) and often lack support for standard recommendation benchmark datasets like ReDial [22] and INSPIRED [15].
LLM-based User Simulation Research [20, 35-37]: The paper acknowledges a decisive shift in research focus towards LLM-based user simulators as the state-of-the-art. These models offer advanced natural language understanding and generation, leading to more fluent, diverse, and human-like interactions compared to rule-based approaches. UserSimCRS v2 directly integrates this trend by introducing LLM-based simulators. For instance, the single-prompt simulator is inspired by Terragni et al. [29] on in-context learning user simulators for task-oriented dialogue systems.
Existing Simulation-based Evaluation Methodologies (iEvaLM [35], CONCEPT [16]): These initiatives share similar objectives to UserSimCRS in facilitating simulation-based evaluation of CRSs using LLM-based user simulators. They consider various metrics (objective and subjective). However, their implementations are noted to be exclusively supporting CRSs implemented in CRSLab [43] and trained on either ReDial [22] or OpenDialKG [24], making them less broadly applicable compared to the wider integration goals of UserSimCRS v2.
CRS Arena [6]: This platform focuses on crowdsourced human evaluation, benchmarking CRSs through pairwise battles. UserSimCRS v2 leverages this by introducing an interface to interact with CRSs available in CRS Arena, such as KBRD [10], BARCOR [34], and ChatCRS [35], thus facilitating broader CRS integration.
LLM-as-a-Judge Evaluation [42]: The trend of using LLMs as surrogates for humans in evaluation is highlighted, with applications across diverse tasks including relevance assessment in information retrieval [11, 12, 30] and multi-faceted evaluation of CRSs [16, 35]. UserSimCRS v2 adopts this approach for conversational quality evaluation, while also acknowledging ongoing questions regarding biases and manipulation [2, 3, 33].

3.3. Technological Evolution

The evolution of Conversational Recommender Systems and their evaluation methods can be summarized as follows:

Early Recommender Systems: Initially, recommender systems focused on offline evaluation using metrics like Mean Reciprocal Rank (MRR) or Precision@k. These systems were often not conversational.
Emergence of CRSs: With the rise of conversational AI, CRSs emerged, necessitating interactive evaluation.
Human Evaluation for CRSs: Human evaluation became a primary method to capture the interactive nature. However, its high cost, lack of scalability, and reproducibility issues became apparent.
Rule-based User Simulation: To mitigate human evaluation drawbacks, user simulation gained traction. Early approaches, like agenda-based user simulators, were often rule-based and modular. While useful, they were limited by their rigidness, predictability, and difficulty in generating diverse, human-like interactions.
Neural Network-based CRSs and Component-level Evaluation: Toolkits like CRSLab supported the development of deep neural network-based CRSs. Evaluation often focused on individual components using metrics like BLEU for NLG, which don't capture end-to-end conversational quality.
LLM Era - User Simulation and Evaluation: The advent of Large Language Models (LLMs) brought a significant leap. LLMs enable more fluent, diverse, and human-like user simulations, leading to more realistic evaluation scenarios. Simultaneously, LLMs began to be employed as "judges" to automate the qualitative assessment of conversational quality, bridging the gap between automated metrics and human perception.
Unified Simulation Frameworks: The current state, as exemplified by UserSimCRS v2, aims to integrate these advancements into a comprehensive and flexible toolkit. This includes enhancing traditional simulators with LLM capabilities, introducing end-to-end LLM-based simulators, and providing LLM-as-a-Judge utilities, all while supporting a wider range of CRSs and benchmark datasets.

This paper's work fits within the latest stage of this evolution, by providing a state-of-the-art open-source toolkit that encapsulates the shift towards LLM-powered simulation and evaluation, addressing the long-standing challenges of CRS evaluation.

3.4. Differentiation Analysis

Compared to the main methods and toolkits in related work, UserSimCRS v2 presents several core differences and innovations:

Comprehensive Simulator Portfolio: Unlike UserSimCRS v1 which offered only an agenda-based simulator, or iEvaLM/CONCEPT which focus exclusively on LLM-based simulators, UserSimCRS v2 provides a holistic approach. It offers an enhanced agenda-based simulator (upgraded with LLM components for NLU/NLG) and introduces two distinct LLM-based user simulators (single-prompt and dual-prompt). This allows researchers to compare different simulation paradigms directly within a single framework.
Broad Interoperability with CRSs: A major innovation is the new communication interface that allows UserSimCRS v2 to interact with a wider range of CRSs, particularly those benchmarked in CRS Arena [6]. This addresses the interoperability challenge that previously hindered widespread adoption and comparative studies, as many existing simulation efforts were tied to specific CRS implementations (e.g., iEvaLM/CONCEPT limited to CRSLab [43] CRSs). By treating the CRS as a "black box" that interacts via DialogueKit, UserSimCRS v2 offers much greater flexibility.
Unified Data Handling and Augmentation: UserSimCRS v2 tackles the burden of data preparation by providing a unified data format and tools for converting widely-used benchmark datasets (INSPIRED, ReDial, IARD). Crucially, it incorporates LLM-powered tools for data augmentation, such as dialogue act annotation and information need extraction for datasets that lack these annotations. This is a significant improvement over previous versions or other toolkits that leave such preparation to the user.
Advanced LLM-as-a-Judge Evaluation: While LLM-as-a-Judge is a growing trend, UserSimCRS v2 integrates a dedicated utility to assess conversational quality across multiple dimensions (relevance, style, fluency, flow, satisfaction). This moves beyond the limited metrics of UserSimCRS v1 and provides a more nuanced, automated evaluation compared to component-specific metrics common in other CRS toolkits.
Specific to Recommendation Domain: Unlike general task-oriented dialogue system toolkits (e.g., ConvLab, PyDial) whose simulators are not designed for recommendation-specific constructs (e.g., preference modeling based on historical interactions), UserSimCRS v2 is explicitly tailored for CRSs. Its dialogue policy and information need structures are designed to reflect the nuances of recommendation dialogues.

In essence, UserSimCRS v2 innovates by creating a more comprehensive, flexible, and state-of-the-art ecosystem for simulation-based evaluation that addresses the key limitations of its predecessor and other existing tools by embracing LLMs for both simulation and evaluation, and by significantly improving interoperability and data handling.

4. Methodology

4.1. Principles

The core idea behind UserSimCRS v2 is to establish a comprehensive and flexible framework for simulation-based evaluation of Conversational Recommender Systems (CRSs). It aims to facilitate the comparison of different CRSs and investigate various user simulators by aligning with the current state-of-the-art in research, particularly the advancements brought by Large Language Models (LLMs). The theoretical basis or intuition is that by providing diverse, LLM-enhanced user simulators and sophisticated LLM-based evaluation metrics within an easily integrable framework, researchers can perform more realistic, scalable, and reproducible CRS evaluations, ultimately optimizing the use of human resources for testing. The toolkit maintains the principle of treating the CRS as a "black box," interacting with it via a standardized interface (DialogueKit) without needing access to its internal code.

4.2. Core Methodology In-depth (Layer by Layer)

The UserSimCRS v2 architecture (Figure 2 in the original paper) builds upon its predecessor by retaining core components while introducing significant upgrades and new modules.

The following figure (Figure 2 from the original paper) provides an overview of the UserSimCRS v2 architecture:

$Fig. 2: Overview of UserSimCRS v2 architecture. Grey components are inherited from \[1\], while hashed and purple components correspond to updated or added components, respectively.$ 该图像是示意图，展示了UserSimCRS v2架构，包括基于议程的模拟器、基于大型语言模型的模拟器、用户建模、项目评分及CRS评估等关键组件。不同颜色的区域表明继承与新增的功能模块，明确了各种工具和数据集的关系。

The grey components are inherited from UserSimCRS v1 [1], hashed components correspond to updated elements, and purple components represent newly added functionalities.

The toolkit's functionality can be broken down into several key areas:

4.2.1. Unified Data Format and Benchmark Datasets

UserSimCRS v2 addresses the challenge of data scarcity and heterogeneity by introducing a unified data format and providing tools for widely-used conversational datasets.

Unified Data Format: The proposed format is based on DialogueKit, where utterances are annotated with dialogue acts. A dialogue act is more detailed than in UserSimCRS v1, comprising an intent and optionally a set of associated slot-value pairs. This allows for a richer representation of complex interactions.
- Example: A CRS utterance "What size and color do you prefer for your new skates?" would be represented by the dialogue acts Elicit(size) and Elicit(color).
Information Need: A crucial addition is the explicit notion of an information need, which guides the conversation and user responses, and informs the user's preferences and persona. It consists of:
- Constraints ( $C$ ): A set of attributes and their desired values that the target item must satisfy.
- Requests ( $R$ ): A set of attributes for which the user wants to know the value.
- Target Items: The specific items the user is looking for.
  
  For example, a user looking for the song 'Happy' by Pharrell Williams and wanting to know its release year would have an information need represented as: $ \begin{array} { c } { C = \Big [ \mathrm { g e n r e } = \mathrm { p o p } } \ { \mathrm { a r t i s t } = \mathrm { P h a r r e l l ~ W i l l i a m s } \Big ] } \ { R = \big [ \mathrm { r e l e a s e _ y e a r } = ? \big ] } \ { \mathrm { t a r g e t } = \mathrm { H a p p y } } \end{array} $ Here, $\mathrm{genre}$ and $\mathrm{artist}$ are constraints in $C$ , $\mathrm{release\_year}$ is a request in $R$ , and $\mathrm{Happy}$ is the target item.
Benchmark Datasets: UserSimCRS v2 includes conversion tools and support for:
- INSPIRED [15] (Movie domain, 1,001 dialogues).
- ReDial [22] (Movie domain, 10,006 dialogues).
- $IARD*$ [8] (Augmented version of a subset of ReDial, Movie domain, 77 dialogues). These datasets are summarized in Table 1 of the original paper.
Data Augmentation: For datasets lacking necessary annotations (e.g., dialogue acts in ReDial), UserSimCRS v2 provides LLM-powered tools to perform automatic dialogue act annotation and information need extraction. This leverages the capability of LLMs as an alternative to human annotators in data-scarce scenarios, though the paper acknowledges the biases and limitations of LLMs in this context.

4.2.2. Enhanced Agenda-based User Simulator (ABUS)

The agenda-based user simulator from UserSimCRS v1 is significantly upgraded by incorporating LLM-based components and a revised dialogue policy.

Natural Language Understanding (NLU):
- Objective: To extract dialogue acts from incoming CRS utterances.
- Mechanism: An LLM-based NLU component is used. It is initialized with a list of possible intents and slots for dialogue acts, and a prompt that includes a placeholder for the incoming CRS utterance.
- Process: The CRS utterance is sent to an LLM (e.g., hosted on Ollama) with the specific prompt. The LLM's output is then parsed to extract the dialogue acts in a predefined format: intent1 (slot="value", slot, ...) | intent2().
- Configuration: A default few-shot prompt and configuration, built using examples from the augmented IARD dataset, are provided for the movie domain.
Dialogue Policy:
- Objective: To determine the dialogue acts of the next user utterance based on the current agenda and dialogue state.
- Mechanism: The interaction model within DialogueKit is revised to align with the new data format, which explicitly uses dialogue acts and the information need to guide the conversation.
- Agenda Initialization: The agenda is now initialized based on the information need. For each constraint in $C$ , a disclosure dialogue act is added to the agenda. For each request in $R$ , an inquiry dialogue act is added.
- Agenda Update Process: The process is modified to reflect the natural flow of recommendation dialogues, inspired by Lyu et al. [23]. It explicitly handles four cases:
  1. Elicitation: When the CRS elicits information, a disclosure dialogue act is pushed to the agenda.
  2. Recommendation: When the CRS makes a recommendation.
  3. Inquiry: When the CRS makes an inquiry.
  4. Other: For other CRS utterances. For the first three cases, a specific type of dialogue act is pushed to the agenda. In the "other" case, the next dialogue act is either taken from the agenda (if coherent with the current dialogue state) or sampled based on historical data.
- Slot-Value Derivation: All slot-value pairs within the dialogue acts are derived from the information need or from the user's preferences if not explicitly present in the information need.
Natural Language Generation (NLG):
- Objective: To generate human-like textual utterances from the dialogue acts provided by the dialogue policy.
- Mechanism: An LLM-based NLG component (relying on an LLM hosted on Ollama) is introduced. This aims to increase the diversity of generated responses compared to the template-based NLG from UserSimCRS v1.
- Process: This component uses a prompt that includes placeholders for dialogue acts and, optionally, additional annotations (e.g., emotion).
- Configuration: A default few-shot prompt, using examples from the augmented IARD dataset, is provided for the movie domain.

4.2.3. Large Language Model-based User Simulators

UserSimCRS v2 introduces two LLM-based user simulators to align with the state-of-the-art trend. These are end-to-end architectures (as shown in Figure 1b from the original paper).

Single-Prompt Simulator:
- Mechanism: This simulator generates the next user utterance in an end-to-end manner using a single, comprehensive prompt.
- Prompt Composition: The prompt is inspired by Terragni et al. [29] and includes:
  - Task description: Instructions on what the simulator should do.
  - Optional persona: Character traits or background information for the simulated user.
  - Information need: The user's specific goals and preferences.
  - Current conversation history: The preceding turns of the dialogue.
- Approach: The implementation follows a zero-shot approach by default (no examples in the prompt), but it can be configured to use a few-shot approach by including examples in the task description.
Dual-Prompt Simulator:
- Mechanism: This extends the single-prompt simulator by introducing an initial decision-making step.
- Stopping Prompt: A separate prompt (stopping prompt) is used first to decide whether the conversation should continue (binary decision). This prompt is structured similarly to the main generation prompt but is re-framed to focus on the continuation decision.
- Generation Prompt: If the stopping prompt decides to continue, the main generation prompt (identical to the one used by the single-prompt simulator) is then used to generate the next utterance.
- Termination: If the stopping prompt decides to stop, a default utterance for ending the conversation is sent.

4.2.4. Integration with Large Language Models

To manage interactions with various LLM backends, UserSimCRS v2 introduces a common interface.

Common Interface: All LLM-based modules (e.g., NLU, NLG, LLM-based simulators, LLM-as-a-Judge) use this interface to communicate with LLM servers.
Implementations: UserSimCRS v2 provides two initial implementations: one for OpenAI and another for an Ollama server.
Extensibility: This design ensures high extensibility, allowing easy addition of interfaces for other LLM backends through inheritance.

4.2.5. Integration with Existing CRSs

Black Box Principle: UserSimCRS maintains its design principle of treating the CRS as a "black box," requiring only an interface for interaction (via DialogueKit).
New Interface for CRS Arena: UserSimCRS v2 introduces a new interface to interact with CRSs available in CRS Arena [6], which is a platform for benchmarking CRSs. This includes models like KBRD [10], BARCOR [34], and ChatCRS [35].
Benefits: This integration significantly reduces the technical burden for users and facilitates experimentation with a more diverse and commonly used set of CRSs. The original UserSimCRS v1 only supported IAI MovieBot [14].

4.2.6. LLM-as-a-Judge for Conversation Quality Evaluation

UserSimCRS v2 expands its evaluation capabilities beyond basic metrics by introducing an LLM-based evaluator ("LLM-as-a-Judge").

Evaluation Aspects: Conversation quality is assessed across five key aspects, inspired by Huang et al. [16]:
1. Recommendation relevance: Measures how well the recommended items match the user's preferences and information need.
2. Communication style: Assesses the conciseness and clarity of the CRS's responses.
3. Fluency: Evaluates the naturalness and human-likeness of the CRS's generated responses.
4. Conversational flow: Examines the coherence and consistency of the entire dialogue.
5. Overall satisfaction: Captures the user's holistic experience with the CRS.
Scoring: Each aspect is assigned a score between 1 and 5, guided by a detailed grading rubric. For example, for recommendation relevance, a score of 1 indicates irrelevant recommendations, while a score of 5 signifies highly relevant recommendations.
Acknowledged Limitations: The paper explicitly acknowledges the ongoing questions and potential issues associated with LLM-based evaluators, including their susceptibility to biases and the variability in their correlation with human judgments [2, 3, 33, 42].

5. Experimental Setup

5.1. Datasets

The experiments in the case study primarily utilize datasets from the movie recommendation domain. UserSimCRS v2 aims to reduce adoption barriers by supporting widely-used conversational datasets.

The following are the results from Table 1 of the original paper:

Dataset	Domain	# Dialogues	Dialogue annotations
IARD* [8]	Movie	77	Dialogue acts, recommendations accepted
INSPIRED [15]	Movie	1,001	Social strategies, entity mentioned, item feedback
ReDial [22]	Movie	10,006	Movies references, item feedback

IARD [8]:* This is an augmented version of a subset of the original IARD dataset. It is in the Movie domain and contains 77 dialogues. Its annotations include dialogue acts and recommendations accepted, which are crucial for training agenda-based user simulators and for evaluation. The * indicates that it's an augmented version (likely with LLM-powered tools for dialogue act and information need extraction where original annotations were missing).
INSPIRED [15]: This dataset is also in the Movie domain, featuring 1,001 dialogues. It includes annotations for social strategies, entities mentioned, and item feedback. These rich annotations make it suitable for evaluating CRSs that engage in more sociable and nuanced conversations.
ReDial [22]: A larger dataset in the Movie domain, comprising 10,006 dialogues. It contains movie references and item feedback. Notably, the utterances in ReDial are not originally annotated with dialogue acts, making it a prime candidate for LLM-powered data augmentation within UserSimCRS v2 to enable its use with dialogue act-dependent user simulators.

These datasets were chosen because they are widely recognized and used benchmarks in the field of conversational recommender systems. Their diversity in size and annotation types allows for comprehensive validation of the toolkit's capabilities, including data augmentation tools and different types of user simulators. They are effective for validating the method's performance by providing realistic conversational contexts and preferences for movie recommendations. The paper does not provide a concrete example of a data sample (e.g., a specific dialogue turn) from these datasets within its main text.

5.2. Evaluation Metrics

The paper describes several evaluation metrics used to assess the performance of CRSs within UserSimCRS v2.

5.2.1. User Satisfaction

Conceptual Definition: User satisfaction quantifies how pleased the simulated user is with the overall recommendation experience and outcomes. It is a subjective measure intended to capture the utility and enjoyment derived from interacting with the CRS. In UserSimCRS v1, this was one of the limited metrics.
Mathematical Formula: The paper does not provide a specific mathematical formula for user satisfaction but implies it is a score assigned (likely by the LLM-as-a-Judge in v2) on a scale, often 1-5.
Symbol Explanation: Not applicable as no formula is provided.

5.2.2. Average Number of Turns

Conceptual Definition: This metric measures the efficiency of the conversation, indicating how many dialogue turns were required for the simulated user to achieve their information need or to conclude the conversation. A lower number of turns often suggests a more efficient CRS.
Mathematical Formula: The paper does not provide a specific mathematical formula, but it is typically calculated as the sum of the number of turns for all conversations divided by the total number of conversations. Let $N$ be the total number of synthetic dialogues. Let $T_i$ be the number of turns in dialogue $i$ . $ \text{Average Number of Turns} = \frac{\sum_{i=1}^{N} T_i}{N} $
Symbol Explanation:
- $N$ : The total number of synthetic dialogues generated.
- $T_i$ : The number of turns in the $i$ -th dialogue.

5.2.3. LLM-as-a-Judge Conversation Quality Metrics

UserSimCRS v2 introduces a utility to assess conversation quality using an LLM-based evaluator, assigning scores between 1 and 5 for five specific aspects:

Recommendation Relevance
- Conceptual Definition: This metric assesses how well the items recommended by the CRS align with the simulated user's stated and inferred preferences and information need. It directly measures the quality of the recommendations in satisfying the user's objective. A score of 1 indicates irrelevant recommendations, while a score of 5 signifies highly relevant recommendations.
- Mathematical Formula: The paper does not provide a specific mathematical formula, as this is a qualitative score assigned by an LLM based on a predefined rubric. The reported values are averages of these qualitative scores.
- Symbol Explanation: Not applicable.
Communication Style
- Conceptual Definition: This metric evaluates the clarity, conciseness, and appropriateness of the CRS's responses. It measures how effectively the CRS communicates, considering factors like grammar, directness, and absence of ambiguity.
- Mathematical Formula: The paper does not provide a specific mathematical formula, as this is a qualitative score assigned by an LLM based on a predefined rubric.
- Symbol Explanation: Not applicable.
Fluency
- Conceptual Definition: Fluency measures the naturalness and grammatical correctness of the CRS's generated utterances, comparing them to how a human would typically phrase responses. It assesses how human-like and easy to understand the language generation is.
- Mathematical Formula: The paper does not provide a specific mathematical formula, as this is a qualitative score assigned by an LLM based on a predefined rubric.
- Symbol Explanation: Not applicable.
Conversational Flow
- Conceptual Definition: This metric evaluates the overall coherence, consistency, and logical progression of the dialogue. It assesses whether the conversation feels natural, follows a logical sequence, and avoids abrupt topic shifts or repetitions.
- Mathematical Formula: The paper does not provide a specific mathematical formula, as this is a qualitative score assigned by an LLM based on a predefined rubric.
- Symbol Explanation: Not applicable.
Overall Satisfaction
- Conceptual Definition: This metric encapsulates the simulated user's holistic experience with the CRS. It is a comprehensive subjective score reflecting the user's general contentment with the entire interaction, considering all aspects of the conversation and the recommendations received.
- Mathematical Formula: The paper does not provide a specific mathematical formula, as this is a qualitative score assigned by an LLM based on a predefined rubric.
- Symbol Explanation: Not applicable.

5.3. Baselines

The paper's case study demonstrates the capabilities of UserSimCRS v2 by evaluating several CRSs using different types of user simulators. The "baselines" in this context are the other CRSs and user simulators against which performance is compared, showcasing the toolkit's comparative evaluation functionalities.

User Simulators (implemented within UserSimCRS v2) compared:

ABUS (Agenda-based User Simulator): The enhanced agenda-based user simulator from UserSimCRS v2, featuring LLM-based NLU and NLG components. This represents an upgraded version of the traditional rule-based simulation approach.
LLM-SP (Single-Prompt LLM-based User Simulator): One of the two end-to-end LLM-based simulators introduced in UserSimCRS v2. It generates responses using a single comprehensive prompt.
LLM-DP (Dual-Prompt LLM-based User Simulator): The other end-to-end LLM-based simulator in UserSimCRS v2. It first uses a separate prompt to decide whether to continue the conversation before generating a response.

Conversational Recommender Systems (CRSs) evaluated: These CRSs are selected from CRS Arena [6] or are the original CRS supported by UserSimCRS v1. They represent various approaches to building CRSs.

BARCOR_OpenDialKG [34]: A CRS based on the BARCOR framework, likely trained or evaluated using the OpenDialKG dataset.
BARCOR_ReDial [34]: Another BARCOR-based CRS, likely trained or evaluated using the ReDial dataset.
KBRD_ReDial [10]: A Knowledge-Based Recommender Dialogue system (KBRD), likely trained or evaluated using the ReDial dataset.
UniCRS_OpenDialKG: A CRS model (likely a universal or unified CRS), likely trained or evaluated using the OpenDialKG dataset.
IAI MovieBot [14]: The original conversational movie recommender system that UserSimCRS v1 was designed to interact with.

These CRSs were chosen because they are commonly used and representative models in the research community, providing a good testbed to demonstrate UserSimCRS v2's ability to integrate with diverse systems and evaluate them using different simulators and datasets. The comparison among these systems, facilitated by the toolkit, helps illustrate its utility in exploring CRS performance under various simulation conditions.

6. Results & Analysis

6.1. Core Results Analysis

The paper presents a case study in movie recommendation to demonstrate the functionalities of UserSimCRS v2. The evaluation involves generating 100 synthetic dialogues for each pair of user simulator and CRS, and then computing evaluation metrics. The reported metrics are user satisfaction, fluency, and recommendation relevance, all scored on a 1-5 scale by the LLM-as-a-Judge utility.

The following are the results from Table 2 of the original paper:

CRS	User satisfaction			Fluency			Rec. relevance
CRS	ABUS	LLM-SP	LLM-DP	ABUS	LLM-SP	LLM-DP	ABUS	LLM-SP	LLM-DP
Dataset: MovieBot
BARCOR_OpenDialKG	1.9 ±0.6	1.9 ±0.6	1.9 ±0.6	2.8 ±0.5	2.6 ±0.6	3.0 ±0.5	1.4 ±0.6	1.5 ±0.9	2.0 ±0.9
BARCOR−ReDial	2.0 ±0.5	1.8 ±0.5	2.0 ±0.6	3.3 ±0.8	2.6 ±0.6	2.9 ±0.6	2.4 ±1.6	1.5 ±0.9	1.9 ±0.9
KBRD_ReDial	1.9 ±0.4	1.9 ±0.4	1.9 ±0.4	2.5 ±0.6	2.4 ±0.6	2.6 ±0.6	1.2 ±0.6	1.3 ±0.7	1.5 ±0.8
UniCRS_OpenDialKG	2.3 ±0.8	1.8 ±0.6	1.9 ±0.6	2.4 ±0.5	2.5 ±0.6	2.8 ±0.5	1.0 ±0.2	1.5 ±1.0	1.5 ±0.7
IAI MovieBot	1.5 ±0.7	1.7 ±0.8	2.5 ±0.7	3.4 ±0.6	3.1 ±0.7	3.3 ±1.0	2.1 ±1.1	2.3 ±1.4	2.9 ±1.7
Dataset: INSPIRED
BARCOR_OpenDialKG	2.2 ±0.7	2.0 ±0.6	2.1 ±0.6	2.6 ±0.6	2.7 ±0.5	3.0 ±0.6	1.6 ±0.6	1.4 ±0.6	1.9 ±1.0
BARCOR−ReDial	2.1 ±0.5	2.0 ±0.5	2.0 ±0.5	3.5 ±0.5	2.6 ±0.6	2.9 ±0.6	2.3 ±0.8	1.4 ±0.9	1.6 ±0.9
KBRD_ReDial	2.0 ±0.3	2.0 ±0.5	2.1 ±0.6	3.3 ±0.6	2.3 ±0.5	2.6 ±0.6	1.9 ±0.9	1.2 ±0.6	1.4 ±0.7
UniCRS_OpenDialKG	1.9 ±0.8	2.1 ±0.6	2.1 ±0.6	2.9 ±0.5	2.4 ±0.5	2.7 ±0.5	1.3 ±0.6	1.1 ±0.4	1.5 ±0.7
IAI MovieBot	1.4 ±0.8	2.2 ±0.8	2.2 ±0.8	3.3 ±0.5	2.9 ±1.0	3.3 ±1.0	2.1 ±0.8	2.7 ±1.9	3.1 ±1.8
Dataset: ReDial
BARCOR_OpenDialKG	2.2 ±0.8	1.9 ±0.8	2.1 ±0.7	3.0 ±0.6	2.6 ±0.5	3.0 ±0.5	1.9 ±0.8	1.4 ±0.6	1.8 ±0.9
BARCOR_ReDial	2.0 ±0.1	2.2 ±0.6	2.3 ±0.8	3.4 ±0.6	2.4 ±0.5	2.4 ±0.6	2.3 ±0.9	1.2 ±0.6	1.3 ±0.6
KBRD_ReDial	1.9 ±0.6	2.1 ±0.5	2.1 ±0.8	3.0 ±0.5	2.1 ±0.4	2.3 ±0.5	1.7 ±0.8	1.0 ±0.1	1.2 ±0.6
UniCRS_OpenDialKG	2.0 ±0.8	1.9 ±0.8	1.9 ±0.8	2.7 ±0.5	2.3 ±0.5	2.8 ±0.6	1.3 ±0.5	1.1 ±0.3	1.5 ±0.9
IAI MovieBot	2.0 ±0.2	2.3 ±0.9	2.2 ±0.9	4.0 ±0.2	2.9 ±1.0	3.2 ±1.0	4.7 ±0.8	2.4 ±1.7	2.8 ±1.6

6.1.1. General CRS Performance

The results generally confirm that CRS performance remains a significant challenge, with most scores falling below 3 on a 5-point scale across all metrics and simulators. This aligns with observations from other studies [5, 6]. This suggests that there is substantial room for improvement in CRS development.

6.1.2. Simulator Disagreement on System Ranking

A crucial finding is the significant disagreement among different user simulators regarding the ranking of CRS performance.

Example: On the MovieBot dataset, ABUS ranks IAI MovieBot last for user satisfaction (1.5 ±0.7), while LLM-DP ranks it first (2.5 ±0.7).
Magnitude Divergence: Even when simulators agree on the top-performing system, they often diverge on the magnitude of its performance. For example, for Recommendation Relevance on ReDial, IAI MovieBot is ranked highest by ABUS (4.7 ±0.8), but the LLM simulators give it much lower scores (2.4 ±1.7 for LLM-SP and 2.8 ±1.6 for LLM-DP). This highlights that the choice of user simulator can profoundly impact the perceived performance and ranking of CRSs, emphasizing the need for careful consideration and potentially multi-faceted evaluation with diverse simulators.

6.1.3. Distinct Simulator Characteristics

The different user simulators exhibit distinct characteristics:

ABUS ("most opinionated"): The agenda-based user simulator (ABUS) appears to be the "most opinionated," as it assigns both the highest score in the entire table (4.7 for IAI MovieBot's Recommendation Relevance on ReDial) and some of the lowest (1.0 for UniCRS_OpenDialKG's Recommendation Relevance on MovieBot). This suggests ABUS might be more sensitive or have stricter criteria for judging, leading to a wider spread of scores.
LLM Simulators (compressed range): In contrast, the LLM simulators (LLM-SP and LLM-DP) generally operate within a more compressed range, rarely awarding scores above 3.3. This could imply a tendency towards more conservative or averaged judgments compared to ABUS.
LLM-SP vs. LLM-DP: The two LLM simulators are not interchangeable. While they sometimes produce similar results, LLM-DP consistently yields equal or higher average scores than LLM-SP on the INSPIRED dataset across all three metrics. This suggests that the dual-prompt mechanism, which includes a explicit stopping decision, might lead to slightly more favorable or complete conversations from the LLM judge's perspective.

6.1.4. CRS Performance is Dataset-Dependent

The performance of CRSs is highly dependent on the dataset used for evaluation:

Extreme Variability: Some systems show extreme variability. For instance, IAI MovieBot's Recommendation Relevance (as judged by ABUS) peaks at 4.7 on ReDial, but drops to a mediocre 2.1 on MovieBot and INSPIRED datasets. This suggests that CRSs can be highly tuned or perform very differently depending on the specific characteristics and nuances of the underlying data.
High Consistency: In contrast, other systems demonstrate high consistency. BARCOR_ReDial's fluency score from ABUS, for example, remains stable and high across all three datasets (3.3, 3.5, 3.4). This indicates robustness in certain aspects for some CRSs.

6.1.5. Enabled Research Directions

The results and the synthetic dialogues generated by UserSimCRS v2 pave the way for various research avenues:

Influence of Simulator Types: Researchers can investigate how different user simulator types (e.g., agenda-based vs. LLM-based) and configurations (e.g., different LLMs, prompts, datasets, and interaction models) influence evaluation outcomes.
CRS Performance Comparison: The toolkit enables robust comparative analysis of CRS performance under controlled simulation conditions.
Dialogue Analysis: The generated synthetic dialogues can be analyzed for their discourse structure, characteristics, or instances of conversational breakdowns, providing insights into CRS weaknesses and strengths.

In summary, the case study effectively demonstrates the advanced capabilities of UserSimCRS v2 for simulation-based evaluation. It reveals the complexity of CRS evaluation, highlighting the varying perspectives of different user simulators and the significant impact of dataset choice on perceived CRS performance. This underscores the toolkit's value in providing a flexible and comprehensive platform for future CRS research.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces UserSimCRS v2, a significant upgrade to the existing UserSimCRS toolkit designed to facilitate more comprehensive and flexible simulation-based evaluation for Conversational Recommender Systems (CRSs). The new version aligns the toolkit with state-of-the-art research by incorporating several key extensions:

Enhanced Agenda-based User Simulator (ABUS): The classical ABUS is upgraded with LLM-powered components for Natural Language Understanding (NLU) and Natural Language Generation (NLG), along with a revised dialogue policy.
Introduction of LLM-based Simulators: Two new end-to-end LLM-based user simulators (single-prompt and dual-prompt) are integrated.
Wider CRS and Dataset Support: The toolkit now supports integration with a broader range of CRSs from CRS Arena and provides a unified data format with LLM-powered augmentation tools for popular benchmark datasets (ReDial, INSPIRED, IARD).
LLM-as-a-Judge Evaluation Utilities: New LLM-based utilities are added for assessing conversational quality across multiple aspects like recommendation relevance, fluency, and overall satisfaction.

A case study in movie recommendation demonstrates these extensions, showcasing the toolkit's ability to evaluate various CRSs under different simulation conditions. The results highlight the differing perspectives of various user simulators and the impact of datasets on CRS performance.

7.2. Limitations & Future Work

The authors acknowledge several areas for future development and implicitly point to some current limitations:

User Modeling Enhancements: Future work will focus on enhancing the integration of user modeling components. This is crucial for better supporting the simulation of diverse user populations, allowing for more realistic and varied user behaviors and preferences.
Novel Evaluation Metrics: The toolkit aims to explore novel evaluation metrics beyond the current set. This suggests that while the LLM-as-a-Judge is an advancement, there might be other dimensions or more robust ways to quantify CRS performance and conversational quality.
Reducing Technical Entry Barrier: A continuous goal is to further reduce the technical entry barrier to encourage widespread adoption of the toolkit by the research community. This implies that while improvements have been made, there might still be complexities for new users.
Acknowledged LLM-as-a-Judge Limitations: The paper explicitly mentions that "significant open questions remain, including susceptibility to biases and manipulation" for LLM-as-a-Judge [2, 3, 33], and that "correlation with human judgments varies across different studies" [11]. This is an important limitation of the new evaluation utility, despite its benefits.
Simulator Disagreement: The case study itself reveals a limitation: different user simulators can significantly disagree on CRS rankings and performance magnitudes. While this enables research, it also implies that any single simulation result must be interpreted with caution, as its validity might depend on the chosen simulator.

7.3. Personal Insights & Critique

This paper presents a timely and valuable contribution to the field of Conversational Recommender Systems. The UserSimCRS v2 toolkit addresses critical gaps in CRS evaluation, moving beyond the limitations of manual human studies and rigid rule-based simulations.

Personal Insights:

Holistic Approach: The most compelling aspect is the holistic integration of various state-of-the-art components. By combining enhanced agenda-based simulators with LLM-based end-to-end simulators, and LLM-as-a-Judge evaluation, the toolkit provides a powerful and versatile platform. This multi-pronged approach acknowledges the strengths and weaknesses of different simulation paradigms.
Interoperability Solved: The focus on interoperability by providing a robust interface to CRS Arena models is a significant practical step forward. This lowers the barrier for comparative research and fosters a more standardized evaluation environment, which is desperately needed in this complex domain.
Data Augmentation Utility: The LLM-powered data augmentation tools are ingenious. Many valuable datasets lack the specific annotations required for advanced dialogue act-based or information need-driven simulations. Automating this process, even with acknowledged LLM biases, provides an accessible path to leverage more diverse data.
Realism vs. Control: UserSimCRS v2 strikes a good balance between the highly controlled, interpretable nature of agenda-based simulation and the more realistic, human-like generative capabilities of LLM-based simulation. This allows researchers to choose the level of realism and control needed for specific experiments.

Critique & Areas for Improvement:

Simulator Validation: The observed "significant disagreement on system ranking" among simulators is a critical finding. While the paper frames this as an "enabled research direction," it also points to a fundamental question: which simulator's judgment is more "correct" or representative of real human users? Future work should prioritize rigorous validation of these simulators against human judgments, perhaps even developing methods to calibrate or combine simulator outputs to gain more reliable insights.
Quantifying LLM-as-a-Judge Biases: While the paper acknowledges LLM-as-a-Judge biases, it doesn't offer insights into the magnitude or nature of these biases within UserSimCRS v2. Quantifying this (e.g., through correlation studies with limited human judgments) would provide greater confidence in the LLM-derived scores.
Transparency of LLM Prompts: The paper mentions default prompts for NLU and NLG and the LLM-based simulators. Making these prompts highly configurable and transparent (e.g., providing a prompt engineering guide) would be beneficial, as prompt design significantly impacts LLM behavior and thus simulation outcomes.
Error Analysis and Breakdowns: The paper suggests analyzing conversational breakdowns. While LLM-as-a-Judge provides scores, a deeper diagnostic capability to pinpoint why a conversation broke down or why recommendations were poor (e.g., specific dialogue act misinterpretations, NLG errors) would be invaluable for CRS developers.
Beyond Movie Domain: While the movie domain is popular, CRSs operate in many domains. Demonstrating the toolkit's adaptability to other domains (e.g., fashion, travel) would further highlight its versatility. This might involve discussing how to adapt the information need structure or LLM prompts for new domains.

Overall, UserSimCRS v2 is a robust and forward-thinking toolkit. It is poised to significantly accelerate research and development in Conversational Recommender Systems by providing much-needed tools for scalable, flexible, and state-of-the-art evaluation. The identified limitations are common in the rapidly evolving LLM landscape and represent exciting avenues for future research.

UserSimCRS v2: Simulation-Based Evaluation for Conversational Recommender Systems

TL;DR Summary