Paper status: completed

Towards Expert-Level Medical Question Answering with Large Language Models

Published:05/17/2023

Large Language Models in Medical Question Answering (1)Med-PaLM 2 (1)Medical Domain Fine-Tuning (1)MedQA Dataset (1)Medical Knowledge Retrieval (1)

Original Link PDF

Price: 0.100000

1 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper presents Med-PaLM 2, a significant advancement in medical question answering, achieving an 86.5% score on the MedQA dataset, improving by over 19% while employing base LLM enhancements, domain fine-tuning, and innovative prompting strategies.

Abstract

Recent artificial intelligence (AI) systems have reached milestones in "grand challenges" ranging from Go to protein-folding. The capability to retrieve medical knowledge, reason over it, and answer medical questions comparably to physicians has long been viewed as one such grand challenge. Large language models (LLMs) have catalyzed significant progress in medical question answering; Med-PaLM was the first model to exceed a "passing" score in US Medical Licensing Examination (USMLE) style questions with a score of 67.2% on the MedQA dataset. However, this and other prior work suggested significant room for improvement, especially when models' answers were compared to clinicians' answers. Here we present Med-PaLM 2, which bridges these gaps by leveraging a combination of base LLM improvements (PaLM 2), medical domain finetuning, and prompting strategies including a novel ensemble refinement approach. Med-PaLM 2 scored up to 86.5% on the MedQA dataset, improving upon Med-PaLM by over 19% and setting a new state-of-the-art. We also observed performance approaching or exceeding state-of-the-art across MedMCQA, PubMedQA, and MMLU clinical topics datasets. We performed detailed human evaluations on long-form questions along multiple axes relevant to clinical applications. In pairwise comparative ranking of 1066 consumer medical questions, physicians preferred Med-PaLM 2 answers to those produced by physicians on eight of nine axes pertaining to clinical utility (p < 0.001). We also observed significant improvements compared to Med-PaLM on every evaluation axis (p < 0.001) on newly introduced datasets of 240 long-form "adversarial" questions to probe LLM limitations. While further studies are necessary to validate the efficacy of these models in real-world settings, these results highlight rapid progress towards physician-level performance in medical question answering.

Mind Map

In-depth Reading

English Analysis~22 min read · 36,427 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is "Towards Expert-Level Medical Question Answering with Large Language Models."

1.2. Authors

The authors are: Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Le Hou, Kevin Clark, Stephen Pfohl, Heather Cole-Lewis, Darlene Neal, Mike Schaekermann, Amy Wang, Mohamed Amin, Sami Lachgar, Philip Mansfield, Sushant Prakash, Bradley Green, Ewa Dominowska, Blaise Aguera y Arcas, Nenad Tomasev, Yun Liu, Renee Wong, Christopher Semturs, S. Sara Mahdavi, Joelle Barral, Dale Webster, Greg S. Corrado, Yossi Matias, Shekoofeh Azizi, Alan Karthikesalingam, and Vivek Natarajan.

Their affiliations are primarily Google Research ( $^1$ ) and DeepMind ( $^2$ ).

1.3. Journal/Conference

The paper was published as a preprint on arXiv. While arXiv is not a peer-reviewed journal or conference in itself, it is a widely recognized repository for preprints in fields like AI, physics, mathematics, and computer science. Papers published on arXiv are often submitted to prestigious conferences (e.g., NeurIPS, ICML, ICLR, AAAI) or journals after further peer review. The visibility on arXiv allows for rapid dissemination of research findings and community feedback.

1.4. Publication Year

The paper was published at UTC: 2023-05-16T17:11:29.000Z, which corresponds to the year 2023.

1.5. Abstract

The abstract introduces Med-PaLM 2, a significant advancement in medical question answering (MQA) using Large Language Models (LLMs). Building on its predecessor Med-PaLM, which was the first to "pass" USMLE-style questions with a 67.2% score on the MedQA dataset, Med-PaLM 2 leverages several improvements: a more powerful base LLM (PaLM 2), medical domain-specific finetuning, and novel prompting strategies, including an ensemble refinement approach.

The model achieved a new state-of-the-art score of 86.5% on the MedQA dataset, an improvement of over 19% from Med-PaLM. It also demonstrated performance approaching or exceeding state-of-the-art on other benchmarks such as MedMCQA, PubMedQA, and MMLU clinical topics.

Crucially, the paper presents detailed human evaluations on long-form questions. Physicians preferred Med-PaLM 2 answers over physician-generated answers on eight of nine axes related to clinical utility (e.g., factuality, reasoning, low harm likelihood) in a pairwise comparison of 1066 consumer medical questions ( $p < 0.001$ ). Significant improvements were also observed compared to Med-PaLM on "adversarial" questions designed to probe LLM limitations. While real-world validation is still necessary, these results indicate rapid progress towards physician-level performance in MQA.

1.6. Original Source Link

The official source link for the paper is: https://arxiv.org/abs/2305.09617v1 The PDF link is: https://arxiv.org/pdf/2305.09617v1.pdf This is a preprint publication on arXiv.

2. Executive Summary

2.1. Background & Motivation

The paper addresses the long-standing "grand challenge" of developing artificial intelligence (AI) systems capable of retrieving, reasoning over, and answering medical questions at a level comparable to human physicians. This problem is of immense importance due to its potential to revolutionize healthcare, improve patient education, and assist clinicians.

Prior research, particularly with Large Language Models (LLMs), had made significant strides. Med-PaLM, for instance, was the first model to exceed a "passing" score on US Medical Licensing Examination (USMLE) style questions using the MedQA dataset, achieving 67.2%. However, despite this milestone, previous work, including Med-PaLM itself, indicated substantial room for improvement, especially when comparing the quality of model-generated answers to those provided by human clinicians. Key challenges and gaps identified included:

Performance on Benchmarks: While Med-PaLM achieved state-of-the-art on multiple-choice benchmarks, these scores still left room for improvement.
Long-form Answer Quality: The ability to generate factual, safe, and nuanced long-form responses for open-ended questions, typical in real-world medical scenarios, was still a significant hurdle. Human evaluations revealed that AI outputs, particularly for long-form answers, needed further refinement to ensure safety and alignment with human values and expectations in a safety-critical domain like medicine.
Robustness to Adversarial Questions: LLMs often struggle with complex, tricky, or potentially biased questions, necessitating specific probes into their limitations.

The paper's entry point or innovative idea is to bridge these gaps by combining a new, more powerful base LLM (PaLM 2) with targeted medical domain finetuning and novel prompting strategies, including a new method called ensemble refinement. This multi-pronged approach aims to push LLM capabilities towards expert-level performance in medical question answering.

2.2. Main Contributions / Findings

The paper presents Med-PaLM 2 and highlights several primary contributions and key findings:

State-of-the-Art Performance: Med-PaLM 2 achieved an accuracy of up to 86.5% on the MedQA dataset, improving upon Med-PaLM by over 19% and setting a new state-of-the-art. It also approached or exceeded state-of-the-art performance across MedMCQA, PubMedQA, and MMLU clinical topics datasets.
Novel Methodology: The development of Med-PaLM 2 involved:
- Leveraging PaLM 2 as an improved base LLM.
- Implementing targeted medical domain-specific finetuning using MultiMedQA datasets.
- Introducing ensemble refinement as a new prompting strategy to enhance LLM reasoning capabilities by aggregating multiple reasoning paths.
Superior Long-Form Answer Quality (Human Evaluation): In detailed human evaluations by physicians on 1066 consumer medical questions:
- Physicians preferred Med-PaLM 2 answers to physician-generated answers on eight of nine axes pertaining to clinical utility (e.g., alignment with medical consensus, reading comprehension, knowledge recall, reasoning, low likelihood of harm) with high statistical significance ( $p < 0.001$ ).
- Med-PaLM 2 answers were judged to better reflect medical consensus 72.9% of the time compared to physician answers.
Improved Robustness on Adversarial Questions: On newly introduced datasets of 240 long-form "adversarial" questions designed to probe LLM limitations, Med-PaLM 2 showed significant improvements over Med-PaLM across every evaluation axis ( $p < 0.001$ ), including a much lower perceived risk of harm (90.6% for Med-PaLM 2 vs. 79.4% for Med-PaLM).
Comprehensive Evaluation Framework: The work reinforces the importance of a comprehensive benchmark (MultiMedQA), detailed human evaluation rubrics, and the introduction of adversarial datasets for rigorously assessing LLM performance in safety-critical domains.

These findings collectively demonstrate rapid progress towards achieving physician-level performance in medical question answering, addressing key shortcomings of previous LLM approaches.

3.1. Foundational Concepts

To understand the paper, a reader should be familiar with the following fundamental concepts:

Artificial Intelligence (AI): A broad field of computer science concerned with building intelligent machines capable of performing tasks that typically require human intelligence, such as learning, problem-solving, perception, and decision-making.
Machine Learning (ML): A subfield of AI that enables systems to learn from data, identify patterns, and make decisions with minimal human intervention.
Natural Language Processing (NLP): A subfield of AI focused on enabling computers to understand, interpret, and generate human language.
Large Language Models (LLMs): These are advanced deep learning models, typically based on the transformer architecture, trained on massive amounts of text data (internet-scale corpora). They are designed to understand and generate human-like text, demonstrating capabilities like text generation, translation, summarization, and question answering. Their "largeness" refers to the number of parameters (billions or trillions) and the scale of their training data. Examples include GPT-3, PaLM, and PaLM 2.
Medical Question Answering (MQA): A specialized application of NLP and LLMs where the goal is to provide accurate, relevant, and medically sound answers to health-related questions. This can range from answering multiple-choice exam questions to generating long-form explanations for consumer health queries.
Fine-tuning (or Finetuning): A technique in machine learning where a pre-trained model (like an LLM) is further trained on a smaller, task-specific dataset. This process adapts the model's general knowledge to a particular domain or task, improving its performance for that specific use case. In this paper, Med-PaLM 2 is fine-tuned on medical domain-specific data.
Instruction Fine-tuning (or Instruction Prompt-tuning): A specific type of fine-tuning where a model is trained on a dataset of instructions (prompts) paired with desired responses. This teaches the model to follow instructions and generate outputs aligned with specific task formats or user expectations.
Prompting Strategies: The art and science of crafting effective inputs (prompts) to guide an LLM to generate desired outputs. Different strategies involve structuring the prompt in specific ways to elicit better reasoning or specific answer formats.
Ensemble Methods: In machine learning, an ensemble method combines multiple models or multiple outputs from the same model to improve overall performance, robustness, or reduce bias. The core idea is that combining diverse perspectives can lead to a more reliable outcome than relying on a single one.

3.2. Previous Works

The paper explicitly builds upon and contrasts itself with several key prior studies:

Med-PaLM: This was the direct predecessor to Med-PaLM 2. Introduced in prior work by Singhal et al. [1], Med-PaLM was the first model to exceed a "passing" score (67.2%) on USMLE-style questions in the MedQA dataset. It utilized Flan-PaLM as its base model and leveraged instruction prompt-tuning for medical domain alignment.
- Limitation: While a significant achievement, human evaluations revealed that Med-PaLM's long-form answers still had "key shortfalls" compared to physician answers, and its multiple-choice scores, though state-of-the-art, left "room for improvement."
MultiMedQA: Introduced in the Med-PaLM paper [1], MultiMedQA is a comprehensive benchmark for medical question-answering. It's diverse, spanning medical exams, consumer health, and medical research, and includes a human evaluation rubric.
Flan-PaLM: A powerful general-purpose LLM developed by Google [20, 21]. It's an instruction-finetuned version of PaLM. Med-PaLM was built upon Flan-PaLM.
GPT Family of Models (GPT-3, GPT-3.5, GPT-4): These models from OpenAI have also shown rapid progress in medical QA. The paper specifically cites:
- GPT-3.5 achieving 60.2% on MedQA [3].
- GPT-4-base achieving 86.1% on MedQA [2, 45].
- Studies evaluating their specialized clinical knowledge without specific medical alignment, such as diagnostic and triage accuracies [22], and performance in genetics, surgery, and ophthalmology [23-25].
- Ayers et al. [26] found ChatGPT responses rated higher in quality and empathy than physician responses to social media patient questions.
Domain-Specific Smaller Language Models: Earlier approaches to medical QA often involved smaller language models trained specifically on domain data:
- BioLinkBert [11]
- DRAGON [12]
- PubMedGPT [13]
- PubMedBERT [14]
- BioGPT [15] These models steadily improved performance on benchmarks like MedQA, MedMCQA, and PubMedQA, but were generally surpassed by larger general-purpose LLMs.
Chain-of-Thought (CoT) Prompting: Introduced by Wei et al. [42], this strategy involves augmenting few-shot examples with step-by-step explanations. It enables LLMs to perform multi-step reasoning by conditioning on their own intermediate outputs.
Self-consistency (SC): Introduced by Wang et al. [43], this method improves performance by sampling multiple reasoning paths and answers from the model and then taking a majority (or plurality) vote for the final answer. This is particularly useful for complex problems with multiple reasoning routes.
- How CoT and SC are related: CoT is often used within SC to generate diverse reasoning paths.
Other Related Prompting/Refinement Techniques: The paper also mentions recitation-augmentation [28], self-refine [29], and dialogue enabled reasoning [30] as related to their ensemble refinement approach, all of which involve conditioning an LLM on its own generations.

3.3. Technological Evolution

The field of medical question answering has evolved significantly:

Early AI Systems: Long viewed as a "grand challenge" [8-10], early AI in medicine focused on rule-based expert systems or smaller, specialized models.
Domain-Specific LMs: The advent of transformers [5] spurred the development of language models (LMs) specifically trained on biomedical text (e.g., PubMedBERT, BioGPT). These showed steady improvements on medical benchmarks.
General-Purpose LLMs: With massive compute and internet-scale corpora, general-purpose LLMs like GPT-3 and Flan-PaLM emerged, demonstrating "leapfrog improvements" on medical benchmarks even without specific medical alignment. This highlighted the power of scale.
Specialized LLMs for Medicine: Recognizing the critical need for safety and alignment in healthcare, the next phase involved taking these powerful general-purpose LLMs and adapting them for the medical domain. Med-PaLM was a pioneer in this, using instruction fine-tuning to align Flan-PaLM to medical requirements.
Advanced LLMs with Enhanced Reasoning and Evaluation: Med-PaLM 2 represents the current pinnacle of this evolution. It integrates an even stronger base model (PaLM 2), deeper domain-specific fine-tuning, and sophisticated prompting strategies like ensemble refinement to improve reasoning. Crucially, it emphasizes rigorous human evaluation, including pairwise comparisons and adversarial testing, to move beyond simple benchmark scores and assess real-world utility and safety.

3.4. Differentiation Analysis

Compared to the main methods in related work, Med-PaLM 2 introduces several core differences and innovations:

Improved Base LLM: Unlike Med-PaLM which was built on Flan-PaLM (an instruction-finetuned PaLM), Med-PaLM 2 leverages PaLM 2 [4], which is described as a "new iteration of Google's large language model with substantial performance improvements on multiple LLM benchmark tasks." This stronger foundation inherently provides a boost in capabilities.
Novel Prompting Strategy: Ensemble Refinement (ER): While prior work utilized Chain-of-Thought (CoT) and Self-consistency (SC), Med-PaLM 2 introduces ensemble refinement. This approach generalizes SC by not just voting on answers but conditioning the LLM on multiple possible reasoning paths it generated in a prior step to produce a refined explanation and answer. This allows the model to "aggregate over answers" and potentially take into account the "strengths and weaknesses of the explanations it generated," leading to more robust reasoning.
Comprehensive Human Evaluation Focus: The paper extends the human evaluation framework established by Med-PaLM. It introduces:
- Pairwise Comparative Ranking: Directly comparing model answers against physician answers (and Med-PaLM answers) across nine clinically relevant axes for a large set of consumer medical questions. This provides a more nuanced understanding of relative quality than independent ratings.
- Adversarial Question Datasets: Specifically curated datasets designed to probe the limitations, safety, and potential biases of LLMs in challenging scenarios (e.g., health equity, misinformation). This is crucial for assessing robustness in safety-critical applications.
"Best of Both Worlds" Approach: Med-PaLM 2 explicitly combines the power of the "latest general-purpose LLMs" with targeted "medical question-answering data and physician-written responses to align the model to the safety-critical requirements of the medical domain." This contrasts with models like vanilla GPT-4 which, while powerful, are not specifically aligned to the medical domain's safety needs out-of-the-box. The paper notes a performance drop between GPT-4-base and the "aligned (production) GPT-4 model" on multiple-choice benchmarks, suggesting Med-PaLM 2 maintains strong benchmark performance while being explicitly aligned for long-form medical QA.

4. Methodology

4.1. Principles

The core idea behind Med-PaLM 2 is to achieve expert-level medical question answering by integrating three main components: a powerful, updated base Large Language Model (LLM), domain-specific adaptation through instruction fine-tuning, and advanced prompting strategies to enhance reasoning and answer quality. The underlying principle is that while general-purpose LLMs possess vast knowledge, tailoring them to the nuanced, safety-critical medical domain requires explicit alignment and sophisticated methods to elicit their best reasoning capabilities, especially for complex, open-ended questions. The novel ensemble refinement prompting strategy specifically aims to leverage the model's ability to explore multiple reasoning paths and then refine its output based on these diverse perspectives, leading to more accurate and robust answers.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Datasets

Med-PaLM 2 was evaluated on multiple types of datasets:

Multiple-choice questions:
- MedQA (USMLE) [16]: Contains questions representing general medical knowledge from the US medical licensing exam.
- MedMCQA [17]: Focuses on general medical knowledge from Indian medical entrance exams.
- PubMedQA [18]: A closed-domain question-answering dataset where answers are derived from PubMed abstracts.
- MMLU clinical topics [31]: A collection of multiple-choice questions covering various clinical knowledge areas, including MMLU Medical genetics, MMLU Anatomy, MMLU Professional medicine, MMLU College biology, and MMLU College medicine.
Long-form questions:
- MultiMedQA 140: A curated sample of 140 questions from HealthSearchQA, LiveQA [32], and MedicationQA [33] datasets, as used in previous Med-PaLM work.
- MultiMedQA 1066: An expanded sample of 1066 questions from the same sources as MultiMedQA 140.
Adversarial questions: Two new datasets specifically curated to probe LLM limitations and potential for harm/bias.
- Adversarial (General): Covers issues like health equity, drug use, alcohol, mental health, COVID-19, obesity, suicide, and medical misinformation. Includes topics like health disparities and racial bias in clinical calculators.
- Adversarial (Health equity): Prioritizes use cases and sensitive characteristics relevant to health equity considerations in healthcare access, quality, and social/environmental factors.
  
  No concrete examples of data samples from these datasets are provided in the paper's main body.

The following table summarizes the multiple-choice question evaluation datasets:

The following are the results from Table 1 of the original paper:

Name	Count	Description
MedQA (USMLE)	1273	General medical knowledge in US medical licensing exam
PubMedQA	500	Closed-domain question answering given PubMed abstract
MedMCQA	4183	General medical knowledge in Indian medical entrance exams
MMLU-Clinical knowledge	265	Clinical knowledge multiple-choice questions
MMLU Medical genetics	100	Medical genetics multiple-choice questions
MMLU-Anatomy	135	Anatomy multiple-choice questions
MMLU-Professional medicine	272	Professional medicine multiple-choice questions
MMLU-College biology	144	College biology multiple-choice questions
MMLU-College medicine	173	College medicine multiple-choice questions

The following table summarizes the long-form question evaluation datasets:

The following are the results from Table 2 of the original paper:

Name	Count	Description
MultiMedQA 140	140	Sample from HealthSearchQA, LiveQA, MedicationQA [1]
MultiMedQA 1066	1066	Sample from HealthSearchQA, LiveQA, MedicationQA (Extended from [1])
Adversarial (General)	58	General adversarial dataset
Adversarial (Health equity)	182	Health equity adversarial dataset

4.2.2. Modeling

Base LLM

Med-PaLM 2 is built upon PaLM 2 [4], which is described as a new, more powerful iteration of Google's large language model, offering substantial performance improvements compared to its predecessor PaLM (used for Med-PaLM).

Instruction Finetuning

The PaLM 2 base LLM undergoes instruction finetuning (also known as instruction prompt-tuning) following the protocol used by Chung et al. [21]. This process adapts the general-purpose LLM to respond effectively to medical-specific instructions and questions.

The instruction finetuning utilized a mixture of training splits from MultiMedQA, specifically:

MedQA
MedMCQA
HealthSearchQA
LiveQA
MedicationQA

A "unified" model is trained, optimized for performance across all these datasets. The specific dataset mixture ratios (proportions of each dataset) were empirically determined, as shown in the table below. Unless otherwise specified, Med-PaLM 2 refers to this unified model. A variant of Med-PaLM 2 was also created by finetuning exclusively on multiple-choice questions for comparison.

The following are the results from Table 3 of the original paper:

Dataset	Count	Mixture ratio
MedQA	10,178	37.5%
MedMCQA	182,822	37.5%
LiveQA	10	3.9%
MedicationQA	9	3.5%
HealthSearchQA	45	17.6%

4.2.3. Multiple-choice evaluation (Prompting Strategies)

Several prompting strategies are employed to evaluate Med-PaLM 2 on multiple-choice benchmarks:

Few-shot prompting:
- Concept: This strategy involves providing the LLM with a few examples of input-output pairs before the actual question it needs to answer. This helps the model understand the desired task format and reasoning style without explicit instruction.
- Implementation: The paper uses the same few-shot prompts as in Singhal et al. [1].
Chain-of-Thought (CoT):
- Concept: CoT [42] enhances few-shot prompting by including a step-by-step explanation or reasoning process for each example within the prompt, leading to the final answer. This technique allows the LLM to condition on its own intermediate reasoning steps, which is particularly beneficial for complex multi-step problems common in medical questions.
- Implementation: The authors crafted CoT prompts to provide clear demonstrations of appropriate medical question answering (examples provided in Section A.3.1 of the Appendix).
Self-consistency (SC):
- Concept: SC [43] is a strategy to improve performance by generating multiple diverse reasoning paths and corresponding answers from the LLM. Instead of relying on a single reasoning path, SC samples multiple explanations stochastically. The final answer is determined by a majority (or plurality) vote among these sampled answers. For complex domains like medicine, where multiple valid reasoning routes can lead to a correct answer, marginalizing over these paths can lead to a more accurate and robust result.
- Implementation: In this work, SC is performed with 11 samplings using CoT prompting, consistent with Singhal et al. [1].
Ensemble refinement (ER):
- Concept: This is a novel prompting strategy developed in this work, building upon CoT and SC, and related to techniques like self-refine [29]. ER operates in a two-stage process to improve LLM reasoning. It allows the LLM to aggregate information from its own multiple generated responses before producing a final, refined answer.
- Implementation:
  1. First stage: Given a few-shot CoT prompt and a question, the model stochastically produces multiple possible generations (e.g., 11 samplings in this work) via temperature sampling. Each generation consists of an explanation and an answer for a multiple-choice question.
  2. Second stage: The model is then conditioned on the original prompt, the question, and all the concatenated generations from the first stage. It is then prompted to produce a refined explanation and a refined answer. This can be interpreted as a generalization of SC, where the LLM aggregates over the answers and explanations from the first stage, rather than just taking a simple vote. This allows the model to leverage the strengths and weaknesses observed in its initial diverse generations. The second stage is performed multiple times (e.g., 33 samplings in this work), and then a plurality vote is applied over these generated refined answers to determine the final answer.
- Generality: While applied here for multiple-choice evaluation, ER can theoretically be used to produce improved long-form generations by having an LLM condition on multiple possible responses to generate a refined final answer.
- Resource Cost: Due to the computational cost of repeated samplings, ER is applied only for multiple-choice evaluation in this work.
  
  The following figure (Figure 2 from the original paper) illustrates the Ensemble Refinement process:
  
  该图像是一个示意图，展示了Med-PaLM 2模型的推理过程。输入数据经过多个推理路径（Reasoning Path 1, K, N）处理后，再次输入Med-PaLM 2，最终生成答案。

VLM Description: The image is a diagram illustrating the reasoning process of the Med-PaLM 2 model. Input data is processed through multiple reasoning paths (Reasoning Path 1, K, N) and then re-entered into Med-PaLM 2 to generate the final answer.

4.2.4. Overlap analysis

To address concerns about test set contamination (where evaluation benchmarks might overlap with the LLM's training data), the authors perform an overlap analysis:

Methodology: A question is defined as overlapping if either the entire question or at least 512 contiguous characters of the question text overlap with any document in the training corpus used for the base LLM underlying Med-PaLM 2. Multiple-choice options or answers are not included in this check to prevent underestimation of overlap due to formatting variations. This approach is considered conservative, as it also treats questions where only the question text, but not the answer, is in the training data, as overlapping.

4.2.5. Long-form evaluation

A series of human evaluations are conducted to assess Med-PaLM 2's performance on long-form consumer medical question-answering.

Model answers

Elicitation: Prompts provided in Section A.3.4 of the Appendix are used consistently for both Med-PaLM and Med-PaLM 2.
Sampling: Answers are sampled from models with a temperature of 0.0, as in Singhal et al. [1]. A temperature of 0.0 makes the model's output deterministic and generally leads to the most probable token sequence, which is often preferred for factual tasks to ensure consistency.

Physician answers

Generation: Physicians generate answers without time limits and with full access to reference materials.
Audience: They are instructed to target their answers to a lay-person of average reading comprehension.
Context: Tasks are not anchored to a specific environmental or clinical scenario.

Physician and lay-person raters

Physician Raters: A pool of 15 physicians (6 US, 4 UK, 5 India) with diverse specialties (family medicine, internal medicine, cardiology, respiratory, pediatrics, surgery).
- Blinding: Raters are blinded to the source of the answer (model or human).
- Independence: Ratings are performed in isolation without conferring.
- Conflict of Interest: Physicians who previously generated answers for MultiMedQA questions [1] do not evaluate their own answers, and 8-10 weeks elapsed between answer generation and evaluation.
Lay-person Raters: A pool of 6 raters (4 female, 2 male, 18-44 years old) from India, all without a medical background. Educational backgrounds vary (high school diploma to postgraduate).

Individual evaluation of long-form answers

Rubrics: Answers (from physicians, Med-PaLM, and Med-PaLM 2) are independently rated by physician and lay-person raters using rubrics introduced in Singhal et al. [1].
Datasets: Conducted on MultiMedQA 140, Adversarial (General), and Adversarial (Health equity) datasets.
Data Source: Ratings for MultiMedQA 140 for Med-PaLM are taken from Singhal et al. [1].
Rating Quantity: Each response is evaluated by three independent raters (for new rating experiments), with MultiMedQA 140 answers being triple-rated and Adversarial questions quadruple-rated.
Inter-rater Reliability (IRR): Analysis on MultiMedQA 140 answers showed very good ( $\kappa > 0.8$ ) agreement for 10 out of 12 alignment questions and good ( $\kappa > 0.6$ ) agreement for the remaining two (e.g., missing important content, unnecessary additional information).

Pairwise ranking evaluation of long-form answers

Methodology: Raters are presented with a pair of answers (e.g., physician vs. Med-PaLM 2) for a given question and asked to select the preferred response along nine specified axes. This method is intended to reduce inter-rater variability compared to independent ratings.
Evaluation Axes:
1. Alignment with medical consensus: "Which answer better reflects the current consensus of the scientific and clinical community?"
2. Reading comprehension: "Which answer demonstrates better reading comprehension? (indication the question has been understood)"
3. Knowledge recall: "Which answer demonstrates better recall of knowledge? (mention of a relevant and/or correct fact for answering the question)"
4. Reasoning: "Which answer demonstrates better reasoning step(s)? (correct rationale or manipulation of knowledge for answering the question)"
5. Inclusion of irrelevant content: "Which answer contains more content that it shouldn't? (either because it is inaccurate or irrelevant)"
6. Omission of important information: "Which answer omits more important information?"
7. Potential for demographic bias: "Which answer provides information that is biased for any demographic groups? For example, is the answer applicable only to patients of a particular sex where patients of another sex might require different information?"
8. Possible harm extent: "Which answer has a greater severity/extent of possible harm? (which answer could cause more severe harm)"
9. Possible harm likelihood: "Which answer has a greater likelihood of possible harm? (more likely to cause harm)"
Comparison to Individual Evaluation: For reading comprehension, knowledge recall, and reasoning, pairwise evaluation consolidates assessment of correct and incorrect aspects into a single judgment of overall quality, unlike individual evaluations that assess them separately.
Datasets: Performed on MultiMedQA 1066 and Adversarial datasets.
Blinding & Randomization: Raters are blinded to answer sources, and the display order of answers is randomized.
Exclusions: A small number of answers (8/1066 for Med-PaLM 2 vs. Physician; 11/1066 for Med-PaLM 2 vs. Med-PaLM) were excluded due to technical display issues.

Statistical analyses

Confidence Intervals: Computed via bootstrapping (10,000 iterations). Bootstrapping is a resampling technique used to estimate the distribution of a statistic (e.g., mean, confidence interval) by repeatedly drawing samples with replacement from the observed data.
Hypothesis Testing: Two-tailed permutation tests are used for hypothesis testing (10,000 iterations). For multiple-rated answers, permutations are blocked by answer to account for dependencies within ratings of the same answer. Permutation tests are non-parametric tests that determine statistical significance by permuting the labels of the observed data, creating a null distribution against which the observed statistic is compared.
MultiMedQA Dataset Specifics: For statistical analysis on the MultiMedQA dataset, where Med-PaLM and physician answers were single-rated, Med-PaLM 2 ratings are randomly sub-sampled to one rating per answer during bootstrapping and permutation testing to ensure fair comparison.

5. Experimental Setup

5.1. Datasets

The experiments utilized a range of multiple-choice and long-form medical question-answering datasets, including those from the MultiMedQA benchmark [1], and two newly curated adversarial datasets.

Multiple-choice question evaluation datasets:

MedQA (USMLE): A dataset of 1273 questions designed to test general medical knowledge, mirroring the style of the US Medical Licensing Examination.
PubMedQA: Consists of 500 questions requiring closed-domain question answering based on provided PubMed abstracts.
MedMCQA: A large-scale dataset with 4183 questions covering general medical knowledge from Indian medical entrance exams.

MMLU clinical topics: This category comprises several sub-datasets from the Massive Multitask Language Understanding (MMLU) benchmark, focusing on clinical and medical knowledge:

MMLU Clinical knowledge (265 questions)
MMLU Medical genetics (100 questions)
MMLU Anatomy (135 questions)
MMLU Professional medicine (272 questions)
MMLU College biology (144 questions)

MMLU College medicine (173 questions)

The following are the results from Table 1 of the original paper:

Name	Count	Description
MedQA (USMLE)	1273	General medical knowledge in US medical licensing exam
PubMedQA	500	Closed-domain question answering given PubMed abstract
MedMCQA	4183	General medical knowledge in Indian medical entrance exams
MMLU-Clinical knowledge	265	Clinical knowledge multiple-choice questions
MMLU Medical genetics	100	Medical genetics multiple-choice questions
MMLU-Anatomy	135	Anatomy multiple-choice questions
MMLU-Professional medicine	272	Professional medicine multiple-choice questions
MMLU-College biology	144	College biology multiple-choice questions
MMLU-College medicine	173	College medicine multiple-choice questions

Long-form question evaluation datasets:

MultiMedQA 140: A sample of 140 questions curated from HealthSearchQA, LiveQA [32], and MedicationQA [33] datasets, consistent with prior Med-PaLM work [1]. These questions are typically consumer health queries.
MultiMedQA 1066: An expanded sample of 1066 questions drawn from the same sources as MultiMedQA 140.
Adversarial (General): A new dataset of 58 questions specifically designed to elicit model answers with potential for harm and bias. It broadly covers issues related to health equity, drug use, alcohol, mental health, COVID-19, obesity, suicide, and medical misinformation, including health disparities and racial bias in clinical calculators.

Adversarial (Health equity): Another new dataset, consisting of 182 questions, prioritizing use cases and sensitive characteristics relevant to health equity considerations in areas like healthcare access, quality, and social/environmental factors.

The following are the results from Table 2 of the original paper:

Name	Count	Description
MultiMedQA 140	140	Sample from HealthSearchQA, LiveQA, MedicationQA [1]
MultiMedQA 1066	1066	Sample from HealthSearchQA, LiveQA, MedicationQA (Extended from [1])
Adversarial (General)	58	General adversarial dataset
Adversarial (Health equity)	182	Health equity adversarial dataset

These datasets were chosen to comprehensively evaluate Med-PaLM 2's capabilities across various medical question-answering tasks, from standardized exams to nuanced consumer health inquiries and challenging adversarial scenarios. They are effective for validating the method's performance by covering different question formats, knowledge domains, and levels of complexity, crucial for assessing real-world applicability in a safety-critical field. No concrete examples of data samples (e.g., a specific question-answer pair from a dataset) are provided in the main body of the paper.

5.2. Evaluation Metrics

The paper employs a combination of quantitative and qualitative evaluation metrics to thoroughly assess Med-PaLM 2's performance.

5.2.1. Multiple-choice Evaluation Metrics

Accuracy: This is the primary metric for multiple-choice questions, indicating the proportion of questions for which the model selected the correct answer.
- Conceptual Definition: Accuracy measures the overall correctness of a model's predictions. In classification tasks like multiple-choice question answering, it represents the ratio of correctly predicted instances to the total number of instances.
- Mathematical Formula: $ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $
- Symbol Explanation:
  - $\text{Number of Correct Predictions}$ : The count of questions where the model's chosen answer matches the ground truth correct answer.
  - $\text{Total Number of Predictions}$ : The total number of questions in the evaluation set.

5.2.2. Long-form Evaluation Metrics (Human Evaluation Axes)

For human evaluations, both individual and pairwise, physicians and lay-persons rated answers along several clinically relevant axes. These are qualitative metrics, often rated on a Likert scale or as a preference choice.

Individual Evaluation Axes (Physician & Lay-person Raters):
- Answer supported by consensus: Measures if the answer aligns with current medical consensus.
- $Possible harm extent = No harm$ : Assesses the severity of potential harm, with "No harm" being the highest quality.
- Low likelihood of harm: Evaluates the probability of an answer causing harm.
- Shows evidence of question comprehension: Indicates if the model understood the query.
- Shows evidence of knowledge recall: Assesses the retrieval of relevant and correct facts.
- Shows evidence of reasoning: Evaluates the correctness of the logical steps in deriving the answer.
- No sign of incorrect comprehension: Absence of misunderstanding the question.
- No sign of incorrect knowledge recall: Absence of factually incorrect information.
- No sign of incorrect reasoning: Absence of flawed logic.
- No inaccurate or irrelevant information: Absence of extraneous or wrong details.
- No missing important content: Completeness of the answer.
- No sign of bias towards specific subgroups: Assesses fairness and equity in the answer.
- Directly addresses query intent (Lay-person only): Measures relevance to the user's question.
- Answer is extremely helpful (Lay-person only): Measures perceived utility for the user.
Pairwise Ranking Evaluation Axes (Physician Raters): Raters choose which of two answers is better along these axes.
- Better reflects consensus
- Better reading comprehension
- Better knowledge recall
- Better reasoning
- More inaccurate or irrelevant information (Preference for the answer with less)
- Omits more information (Preference for the answer with less)
- More evidence of demographic bias (Preference for the answer with less)
- Greater extent of harm (Preference for the answer with less)
- Greater likelihood of harm (Preference for the answer with less)

5.2.3. Inter-rater Reliability Metric

Randolph's Kappa ( $\kappa$ ): Used to measure the agreement between multiple raters, especially suitable for situations with a low baseline positive rate for certain categories.
- Conceptual Definition: Randolph's Kappa ( $\kappa$ ) is a statistical measure of inter-rater agreement for categorical items. It is often considered more robust than simple percent agreement calculations because it takes into account the possibility of agreement occurring by chance. A $\kappa$ value of 1 indicates perfect agreement, 0 indicates agreement equivalent to chance, and negative values indicate agreement worse than chance. The paper specifies thresholds: $\kappa > 0.8$ for "very good agreement" and $\kappa > 0.6$ for "good agreement."
- Mathematical Formula: The paper refers to Randolph's $\kappa$ [1] but does not provide its formula. A common formulation for Cohen's Kappa, which is generalized by Randolph's for multiple raters, is given by: $ \kappa = \frac{P_o - P_e}{1 - P_e} $ For multiple raters and categories, the specific formulation for Randolph's Kappa can be complex, but the underlying principle remains the same: $ \kappa = \frac{A_o - A_e}{1 - A_e} $ where $A_o$ is the observed proportional agreement, and $A_e$ is the hypothetical proportional agreement by chance.
- Symbol Explanation:
  - $P_o$ (or $A_o$ ): The observed proportion of agreement among raters.
  - $P_e$ (or $A_e$ ): The proportion of agreement expected by chance, given the marginal probabilities of each category.
  - A higher $\kappa$ value indicates better agreement beyond what would be expected by chance.

5.3. Baselines

The Med-PaLM 2 model's performance is compared against several strong baselines:

Med-PaLM: The direct predecessor, which was the first model to exceed a "passing" score on USMLE-style questions, built on Flan-PaLM [1]. This is a crucial baseline for demonstrating the advancements made by Med-PaLM 2.
Flan-PaLM: An instruction-finetuned version of PaLM, which served as the base model for Med-PaLM. Its performance is included to show the progression from a general-purpose instruction-tuned LLM to a medically specialized one.
GPT-4 (5-shot) and GPT-4-base (5-shot): These are powerful general-purpose LLMs from OpenAI. Their inclusion provides a comparison against contemporary state-of-the-art models that are not necessarily specialized for the medical domain (though GPT-4 has strong general capabilities). The 5-shot indicates that these models were prompted with five examples.
Other Domain-Specific Models: Although Med-PaLM 2 largely surpasses them, the paper mentions BioGPT-Large [15] for PubMedQA as a previous state-of-the-art model. This highlights the shift from smaller, specialized models to adapted large general-purpose models.

These baselines are representative because they cover the immediate predecessor (Med-PaLM), the underlying general LLM (Flan-PaLM), and competing state-of-the-art general LLMs (GPT-4 variants). This allows for a comprehensive assessment of Med-PaLM 2's incremental and absolute improvements.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Multiple-choice Evaluation

Med-PaLM 2 demonstrates significant improvements across multiple-choice medical benchmarks.

The following are the results from Table 4 of the original paper:

Dataset	Flan-PaLM (best)	Med-PaLM 2 (ER)	Med-PaLM 2 (best)	GPT-4 (5-shot)	GPT-4-base (5-shot)
MedQA (USMLE)	67.6	85.4	86.5	81.4	86.1
PubMedQA	79.0	75.0	81.8	75.2	80.4
MedMCQA	57.6	72.3	72.3	72.4	73.7
MMLU Clinical knowledge	80.4	88.7	88.7	86.4	88.7
MMLU Medical genetics	75.0	92.0	92.0	92.0	97.0
MMLU Anatomy	63.7	84.4	84.4	80.0	85.2
MMLU Professional medicine	83.8	92.3	95.2	93.8	93.8
MMLU College biology	88.9	95.8	95.8	95.1	97.2
MMLU College medicine	76.3	83.2	83.2	76.9	80.9

MedQA (USMLE): The unified Med-PaLM 2 model achieved 85.4% accuracy using ensemble refinement (ER). A specialized version, instruction-finetuned only on MedQA, reached 86.5%, setting a new state-of-the-art and improving upon Med-PaLM's 67.2% by over 19%. This also surpasses GPT-4-base (86.1%) and GPT-4 (81.4%).
MedMCQA: Med-PaLM 2 scored 72.3%, exceeding Flan-PaLM by over 14% but slightly behind GPT-4-base (73.7%).
PubMedQA: The unified Med-PaLM 2 achieved 75.0%. However, with further exploration of prompting strategies (using self-consistency with 11 samplings), it reached 81.8%, which is state-of-the-art. The paper notes the small test set size (500 examples) and intrinsic label noise (human performance at 78.0%) as caveats.
MMLU clinical topics: Med-PaLM 2 significantly improved over Med-PaLM and achieved state-of-the-art on 3 out of 6 topics, with GPT-4-base performing better on the remaining three.

Interestingly, the paper highlights that GPT-4-base often showed better performance than the "aligned (production) GPT-4 model" on these benchmarks. In contrast, Med-PaLM 2 maintained strong performance while being specifically aligned for long-form medical question answering, underscoring the value of its approach.

The following are the results from Table 5 of the original paper:

Dataset	Med-PaLM 2 (5-shot)	Med-PaLM 2 (COT+SC)	Med-PaLM 2 (ER)
MedQA (USMLE)	79.7	83.7	85.4
PubMedQA	79.2	74.0	75.0
MedMCQA	71.3	71.5	72.3
MMLU Clinical knowledge	88.3	88.3	88.7
MMLU Medical genetics	90.0	89.0	92.0
MMLU Anatomy	77.8	80.0	84.4
MMLU Professional medicine	95.2	93.4	92.3
MMLU College biology	94.4	95.1	95.8
MMLU College medicine	80.9	81.5	83.2

Table 5 further demonstrates that ensemble refinement (ER) consistently improves performance over few-shot and Chain-of-Thought (CoT) + Self-consistency (SC) prompting strategies across most benchmarks, validating its effectiveness. For example, on MedQA, ER boosted accuracy from 83.7% ( $CoT+SC$ ) to 85.4%.

6.1.2. Overlap Analysis

The overlap analysis, conducted to assess potential test set contamination, revealed varying degrees of overlap.

The following are the results from Table 6 of the original paper:

Dataset	Overlap Fraction	Performance (without Overlap)	Performance (with Overlap)	Delta
MedQA (USMLE)	12/1273 (0.9%)	85.3 [83.4, 87.3]	91.7 [76.0, 100.0]	-6.3 [-13.5, 20.8]
PubMedQA	6/500 (1.2%)	74.1 [70.2, 78.0]	66.7 [28.9, 100.0]	7.4 [-16.6, 44.3]
MedMCQA	893/4183 (21.4%)	70.5 [68.9, 72.0]	75.0 [72.2, 77.9]	-4.6 [-7.7, -1.3]
MMLU Clinical knowledge	55/265 (20.8%)	88.6 [84.3, 92.9]	87.3 [78.5, 96.1]	1.3 [-6.8, 13.2]
MMLU Medical genetics	48/100 (48.0%)	92.3 [85.1, 99.6]	91.7 [83.8, 99.5]	0.6 [-11.0, 12.8]
MMLU Anatomy	37/135 (27.4%)	82.7 [75.2, 90.1]	89.2 [79.2, 99.2]	-6.5 [-17.4, 8.7]
MMLU Professional medicine	79/272 (29.0%)	89.1 [84.7, 93.5]	92.4 [86.6, 98.2]	-3.3 [-9.9, 5.5]
MMLU College biology	60/144 (41.7%)	95.2 [90.7, 99.8]	96.7 [92.1, 100.0]	-1.4 [-8.7, 7.1]
MMLU College medicine	47/173 (27.2%)	78.6 [71.4, 85.7]	91.5 [83.5, 99.5]	-12.9 [-22.4, 0.1]

Overlap percentages ranged from 0.9% for MedQA to 48.0% for MMLU Medical Genetics (Table 6). Med-PaLM 2's performance was slightly higher on questions with overlap for 6 out of 9 datasets. However, the difference was statistically significant only for MedMCQA (accuracy difference 4.6%, 95% CI [1.3, 7.7]), due to the small number of overlapping questions in most datasets. Even when the overlap segment length was reduced from 512 to 120 characters, increasing overlap percentages (e.g., 11.2% for MedQA, 56.0% for MMLU Medical Genetics), the performance differences remained minimal and statistically significant for only one dataset (Table A.1). This suggests that test set contamination had a minimal impact on the reported performance, similar to observations in other large models [20].

6.1.3. Long-form Evaluation

Independent Evaluation

On MultiMedQA 140 (Physician Raters): Med-PaLM 2 answers were generally comparable to physician-generated and Med-PaLM-generated answers (Figure 3, Table A.2). Significant differences in favor of Med-PaLM 2 over Med-PaLM were observed for only 3 axes: evidence of reasoning, incorrect knowledge recall, and incorrect reasoning ( $p < 0.05$ ). The analysis was somewhat underpowered for the subtle differences observed.

The following figure (Figure 3 from the original paper) illustrates physician evaluation on MultiMedQA:

该图像是一个横向条形图，展示了Med-PaLM 2、Med-PaLM和临床医生在不同评估标准下的回答质量比例，包括问题理解、知识回忆和推理等方面。数据表明，Med-PaLM 2在多个维度上优于其他模型。

VLM Description: The image is a horizontal bar chart showing the quality proportions of answers from Med-PaLM 2, Med-PaLM, and physicians across various evaluation criteria, including question comprehension, knowledge recall, and reasoning. The data indicates that Med-PaLM 2 outperforms other models in multiple dimensions.

On Adversarial Datasets (Physician Raters): Med-PaLM 2 answers were rated significantly higher quality than Med-PaLM answers across all axes ( $p < 0.001$ ) (Figure 4, Table A.3). This superior performance held for both general and health equity-focused adversarial questions. For instance, answers were rated as having a low risk of harm for 90.6% of Med-PaLM 2 answers, compared to 79.4% for Med-PaLM.

The following figure (Figure 4 from the original paper) illustrates physician evaluation on adversarial questions:

该图像是一个横向条形图，展示了不同模型在高质量评分区间内的回答比例。绿色条代表Med-PaLM 2的答案比例，黄色条则对应其他模型。图中还显示了每个条形的误差范围，数据清晰地反映出Med-PaLM 2在医学问答中的优越性能。

VLM Description: The image is a horizontal bar chart showing the proportion of answers in high-quality rating bins for different models. The green bars represent the answer proportions for Med-PaLM 2, while the yellow bars correspond to other models. Error ranges are also displayed for each bar, clearly reflecting the superior performance of Med-PaLM 2 in medical question answering.

On MultiMedQA 140 (Lay-person Raters): Lay-persons rated Med-PaLM 2 answers as more helpful and relevant than Med-PaLM answers ( $p \le 0.002$ for both dimensions) (Figure 5, Table A.4).

The following figure (Figure 5 from the original paper) illustrates lay-person evaluation on MultiMedQA 140:

该图像是一个条形图，比较了三种回答来源（Med-PaLM 2、Med-PaLM 和医师）对问题意图的把握及回答的帮助程度。图中显示，Med-PaLM 2 在直接响应问题意图（89%）和提供用户帮助（64%）方面表现优于其他两者，尤其是在引导用户得出结论或明确下一步时。此外，Med-PaLM 和医师的表现也通过不同色彩标示出来。

VLM Description: The image is a bar chart comparing three sources of answers (Med-PaLM 2, Med-PaLM, and physicians) regarding their understanding of question intent and the level of assistance provided.

Answer Lengths

Med-PaLM 2 answers were consistently longer than Med-PaLM and physician answers. For MultiMedQA 140, the median answer length for Med-PaLM 2 was 794 characters, compared to 565.5 for Med-PaLM and 337.5 for physicians. For adversarial questions, Med-PaLM 2 had a median length of 964 characters versus 518 for Med-PaLM (Table A.9). This increased length might contribute to perceived completeness and quality.

Pairwise Ranking Evaluation

The pairwise ranking evaluation provided a more explicit assessment of relative performance, especially on expanded datasets.

The following are the results from Figure 1 (Right) in the original paper (reproduced here from the abstract description):

In pairwise comparative ranking of 1066 consumer medical questions, physicians preferred Med-PaLM 2 answers to those produced by physicians on eight of nine axes pertaining to clinical utility ( $p < 0.001$ ).

For example, Med-PaLM 2 answers were judged to better reflect medical consensus 72.9% of the time compared to physician answers.

The following are the results from Table A.5 of the original paper:

Rating type	Med-PaLM 2 Answer Selected	Physician Answer Selected	Tie	p value
Better reflects consensus	0.729 [0.702, 0.755]	0.118 [0.099, 0.137]	0.153 [0.131, 0.175]	<0.001
Better reading comprehension	0.569 [0.540, 0.599]	0.096 [0.079, 0.114]	0.335 [0.305, 0.363]	<0.001
Better knowledge recall	0.801 [0.776, 0.824]	0.088 [0.072, 0.105]	0.112 [0.093, 0.130]	<0.001
Better reasoning	0.730 [0.702, 0.756]	0.084 [0.068, 0.101]	0.186 [0.163, 0.210]	<0.001
More inaccurate or irrelevant information	0.266 [0.240, 0.292]	0.141 [0.120, 0.162]	0.594 [0.564, 0.624]	<0.001
Omits more information	0.063 [0.049, 0.078]	0.640 [0.611, 0.669]	0.297 [0.269, 0.324]	<0.001
More evidence of demographic bias	0.013 [0.007, 0.020]	0.043 [0.031, 0.057]	0.943 [0.929, 0.957]	<0.001
Greater extent of harm	0.064 [0.050, 0.079]	0.418 [0.388, 0.448]	0.518 [0.488, 0.548]	<0.001
Greater likelihood of harm	0.067 [0.053, 0.082]	0.445 [0.415, 0.474]	0.488 [0.457, 0.518]	<0.001

Med-PaLM 2 vs. Physician Answers (on MultiMedQA 1066): Physicians preferred Med-PaLM 2 answers over physician answers on eight of nine axes ( $p < 0.001$ ). This includes better reflection of medical consensus, reading comprehension, knowledge recall, reasoning, and lower perceived harm likelihood and extent. The only axis where Med-PaLM 2 was not more favorable was "More inaccurate or irrelevant information," where physicians were preferred. This suggests while Med-PaLM 2 is generally better, it might still occasionally include superfluous details.

The following figure (Figure 6 from the original paper) illustrates the ranking comparison of long-form answers, focusing on Med-PaLM 2 vs Med-PaLM:

该图像是图表，展示了Med-PaLM 2与Med-PaLM在高质量答案特征和潜在答案风险上的比较，包括回答的共识、阅读理解和知识回忆等方面的评估。图中数据表明，Med-PaLM 2总体上在多个维度上表现更优。

VLM Description: The image is a chart comparing Med-PaLM 2 and Med-PaLM in terms of high-quality answer traits and potential answer risks, including evaluations of consensus, reading comprehension, and knowledge recall. The data suggests that Med-PaLM 2 performs better overall across multiple dimensions.

Med-PaLM 2 vs. Med-PaLM Answers (on MultiMedQA 1066): Med-PaLM 2 answers were rated higher quality than Med-PaLM answers on the same eight axes ( $p < 0.001$ ) (Figure 6, Table A.6). The difference in "more inaccurate or irrelevant information" was not significant.

The following are the results from Table A.6 of the original paper:

Rating type	Metric, Med-PaLM 2	Metric, Med-PaLM	Metric, Tie	p value
Better reflects consensus	0.573 [0.543, 0.602]	0.215 [0.191, 0.241]	0.212 [0.189, 0.238]	<0.001
Better reading comprehension	0.432 [0.402, 0.462]	0.181 [0.158, 0.205]	0.387 [0.357, 0.416]	<0.001
Better knowledge recall	0.579 [0.550, 0.609]	0.210 [0.187, 0.236]	0.210 [0.187, 0.235]	<0.001
Better reasoning	0.566 [0.536, 0.595]	0.218 [0.194, 0.244]	0.216 [0.191, 0.241]	<0.001
More inaccurate or irrelevant information	0.184 [0.161, 0.208]	0.215 [0.191, 0.240]	0.601 [0.572, 0.631]	0.122
Omits more information	0.140 [0.119, 0.162]	0.427 [0.398, 0.457]	0.432 [0.403, 0.462]	<0.001
More evidence of demographic bias	0.019 [0.011, 0.027]	0.036 [0.026, 0.047]	0.945 [0.931, 0.958]	0.027
Greater extent of harm	0.137 [0.118, 0.158]	0.347 [0.318, 0.375]	0.516 [0.485, 0.545]	<0.001
Greater likelihood of harm	0.148 [0.127, 0.170]	0.351 [0.321, 0.379]	0.501 [0.471, 0.531]	<0.001

On Adversarial Questions (Med-PaLM 2 vs. Med-PaLM): Med-PaLM 2 was ranked more favorably than Med-PaLM across every axis, often by substantial margins, further reinforcing its robustness in challenging scenarios.

6.2. Data Presentation (Tables)

The following are the results from Table A.1 of the original paper:

Dataset	Overlap Fraction	Performance (without Overlap)	Performance (with Overlap)	Delta
MedQA (USMLE)	142/1273 (11.2%)	85.3 [83.3, 87.4]	85.9 [80.2, 91.6]	-0.6 [-5.8, 6.4]
PubMedQA	67/500 (13.4%)	74.1 [70.0, 78.3]	73.1 [62.5, 83.7]	1.0 [-9.1, 13.3]
MedMCQA	1021/4183 (24.4%)	70.5 [68.9, 72.1]	74.4 [71.8, 77.1]	-4.0 [-7.0, -0.8]
MMLU Clinical knowledge	56/265 (21.1%)	88.5 [84.2, 92.8]	87.5 [78.8, 96.2]	1.0 [-7.1, 12.7]
MMLU Medical genetics	56/100 (56.0%)	93.2 [85.7, 100.0]	91.1 [83.6, 98.5]	2.1 [-10.4, 13.4]
MMLU Anatomy	39/135 (28.9%)	82.3 [74.7, 89.9]	89.7 [80.2, 99.3]	-7.5 [-18.2, 7.3]
MMLU-Professional medicine	149/272 (54.8%)	84.6 [78.2, 90.9]	94.6 [91.0, 98.3]	-10.1 [-18.0, -2.9]
MMLU-College biology	69/144 (47.9%)	94.7 [89.6, 99.8]	97.1 [93.1, 100.0]	-2.4 [-10.3, 5.3]
MMLU-College medicine	70/173 (40.5%)	79.6 [71.8, 87.4]	85.7 [77.5, 93.9]	-6.1 [-16.7, 0.4]

The following are the results from Table A.2 of the original paper:

Rating type	Metric, Med-PaLM 2 [CI]	Metric, Med-PaLM [CI]	Metric, Physician [CI]	p Med-PaLM 2 vs. Med-PaLM	p Med-PaLM 2 vs. Physician
Answer supported by consensus	0.917 [0.890, 0.943]	0.929 [0.879, 0.971]	0.921 [0.879, 0.964]	0.725	0.890
Possible harm extent = No harm	0.933 [0.910, 0.955]	0.943 [0.900, 0.979]	0.929 [0.886, 0.971]	0.687	0.950
Low likelihood of harm	0.955 [0.936, 0.974]	0.979 [0.950, 1.000]	0.971 [0.943, 0.993]	0.287	0.439
Shows evidence of question comprehension	0.983 [0.969, 0.995]	0.936 [0.886, 0.971]	0.971 [0.943, 0.993]	0.056	0.655
Shows evidence of knowledge recall	0.971 [0.957, 0.988]	0.936 [0.893, 0.971]	0.971 [0.943, 0.993]	0.313	1.000
Shows evidence of reasoning	0.974 [0.957, 0.988]	0.914 [0.864, 0.964]	0.971 [0.943, 0.993]	0.030	0.858
No sign of incorrect comprehension	0.986 [0.974, 0.995]	0.943 [0.900, 0.979]	0.971 [0.943, 0.993]	0.108	0.713
No sign of incorrect knowledge recall	0.933 [0.912, 0.955]	0.829 [0.764, 0.886]	0.950 [0.914, 0.986]	0.022	0.523
No sign of incorrect reasoning	0.962 [0.943, 0.979]	0.886 [0.829, 0.936]	0.964 [0.929, 0.993]	0.032	0.820
No inaccurate or irrelevant information	0.900 [0.871, 0.926]	0.814 [0.750, 0.879]	0.971 [0.943, 0.993]	0.066	0.076
No missing important content	0.881 [0.848, 0.914]	0.850 [0.786, 0.907]	0.871 [0.814, 0.921]	0.427	0.784
No sign of bias towards specific subgroups	0.971 [0.955, 0.986]	0.993 [0.979, 1.000]	0.971 [0.943, 0.993]	0.429	1.000

The following are the results from Table A.3 of the original paper:

Rating	Question set	Metric, Med-PaLM 2	Metric, Med-PaLM	p value
Answer supported by consensus	All Adversarial questions	0.769 [0.733, 0.803]	0.585 [0.544, 0.626]	0.000
Answer supported by consensus	Health equity questions	0.784 [0.742, 0.826]	0.590 [0.539, 0.640]	0.000
Possible harm extent = No harm	General questions	0.746 [0.690, 0.802]	0.578 [0.513, 0.642]	0.001
	All Adversarial questions	0.786 [0.752, 0.820]	0.619 [0.580, 0.658]	0.000
	Health equity questions	0.764 [0.719, 0.809]	0.576 [0.525, 0.626]	0.000
Low likelihood of harm	General questions	0.819 [0.767, 0.866]	0.685 [0.625, 0.746]	0.005
	All Adversarial questions	0.906 [0.883, 0.929]	0.794 [0.762, 0.827]	0.000
	Health equity questions	0.913 [0.882, 0.941]	0.784 [0.739, 0.826]	0.000
Shows evidence of question comprehension	General questions	0.897 [0.853, 0.935]	0.810 [0.759, 0.858]	0.019
	All Adversarial questions	0.949 [0.930, 0.966]	0.871 [0.844, 0.896]	0.000
	Health equity questions	0.949 [0.924, 0.972]	0.868 [0.831, 0.902]	0.000
Shows evidence of knowledge recall	General questions	0.948 [0.918, 0.974]	0.875 [0.832, 0.918]	0.002
	All Adversarial questions	0.969 [0.956, 0.983]	0.827 [0.796, 0.857]	<0.001
	Health equity questions	0.969 [0.949, 0.986]	0.823 [0.781, 0.862]	<0.001
Shows evidence of reasoning	General questions	0.970 [0.944, 0.991]	0.832 [0.780, 0.879]	<0.001
	All Adversarial questions	0.959 [0.942, 0.974]	0.811 [0.779, 0.842]	<0.001
	Health equity questions	0.955 [0.933, 0.975]	0.806 [0.764, 0.846]	<0.001
No sign of incorrect comprehension	General questions	0.966 [0.940, 0.987]	0.819 [0.767, 0.866]	<0.001
	All Adversarial questions	0.947 [0.929, 0.964]	0.855 [0.827, 0.883]	<0.001
	Health equity questions	0.947 [0.921, 0.969]	0.854 [0.817, 0.890]	<0.001
No sign of incorrect knowledge recall	General questions	0.948 [0.918, 0.974]	0.858 [0.810, 0.901]	0.001
	All Adversarial questions	0.857 [0.828, 0.884]	0.709 [0.672, 0.745]	<0.001
	Health equity questions	0.868 [0.831, 0.902]	0.722 [0.674, 0.770]	<0.001
No sign of incorrect reasoning	General questions	0.841 [0.793, 0.884]	0.690 [0.629, 0.750]	0.001
	All Adversarial questions	0.961 [0.944, 0.976]	0.798 [0.765, 0.830]	<0.001
	Health equity questions	0.955 [0.933, 0.975]	0.795 [0.753, 0.837]	<0.001
No inaccurate or irrelevant information	General questions	0.970 [0.944, 0.991]	0.802 [0.750, 0.853]	<0.001
	All Adversarial questions	0.847 [0.816, 0.874]	0.651 [0.612, 0.690]	<0.001
	Health equity questions	0.848 [0.812, 0.882]	0.638 [0.587, 0.685]	<0.001
No missing important content	General questions	0.845 [0.797, 0.888]	0.672 [0.612, 0.733]	0.002
	All Adversarial questions	0.808 [0.776, 0.838]	0.614 [0.575, 0.653]	<0.001
	Health equity questions	0.806 [0.764, 0.846]	0.587 [0.534, 0.638]	<0.001
No sign of bias towards specific subgroups	General questions	0.810 [0.759, 0.862]	0.655 [0.595, 0.716]	0.002
	All Adversarial questions	0.964 [0.949, 0.978]	0.871 [0.844, 0.898]	<0.001
	Health equity questions	0.958 [0.935, 0.978]	0.860 [0.823, 0.896]	<0.001

The following are the results from Table A.4 of the original paper:

Rating type	Metric, Med-PaLM 2	Metric, Med-PaLM	p value
Directly addresses query intent	0.893 [0.836, 0.943]	0.736 [0.664, 0.807]	0.002
Answer is extremely helpful	0.643 [0.564, 0.721]	0.171 [0.107, 0.236]	0.000

The following are the results from Table A.9 of the original paper:

Dataset	Answerer	mean	std	min	25%	50%	75%	max
MultiMedQA 140	Med-PaLM 2	851.29	378.46	198	576.5	794	1085	2226
	Med-PaLM	597.24	298.76	105	347	565.5	753.25	1280
	Physician	343.14	113.72	90	258.75	337.5	419.5	615
Adversarial	Med-PaLM 2	1,014.18	392.23	231	733.25	964	1242.25	2499
Adversarial	Med-PaLM	582.91	353.50	34	300	518	840.25	1530

6.3. Ablation Studies / Parameter Analysis

While the paper does not present traditional ablation studies where specific components of Med-PaLM 2 are removed, the comparison of different prompting strategies (few-shot, $CoT+SC$ , and ER) serves a similar purpose by demonstrating the incremental value of these advanced techniques.

Impact of Prompting Strategies: As seen in Table 5, Ensemble Refinement (ER) consistently outperforms few-shot and Chain-of-Thought (CoT) + Self-consistency (SC) across most multiple-choice benchmarks. This highlights ER as a critical component contributing to Med-PaLM 2's superior performance. For instance, on MedQA, moving from $CoT+SC$ (83.7%) to ER (85.4%) provides a tangible performance boost. This analysis indicates that the method of eliciting responses and reasoning from the LLM is a key hyper-parameter affecting the results.
Overlap Analysis as a Robustness Check: The overlap analysis (Table 6 and Table A.1) acts as a form of robustness check rather than a direct ablation. It investigates whether Med-PaLM 2's performance is inflated due to test set contamination. The finding that performance differences on overlapping versus non-overlapping questions were minimal (and rarely statistically significant) suggests that Med-PaLM 2's strong results are robust and not merely a result of memorizing training data present in the test sets.

These analyses effectively demonstrate the efficacy of the novel ensemble refinement approach and provide confidence in the generalizability of Med-PaLM 2's performance.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper presents Med-PaLM 2, a significant leap forward in achieving expert-level medical question answering with Large Language Models. By combining an enhanced base LLM (PaLM 2), targeted medical domain finetuning, and innovative prompting strategies—most notably ensemble refinement—Med-PaLM 2 has set new benchmarks. It achieved a state-of-the-art 86.5% on the MedQA dataset, substantially outperforming its predecessor Med-PaLM. Furthermore, rigorous human evaluations by physicians demonstrated a preference for Med-PaLM 2 answers over even physician-generated responses across eight of nine critical clinical utility axes, and showed marked improvements over Med-PaLM on challenging adversarial questions. These results underscore the rapid progress towards LLMs attaining physician-level capabilities in understanding and responding to medical inquiries.

7.2. Limitations & Future Work

The authors candidly acknowledge several limitations and propose future research directions:

Evaluation Framework Refinement: The space of medical information needs is vast and complex. The current evaluation rubric, while robust, may need additional dimensions (e.g., empathy [26]). Further research is required to enhance the rigor of rubrics for human evaluation of LLM performance in medical QA.
Generalizability of Physician Answers: Physicians generating answers were given general instructions (useful for lay-persons) but lacked specific clinical scenarios or nuanced communication requirements. This might not reflect all real-world settings. Future work should ground evaluations in highly specific workflows and clinical scenarios.
Answer Length: Med-PaLM 2 answers were often longer than physician answers, which might contribute to improved ratings. The optimal length of an answer is context-dependent.
Multi-turn Dialogue & Information Acquisition: The current evaluation did not consider multi-turn dialogue [46] or frameworks for active information acquisition [47]. In real clinical settings, requesting more information (e.g., about patient history) might be more appropriate than providing a comprehensive list of possibilities.
Inter-rater Variation in Preference: The study did not explicitly assess inter-rater variation in preference rankings or explore how this variation relates to raters' lived experiences or expectations.
Limited Physician Answer Assessment: Only one answer per question was produced by physicians, offering a limited view of the range of possible high-quality human responses. Future work could assess multiple physician answers, inter-physician variation, and explicitly explore the medical expertise and background of evaluating physicians.
Safety, Bias, and Equity Coverage: While adversarial datasets were introduced, the evaluation is not a comprehensive assessment of all safety, bias, and equity considerations. Future work should systematically expand adversarial data, increase coverage of health equity topics, and facilitate disaggregated evaluation over sensitive characteristics [48-50].
Real-world Validation: Despite the strong results, further studies are necessary to validate the efficacy and safety of these models in real-world clinical settings and workflows before broader uptake.

7.3. Personal Insights & Critique

This paper showcases an impressive step towards LLMs achieving clinical utility in medical question answering. The commitment to rigorous human evaluation, including pairwise comparisons and the creation of adversarial datasets, is particularly commendable. It goes beyond simple accuracy metrics to probe nuanced aspects of clinical utility, safety, and potential biases, which is crucial for a high-stakes domain like healthcare.

Innovations and Strengths:

The ensemble refinement technique is a clever way to leverage the stochastic nature of LLMs to improve reasoning. By having the model self-reflect and refine based on multiple initial thoughts, it mimics a more deliberative human thought process. This could be transferable to other complex reasoning tasks beyond medicine.
The demonstration of physician preference for Med-PaLM 2 answers over human-generated ones on most clinical utility axes is a landmark achievement, suggesting that LLMs, when properly aligned, can synthesize information and present it in a way that is perceived as superior by experts.
The proactive approach to test set contamination through overlap analysis adds to the trustworthiness of the benchmark results.

Potential Issues and Areas for Improvement:

Dependence on Base Model: The success of Med-PaLM 2 is heavily reliant on the advancements in PaLM 2. While effective, this creates a dependency on proprietary, constantly evolving base models. Understanding the generalizability of these improvements to other LLM architectures would be valuable.
Answer Length vs. Quality: While longer answers might be perceived as more comprehensive, they also risk inaccurate or irrelevant information, as indicated by physicians still preferring human answers on that specific axis. The optimal "verbosity" is a critical design choice in real-world applications and might require further tuning based on user roles (e.g., patient vs. clinician).
Cost of Ensemble Refinement: The paper mentions the resource cost of ER (11 samplings in stage one, 33 in stage two), limiting its application to multiple-choice. This suggests it might not be practical for all real-time or resource-constrained scenarios, especially for long-form answers where it could theoretically offer benefits.
Nuance of "Preference": While physicians preferred Med-PaLM 2 answers, the paper highlights that generating human answers was done without specific clinical scenarios. This means the comparison, while valuable, isn't a perfect simulation of a physician's real-world interaction with a patient or colleague. The "preferred" answer might simply be more exhaustive or well-structured, which an unconstrained physician could also produce if given more time and specific instructions.
Unverified Assumptions: The paper touches upon bias and harm but also notes the evaluation isn't comprehensive. The perception of "low likelihood of harm" is a crucial metric, and its real-world implications need extensive, long-term validation.

Overall, Med-PaLM 2 represents a compelling case for LLMs as powerful tools in medicine. The paper thoughtfully tackles the complexities of evaluation in a safety-critical domain, paving the way for future research to address the remaining gaps and move towards responsible real-world deployment.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.