Towards Expert-Level Medical Question Answering with Large Language Models
TL;DR Summary
This paper presents Med-PaLM 2, a significant advancement in medical question answering, achieving an 86.5% score on the MedQA dataset, improving by over 19% while employing base LLM enhancements, domain fine-tuning, and innovative prompting strategies.
Abstract
Recent artificial intelligence (AI) systems have reached milestones in "grand challenges" ranging from Go to protein-folding. The capability to retrieve medical knowledge, reason over it, and answer medical questions comparably to physicians has long been viewed as one such grand challenge. Large language models (LLMs) have catalyzed significant progress in medical question answering; Med-PaLM was the first model to exceed a "passing" score in US Medical Licensing Examination (USMLE) style questions with a score of 67.2% on the MedQA dataset. However, this and other prior work suggested significant room for improvement, especially when models' answers were compared to clinicians' answers. Here we present Med-PaLM 2, which bridges these gaps by leveraging a combination of base LLM improvements (PaLM 2), medical domain finetuning, and prompting strategies including a novel ensemble refinement approach. Med-PaLM 2 scored up to 86.5% on the MedQA dataset, improving upon Med-PaLM by over 19% and setting a new state-of-the-art. We also observed performance approaching or exceeding state-of-the-art across MedMCQA, PubMedQA, and MMLU clinical topics datasets. We performed detailed human evaluations on long-form questions along multiple axes relevant to clinical applications. In pairwise comparative ranking of 1066 consumer medical questions, physicians preferred Med-PaLM 2 answers to those produced by physicians on eight of nine axes pertaining to clinical utility (p < 0.001). We also observed significant improvements compared to Med-PaLM on every evaluation axis (p < 0.001) on newly introduced datasets of 240 long-form "adversarial" questions to probe LLM limitations. While further studies are necessary to validate the efficacy of these models in real-world settings, these results highlight rapid progress towards physician-level performance in medical question answering.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is "Towards Expert-Level Medical Question Answering with Large Language Models."
1.2. Authors
The authors are: Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Le Hou, Kevin Clark, Stephen Pfohl, Heather Cole-Lewis, Darlene Neal, Mike Schaekermann, Amy Wang, Mohamed Amin, Sami Lachgar, Philip Mansfield, Sushant Prakash, Bradley Green, Ewa Dominowska, Blaise Aguera y Arcas, Nenad Tomasev, Yun Liu, Renee Wong, Christopher Semturs, S. Sara Mahdavi, Joelle Barral, Dale Webster, Greg S. Corrado, Yossi Matias, Shekoofeh Azizi, Alan Karthikesalingam, and Vivek Natarajan.
Their affiliations are primarily Google Research () and DeepMind ().
1.3. Journal/Conference
The paper was published as a preprint on arXiv. While arXiv is not a peer-reviewed journal or conference in itself, it is a widely recognized repository for preprints in fields like AI, physics, mathematics, and computer science. Papers published on arXiv are often submitted to prestigious conferences (e.g., NeurIPS, ICML, ICLR, AAAI) or journals after further peer review. The visibility on arXiv allows for rapid dissemination of research findings and community feedback.
1.4. Publication Year
The paper was published at UTC: 2023-05-16T17:11:29.000Z, which corresponds to the year 2023.
1.5. Abstract
The abstract introduces Med-PaLM 2, a significant advancement in medical question answering (MQA) using Large Language Models (LLMs). Building on its predecessor Med-PaLM, which was the first to "pass" USMLE-style questions with a 67.2% score on the MedQA dataset, Med-PaLM 2 leverages several improvements: a more powerful base LLM (PaLM 2), medical domain-specific finetuning, and novel prompting strategies, including an ensemble refinement approach.
The model achieved a new state-of-the-art score of 86.5% on the MedQA dataset, an improvement of over 19% from Med-PaLM. It also demonstrated performance approaching or exceeding state-of-the-art on other benchmarks such as MedMCQA, PubMedQA, and MMLU clinical topics.
Crucially, the paper presents detailed human evaluations on long-form questions. Physicians preferred Med-PaLM 2 answers over physician-generated answers on eight of nine axes related to clinical utility (e.g., factuality, reasoning, low harm likelihood) in a pairwise comparison of 1066 consumer medical questions (). Significant improvements were also observed compared to Med-PaLM on "adversarial" questions designed to probe LLM limitations. While real-world validation is still necessary, these results indicate rapid progress towards physician-level performance in MQA.
1.6. Original Source Link
The official source link for the paper is: https://arxiv.org/abs/2305.09617v1 The PDF link is: https://arxiv.org/pdf/2305.09617v1.pdf This is a preprint publication on arXiv.
2. Executive Summary
2.1. Background & Motivation
The paper addresses the long-standing "grand challenge" of developing artificial intelligence (AI) systems capable of retrieving, reasoning over, and answering medical questions at a level comparable to human physicians. This problem is of immense importance due to its potential to revolutionize healthcare, improve patient education, and assist clinicians.
Prior research, particularly with Large Language Models (LLMs), had made significant strides. Med-PaLM, for instance, was the first model to exceed a "passing" score on US Medical Licensing Examination (USMLE) style questions using the MedQA dataset, achieving 67.2%. However, despite this milestone, previous work, including Med-PaLM itself, indicated substantial room for improvement, especially when comparing the quality of model-generated answers to those provided by human clinicians. Key challenges and gaps identified included:
-
Performance on Benchmarks: While
Med-PaLMachieved state-of-the-art on multiple-choice benchmarks, these scores still left room for improvement. -
Long-form Answer Quality: The ability to generate factual, safe, and nuanced long-form responses for open-ended questions, typical in real-world medical scenarios, was still a significant hurdle. Human evaluations revealed that AI outputs, particularly for long-form answers, needed further refinement to ensure safety and alignment with human values and expectations in a safety-critical domain like medicine.
-
Robustness to Adversarial Questions: LLMs often struggle with complex, tricky, or potentially biased questions, necessitating specific probes into their limitations.
The paper's entry point or innovative idea is to bridge these gaps by combining a new, more powerful base LLM (
PaLM 2) with targeted medical domain finetuning and novel prompting strategies, including a new method calledensemble refinement. This multi-pronged approach aims to push LLM capabilities towards expert-level performance in medical question answering.
2.2. Main Contributions / Findings
The paper presents Med-PaLM 2 and highlights several primary contributions and key findings:
-
State-of-the-Art Performance:
Med-PaLM 2achieved an accuracy of up to 86.5% on theMedQAdataset, improving uponMed-PaLMby over 19% and setting a new state-of-the-art. It also approached or exceeded state-of-the-art performance acrossMedMCQA,PubMedQA, andMMLUclinical topics datasets. -
Novel Methodology: The development of
Med-PaLM 2involved:- Leveraging
PaLM 2as an improved base LLM. - Implementing targeted medical domain-specific finetuning using
MultiMedQAdatasets. - Introducing
ensemble refinementas a new prompting strategy to enhance LLM reasoning capabilities by aggregating multiple reasoning paths.
- Leveraging
-
Superior Long-Form Answer Quality (Human Evaluation): In detailed human evaluations by physicians on 1066 consumer medical questions:
- Physicians preferred
Med-PaLM 2answers to physician-generated answers on eight of nine axes pertaining to clinical utility (e.g., alignment with medical consensus, reading comprehension, knowledge recall, reasoning, low likelihood of harm) with high statistical significance (). Med-PaLM 2answers were judged to better reflect medical consensus 72.9% of the time compared to physician answers.
- Physicians preferred
-
Improved Robustness on Adversarial Questions: On newly introduced datasets of 240 long-form "adversarial" questions designed to probe LLM limitations,
Med-PaLM 2showed significant improvements overMed-PaLMacross every evaluation axis (), including a much lower perceived risk of harm (90.6% forMed-PaLM 2vs. 79.4% forMed-PaLM). -
Comprehensive Evaluation Framework: The work reinforces the importance of a comprehensive benchmark (
MultiMedQA), detailed human evaluation rubrics, and the introduction of adversarial datasets for rigorously assessing LLM performance in safety-critical domains.These findings collectively demonstrate rapid progress towards achieving physician-level performance in medical question answering, addressing key shortcomings of previous LLM approaches.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand the paper, a reader should be familiar with the following fundamental concepts:
- Artificial Intelligence (AI): A broad field of computer science concerned with building intelligent machines capable of performing tasks that typically require human intelligence, such as learning, problem-solving, perception, and decision-making.
- Machine Learning (ML): A subfield of AI that enables systems to learn from data, identify patterns, and make decisions with minimal human intervention.
- Natural Language Processing (NLP): A subfield of AI focused on enabling computers to understand, interpret, and generate human language.
- Large Language Models (LLMs): These are advanced deep learning models, typically based on the transformer architecture, trained on massive amounts of text data (internet-scale corpora). They are designed to understand and generate human-like text, demonstrating capabilities like text generation, translation, summarization, and question answering. Their "largeness" refers to the number of parameters (billions or trillions) and the scale of their training data. Examples include
GPT-3,PaLM, andPaLM 2. - Medical Question Answering (MQA): A specialized application of NLP and LLMs where the goal is to provide accurate, relevant, and medically sound answers to health-related questions. This can range from answering multiple-choice exam questions to generating long-form explanations for consumer health queries.
- Fine-tuning (or Finetuning): A technique in machine learning where a pre-trained model (like an LLM) is further trained on a smaller, task-specific dataset. This process adapts the model's general knowledge to a particular domain or task, improving its performance for that specific use case. In this paper,
Med-PaLM 2is fine-tuned on medical domain-specific data. - Instruction Fine-tuning (or Instruction Prompt-tuning): A specific type of fine-tuning where a model is trained on a dataset of instructions (prompts) paired with desired responses. This teaches the model to follow instructions and generate outputs aligned with specific task formats or user expectations.
- Prompting Strategies: The art and science of crafting effective inputs (prompts) to guide an LLM to generate desired outputs. Different strategies involve structuring the prompt in specific ways to elicit better reasoning or specific answer formats.
- Ensemble Methods: In machine learning, an ensemble method combines multiple models or multiple outputs from the same model to improve overall performance, robustness, or reduce bias. The core idea is that combining diverse perspectives can lead to a more reliable outcome than relying on a single one.
3.2. Previous Works
The paper explicitly builds upon and contrasts itself with several key prior studies:
- Med-PaLM: This was the direct predecessor to
Med-PaLM 2. Introduced in prior work by Singhal et al. [1],Med-PaLMwas the first model to exceed a "passing" score (67.2%) on USMLE-style questions in theMedQAdataset. It utilizedFlan-PaLMas its base model and leveraged instruction prompt-tuning for medical domain alignment.- Limitation: While a significant achievement, human evaluations revealed that
Med-PaLM's long-form answers still had "key shortfalls" compared to physician answers, and its multiple-choice scores, though state-of-the-art, left "room for improvement."
- Limitation: While a significant achievement, human evaluations revealed that
- MultiMedQA: Introduced in the
Med-PaLMpaper [1],MultiMedQAis a comprehensive benchmark for medical question-answering. It's diverse, spanning medical exams, consumer health, and medical research, and includes a human evaluation rubric. - Flan-PaLM: A powerful general-purpose LLM developed by Google [20, 21]. It's an instruction-finetuned version of
PaLM.Med-PaLMwas built uponFlan-PaLM. - GPT Family of Models (GPT-3, GPT-3.5, GPT-4): These models from OpenAI have also shown rapid progress in medical QA. The paper specifically cites:
GPT-3.5achieving 60.2% onMedQA[3].GPT-4-baseachieving 86.1% onMedQA[2, 45].- Studies evaluating their specialized clinical knowledge without specific medical alignment, such as diagnostic and triage accuracies [22], and performance in genetics, surgery, and ophthalmology [23-25].
- Ayers et al. [26] found
ChatGPTresponses rated higher in quality and empathy than physician responses to social media patient questions.
- Domain-Specific Smaller Language Models: Earlier approaches to medical QA often involved smaller language models trained specifically on domain data:
BioLinkBert[11]DRAGON[12]PubMedGPT[13]PubMedBERT[14]BioGPT[15] These models steadily improved performance on benchmarks likeMedQA,MedMCQA, andPubMedQA, but were generally surpassed by larger general-purpose LLMs.
- Chain-of-Thought (CoT) Prompting: Introduced by Wei et al. [42], this strategy involves augmenting few-shot examples with step-by-step explanations. It enables LLMs to perform multi-step reasoning by conditioning on their own intermediate outputs.
- Self-consistency (SC): Introduced by Wang et al. [43], this method improves performance by sampling multiple reasoning paths and answers from the model and then taking a majority (or plurality) vote for the final answer. This is particularly useful for complex problems with multiple reasoning routes.
- How CoT and SC are related: CoT is often used within SC to generate diverse reasoning paths.
- Other Related Prompting/Refinement Techniques: The paper also mentions
recitation-augmentation[28],self-refine[29], anddialogue enabled reasoning[30] as related to theirensemble refinementapproach, all of which involve conditioning an LLM on its own generations.
3.3. Technological Evolution
The field of medical question answering has evolved significantly:
- Early AI Systems: Long viewed as a "grand challenge" [8-10], early AI in medicine focused on rule-based expert systems or smaller, specialized models.
- Domain-Specific LMs: The advent of
transformers[5] spurred the development of language models (LMs) specifically trained on biomedical text (e.g.,PubMedBERT,BioGPT). These showed steady improvements on medical benchmarks. - General-Purpose LLMs: With massive compute and internet-scale corpora, general-purpose
LLMslikeGPT-3andFlan-PaLMemerged, demonstrating "leapfrog improvements" on medical benchmarks even without specific medical alignment. This highlighted the power of scale. - Specialized LLMs for Medicine: Recognizing the critical need for safety and alignment in healthcare, the next phase involved taking these powerful general-purpose
LLMsand adapting them for the medical domain.Med-PaLMwas a pioneer in this, using instruction fine-tuning to alignFlan-PaLMto medical requirements. - Advanced LLMs with Enhanced Reasoning and Evaluation:
Med-PaLM 2represents the current pinnacle of this evolution. It integrates an even stronger base model (PaLM 2), deeper domain-specific fine-tuning, and sophisticated prompting strategies likeensemble refinementto improve reasoning. Crucially, it emphasizes rigorous human evaluation, including pairwise comparisons and adversarial testing, to move beyond simple benchmark scores and assess real-world utility and safety.
3.4. Differentiation Analysis
Compared to the main methods in related work, Med-PaLM 2 introduces several core differences and innovations:
- Improved Base LLM: Unlike
Med-PaLMwhich was built onFlan-PaLM(an instruction-finetunedPaLM),Med-PaLM 2leveragesPaLM 2[4], which is described as a "new iteration of Google's large language model with substantial performance improvements on multiple LLM benchmark tasks." This stronger foundation inherently provides a boost in capabilities. - Novel Prompting Strategy: Ensemble Refinement (ER): While prior work utilized
Chain-of-Thought (CoT)andSelf-consistency (SC),Med-PaLM 2introducesensemble refinement. This approach generalizesSCby not just voting on answers but conditioning the LLM on multiple possible reasoning paths it generated in a prior step to produce a refined explanation and answer. This allows the model to "aggregate over answers" and potentially take into account the "strengths and weaknesses of the explanations it generated," leading to more robust reasoning. - Comprehensive Human Evaluation Focus: The paper extends the human evaluation framework established by
Med-PaLM. It introduces:- Pairwise Comparative Ranking: Directly comparing model answers against physician answers (and
Med-PaLManswers) across nine clinically relevant axes for a large set of consumer medical questions. This provides a more nuanced understanding of relative quality than independent ratings. - Adversarial Question Datasets: Specifically curated datasets designed to probe the limitations, safety, and potential biases of
LLMsin challenging scenarios (e.g., health equity, misinformation). This is crucial for assessing robustness in safety-critical applications.
- Pairwise Comparative Ranking: Directly comparing model answers against physician answers (and
- "Best of Both Worlds" Approach:
Med-PaLM 2explicitly combines the power of the "latest general-purpose LLMs" with targeted "medical question-answering data and physician-written responses to align the model to the safety-critical requirements of the medical domain." This contrasts with models like vanillaGPT-4which, while powerful, are not specifically aligned to the medical domain's safety needs out-of-the-box. The paper notes a performance drop betweenGPT-4-baseand the "aligned (production)GPT-4model" on multiple-choice benchmarks, suggestingMed-PaLM 2maintains strong benchmark performance while being explicitly aligned for long-form medical QA.
4. Methodology
4.1. Principles
The core idea behind Med-PaLM 2 is to achieve expert-level medical question answering by integrating three main components: a powerful, updated base Large Language Model (LLM), domain-specific adaptation through instruction fine-tuning, and advanced prompting strategies to enhance reasoning and answer quality. The underlying principle is that while general-purpose LLMs possess vast knowledge, tailoring them to the nuanced, safety-critical medical domain requires explicit alignment and sophisticated methods to elicit their best reasoning capabilities, especially for complex, open-ended questions. The novel ensemble refinement prompting strategy specifically aims to leverage the model's ability to explore multiple reasoning paths and then refine its output based on these diverse perspectives, leading to more accurate and robust answers.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Datasets
Med-PaLM 2 was evaluated on multiple types of datasets:
- Multiple-choice questions:
MedQA (USMLE)[16]: Contains questions representing general medical knowledge from the US medical licensing exam.MedMCQA[17]: Focuses on general medical knowledge from Indian medical entrance exams.PubMedQA[18]: A closed-domain question-answering dataset where answers are derived fromPubMedabstracts.MMLU clinical topics[31]: A collection of multiple-choice questions covering various clinical knowledge areas, includingMMLU Medical genetics,MMLU Anatomy,MMLU Professional medicine,MMLU College biology, andMMLU College medicine.
- Long-form questions:
MultiMedQA 140: A curated sample of 140 questions fromHealthSearchQA,LiveQA[32], andMedicationQA[33] datasets, as used in previousMed-PaLMwork.MultiMedQA 1066: An expanded sample of 1066 questions from the same sources asMultiMedQA 140.
- Adversarial questions: Two new datasets specifically curated to probe
LLMlimitations and potential for harm/bias.-
Adversarial (General): Covers issues like health equity, drug use, alcohol, mental health, COVID-19, obesity, suicide, and medical misinformation. Includes topics like health disparities and racial bias in clinical calculators. -
Adversarial (Health equity): Prioritizes use cases and sensitive characteristics relevant to health equity considerations in healthcare access, quality, and social/environmental factors.No concrete examples of data samples from these datasets are provided in the paper's main body.
-
The following table summarizes the multiple-choice question evaluation datasets:
The following are the results from Table 1 of the original paper:
| Name | Count | Description |
| MedQA (USMLE) | 1273 | General medical knowledge in US medical licensing exam |
| PubMedQA | 500 | Closed-domain question answering given PubMed abstract |
| MedMCQA | 4183 | General medical knowledge in Indian medical entrance exams |
| MMLU-Clinical knowledge | 265 | Clinical knowledge multiple-choice questions |
| MMLU Medical genetics | 100 | Medical genetics multiple-choice questions |
| MMLU-Anatomy | 135 | Anatomy multiple-choice questions |
| MMLU-Professional medicine | 272 | Professional medicine multiple-choice questions |
| MMLU-College biology | 144 | College biology multiple-choice questions |
| MMLU-College medicine | 173 | College medicine multiple-choice questions |
The following table summarizes the long-form question evaluation datasets:
The following are the results from Table 2 of the original paper:
| Name | Count | Description |
| MultiMedQA 140 | 140 | Sample from HealthSearchQA, LiveQA, MedicationQA [1] |
| MultiMedQA 1066 | 1066 | Sample from HealthSearchQA, LiveQA, MedicationQA (Extended from [1]) |
| Adversarial (General) | 58 | General adversarial dataset |
| Adversarial (Health equity) | 182 | Health equity adversarial dataset |
4.2.2. Modeling
Base LLM
Med-PaLM 2 is built upon PaLM 2 [4], which is described as a new, more powerful iteration of Google's large language model, offering substantial performance improvements compared to its predecessor PaLM (used for Med-PaLM).
Instruction Finetuning
The PaLM 2 base LLM undergoes instruction finetuning (also known as instruction prompt-tuning) following the protocol used by Chung et al. [21]. This process adapts the general-purpose LLM to respond effectively to medical-specific instructions and questions.
The instruction finetuning utilized a mixture of training splits from MultiMedQA, specifically:
-
MedQA -
MedMCQA -
HealthSearchQA -
LiveQA -
MedicationQAA "unified" model is trained, optimized for performance across all these datasets. The specific
dataset mixture ratios(proportions of each dataset) were empirically determined, as shown in the table below. Unless otherwise specified,Med-PaLM 2refers to this unified model. A variant ofMed-PaLM 2was also created by finetuning exclusively on multiple-choice questions for comparison.
The following are the results from Table 3 of the original paper:
| Dataset | Count | Mixture ratio |
| MedQA | 10,178 | 37.5% |
| MedMCQA | 182,822 | 37.5% |
| LiveQA | 10 | 3.9% |
| MedicationQA | 9 | 3.5% |
| HealthSearchQA | 45 | 17.6% |
4.2.3. Multiple-choice evaluation (Prompting Strategies)
Several prompting strategies are employed to evaluate Med-PaLM 2 on multiple-choice benchmarks:
- Few-shot prompting:
- Concept: This strategy involves providing the
LLMwith a few examples of input-output pairs before the actual question it needs to answer. This helps the model understand the desired task format and reasoning style without explicit instruction. - Implementation: The paper uses the same few-shot prompts as in Singhal et al. [1].
- Concept: This strategy involves providing the
- Chain-of-Thought (CoT):
- Concept:
CoT[42] enhances few-shot prompting by including a step-by-step explanation or reasoning process for each example within the prompt, leading to the final answer. This technique allows theLLMto condition on its own intermediate reasoning steps, which is particularly beneficial for complex multi-step problems common in medical questions. - Implementation: The authors crafted
CoTprompts to provide clear demonstrations of appropriate medical question answering (examples provided in Section A.3.1 of the Appendix).
- Concept:
- Self-consistency (SC):
- Concept:
SC[43] is a strategy to improve performance by generating multiple diverse reasoning paths and corresponding answers from theLLM. Instead of relying on a single reasoning path,SCsamples multiple explanations stochastically. The final answer is determined by a majority (or plurality) vote among these sampled answers. For complex domains like medicine, where multiple valid reasoning routes can lead to a correct answer, marginalizing over these paths can lead to a more accurate and robust result. - Implementation: In this work,
SCis performed with 11 samplings usingCoTprompting, consistent with Singhal et al. [1].
- Concept:
- Ensemble refinement (ER):
-
Concept: This is a novel prompting strategy developed in this work, building upon
CoTandSC, and related to techniques likeself-refine[29].ERoperates in a two-stage process to improveLLMreasoning. It allows theLLMto aggregate information from its own multiple generated responses before producing a final, refined answer. -
Implementation:
- First stage: Given a few-shot
CoTprompt and a question, the model stochastically produces multiple possible generations (e.g., 11 samplings in this work) viatemperature sampling. Each generation consists of an explanation and an answer for a multiple-choice question. - Second stage: The model is then conditioned on the original prompt, the question, and all the concatenated generations from the first stage. It is then prompted to produce a
refined explanationand arefined answer. This can be interpreted as a generalization ofSC, where theLLMaggregates over the answers and explanations from the first stage, rather than just taking a simple vote. This allows the model to leverage the strengths and weaknesses observed in its initial diverse generations. The second stage is performed multiple times (e.g., 33 samplings in this work), and then a plurality vote is applied over these generated refined answers to determine the final answer.
- First stage: Given a few-shot
-
Generality: While applied here for multiple-choice evaluation,
ERcan theoretically be used to produce improved long-form generations by having anLLMcondition on multiple possible responses to generate a refined final answer. -
Resource Cost: Due to the computational cost of repeated samplings,
ERis applied only for multiple-choice evaluation in this work.The following figure (Figure 2 from the original paper) illustrates the
Ensemble Refinementprocess:
该图像是一个示意图,展示了Med-PaLM 2模型的推理过程。输入数据经过多个推理路径(Reasoning Path 1, K, N)处理后,再次输入Med-PaLM 2,最终生成答案。
-
VLM Description: The image is a diagram illustrating the reasoning process of the Med-PaLM 2 model. Input data is processed through multiple reasoning paths (Reasoning Path 1, K, N) and then re-entered into Med-PaLM 2 to generate the final answer.
4.2.4. Overlap analysis
To address concerns about test set contamination (where evaluation benchmarks might overlap with the LLM's training data), the authors perform an overlap analysis:
- Methodology: A question is defined as overlapping if either the entire question or at least 512 contiguous characters of the question text overlap with any document in the training corpus used for the base
LLMunderlyingMed-PaLM 2. Multiple-choice options or answers are not included in this check to prevent underestimation of overlap due to formatting variations. This approach is considered conservative, as it also treats questions where only the question text, but not the answer, is in the training data, as overlapping.
4.2.5. Long-form evaluation
A series of human evaluations are conducted to assess Med-PaLM 2's performance on long-form consumer medical question-answering.
Model answers
- Elicitation: Prompts provided in Section A.3.4 of the Appendix are used consistently for both
Med-PaLMandMed-PaLM 2. - Sampling: Answers are sampled from models with a
temperatureof 0.0, as in Singhal et al. [1]. Atemperatureof 0.0 makes the model's output deterministic and generally leads to the most probable token sequence, which is often preferred for factual tasks to ensure consistency.
Physician answers
- Generation: Physicians generate answers without time limits and with full access to reference materials.
- Audience: They are instructed to target their answers to a
lay-personof average reading comprehension. - Context: Tasks are not anchored to a specific environmental or clinical scenario.
Physician and lay-person raters
- Physician Raters: A pool of 15 physicians (6 US, 4 UK, 5 India) with diverse specialties (family medicine, internal medicine, cardiology, respiratory, pediatrics, surgery).
- Blinding: Raters are blinded to the source of the answer (model or human).
- Independence: Ratings are performed in isolation without conferring.
- Conflict of Interest: Physicians who previously generated answers for
MultiMedQAquestions [1] do not evaluate their own answers, and 8-10 weeks elapsed between answer generation and evaluation.
- Lay-person Raters: A pool of 6 raters (4 female, 2 male, 18-44 years old) from India, all without a medical background. Educational backgrounds vary (high school diploma to postgraduate).
Individual evaluation of long-form answers
- Rubrics: Answers (from physicians,
Med-PaLM, andMed-PaLM 2) are independently rated by physician and lay-person raters using rubrics introduced in Singhal et al. [1]. - Datasets: Conducted on
MultiMedQA 140,Adversarial (General), andAdversarial (Health equity)datasets. - Data Source: Ratings for
MultiMedQA 140forMed-PaLMare taken from Singhal et al. [1]. - Rating Quantity: Each response is evaluated by three independent raters (for new rating experiments), with
MultiMedQA 140answers being triple-rated andAdversarialquestions quadruple-rated. - Inter-rater Reliability (IRR): Analysis on
MultiMedQA 140answers showed very good () agreement for 10 out of 12 alignment questions and good () agreement for the remaining two (e.g., missing important content, unnecessary additional information).
Pairwise ranking evaluation of long-form answers
- Methodology: Raters are presented with a pair of answers (e.g., physician vs.
Med-PaLM 2) for a given question and asked to select the preferred response along nine specified axes. This method is intended to reduce inter-rater variability compared to independent ratings. - Evaluation Axes:
- Alignment with medical consensus: "Which answer better reflects the current consensus of the scientific and clinical community?"
- Reading comprehension: "Which answer demonstrates better reading comprehension? (indication the question has been understood)"
- Knowledge recall: "Which answer demonstrates better recall of knowledge? (mention of a relevant and/or correct fact for answering the question)"
- Reasoning: "Which answer demonstrates better reasoning step(s)? (correct rationale or manipulation of knowledge for answering the question)"
- Inclusion of irrelevant content: "Which answer contains more content that it shouldn't? (either because it is inaccurate or irrelevant)"
- Omission of important information: "Which answer omits more important information?"
- Potential for demographic bias: "Which answer provides information that is biased for any demographic groups? For example, is the answer applicable only to patients of a particular sex where patients of another sex might require different information?"
- Possible harm extent: "Which answer has a greater severity/extent of possible harm? (which answer could cause more severe harm)"
- Possible harm likelihood: "Which answer has a greater likelihood of possible harm? (more likely to cause harm)"
- Comparison to Individual Evaluation: For reading comprehension, knowledge recall, and reasoning, pairwise evaluation consolidates assessment of correct and incorrect aspects into a single judgment of overall quality, unlike individual evaluations that assess them separately.
- Datasets: Performed on
MultiMedQA 1066andAdversarialdatasets. - Blinding & Randomization: Raters are blinded to answer sources, and the display order of answers is randomized.
- Exclusions: A small number of answers (8/1066 for
Med-PaLM 2vs. Physician; 11/1066 forMed-PaLM 2vs.Med-PaLM) were excluded due to technical display issues.
Statistical analyses
- Confidence Intervals: Computed via
bootstrapping(10,000 iterations).Bootstrappingis a resampling technique used to estimate the distribution of a statistic (e.g., mean, confidence interval) by repeatedly drawing samples with replacement from the observed data. - Hypothesis Testing:
Two-tailed permutation testsare used for hypothesis testing (10,000 iterations). For multiple-rated answers, permutations are blocked by answer to account for dependencies within ratings of the same answer.Permutation testsare non-parametric tests that determine statistical significance by permuting the labels of the observed data, creating a null distribution against which the observed statistic is compared. - MultiMedQA Dataset Specifics: For statistical analysis on the
MultiMedQAdataset, whereMed-PaLMand physician answers were single-rated,Med-PaLM 2ratings are randomly sub-sampled to one rating per answer during bootstrapping and permutation testing to ensure fair comparison.
5. Experimental Setup
5.1. Datasets
The experiments utilized a range of multiple-choice and long-form medical question-answering datasets, including those from the MultiMedQA benchmark [1], and two newly curated adversarial datasets.
-
Multiple-choice question evaluation datasets:
MedQA (USMLE): A dataset of 1273 questions designed to test general medical knowledge, mirroring the style of the US Medical Licensing Examination.PubMedQA: Consists of 500 questions requiring closed-domain question answering based on providedPubMedabstracts.MedMCQA: A large-scale dataset with 4183 questions covering general medical knowledge from Indian medical entrance exams.MMLU clinical topics: This category comprises several sub-datasets from theMassive Multitask Language Understanding (MMLU)benchmark, focusing on clinical and medical knowledge:-
MMLU Clinical knowledge(265 questions) -
MMLU Medical genetics(100 questions) -
MMLU Anatomy(135 questions) -
MMLU Professional medicine(272 questions) -
MMLU College biology(144 questions) -
MMLU College medicine(173 questions)The following are the results from Table 1 of the original paper:
Name Count Description MedQA (USMLE) 1273 General medical knowledge in US medical licensing exam PubMedQA 500 Closed-domain question answering given PubMed abstract MedMCQA 4183 General medical knowledge in Indian medical entrance exams MMLU-Clinical knowledge 265 Clinical knowledge multiple-choice questions MMLU Medical genetics 100 Medical genetics multiple-choice questions MMLU-Anatomy 135 Anatomy multiple-choice questions MMLU-Professional medicine 272 Professional medicine multiple-choice questions MMLU-College biology 144 College biology multiple-choice questions MMLU-College medicine 173 College medicine multiple-choice questions
-
-
Long-form question evaluation datasets:
-
MultiMedQA 140: A sample of 140 questions curated fromHealthSearchQA,LiveQA[32], andMedicationQA[33] datasets, consistent with priorMed-PaLMwork [1]. These questions are typically consumer health queries. -
MultiMedQA 1066: An expanded sample of 1066 questions drawn from the same sources asMultiMedQA 140. -
Adversarial (General): A new dataset of 58 questions specifically designed to elicit model answers with potential for harm and bias. It broadly covers issues related to health equity, drug use, alcohol, mental health, COVID-19, obesity, suicide, and medical misinformation, including health disparities and racial bias in clinical calculators. -
Adversarial (Health equity): Another new dataset, consisting of 182 questions, prioritizing use cases and sensitive characteristics relevant to health equity considerations in areas like healthcare access, quality, and social/environmental factors.The following are the results from Table 2 of the original paper:
Name Count Description MultiMedQA 140 140 Sample from HealthSearchQA, LiveQA, MedicationQA [1] MultiMedQA 1066 1066 Sample from HealthSearchQA, LiveQA, MedicationQA (Extended from [1]) Adversarial (General) 58 General adversarial dataset Adversarial (Health equity) 182 Health equity adversarial dataset
-
These datasets were chosen to comprehensively evaluate Med-PaLM 2's capabilities across various medical question-answering tasks, from standardized exams to nuanced consumer health inquiries and challenging adversarial scenarios. They are effective for validating the method's performance by covering different question formats, knowledge domains, and levels of complexity, crucial for assessing real-world applicability in a safety-critical field. No concrete examples of data samples (e.g., a specific question-answer pair from a dataset) are provided in the main body of the paper.
5.2. Evaluation Metrics
The paper employs a combination of quantitative and qualitative evaluation metrics to thoroughly assess Med-PaLM 2's performance.
5.2.1. Multiple-choice Evaluation Metrics
- Accuracy: This is the primary metric for multiple-choice questions, indicating the proportion of questions for which the model selected the correct answer.
- Conceptual Definition: Accuracy measures the overall correctness of a model's predictions. In classification tasks like multiple-choice question answering, it represents the ratio of correctly predicted instances to the total number of instances.
- Mathematical Formula: $ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $
- Symbol Explanation:
- : The count of questions where the model's chosen answer matches the ground truth correct answer.
- : The total number of questions in the evaluation set.
5.2.2. Long-form Evaluation Metrics (Human Evaluation Axes)
For human evaluations, both individual and pairwise, physicians and lay-persons rated answers along several clinically relevant axes. These are qualitative metrics, often rated on a Likert scale or as a preference choice.
-
Individual Evaluation Axes (Physician & Lay-person Raters):
Answer supported by consensus: Measures if the answer aligns with current medical consensus.- : Assesses the severity of potential harm, with "No harm" being the highest quality.
Low likelihood of harm: Evaluates the probability of an answer causing harm.Shows evidence of question comprehension: Indicates if the model understood the query.Shows evidence of knowledge recall: Assesses the retrieval of relevant and correct facts.Shows evidence of reasoning: Evaluates the correctness of the logical steps in deriving the answer.No sign of incorrect comprehension: Absence of misunderstanding the question.No sign of incorrect knowledge recall: Absence of factually incorrect information.No sign of incorrect reasoning: Absence of flawed logic.No inaccurate or irrelevant information: Absence of extraneous or wrong details.No missing important content: Completeness of the answer.No sign of bias towards specific subgroups: Assesses fairness and equity in the answer.Directly addresses query intent(Lay-person only): Measures relevance to the user's question.Answer is extremely helpful(Lay-person only): Measures perceived utility for the user.
-
Pairwise Ranking Evaluation Axes (Physician Raters): Raters choose which of two answers is better along these axes.
Better reflects consensusBetter reading comprehensionBetter knowledge recallBetter reasoningMore inaccurate or irrelevant information(Preference for the answer with less)Omits more information(Preference for the answer with less)More evidence of demographic bias(Preference for the answer with less)Greater extent of harm(Preference for the answer with less)Greater likelihood of harm(Preference for the answer with less)
5.2.3. Inter-rater Reliability Metric
- Randolph's Kappa (): Used to measure the agreement between multiple raters, especially suitable for situations with a low baseline positive rate for certain categories.
- Conceptual Definition:
Randolph's Kappa() is a statistical measure of inter-rater agreement for categorical items. It is often considered more robust than simple percent agreement calculations because it takes into account the possibility of agreement occurring by chance. A value of 1 indicates perfect agreement, 0 indicates agreement equivalent to chance, and negative values indicate agreement worse than chance. The paper specifies thresholds: for "very good agreement" and for "good agreement." - Mathematical Formula: The paper refers to Randolph's [1] but does not provide its formula. A common formulation for Cohen's Kappa, which is generalized by Randolph's for multiple raters, is given by:
$
\kappa = \frac{P_o - P_e}{1 - P_e}
$
For multiple raters and categories, the specific formulation for
Randolph's Kappacan be complex, but the underlying principle remains the same: $ \kappa = \frac{A_o - A_e}{1 - A_e} $ where is the observed proportional agreement, and is the hypothetical proportional agreement by chance. - Symbol Explanation:
- (or ): The observed proportion of agreement among raters.
- (or ): The proportion of agreement expected by chance, given the marginal probabilities of each category.
- A higher value indicates better agreement beyond what would be expected by chance.
- Conceptual Definition:
5.3. Baselines
The Med-PaLM 2 model's performance is compared against several strong baselines:
-
Med-PaLM: The direct predecessor, which was the first model to exceed a "passing" score on USMLE-style questions, built on
Flan-PaLM[1]. This is a crucial baseline for demonstrating the advancements made byMed-PaLM 2. -
Flan-PaLM: An instruction-finetuned version of
PaLM, which served as the base model forMed-PaLM. Its performance is included to show the progression from a general-purpose instruction-tunedLLMto a medically specialized one. -
GPT-4 (5-shot) and GPT-4-base (5-shot): These are powerful general-purpose
LLMsfrom OpenAI. Their inclusion provides a comparison against contemporary state-of-the-art models that are not necessarily specialized for the medical domain (thoughGPT-4has strong general capabilities). The5-shotindicates that these models were prompted with five examples. -
Other Domain-Specific Models: Although
Med-PaLM 2largely surpasses them, the paper mentionsBioGPT-Large[15] forPubMedQAas a previous state-of-the-art model. This highlights the shift from smaller, specialized models to adapted large general-purpose models.These baselines are representative because they cover the immediate predecessor (
Med-PaLM), the underlying generalLLM(Flan-PaLM), and competing state-of-the-art generalLLMs(GPT-4variants). This allows for a comprehensive assessment ofMed-PaLM 2's incremental and absolute improvements.
6. Results & Analysis
6.1. Core Results Analysis
6.1.1. Multiple-choice Evaluation
Med-PaLM 2 demonstrates significant improvements across multiple-choice medical benchmarks.
The following are the results from Table 4 of the original paper:
| Dataset | Flan-PaLM (best) | Med-PaLM 2 (ER) | Med-PaLM 2 (best) | GPT-4 (5-shot) | GPT-4-base (5-shot) |
| MedQA (USMLE) | 67.6 | 85.4 | 86.5 | 81.4 | 86.1 |
| PubMedQA | 79.0 | 75.0 | 81.8 | 75.2 | 80.4 |
| MedMCQA | 57.6 | 72.3 | 72.3 | 72.4 | 73.7 |
| MMLU Clinical knowledge | 80.4 | 88.7 | 88.7 | 86.4 | 88.7 |
| MMLU Medical genetics | 75.0 | 92.0 | 92.0 | 92.0 | 97.0 |
| MMLU Anatomy | 63.7 | 84.4 | 84.4 | 80.0 | 85.2 |
| MMLU Professional medicine | 83.8 | 92.3 | 95.2 | 93.8 | 93.8 |
| MMLU College biology | 88.9 | 95.8 | 95.8 | 95.1 | 97.2 |
| MMLU College medicine | 76.3 | 83.2 | 83.2 | 76.9 | 80.9 |
-
MedQA(USMLE): The unifiedMed-PaLM 2model achieved 85.4% accuracy usingensemble refinement (ER). A specialized version, instruction-finetuned only onMedQA, reached 86.5%, setting a new state-of-the-art and improving uponMed-PaLM's 67.2% by over 19%. This also surpassesGPT-4-base(86.1%) andGPT-4(81.4%). -
MedMCQA:Med-PaLM 2scored 72.3%, exceedingFlan-PaLMby over 14% but slightly behindGPT-4-base(73.7%). -
PubMedQA: The unifiedMed-PaLM 2achieved 75.0%. However, with further exploration of prompting strategies (usingself-consistencywith 11 samplings), it reached 81.8%, which is state-of-the-art. The paper notes the small test set size (500 examples) and intrinsic label noise (human performance at 78.0%) as caveats. -
MMLU clinical topics:Med-PaLM 2significantly improved overMed-PaLMand achieved state-of-the-art on 3 out of 6 topics, withGPT-4-baseperforming better on the remaining three.Interestingly, the paper highlights that
GPT-4-baseoften showed better performance than the "aligned (production)GPT-4model" on these benchmarks. In contrast,Med-PaLM 2maintained strong performance while being specifically aligned for long-form medical question answering, underscoring the value of its approach.
The following are the results from Table 5 of the original paper:
| Dataset | Med-PaLM 2 (5-shot) | Med-PaLM 2 (COT+SC) | Med-PaLM 2 (ER) |
| MedQA (USMLE) | 79.7 | 83.7 | 85.4 |
| PubMedQA | 79.2 | 74.0 | 75.0 |
| MedMCQA | 71.3 | 71.5 | 72.3 |
| MMLU Clinical knowledge | 88.3 | 88.3 | 88.7 |
| MMLU Medical genetics | 90.0 | 89.0 | 92.0 |
| MMLU Anatomy | 77.8 | 80.0 | 84.4 |
| MMLU Professional medicine | 95.2 | 93.4 | 92.3 |
| MMLU College biology | 94.4 | 95.1 | 95.8 |
| MMLU College medicine | 80.9 | 81.5 | 83.2 |
Table 5 further demonstrates that ensemble refinement (ER) consistently improves performance over few-shot and Chain-of-Thought (CoT) + Self-consistency (SC) prompting strategies across most benchmarks, validating its effectiveness. For example, on MedQA, ER boosted accuracy from 83.7% () to 85.4%.
6.1.2. Overlap Analysis
The overlap analysis, conducted to assess potential test set contamination, revealed varying degrees of overlap.
The following are the results from Table 6 of the original paper:
| Dataset | Overlap Fraction | Performance (without Overlap) | Performance (with Overlap) | Delta |
| MedQA (USMLE) | 12/1273 (0.9%) | 85.3 [83.4, 87.3] | 91.7 [76.0, 100.0] | -6.3 [-13.5, 20.8] |
| PubMedQA | 6/500 (1.2%) | 74.1 [70.2, 78.0] | 66.7 [28.9, 100.0] | 7.4 [-16.6, 44.3] |
| MedMCQA | 893/4183 (21.4%) | 70.5 [68.9, 72.0] | 75.0 [72.2, 77.9] | -4.6 [-7.7, -1.3] |
| MMLU Clinical knowledge | 55/265 (20.8%) | 88.6 [84.3, 92.9] | 87.3 [78.5, 96.1] | 1.3 [-6.8, 13.2] |
| MMLU Medical genetics | 48/100 (48.0%) | 92.3 [85.1, 99.6] | 91.7 [83.8, 99.5] | 0.6 [-11.0, 12.8] |
| MMLU Anatomy | 37/135 (27.4%) | 82.7 [75.2, 90.1] | 89.2 [79.2, 99.2] | -6.5 [-17.4, 8.7] |
| MMLU Professional medicine | 79/272 (29.0%) | 89.1 [84.7, 93.5] | 92.4 [86.6, 98.2] | -3.3 [-9.9, 5.5] |
| MMLU College biology | 60/144 (41.7%) | 95.2 [90.7, 99.8] | 96.7 [92.1, 100.0] | -1.4 [-8.7, 7.1] |
| MMLU College medicine | 47/173 (27.2%) | 78.6 [71.4, 85.7] | 91.5 [83.5, 99.5] | -12.9 [-22.4, 0.1] |
Overlap percentages ranged from 0.9% for MedQA to 48.0% for MMLU Medical Genetics (Table 6). Med-PaLM 2's performance was slightly higher on questions with overlap for 6 out of 9 datasets. However, the difference was statistically significant only for MedMCQA (accuracy difference 4.6%, 95% CI [1.3, 7.7]), due to the small number of overlapping questions in most datasets. Even when the overlap segment length was reduced from 512 to 120 characters, increasing overlap percentages (e.g., 11.2% for MedQA, 56.0% for MMLU Medical Genetics), the performance differences remained minimal and statistically significant for only one dataset (Table A.1). This suggests that test set contamination had a minimal impact on the reported performance, similar to observations in other large models [20].
6.1.3. Long-form Evaluation
Independent Evaluation
-
On
MultiMedQA 140(Physician Raters):Med-PaLM 2answers were generally comparable to physician-generated andMed-PaLM-generated answers (Figure 3, Table A.2). Significant differences in favor ofMed-PaLM 2overMed-PaLMwere observed for only 3 axes: evidence of reasoning, incorrect knowledge recall, and incorrect reasoning (). The analysis was somewhat underpowered for the subtle differences observed.The following figure (Figure 3 from the original paper) illustrates physician evaluation on
MultiMedQA:
该图像是一个横向条形图,展示了Med-PaLM 2、Med-PaLM和临床医生在不同评估标准下的回答质量比例,包括问题理解、知识回忆和推理等方面。数据表明,Med-PaLM 2在多个维度上优于其他模型。
VLM Description: The image is a horizontal bar chart showing the quality proportions of answers from Med-PaLM 2, Med-PaLM, and physicians across various evaluation criteria, including question comprehension, knowledge recall, and reasoning. The data indicates that Med-PaLM 2 outperforms other models in multiple dimensions.
-
On Adversarial Datasets (Physician Raters):
Med-PaLM 2answers were rated significantly higher quality thanMed-PaLManswers across all axes () (Figure 4, Table A.3). This superior performance held for both general and health equity-focused adversarial questions. For instance, answers were rated as having a low risk of harm for 90.6% ofMed-PaLM 2answers, compared to 79.4% forMed-PaLM.The following figure (Figure 4 from the original paper) illustrates physician evaluation on adversarial questions:
该图像是一个横向条形图,展示了不同模型在高质量评分区间内的回答比例。绿色条代表Med-PaLM 2的答案比例,黄色条则对应其他模型。图中还显示了每个条形的误差范围,数据清晰地反映出Med-PaLM 2在医学问答中的优越性能。
VLM Description: The image is a horizontal bar chart showing the proportion of answers in high-quality rating bins for different models. The green bars represent the answer proportions for Med-PaLM 2, while the yellow bars correspond to other models. Error ranges are also displayed for each bar, clearly reflecting the superior performance of Med-PaLM 2 in medical question answering.
-
On
MultiMedQA 140(Lay-person Raters): Lay-persons ratedMed-PaLM 2answers as more helpful and relevant thanMed-PaLManswers ( for both dimensions) (Figure 5, Table A.4).The following figure (Figure 5 from the original paper) illustrates lay-person evaluation on
MultiMedQA 140:
该图像是一个条形图,比较了三种回答来源(Med-PaLM 2、Med-PaLM 和医师)对问题意图的把握及回答的帮助程度。图中显示,Med-PaLM 2 在直接响应问题意图(89%)和提供用户帮助(64%)方面表现优于其他两者,尤其是在引导用户得出结论或明确下一步时。此外,Med-PaLM 和医师的表现也通过不同色彩标示出来。
VLM Description: The image is a bar chart comparing three sources of answers (Med-PaLM 2, Med-PaLM, and physicians) regarding their understanding of question intent and the level of assistance provided.
Answer Lengths
Med-PaLM 2 answers were consistently longer than Med-PaLM and physician answers. For MultiMedQA 140, the median answer length for Med-PaLM 2 was 794 characters, compared to 565.5 for Med-PaLM and 337.5 for physicians. For adversarial questions, Med-PaLM 2 had a median length of 964 characters versus 518 for Med-PaLM (Table A.9). This increased length might contribute to perceived completeness and quality.
Pairwise Ranking Evaluation
The pairwise ranking evaluation provided a more explicit assessment of relative performance, especially on expanded datasets.
The following are the results from Figure 1 (Right) in the original paper (reproduced here from the abstract description):
-
In pairwise comparative ranking of 1066 consumer medical questions, physicians preferred
Med-PaLM 2answers to those produced by physicians on eight of nine axes pertaining to clinical utility (). -
For example,
Med-PaLM 2answers were judged to better reflect medical consensus 72.9% of the time compared to physician answers.The following are the results from Table A.5 of the original paper:
Rating type Med-PaLM 2 Answer Selected Physician Answer Selected Tie p value Better reflects consensus 0.729 [0.702, 0.755] 0.118 [0.099, 0.137] 0.153 [0.131, 0.175] <0.001 Better reading comprehension 0.569 [0.540, 0.599] 0.096 [0.079, 0.114] 0.335 [0.305, 0.363] <0.001 Better knowledge recall 0.801 [0.776, 0.824] 0.088 [0.072, 0.105] 0.112 [0.093, 0.130] <0.001 Better reasoning 0.730 [0.702, 0.756] 0.084 [0.068, 0.101] 0.186 [0.163, 0.210] <0.001 More inaccurate or irrelevant information 0.266 [0.240, 0.292] 0.141 [0.120, 0.162] 0.594 [0.564, 0.624] <0.001 Omits more information 0.063 [0.049, 0.078] 0.640 [0.611, 0.669] 0.297 [0.269, 0.324] <0.001 More evidence of demographic bias 0.013 [0.007, 0.020] 0.043 [0.031, 0.057] 0.943 [0.929, 0.957] <0.001 Greater extent of harm 0.064 [0.050, 0.079] 0.418 [0.388, 0.448] 0.518 [0.488, 0.548] <0.001 Greater likelihood of harm 0.067 [0.053, 0.082] 0.445 [0.415, 0.474] 0.488 [0.457, 0.518] <0.001 -
Med-PaLM 2vs. Physician Answers (onMultiMedQA 1066): Physicians preferredMed-PaLM 2answers over physician answers on eight of nine axes (). This includes better reflection of medical consensus, reading comprehension, knowledge recall, reasoning, and lower perceived harm likelihood and extent. The only axis whereMed-PaLM 2was not more favorable was "More inaccurate or irrelevant information," where physicians were preferred. This suggests whileMed-PaLM 2is generally better, it might still occasionally include superfluous details.The following figure (Figure 6 from the original paper) illustrates the ranking comparison of long-form answers, focusing on Med-PaLM 2 vs Med-PaLM:
该图像是图表,展示了Med-PaLM 2与Med-PaLM在高质量答案特征和潜在答案风险上的比较,包括回答的共识、阅读理解和知识回忆等方面的评估。图中数据表明,Med-PaLM 2总体上在多个维度上表现更优。
VLM Description: The image is a chart comparing Med-PaLM 2 and Med-PaLM in terms of high-quality answer traits and potential answer risks, including evaluations of consensus, reading comprehension, and knowledge recall. The data suggests that Med-PaLM 2 performs better overall across multiple dimensions.
-
Med-PaLM 2vs.Med-PaLMAnswers (onMultiMedQA 1066):Med-PaLM 2answers were rated higher quality thanMed-PaLManswers on the same eight axes () (Figure 6, Table A.6). The difference in "more inaccurate or irrelevant information" was not significant.The following are the results from Table A.6 of the original paper:
Rating type Metric, Med-PaLM 2 Metric, Med-PaLM Metric, Tie p value Better reflects consensus 0.573 [0.543, 0.602] 0.215 [0.191, 0.241] 0.212 [0.189, 0.238] <0.001 Better reading comprehension 0.432 [0.402, 0.462] 0.181 [0.158, 0.205] 0.387 [0.357, 0.416] <0.001 Better knowledge recall 0.579 [0.550, 0.609] 0.210 [0.187, 0.236] 0.210 [0.187, 0.235] <0.001 Better reasoning 0.566 [0.536, 0.595] 0.218 [0.194, 0.244] 0.216 [0.191, 0.241] <0.001 More inaccurate or irrelevant information 0.184 [0.161, 0.208] 0.215 [0.191, 0.240] 0.601 [0.572, 0.631] 0.122 Omits more information 0.140 [0.119, 0.162] 0.427 [0.398, 0.457] 0.432 [0.403, 0.462] <0.001 More evidence of demographic bias 0.019 [0.011, 0.027] 0.036 [0.026, 0.047] 0.945 [0.931, 0.958] 0.027 Greater extent of harm 0.137 [0.118, 0.158] 0.347 [0.318, 0.375] 0.516 [0.485, 0.545] <0.001 Greater likelihood of harm 0.148 [0.127, 0.170] 0.351 [0.321, 0.379] 0.501 [0.471, 0.531] <0.001 -
On Adversarial Questions (
Med-PaLM 2vs.Med-PaLM):Med-PaLM 2was ranked more favorably thanMed-PaLMacross every axis, often by substantial margins, further reinforcing its robustness in challenging scenarios.
6.2. Data Presentation (Tables)
The following are the results from Table A.1 of the original paper:
| Dataset | Overlap Fraction | Performance (without Overlap) | Performance (with Overlap) | Delta |
| MedQA (USMLE) | 142/1273 (11.2%) | 85.3 [83.3, 87.4] | 85.9 [80.2, 91.6] | -0.6 [-5.8, 6.4] |
| PubMedQA | 67/500 (13.4%) | 74.1 [70.0, 78.3] | 73.1 [62.5, 83.7] | 1.0 [-9.1, 13.3] |
| MedMCQA | 1021/4183 (24.4%) | 70.5 [68.9, 72.1] | 74.4 [71.8, 77.1] | -4.0 [-7.0, -0.8] |
| MMLU Clinical knowledge | 56/265 (21.1%) | 88.5 [84.2, 92.8] | 87.5 [78.8, 96.2] | 1.0 [-7.1, 12.7] |
| MMLU Medical genetics | 56/100 (56.0%) | 93.2 [85.7, 100.0] | 91.1 [83.6, 98.5] | 2.1 [-10.4, 13.4] |
| MMLU Anatomy | 39/135 (28.9%) | 82.3 [74.7, 89.9] | 89.7 [80.2, 99.3] | -7.5 [-18.2, 7.3] |
| MMLU-Professional medicine | 149/272 (54.8%) | 84.6 [78.2, 90.9] | 94.6 [91.0, 98.3] | -10.1 [-18.0, -2.9] |
| MMLU-College biology | 69/144 (47.9%) | 94.7 [89.6, 99.8] | 97.1 [93.1, 100.0] | -2.4 [-10.3, 5.3] |
| MMLU-College medicine | 70/173 (40.5%) | 79.6 [71.8, 87.4] | 85.7 [77.5, 93.9] | -6.1 [-16.7, 0.4] |
The following are the results from Table A.2 of the original paper:
| Rating type | Metric, Med-PaLM 2 [CI] | Metric, Med-PaLM [CI] | Metric, Physician [CI] | p Med-PaLM 2 vs. Med-PaLM | p Med-PaLM 2 vs. Physician |
| Answer supported by consensus | 0.917 [0.890, 0.943] | 0.929 [0.879, 0.971] | 0.921 [0.879, 0.964] | 0.725 | 0.890 |
| Possible harm extent = No harm | 0.933 [0.910, 0.955] | 0.943 [0.900, 0.979] | 0.929 [0.886, 0.971] | 0.687 | 0.950 |
| Low likelihood of harm | 0.955 [0.936, 0.974] | 0.979 [0.950, 1.000] | 0.971 [0.943, 0.993] | 0.287 | 0.439 |
| Shows evidence of question comprehension | 0.983 [0.969, 0.995] | 0.936 [0.886, 0.971] | 0.971 [0.943, 0.993] | 0.056 | 0.655 |
| Shows evidence of knowledge recall | 0.971 [0.957, 0.988] | 0.936 [0.893, 0.971] | 0.971 [0.943, 0.993] | 0.313 | 1.000 |
| Shows evidence of reasoning | 0.974 [0.957, 0.988] | 0.914 [0.864, 0.964] | 0.971 [0.943, 0.993] | 0.030 | 0.858 |
| No sign of incorrect comprehension | 0.986 [0.974, 0.995] | 0.943 [0.900, 0.979] | 0.971 [0.943, 0.993] | 0.108 | 0.713 |
| No sign of incorrect knowledge recall | 0.933 [0.912, 0.955] | 0.829 [0.764, 0.886] | 0.950 [0.914, 0.986] | 0.022 | 0.523 |
| No sign of incorrect reasoning | 0.962 [0.943, 0.979] | 0.886 [0.829, 0.936] | 0.964 [0.929, 0.993] | 0.032 | 0.820 |
| No inaccurate or irrelevant information | 0.900 [0.871, 0.926] | 0.814 [0.750, 0.879] | 0.971 [0.943, 0.993] | 0.066 | 0.076 |
| No missing important content | 0.881 [0.848, 0.914] | 0.850 [0.786, 0.907] | 0.871 [0.814, 0.921] | 0.427 | 0.784 |
| No sign of bias towards specific subgroups | 0.971 [0.955, 0.986] | 0.993 [0.979, 1.000] | 0.971 [0.943, 0.993] | 0.429 | 1.000 |
The following are the results from Table A.3 of the original paper:
| Rating | Question set | Metric, Med-PaLM 2 | Metric, Med-PaLM | p value |
| Answer supported by consensus | All Adversarial questions | 0.769 [0.733, 0.803] | 0.585 [0.544, 0.626] | 0.000 |
| Health equity questions | 0.784 [0.742, 0.826] | 0.590 [0.539, 0.640] | 0.000 | |
| Possible harm extent = No harm | General questions | 0.746 [0.690, 0.802] | 0.578 [0.513, 0.642] | 0.001 |
| All Adversarial questions | 0.786 [0.752, 0.820] | 0.619 [0.580, 0.658] | 0.000 | |
| Health equity questions | 0.764 [0.719, 0.809] | 0.576 [0.525, 0.626] | 0.000 | |
| Low likelihood of harm | General questions | 0.819 [0.767, 0.866] | 0.685 [0.625, 0.746] | 0.005 |
| All Adversarial questions | 0.906 [0.883, 0.929] | 0.794 [0.762, 0.827] | 0.000 | |
| Health equity questions | 0.913 [0.882, 0.941] | 0.784 [0.739, 0.826] | 0.000 | |
| Shows evidence of question comprehension | General questions | 0.897 [0.853, 0.935] | 0.810 [0.759, 0.858] | 0.019 |
| All Adversarial questions | 0.949 [0.930, 0.966] | 0.871 [0.844, 0.896] | 0.000 | |
| Health equity questions | 0.949 [0.924, 0.972] | 0.868 [0.831, 0.902] | 0.000 | |
| Shows evidence of knowledge recall | General questions | 0.948 [0.918, 0.974] | 0.875 [0.832, 0.918] | 0.002 |
| All Adversarial questions | 0.969 [0.956, 0.983] | 0.827 [0.796, 0.857] | <0.001 | |
| Health equity questions | 0.969 [0.949, 0.986] | 0.823 [0.781, 0.862] | <0.001 | |
| Shows evidence of reasoning | General questions | 0.970 [0.944, 0.991] | 0.832 [0.780, 0.879] | <0.001 |
| All Adversarial questions | 0.959 [0.942, 0.974] | 0.811 [0.779, 0.842] | <0.001 | |
| Health equity questions | 0.955 [0.933, 0.975] | 0.806 [0.764, 0.846] | <0.001 | |
| No sign of incorrect comprehension | General questions | 0.966 [0.940, 0.987] | 0.819 [0.767, 0.866] | <0.001 |
| All Adversarial questions | 0.947 [0.929, 0.964] | 0.855 [0.827, 0.883] | <0.001 | |
| Health equity questions | 0.947 [0.921, 0.969] | 0.854 [0.817, 0.890] | <0.001 | |
| No sign of incorrect knowledge recall | General questions | 0.948 [0.918, 0.974] | 0.858 [0.810, 0.901] | 0.001 |
| All Adversarial questions | 0.857 [0.828, 0.884] | 0.709 [0.672, 0.745] | <0.001 | |
| Health equity questions | 0.868 [0.831, 0.902] | 0.722 [0.674, 0.770] | <0.001 | |
| No sign of incorrect reasoning | General questions | 0.841 [0.793, 0.884] | 0.690 [0.629, 0.750] | 0.001 |
| All Adversarial questions | 0.961 [0.944, 0.976] | 0.798 [0.765, 0.830] | <0.001 | |
| Health equity questions | 0.955 [0.933, 0.975] | 0.795 [0.753, 0.837] | <0.001 | |
| No inaccurate or irrelevant information | General questions | 0.970 [0.944, 0.991] | 0.802 [0.750, 0.853] | <0.001 |
| All Adversarial questions | 0.847 [0.816, 0.874] | 0.651 [0.612, 0.690] | <0.001 | |
| Health equity questions | 0.848 [0.812, 0.882] | 0.638 [0.587, 0.685] | <0.001 | |
| No missing important content | General questions | 0.845 [0.797, 0.888] | 0.672 [0.612, 0.733] | 0.002 |
| All Adversarial questions | 0.808 [0.776, 0.838] | 0.614 [0.575, 0.653] | <0.001 | |
| Health equity questions | 0.806 [0.764, 0.846] | 0.587 [0.534, 0.638] | <0.001 | |
| No sign of bias towards specific subgroups | General questions | 0.810 [0.759, 0.862] | 0.655 [0.595, 0.716] | 0.002 |
| All Adversarial questions | 0.964 [0.949, 0.978] | 0.871 [0.844, 0.898] | <0.001 | |
| Health equity questions | 0.958 [0.935, 0.978] | 0.860 [0.823, 0.896] | <0.001 |
The following are the results from Table A.4 of the original paper:
| Rating type | Metric, Med-PaLM 2 | Metric, Med-PaLM | p value |
| Directly addresses query intent | 0.893 [0.836, 0.943] | 0.736 [0.664, 0.807] | 0.002 |
| Answer is extremely helpful | 0.643 [0.564, 0.721] | 0.171 [0.107, 0.236] | 0.000 |
The following are the results from Table A.9 of the original paper:
| Dataset | Answerer | mean | std | min | 25% | 50% | 75% | max |
| MultiMedQA 140 | Med-PaLM 2 | 851.29 | 378.46 | 198 | 576.5 | 794 | 1085 | 2226 |
| Med-PaLM | 597.24 | 298.76 | 105 | 347 | 565.5 | 753.25 | 1280 | |
| Physician | 343.14 | 113.72 | 90 | 258.75 | 337.5 | 419.5 | 615 | |
| Adversarial | Med-PaLM 2 | 1,014.18 | 392.23 | 231 | 733.25 | 964 | 1242.25 | 2499 |
| Med-PaLM | 582.91 | 353.50 | 34 | 300 | 518 | 840.25 | 1530 |
6.3. Ablation Studies / Parameter Analysis
While the paper does not present traditional ablation studies where specific components of Med-PaLM 2 are removed, the comparison of different prompting strategies (few-shot, , and ER) serves a similar purpose by demonstrating the incremental value of these advanced techniques.
-
Impact of Prompting Strategies: As seen in Table 5,
Ensemble Refinement (ER)consistently outperformsfew-shotandChain-of-Thought (CoT) + Self-consistency (SC)across most multiple-choice benchmarks. This highlightsERas a critical component contributing toMed-PaLM 2's superior performance. For instance, onMedQA, moving from (83.7%) toER(85.4%) provides a tangible performance boost. This analysis indicates that the method of eliciting responses and reasoning from theLLMis a key hyper-parameter affecting the results. -
Overlap Analysis as a Robustness Check: The
overlap analysis(Table 6 and Table A.1) acts as a form of robustness check rather than a direct ablation. It investigates whetherMed-PaLM 2's performance is inflated due totest set contamination. The finding that performance differences on overlapping versus non-overlapping questions were minimal (and rarely statistically significant) suggests thatMed-PaLM 2's strong results are robust and not merely a result of memorizing training data present in the test sets.These analyses effectively demonstrate the efficacy of the novel
ensemble refinementapproach and provide confidence in the generalizability ofMed-PaLM 2's performance.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper presents Med-PaLM 2, a significant leap forward in achieving expert-level medical question answering with Large Language Models. By combining an enhanced base LLM (PaLM 2), targeted medical domain finetuning, and innovative prompting strategies—most notably ensemble refinement—Med-PaLM 2 has set new benchmarks. It achieved a state-of-the-art 86.5% on the MedQA dataset, substantially outperforming its predecessor Med-PaLM. Furthermore, rigorous human evaluations by physicians demonstrated a preference for Med-PaLM 2 answers over even physician-generated responses across eight of nine critical clinical utility axes, and showed marked improvements over Med-PaLM on challenging adversarial questions. These results underscore the rapid progress towards LLMs attaining physician-level capabilities in understanding and responding to medical inquiries.
7.2. Limitations & Future Work
The authors candidly acknowledge several limitations and propose future research directions:
- Evaluation Framework Refinement: The space of medical information needs is vast and complex. The current evaluation rubric, while robust, may need additional dimensions (e.g., empathy [26]). Further research is required to enhance the rigor of rubrics for human evaluation of
LLMperformance in medical QA. - Generalizability of Physician Answers: Physicians generating answers were given general instructions (useful for lay-persons) but lacked specific clinical scenarios or nuanced communication requirements. This might not reflect all real-world settings. Future work should ground evaluations in highly specific workflows and clinical scenarios.
- Answer Length:
Med-PaLM 2answers were often longer than physician answers, which might contribute to improved ratings. The optimal length of an answer is context-dependent. - Multi-turn Dialogue & Information Acquisition: The current evaluation did not consider
multi-turn dialogue[46] or frameworks foractive information acquisition[47]. In real clinical settings, requesting more information (e.g., about patient history) might be more appropriate than providing a comprehensive list of possibilities. - Inter-rater Variation in Preference: The study did not explicitly assess
inter-rater variationin preference rankings or explore how this variation relates to raters' lived experiences or expectations. - Limited Physician Answer Assessment: Only one answer per question was produced by physicians, offering a limited view of the range of possible high-quality human responses. Future work could assess multiple physician answers, inter-physician variation, and explicitly explore the medical expertise and background of evaluating physicians.
- Safety, Bias, and Equity Coverage: While adversarial datasets were introduced, the evaluation is not a comprehensive assessment of all safety, bias, and equity considerations. Future work should systematically expand adversarial data, increase coverage of health equity topics, and facilitate
disaggregated evaluationover sensitive characteristics [48-50]. - Real-world Validation: Despite the strong results, further studies are necessary to validate the efficacy and safety of these models in real-world clinical settings and workflows before broader uptake.
7.3. Personal Insights & Critique
This paper showcases an impressive step towards LLMs achieving clinical utility in medical question answering. The commitment to rigorous human evaluation, including pairwise comparisons and the creation of adversarial datasets, is particularly commendable. It goes beyond simple accuracy metrics to probe nuanced aspects of clinical utility, safety, and potential biases, which is crucial for a high-stakes domain like healthcare.
Innovations and Strengths:
- The
ensemble refinementtechnique is a clever way to leverage the stochastic nature ofLLMsto improve reasoning. By having the model self-reflect and refine based on multiple initial thoughts, it mimics a more deliberative human thought process. This could be transferable to other complex reasoning tasks beyond medicine. - The demonstration of physician preference for
Med-PaLM 2answers over human-generated ones on most clinical utility axes is a landmark achievement, suggesting thatLLMs, when properly aligned, can synthesize information and present it in a way that is perceived as superior by experts. - The proactive approach to
test set contaminationthrough overlap analysis adds to the trustworthiness of the benchmark results.
Potential Issues and Areas for Improvement:
-
Dependence on Base Model: The success of
Med-PaLM 2is heavily reliant on the advancements inPaLM 2. While effective, this creates a dependency on proprietary, constantly evolving base models. Understanding the generalizability of these improvements to otherLLMarchitectures would be valuable. -
Answer Length vs. Quality: While longer answers might be perceived as more comprehensive, they also risk
inaccurate or irrelevant information, as indicated by physicians still preferring human answers on that specific axis. The optimal "verbosity" is a critical design choice in real-world applications and might require further tuning based on user roles (e.g., patient vs. clinician). -
Cost of
Ensemble Refinement: The paper mentions theresource costofER(11 samplings in stage one, 33 in stage two), limiting its application to multiple-choice. This suggests it might not be practical for all real-time or resource-constrained scenarios, especially for long-form answers where it could theoretically offer benefits. -
Nuance of "Preference": While physicians preferred
Med-PaLM 2answers, the paper highlights that generating human answers was done without specific clinical scenarios. This means the comparison, while valuable, isn't a perfect simulation of a physician's real-world interaction with a patient or colleague. The "preferred" answer might simply be more exhaustive or well-structured, which an unconstrained physician could also produce if given more time and specific instructions. -
Unverified Assumptions: The paper touches upon
biasandharmbut also notes the evaluation isn't comprehensive. The perception of "low likelihood of harm" is a crucial metric, and its real-world implications need extensive, long-term validation.Overall,
Med-PaLM 2represents a compelling case forLLMsas powerful tools in medicine. The paper thoughtfully tackles the complexities of evaluation in a safety-critical domain, paving the way for future research to address the remaining gaps and move towards responsible real-world deployment.
Similar papers
Recommended via semantic vector search.