ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation
TL;DR Summary
This study introduces ImageReward, a general-purpose human preference reward model for text-to-image generation, trained on a systematic annotation process with 137,000 expert comparisons. It outperforms existing models and proposes Reward Feedback Learning (ReFL) for optimizing
Abstract
We present a comprehensive solution to learn and improve text-to-image models from human preference feedback. To begin with, we build ImageReward -- the first general-purpose text-to-image human preference reward model -- to effectively encode human preferences. Its training is based on our systematic annotation pipeline including rating and ranking, which collects 137k expert comparisons to date. In human evaluation, ImageReward outperforms existing scoring models and metrics, making it a promising automatic metric for evaluating text-to-image synthesis. On top of it, we propose Reward Feedback Learning (ReFL), a direct tuning algorithm to optimize diffusion models against a scorer. Both automatic and human evaluation support ReFL's advantages over compared methods. All code and datasets are provided at \url{https://github.com/THUDM/ImageReward}.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation
1.2. Authors
Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, Yuxiao Dong. The authors are primarily affiliated with Tsinghua University, with additional affiliations from Zhipu AI and Beijing University of Posts and Telecommunications.
1.3. Journal/Conference
The paper was published on arXiv, a preprint server, on 2023-04-12T16:58:13.000Z. arXiv is a well-respected platform for disseminating research quickly, particularly in machine learning and artificial intelligence. While not a peer-reviewed journal or conference in its initial posting, papers on arXiv are often subsequently submitted to and published in top-tier venues. Its reputation lies in its role in fast and open scientific communication.
1.4. Publication Year
2023
1.5. Abstract
This paper introduces a comprehensive solution for learning from and improving text-to-image (T2I) models based on human preference feedback. The core of their approach is ImageReward, presented as the first general-purpose T2I human preference reward model. This model is trained using a systematic annotation pipeline, encompassing rating and ranking, which has accumulated 137,000 expert comparisons. ImageReward is shown to outperform existing scoring models and metrics in human evaluation, establishing it as a promising automatic metric for evaluating T2I synthesis. Building upon ImageReward, the authors propose Reward Feedback Learning (ReFL), a direct tuning algorithm designed to optimize diffusion models using feedback from a scorer. Both automatic and human evaluations demonstrate ReFL's advantages over alternative methods. The code and datasets associated with this work are publicly available on GitHub.
1.6. Original Source Link
https://arxiv.org/abs/2304.05977 The paper is available as a preprint on arXiv. PDF Link: https://arxiv.org/pdf/2304.05977v4.pdf
2. Executive Summary
2.1. Background & Motivation
The rapid advancements in text-to-image (T2I) generative models, particularly diffusion-based approaches, have enabled the creation of high-fidelity and semantically-related images from text prompts. Despite this progress, these models still face significant challenges in aligning with nuanced human preferences. The primary issues stem from the fact that existing models are often self-supervised pre-trained on noisy data distributions that do not perfectly match actual user-prompt distributions. This discrepancy leads to several well-documented problems in generated images:
-
Text-image Alignment: Models often fail to accurately depict all specified numbers, attributes, properties, and relationships described in text prompts.
-
Body Problem: Generated images frequently feature distorted, incomplete, duplicated, or abnormal body parts of humans or animals.
-
Human Aesthetic: The outputs can deviate from mainstream human aesthetic preferences.
-
Toxicity and Biases: Models may generate harmful, violent, sexual, discriminatory, illegal, or psychologically disturbing content.
These challenges are difficult to address solely through architectural improvements or more extensive pre-training data. The paper draws inspiration from the success of reinforcement learning from human feedback (RLHF) in natural language processing (NLP), which has been instrumental in aligning large language models with human preferences and values. RLHF relies on training a
reward model (RM)to capture human preferences from expert-annotated comparisons of model outputs. However, extending RLHF to T2I generation presents unique challenges, particularly regarding the costly and complex annotation process for images and the difficulty of applying traditional RL algorithms to diffusion models which do not provide likelihoods for their generations.
The paper's entry point is to tackle these problems by systematically creating a human preference reward model for T2I generation and then developing a method to directly optimize generative models using this feedback.
2.2. Main Contributions / Findings
The paper makes several significant contributions to the field of text-to-image generation:
- ImageReward Model: The authors introduce
ImageReward, the first general-purpose text-to-image human preference reward model. This model is specifically designed to effectively encode human preferences for T2I outputs. - Systematic Annotation Pipeline and Dataset: A novel, systematic annotation pipeline was developed, incorporating both rating and ranking stages. This rigorous process led to the collection of a dataset comprising 137,000 expert comparisons, which serves as the foundation for training
ImageReward. - Superior Evaluation Metric:
ImageRewardis demonstrated to significantly outperform existing scoring models and metrics (such asCLIP,Aesthetic, andBLIPscores) in understanding and predicting human preferences for T2I synthesis. It also exhibits better distinguishability across models and individual samples and aligns strongly with human rankings. - Reward Feedback Learning (ReFL): The paper proposes
ReFL, a direct tuning algorithm tailored for optimizing diffusion models using feedback from ascorer(likeImageReward). This addresses the limitation of traditional RLHF approaches that are not directly applicable to diffusion models due to their lack of explicit likelihoods. A key insight forReFLis thatImageRewardscores become reliably identifiable at later denoising steps of a diffusion process, enabling direct gradient-based optimization. - Improved Generative Models: Both automatic and human evaluations validate
ReFL's advantages over alternative methods, including data augmentation and loss re-weighting, in improving the quality and human alignment of generated images. TheReFL-trained models demonstrate enhanced text coherence and better adherence to human preferences. - Open-sourced Resources: All code and datasets developed in this research are made publicly available, fostering reproducibility and further research in the community.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
- Text-to-Image (T2I) Generation Models: These are artificial intelligence models that create visual images from textual descriptions (prompts). They have revolutionized digital art and content creation. Early T2I models were based on
generative adversarial networks (GANs), while more recent and higher-performing models leveragediffusion modelsorauto-regressive models.- Auto-regressive Models: These models generate images pixel by pixel or token by token, conditioning each new generation on the previously generated elements. Examples include DALL-E and CogView.
- Diffusion Models: These models work by iteratively denoising an image that starts as pure random noise. They learn to reverse a diffusion process that gradually adds noise to an image.
Latent Diffusion Models (LDMs), like Stable Diffusion, perform this process in a compressed latent space, making them more computationally efficient.
- Reinforcement Learning from Human Feedback (RLHF): A training paradigm where a model (often a
large language modelorLLM) is fine-tuned using feedback provided by humans. This feedback is typically used to train areward model (RM), which then assigns a score to different model outputs. TheLLMis subsequently optimized (e.g., usingProximal Policy Optimization (PPO)) to generate outputs that maximize thisrewardscore, thus aligning the model's behavior with human preferences. - Reward Model (RM): In the context of RLHF, a
reward modelis a neural network trained to predict human preferences or scores for a given model output (e.g., a generated text response or an image). It learns from human comparisons (e.g., "output A is better than output B"). Once trained, theRMcan provide a continuous feedback signal for optimizing the generative model, circumventing the need for constant human supervision during theRLphase. - CLIP (Contrastive Language–Image Pre-training): A neural network trained on a massive dataset of image-text pairs to understand the relationship between images and text.
CLIPcan compute a similarity score between any given image and text prompt without being specifically trained on that task. It consists of an image encoder and a text encoder, both of which project inputs into a shared embedding space where semantic similarity can be measured (e.g., using cosine similarity). - BLIP (Bootstrapping Language-Image Pre-training): Another vision-language model that improves upon
CLIPby bootstrapping its training data and using a more unified architecture for both understanding and generation tasks. It generally achieves better performance in various vision-language benchmarks due to its multi-modal mixture of encoder-decoder architecture. - FID (Fréchet Inception Distance): A widely used metric for evaluating the quality of images generated by
generative adversarial networks (GANs)anddiffusion models.FIDmeasures the similarity between the feature distributions of generated images and real images. It does this by calculating theFréchet distancebetween two multivariate Gaussian distributions fitted to the features extracted from anInception-v3model for both real and generated image sets. A lowerFIDscore indicates higher quality and diversity of generated images, closer to the real data distribution. - Aesthetic Score: This refers to a score provided by a specialized predictor (often built on top of
CLIPembeddings) that aims to quantify the aesthetic quality or appeal of an image according to human judgment. These predictors are typically trained on datasets where images have been rated by humans for aesthetics. - Denoising Diffusion Probabilistic Models (DDPMs): The foundational theory for many
diffusion models. These models learn to reverse a Markov chain of gradually adding Gaussian noise to data. During generation (sampling), the model starts from pure noise and iteratively predicts and removes the noise to produce a clear image. - Latent Diffusion Models (LDMs): An advancement over
DDPMsthat performs thediffusionprocess in a compressedlatent spacerather than directly on pixel space. This makesLDMssignificantly more efficient for generating high-resolution images while maintaining quality. Stable Diffusion is a prominent example of anLHM.
3.2. Previous Works
- Early T2I Models and Evolution: The paper references
GANs[17] as early pioneers, followed by auto-regressive models likeDALL-E[43],CogView[11], andMake-A-Scene[16], and more recently, diffusion models such asGLIDE[37],DALL-E 2[42],Stable Diffusion[45], andImagen[46]. These models have continually improved image fidelity and text alignment but often struggle with subtle human preferences. - RLHF in NLP: The paper highlights
RLHF's success in NLP with models likeInstructGPT[39],ChatGPT, andWebGPT[36]. These works demonstrated that training areward modelfrom human comparisons and then optimizinglanguage modelswithProximal Policy Optimization (PPO)[51] could alignLLMswith human values (e.g., helpfulness, harmlessness, truthfulness). - Evaluation Metrics for T2I:
- FID [18] and Inception Score (IS) [4]: These are commonly used objective metrics. However, the paper points out their limitations: they measure statistical similarity to real images but don't capture human preference, struggle with single-image evaluation, and
zero-shot FIDcan be problematic due to potential data leakage. - CLIP Score [41]: Often used to evaluate text-image coherence by measuring the cosine similarity of embeddings. While useful for alignment, it doesn't fully capture human aesthetic preference or other aspects of "goodness" from a human perspective.
- Aesthetic Predictors [50]: Models trained to predict image aesthetics, often using
CLIPfeatures. These contribute to assessing one aspect of human preference but lack comprehensive evaluation of text-image alignment and other factors.
- FID [18] and Inception Score (IS) [4]: These are commonly used objective metrics. However, the paper points out their limitations: they measure statistical similarity to real images but don't capture human preference, struggle with single-image evaluation, and
- Prior T2I Preference Alignment Efforts: The paper discusses concurrent works aimed at aligning T2I models with human feedback, categorizing them as indirect:
- Dataset Filtering [61; 13]: Uses a reward model to filter datasets, selecting high-quality samples or identifying low-quality ones for specific handling during fine-tuning.
- Loss Re-weighting [23]: Adjusts the weights of training losses based on a reward model's scores, giving more emphasis to preferred samples.
- RAFT [13]: Proposes
Reward Ranked Fine-Tuning, which constructs datasets of generated images with higher rewards for fine-tuning. The paper argues that these approaches areindirectas they primarily manipulate data or loss coefficients rather than directly optimizing the generative process based onRMgradients.
3.3. Technological Evolution
The evolution of generative models for images has moved from GANs (e.g., StyleGAN) which could generate realistic images but were often hard to train and control, to auto-regressive models (e.g., DALL-E) that offered better text conditioning but were computationally intensive for high resolutions. The breakthrough of diffusion models (e.g., DDPMs, LDMs like Stable Diffusion) provided a more stable and high-quality generation process. Concurrently, vision-language models (CLIP, BLIP) emerged, demonstrating powerful text-image understanding.
The success of RLHF in NLP, particularly in aligning LLMs with complex human instructions and safety standards, highlighted a pathway for improving generative models beyond purely objective metrics. However, applying RLHF to image generation is challenging due to the continuous and high-dimensional nature of image data, the difficulty of obtaining human feedback at scale, and the architectural differences of diffusion models compared to LLMs (e.g., lack of explicit likelihoods). This paper's work represents a significant step in adapting RLHF principles for T2I models, bridging the gap between objective quality and subjective human preferences by creating a specialized reward model and a direct optimization strategy.
3.4. Differentiation Analysis
Compared to the main methods in related work, this paper's approach offers several core differences and innovations:
- First General-Purpose T2I Human Preference RM: While
CLIP,Aesthetic, andBLIPprovide scores, they are not explicitly trained as general-purpose human preferencereward modelsforT2Igeneration, encompassing alignment, fidelity, and harmlessness, and tuned specifically from human comparisons in this domain.ImageRewardis specifically built for this comprehensive task based on a large, systematically annotated dataset. - Rigorous Annotation Pipeline: Unlike some concurrent works that rely on noisy online user clicking data [21, 61, 13],
ImageReward's training dataset is built on asystematic annotation pipelinewithexpert comparisons, specificlabeling criteria,annotator training, andquality validation. This ensures higher quality and consistency of the human feedback. - Direct Tuning of Diffusion Models (ReFL): Most prior and concurrent works aiming to use human feedback to improve
T2Imodels (Dataset Filtering[61],Reward Weighted[23],RAFT[13]) adoptindirectmethods like data augmentation or loss re-weighting. These approaches are often limited by the quality of the filtered data or the effectiveness of the re-weighting scheme.ReFL, in contrast, proposes adirect tuning algorithmthat leverages thereward model's gradients to optimize thediffusion model's parameters. This is a more direct and potentially powerful approach, especially given the identified insight aboutImageRewardscores becoming reliable at later denoising steps. - Addressing LDM Likelihood Limitations:
ReFLspecifically addresses the challenge thatLDMsdo not provide explicit likelihoods (a requirement for traditionalRLHFmethods likePPOused inNLP). By usingImageRewardto provide feedback on intermediate denoised latents,ReFLfinds a novel way to apply preference-based optimization todiffusion models.
4. Methodology
4.1. Principles
The core idea behind this work is to leverage human preference feedback to systematically improve text-to-image (T2I) generative models. This involves two main principles:
- Learning Human Preferences: Create a
reward model (RM)that can accurately quantify human preferences for generated images given a text prompt. ThisRM(ImageReward) is trained on a large dataset of expert human comparisons, encoding complex criteria like text-image alignment, image fidelity, and harmlessness. - Optimizing Generative Models with RM Feedback: Develop a method to use the learned
RMas ascorerto directly guide the training ofdiffusion models. Thisdirect tuningapproach (Reward Feedback Learning, orReFL) allows the generative model to produce outputs that are inherently more aligned with human preferences, bypassing the inefficiencies of filtering or re-weighting existing data.
4.2. Core Methodology In-depth (Layer by Layer)
The overall architecture for ImageReward and ReFL is depicted in Figure 2.
该图像是图示,展示了ImageReward和ReFL的整体流程。上半部分描述了数据收集、注释和偏好学习的步骤,下半部分展示了ReFL如何利用ImageReward的反馈在随机去噪步骤中直接优化扩散模型,包括反向传播过程。
The image (Figure 2 from the original paper) illustrates the overall workflow of ImageReward and ReFL. The upper section details the ImageReward process, starting with Data Collection (real-world user prompts from DiffusionDB), followed by Annotation (human experts rating and ranking images), and finally Preference Learning to train the ImageReward model. The lower section shows ReFL, where the ImageReward model provides feedback to directly optimize the diffusion model at a random later denoising step through backpropagation.
4.2.1. ImageReward: Learning to Score and Evaluate Human Preferences
ImageReward is constructed through a systematic pipeline focusing on data collection, human annotation from experts, and reward model (RM) training.
4.2.1.1. Annotation Pipeline Design
The annotation pipeline is crucial for collecting high-quality human preference data.
-
Prompt Selection and Image Collection:
- Source: Real user prompts are sourced from
DiffusionDB[58], a large open-source dataset. - Diversity: To ensure a diverse and representative prompt distribution, a
graph-based algorithmis employed. This algorithm usesSentence-BERT[44] to calculate prompt similarity.Sentence-BERTtransforms sentences into dense vector embeddings, allowing for efficient similarity comparison. - Process: Prompts are grouped into sets (e.g., 100 sets of 20k prompts). Within each set, a
graph-based algorithmselects100 prompts. Each prompt is a vertex, connected to its nearest neighbors (where ) based on cosine similarity ofSentence-BERTembeddings. A score is calculated for each vertex based on its unselected neighbors, and vertices with high scores are iteratively selected. - Result: 10,000 candidate prompts are selected, each paired with 4 to 9 sampled images from
DiffusionDB, totaling 177,304 candidate pairs for annotation.
- Source: Real user prompts are sourced from
-
Human Annotation Design: The pipeline involves three stages to ensure comprehensive and consistent feedback.
-
Prompt Annotation:
- Categorization: Annotators classify prompts based on a schema adapted from
Parti[63] (e.g., Abstract, Animals, People, Outdoor Scenes). This helps in understanding model performance across different domains. - Problem Identification: Annotators identify
problematic promptssuch as ambiguous, unclear, or those containingtoxic content(pornographic, violent, discriminatory).
- Categorization: Annotators classify prompts based on a schema adapted from
-
Text-Image Rating: For each generated image corresponding to a prompt, annotators rate it based on three key measurements using
seven-level Likert scales(1=worst, 7=best):-
Alignment: How faithfully the image depicts objects, attributes, and relationships described in the prompt.
-
Fidelity: The quality of the image itself, including realism, aesthetic appeal, and absence of visual errors (e.g., distorted body parts).
-
Harmlessness: Absence of toxic, illegal, biased, or psychologically discomforting content.
-
Problem Checkboxes: Annotators also check specific issues like "Obvious 'repeated generation'," "Existence of body problem," "Too blurry to see objects," "Causes psychological discomfort," "Output contains sexual/violent content," or "Output contains content that defames certain groups." The annotation system screenshots are provided in Figure 8 of the paper's appendix.
该图像是示意图,展示了ImageReward模型如何通过人类偏好反馈来改进文本到图像生成的过程。图中可能包含与训练和评估流程相关的框架或算法,展示了系统注解和评分机制的关联。
The image (Figure 8a from the original paper) shows the user interface for
Text-Image Rating. Annotators rateOverall Rating,Image-Text Alignment, andFidelityusing a 7-point Likert scale. They also select checkboxes to identify specific issues likerepeated generation,body problems,blurriness,psychological discomfort,sexual content,violent content, ordefamation. -
-
Image Ranking: After individual image ratings, annotators rank all generated images for a given prompt from best to worst.
-
Trade-offs: To handle contradictions (e.g., high alignment but high toxicity), explicit
trade-off guidelinesare provided (detailed in Appendix B). For instance, harmlessness generally outweighs alignment. -
Slot-based Ranking: Images are dragged into 5 slots representing different preference levels, allowing for ties (max two images per slot).
该图像是一个插图,展示了待排序的风景画,分为五个等级,从最佳到最差。每幅画的整体颜色和构图各有不同,排列在用户评分界面中,显示出人类对图像的偏好。
The image (Figure 8b from the original paper) displays the
Image Rankinginterface. Annotators drag generated images into 5 slots, from "1=best" to "5=worst," to express their preference for images corresponding to a given prompt. -
-
-
Annotation Management:
- Annotator Recruitment & Training: Professional data annotation company collaboration. Annotators are trained using detailed documents (Appendix B) describing labeling processes, criteria, and examples. Most annotators have at least college-level education.
- Quality Control: Quality inspectors double-check annotations, reassigning invalid ones.
-
Human Annotation Analysis:
- Scale: After 2 months, 8,878 valid prompts and 136,892 compared pairs were collected.
- Prompt Categories (Figure 9 in Appendix A.4): Distribution is diverse, with
People(3,360),Arts(2,407),Outdoor Scenes(1,189), andArtifacts(831) being most common. - Average Score Distribution by Category (Figure 10 in Appendix A.4):
Abstractprompts yield the lowest scores, possibly due toStable Diffusion's difficulty with vague descriptions.Plants,Outdoor Scenes,Indoor Scenesgenerally receive higher scores. - Problem Distribution by Category (Figure 11 in Appendix A.4):
Body problemis the most severe (average 21.14%), most frequent inPeopleandArtscategories.Repeated generation(average 3.16%) appears inWord Knowledge,People,Arts,Vehicles.Fuzzy imagesare common inAbstract,Animal, andArts.Toxic,pornographic,violentcontent is observed due to problematic prompts, highlighting models' inability to filter such content.
- "Function" Words Analysis (Figure 12, 13 in Appendix A.4):
Function words(e.g., "8k", "highly detailed") in prompts influence generated quality.- As the proportion of
function wordsincreases from low to moderate, average scores (alignment, fidelity, overall) generally improve. - The frequency of problems like
repeated generationandbody/fuzzy problemsshows a complex trend, initially increasing, then decreasing, and potentially increasing again whenfunction wordsdominate.
4.2.1.2. RM Training
The goal is to train a reward model (RM) that can function as a virtual evaluator.
- Preference Formulation: Human annotations are formulated as rankings. For a prompt , images () are ranked from best to worst. This generates up to comparison pairs (if no ties). For each comparison where is preferred over , the model learns to assign a higher score to .
- Loss Function: The
reward modelis a neural network that takes a text prompt and an image as input and outputs a scalar preference score. The model is trained using a pairwise comparison loss, similar to those used inRLHFforlanguage models. $ \mathrm{loss}(\theta) = - \mathbb{E}{(T, x_i, x_j) \sim \mathcal{D}} [ \mathrm{log} ( \sigma ( f\theta (T, x_i) - f_\theta (T, x_j) ) ) ] $- Symbol Explanation:
- : The parameters of the
reward model. - : Expectation over the dataset .
- : A sampled triplet from the dataset, consisting of a text prompt , an image preferred by humans, and an image less preferred than .
- : The natural logarithm function.
- : The sigmoid function, . It squashes the difference in scores to a probability-like value between 0 and 1.
- : The preference score predicted by the
reward modelfor prompt and image . - : The preference score predicted by the
reward modelfor prompt and image .
- : The parameters of the
- Purpose: This loss function encourages the
reward modelto assign a higher score to the preferred image than to the less preferred image . Minimizing this negative log-likelihood maximizes the probability that is greater than .
- Symbol Explanation:
- Training Techniques:
- Backbone:
BLIP[26] is chosen as the backbone overCLIPdue to its superior performance in preliminary experiments (Table 2b).BLIP'simage-grounded text encoderis used as a feature encoder. - Architecture:
BLIP's image and text features are combined usingcross-attention, and anMLP (Multi-Layer Perceptron)head generates the scalar preference score. - Overfitting Mitigation: Rapid convergence and overfitting were observed. Freezing a proper number of
backbone transformer layers(70% in this case) was found to improve performance. - Hyperparameter Tuning: Sensitivity to
learning rateandbatch sizerequired carefulgrid search. Optimal values found were alearning rateof 1e-5 and abatch sizeof 64 (per-GPU batch size of 16 on 4 NVIDIA A100 GPUs).
- Backbone:
4.2.1.3. As Metric: Re-Evaluating Human Preferences on Text-to-Image Models
The paper proposes ImageReward as a promising zero-shot automatic evaluation metric for T2I model comparison and individual sample selection, addressing limitations of existing metrics.
-
Limitations of FID:
Zero-shot Usage: Fine-tunedFIDmay not reflect real-world zero-shot performance.Zero-shot FIDcan be affected by data leakage in pre-training.Human Preference:FIDmeasures statistical distance to real images, not human preference, and is an average metric, not suitable for single-image selection.
-
Better Human Alignment Across Models:
-
Setup: Researcher annotation was performed on 6 popular T2I models (
CogView 2,Versatile Diffusion,Stable Diffusion 1.4/2.1,DALL-E 2,Openjourney). 100 real-user prompts were used, with 10 outputs per model. -
Process: For each prompt, the best image out of 10 was selected from each model. Annotators then ranked these best images from different models. Win counts were aggregated.
-
Metrics:
ImageRewardandCLIPscores were averaged over 1,000 text-image pairs per model.Zero-shot FID(30k) onMS-COCO 2014was also calculated. -
Result: As shown in Table 1,
ImageReward(Spearman = 1.00) aligns significantly better with human ranking thanCLIP(Spearman = 0.60) andzero-shot FID(Spearman = 0.09). The following are the results from Table 1 of the original paper:Dataset & Model Real User Prompts MS-COCO 2014 Human Eval. ImageReward CLIP ImageReward Zero-shot FID* Rank #Win Rank Score Rank Score Rank Score Rank Score Openjourney 1 507 1 0.2614 2 0.2726 2 -0.0455 5 20.7 Stable Diffusion 2.1-base 2 463 2 0.2458 4 0.2683 3 0.1553 4 18.8 DALL-E 2 3 390 3 0.2114 3 0.2684 1 0.5387 1 10.9* Stable Diffusion 1.4 4 362 4 0.1344 1 0.2763 4 -0.0857 3 17.9 Versatile Diffusion 5 340 5 -0.2470 5 0.2606 6 -0.5485 2 18.4 CogView 2 6 74 6 -1.2376 6 0.2044 5 -0.8510 6 26.2 Spearman ρ to Human Eval. - 1.00 0.60 0.77 0.09
-
-
Better Distinguishability Across Models and Samples:
-
ImageRewarddemonstrates a much largerinterquartile range (IQR)in its score distributions compared toCLIP(Figure 3), indicating its ability to better distinguish quality between individual samples. -
The
mediansofImageRewardscores for different models also align with human ranking, whileCLIP's medians do not.
该图像是图表,展示了不同生成模型的ImageReward和CLIP评分的规范化分布(已去除异常值)。ImageReward的评分与人类偏好高度一致,并表现出更高的可区分性。
The image (Figure 3 from the original paper) shows the normalized distribution of
ImageRewardandCLIPscores for different generative models, with outliers discarded. It illustrates thatImageReward's scores align well with human preference and exhibit higher distinguishability, evidenced by a larger spread (IQR) of scores compared toCLIP. -
4.2.2. ReFL: Reward Feedback Learning Improves Text-to-Image Diffusion
While ImageReward can select high-quality images, the generate-and-then-filter paradigm is inefficient. ReFL aims to improve the generative model directly.
-
Challenge:
LDMs(Latent Diffusion Models) lack explicitlikelihoodsfor their generations, preventing direct application ofRLHFalgorithms likePPOused inNLP.Classifier-guidanceis an inference-only technique and requires aclassifiertrained on noisy intermediate latents, which contradicts theRM's annotation process (requiring fully denoised images).Indirect solutions(data filtering, loss re-weighting) are limited and only provide mild improvements.
-
ReFL: Insight and Solution:
- Key Insight: Observing
ImageRewardscores along thedenoising steps(e.g., 40 steps), it was found that the final image quality becomes reliably identifiable after a certain number of steps (e.g., out of 40).-
When : Scores are uniformly low.
-
When : High-quality generations start to emerge, but overall judgment is difficult.
-
When : Generations with different
ImageRewardscores are generally distinguishable.
该图像是一个示意图,展示了不同去噪步数下的 ImageReward 值变化及其对应生成图像。在去噪步骤为 30 时,ImageReward 值为 0.35,在步骤为 40 时,ImageReward 值为 0.59,显示出人类偏好的改进趋势。
-
The image (Figure 4 from the original paper) illustrates
ImageRewardscores of a prompt generated with different seeds alongdenoising steps. It shows thatfinal image qualities become identifiable after 30 out of 40 steps, suggesting thatImageRewardcan provide meaningful feedback even on partially denoised latents.- This insight allows
ImageRewardscores from partially denoised latents (predicted after steps) to serve as reliable feedback forLDMimprovement. - Direct Optimization:
ReFLproposes a direct fine-tuning method forLDMsby treatingRMscores ashuman preference lossesand backpropagating gradients to a randomly pickedlatter step(e.g., ). Using a random (instead of just the final step) is crucial for training stability and avoiding rapid overfitting. - Regularization: To prevent overfitting and stabilize fine-tuning,
ReFL lossis re-weighted () and regularized with the originalpre-training loss.
- Key Insight: Observing
-
Algorithm 1: Reward Feedback Learning (ReFL) for LDMs
1: Dataset: Prompt set
2: Pre-training Dataset: Text-image pairs dataset
3: Input: LDM with pre-trained parameters , reward model , reward-to-loss map function , LDM pre-training loss function , reward re-weight scale
4: Initialization: The number of noise scheduler time steps , and time step range for fine-tuning
5: for and do
6:
7: // Update using Pre-training Loss
8: // Pick a random time step
9: // Sample noise as latent
10: for do
11: no grad:
12: end for
13: with grad:
14: // Predict the original latent by noise scheduler
15: // From latent to image
16: // ReFL loss
17: // Update using ReFL loss
18: end for- Explanation of Algorithm 1:
- Line 1-4 (Inputs and Initialization): The algorithm takes a set of prompts and a pre-training dataset (text-image pairs). It requires an
LDMwith initial parameters , thereward model, areward-to-loss mapping function, theLDM pre-training loss function, and areward re-weight scale. Key initializations include the total number ofdenoising stepsand the range for selecting a randomtime stepfor fine-tuning. - Line 5 (Loop over Data): The algorithm iterates over prompts from the prompt set and text-image pairs from the pre-training dataset .
- Line 6-7 (Pre-training Loss Update): It calculates the standard
LDM pre-training lossfor the current LDM parameters using the pre-training data. This part ensures the LDM retains its general generative capabilities. The LDM parameters are then updated based on this loss. - Line 8 (Random Time Step Selection): A random
time stepis chosen within the predefined range (e.g.,[30, 40]). This is the point at which thereward modelwill evaluate a partially denoised latent. - Line 9 (Sample Noise): Initial random noise is sampled from a standard normal distribution to serve as the starting latent for the
diffusion process. - Line 10-12 (Partial Denoising - No Gradient): The
LDMperformsdenoising stepsfrom down to . Crucially,no gradientsare computed or retained during this initial phase. This means these steps are treated as a fixed forward pass to reach an intermediate latent. - Line 13 (Denoising with Gradient): The
LDMperforms the denoising step from to . This step does retain gradients, as it's the point whereReFLloss will be applied. - Line 14-15 (Predict Original Latent and Convert to Image): From the partially denoised latent , the
noise schedulerpredicts the corresponding original (clean) latent . This is then converted into an image representation . This image is what thereward modelwill evaluate. - Line 16 (ReFL Loss Calculation): The
reward modelevaluates the generated image given the prompt . This raw reward score is passed through areward-to-loss map function(e.g.,ReLU) and scaled by to form theReFL loss. A higher reward from should ideally translate to a lower loss. - Line 17 (LDM Update with ReFL Loss): The
LDMparameters are updated using the calculatedReFL loss. This update directly steers the LDM to produce images that yield higherreward modelscores. - Line 18 (End Loop): The process repeats for the next prompt and pre-training data sample.
- Line 1-4 (Inputs and Initialization): The algorithm takes a set of prompts and a pre-training dataset (text-image pairs). It requires an
- Explanation of Algorithm 1:
-
Combined Loss Function: In practice, the fine-tuning uses a combination of the
ReFL lossand the standardpre-training lossto avoid rapid overfitting and stabilize the training. $ \mathcal{L}{reward} = \lambda \mathbb{E}{y_i \sim \mathcal{V}} \big ( \phi \big ( r \big ( y_i, g_\theta (y_i) \big ) \big ) \big ) \ \mathcal{L}{pre} = \mathbb{E}{(y_i, x_i) \sim \mathcal{D}} \big ( \mathbb{E}{\mathcal{E}(x_i), y_i, \epsilon \sim \mathcal{N}(0, 1), t} [ | \epsilon - \epsilon\theta \big ( z_t, t, \tau_\theta (y_i) \big ) |_2^2 ] \big ) $ The total loss optimized is .- Symbol Explanation for :
- : The
ReFLloss component. - : A
reward re-weight scale(hyperparameter) that controls the influence of theReFL loss. - : Expectation over prompts sampled from the prompt set (equivalent to in Algorithm 1).
- : The
reward-to-loss map function(e.g.,ReLU), which transforms the raw reward score into a loss signal. For instance, could be used if is the reward and we want to minimize loss by maximizing reward. - : The
reward model(e.g.,ImageReward) that takes a prompt and a generated image and returns a scalar reward score. - : The generated image from the
LDMwith parameters corresponding to prompt . This implicitly refers to the image generated in Algorithm 1.
- : The
- Symbol Explanation for :
- : The standard
pre-training lossforLDMs, based ondenoising score matching. - : Expectation over text-image pairs from the pre-training dataset .
- : Expectation over
encoderoutputs , prompts , noise sampled from a standard normal distribution, andtime steps. - : The squared
L2norm (mean squared error) between the actual noise added to the latent attime stepand the noise predicted by theLDM'snoise predictor. This is the typicaldiffusion modeltraining objective. - : The ground truth noise added at
time step. - : The
noise prediction networkof theLDMwith parameters . - : The noisy latent at
time step. - : The text embedding (or conditioning vector) for the prompt , derived from the
LDM's text encoder.
- : The standard
- Symbol Explanation for :
5. Experimental Setup
5.1. Datasets
The experiments utilized a combination of datasets for training and evaluation of both ImageReward and ReFL.
- ImageReward Training Dataset:
- Source:
DiffusionDB[58], a large open-source dataset of real user prompts. - Selection: 10,000 candidate prompts were systematically selected using a graph-based algorithm leveraging
Sentence-BERTfor diversity. - Composition: 8,878 valid prompts yielded 136,892 comparison pairs after human expert annotation. This dataset was divided for training and validation, with 466 prompts used for testing (6,399 comparisons).
- Source:
- ImageReward Evaluation Datasets:
- Model Comparison: 100 real-user test prompts (from
DiffusionDB) were used to compare 6 popular T2I models. Each model generated 10 outputs per prompt. - Recall/Filter Test: An additional test set of 371 prompts, each with 8 generated images, was used for evaluating
Recall@KandFilter@Kmetrics. - MS-COCO 2014 [28]: Used for
zero-shot FIDandImageRewardscore evaluation of T2I models. Images were unified to resolution.
- Model Comparison: 100 real-user test prompts (from
- ReFL Training Dataset:
- Pre-training Dataset: A subset of
LAION-5B[50] (selected by aesthetic score) was used forLDMpre-training. - Prompt Set for ReFL: Sampled from
DiffusionDB.
- Pre-training Dataset: A subset of
- ReFL Evaluation Datasets:
- Real User Prompts: 466 real user prompts from
DiffusionDB. - Multi-task Benchmark (MT Bench) [40]: 90 designed challenging prompts for evaluation.
- Real User Prompts: 466 real user prompts from
Example of a data sample (prompt):
-
"a painting of an ocean with clouds and birds, day time, low depth field effect" (Figure 9)
-
"Landscape photography by marc adamus, mountains with some forests, small lake in the center, fog in the background, sunrays, golden hour, high quality." (Figure 4)
-
"Beautiful portrait of female model Irakli Nadar with intricate details of color crushed fluid oil and acrylic on the shoulders, berries and dried moss and dried autumn leaves headdress." (Figure 14)
-
"High quality neoclassical portrait of a cat dressed as Napoleon Bonaparte, very detailed, 1860, oil on canvas." (Figure 14)
-
"Michelangelo's statue of David driving a blue Rolls Royce convertible, A night at Caesars Palace in Las Vegas in the background." (Figure 14)
These datasets were chosen because
DiffusionDBprovides a rich source of real-world user prompts, which are representative of actual T2I model usage.LAION-5Bis a standard large-scale dataset for pre-training vision-language models.MS-COCOis a common benchmark for image generation.MT Benchprovides challenging prompts to test model robustness. This diverse selection allows for thorough validation of the proposed methods against various aspects of T2I generation and human preference.
5.2. Evaluation Metrics
The paper uses several metrics to evaluate ImageReward and ReFL.
-
Preference Accuracy:
- Conceptual Definition: Measures the correctness of a scorer (e.g.,
ImageRewardor a baseline) in choosing the same preferred image as a human from a pair of images. It quantifies how well the model's pairwise preference prediction aligns with human judgment. - Mathematical Formula: While not explicitly given as a standard formula in the paper, it is conceptually defined as the proportion of comparison pairs where the model's preference matches the human's preference. $ \text{Preference Accuracy} = \frac{\text{Number of matching preference pairs}}{\text{Total number of comparison pairs}} $
- Symbol Explanation:
Number of matching preference pairs: The count of instances where the model's chosen better image is the same as the human's chosen better image.Total number of comparison pairs: The total number of image pairs that were presented for preference judgment.
- Conceptual Definition: Measures the correctness of a scorer (e.g.,
-
Recall@K and Filter@K:
- Conceptual Definition:
Recall@Kmeasures the ability of a scoring model to identify the best image within a larger set of generated images. Specifically, it's the proportion of times the human-annotated best image is present among the top K images selected by the model's scores.Filter@Kmeasures the ability of a scoring model to avoid selecting the worst image. It's the proportion of times the human-annotated worst image is not present among the top K images selected by the model, or conversely, is found among the bottom K.
- Mathematical Formula: Not explicitly provided in the paper. Conceptually, for a set of N images for a given prompt: $ \text{Recall@K} = \frac{\text{Number of prompts where human best is in model's top K}}{\text{Total number of prompts}} $ $ \text{Filter@K} = \frac{\text{Number of prompts where human worst is NOT in model's top K}}{\text{Total number of prompts}} $
- Symbol Explanation:
human best: The image identified as the best by human annotators.model's top K: The K images with the highest scores according to the model.human worst: The image identified as the worst by human annotators.model's top K: The K images with the highest scores according to the model (for Filter@K, we want the human worst NOT to be in this set).
- Conceptual Definition:
-
Spearman (Spearman's Rank Correlation Coefficient):
- Conceptual Definition: A non-parametric measure of the strength and direction of association between two ranked variables. It assesses how well the relationship between two variables can be described using a monotonic function. A value of
1indicates a perfect monotonic increasing relationship,-1indicates a perfect monotonic decreasing relationship, and0indicates no monotonic relationship. In this paper, it's used to quantify the agreement between model rankings and human rankings of T2I models. - Mathematical Formula: $ \rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)} $
- Symbol Explanation:
- : Spearman's rank correlation coefficient.
- : The difference between the ranks of each observation pair. For example, if a model ranks
DALL-E 2as 1 and humans rank it as 3, . - : The number of observations (e.g., the number of T2I models being ranked).
- Conceptual Definition: A non-parametric measure of the strength and direction of association between two ranked variables. It assesses how well the relationship between two variables can be described using a monotonic function. A value of
-
Human Win Rate:
- Conceptual Definition: A direct metric of human preference, typically used in pairwise or multi-choice comparisons. It represents the percentage of times a particular model's output is chosen as "better" or "preferred" by human annotators compared to other models or a baseline.
- Mathematical Formula: Not explicitly provided in the paper. $ \text{Human Win Rate} = \frac{\text{Number of times a model's output wins}}{\text{Total number of comparisons involving that model}} \times 100% $
- Symbol Explanation:
Number of times a model's output wins: The count of times human annotators preferred the output of a specific model.Total number of comparisons involving that model: The total count of times the model's output was part of a comparison.
-
FID (Fréchet Inception Distance):
- Conceptual Definition: A metric for assessing the quality and diversity of images generated by generative models. It quantifies the distance between the feature distributions of generated images and real images. Lower
FIDscores indicate that the generated images are more realistic and diverse, resembling the distribution of real images. - Mathematical Formula: $ \mathrm{FID} = ||\mu_1 - \mu_2||^2_2 + \mathrm{Tr}(\Sigma_1 + \Sigma_2 - 2(\Sigma_1 \Sigma_2)^{1/2}) $
- Symbol Explanation:
- : The mean feature vector of the real images (extracted from an
Inception-v3model). - : The mean feature vector of the generated images (extracted from an
Inception-v3model). - : The covariance matrix of the feature vectors of the real images.
- : The covariance matrix of the feature vectors of the generated images.
- : The trace of a matrix (sum of diagonal elements).
- : The squared Euclidean distance between the mean feature vectors.
- : The matrix square root of the product of the covariance matrices.
- : The mean feature vector of the real images (extracted from an
- Conceptual Definition: A metric for assessing the quality and diversity of images generated by generative models. It quantifies the distance between the feature distributions of generated images and real images. Lower
-
CLIP Score:
- Conceptual Definition: Measures the semantic similarity between a text prompt and a generated image. It uses the
CLIPmodel's text encoder and image encoder to obtain embeddings for the text and image, respectively, and then computes thecosine similaritybetween these embeddings. A higher score indicates better text-image coherence. - Mathematical Formula: $ \text{CLIP Score}(T, I) = \text{cosine_similarity}(\text{CLIP_encoder}(T), \text{CLIP_encoder}(I)) $ where .
- Symbol Explanation:
- : The text prompt.
- : The generated image.
- : The vector embedding of the text prompt produced by the
CLIPtext encoder. - : The vector embedding of the image produced by the
CLIPimage encoder. - : The dot product of vectors and .
- : The L2 norm (magnitude) of vector .
- Conceptual Definition: Measures the semantic similarity between a text prompt and a generated image. It uses the
-
Aesthetic Score:
- Conceptual Definition: A score that quantifies the perceived aesthetic quality of an image, often predicted by a neural network trained on human aesthetic judgments. This network typically takes features (e.g.,
CLIPimage embeddings) as input and outputs a scalar score. - Mathematical Formula: Not explicitly provided in the paper, as it depends on the specific aesthetic predictor used (e.g.,
LAION's aesthetic predictor). Generally, it involves a function : $ \text{Aesthetic Score}(I) = A(\text{features}(I)) $ - Symbol Explanation:
- : The image.
- : Features extracted from the image (e.g.,
CLIPimage embeddings). - : The aesthetic prediction function (a neural network).
- Conceptual Definition: A score that quantifies the perceived aesthetic quality of an image, often predicted by a neural network trained on human aesthetic judgments. This network typically takes features (e.g.,
5.3. Baselines
The paper compares its proposed methods against several established or concurrent baselines for both reward modeling and diffusion model optimization.
-
For ImageReward (Human Preference Prediction/Evaluation):
- CLIP Score [41]: A widely used metric for measuring text-image coherence. It calculates the cosine similarity between text and image embeddings produced by the
CLIPmodel. - Aesthetic Score [50]: Provided by
LAION, this is an aesthetic predictor that uses aCLIP-based architecture to score the aesthetic quality of images. - BLIP Score [26]: Similar to
CLIPscore, it uses aBLIPmodel to calculate the cosine similarity between text and image embeddings.BLIPis known for its improved performance in vision-language tasks.
- CLIP Score [41]: A widely used metric for measuring text-image coherence. It calculates the cosine similarity between text and image embeddings produced by the
-
For ReFL (Improving Diffusion Models):
-
Stable Diffusion v1.4 (baseline) [45]: The original pre-trained
Latent Diffusion Modelserves as the baseline generative model against which all fine-tuning methods are compared. -
Dataset Filtering [61]: An indirect method that uses a
reward modelto filter datasets. High-scoring images form apreferred dataset, and low-scoring images (paired with "Weird image" prompts) form anon-preferred dataset. These new datasets are then used for fine-tuning theLDM. -
Reward Weighted [23]: Another indirect method that modifies the
loss functioncoefficients based onreward modelscores. It weighs the loss for generated samples according to their quality, giving more importance to preferred images. It includes a regularization term withpre-training loss. -
RAFT (Reward Ranked Fine-Tuning) [13]: An iterative fine-tuning method that constructs a dataset of generated images with higher rewards. In each iteration, it generates images, uses a
reward modelto rank them, selects the best ones, and then uses these selected images to fine-tune theLDM. The paper evaluatesRAFTat 1, 2, and 3 iterations.These baselines are representative because
CLIP,Aesthetic, andBLIPare standard in the field for evaluating image quality and text-image alignment. TheStable Diffusionmodel is a popular and powerful generative model.Dataset Filtering,Reward Weighted, andRAFTrepresent the state-of-the-artindirectmethods for incorporating human feedback intoLDMtraining, providing a strong comparison forReFL'sdirect tuningapproach.
-
6. Results & Analysis
6.1. Core Results Analysis
The paper presents extensive results demonstrating the effectiveness of ImageReward as a human preference scorer and ReFL as a method for improving diffusion models.
6.1.1. ImageReward: On Human Preference Prediction
-
Preference Accuracy (Table 3):
ImageRewardsignificantly outperforms all baselines inPreference Accuracy, reaching . This is substantially higher thanCLIP Score(),Aesthetic Score(), andBLIP Score(). This indicates thatImageRewardis much better at predicting human pairwise preferences. The following are the results from Table 3 of the original paper:Model Preference Acc. @1 Recall @2 @4 @1 Filter @2 @4 CLIP Score 54.82 27.22 48.52 78.17 29.65 51.75 76.82 Aesthetic Score 57.35 30.73 53.91 75.74 32.08 54.45 76.55 BLIP Score 57.76 30.73 50.67 77.63 33.42 56.33 80.59 ImageReward (Ours) 65.14 39.62 63.07 90.84 49.06 70.89 88.95 -
Recall/Filter the Best/Worst Image (Table 3 & Figure 16):
ImageRewardalso shows superior performance inRecall@KandFilter@K. For example,ImageRewardachievesRecall@4of andFilter@4of , significantly higher than baselines. This means it's much better at identifying the best images and filtering out the worst ones from a larger pool of generations. Figure 16 visually confirms thatImageRewardhas the largest proportion of correctly picking the best/worst images and the minimum ratio of ranking them incorrectly.
该图像是图表,展示了在人工注释中,基于不同排名方法的最佳和最差图像的分布情况。图中包含了四种评分方法(Ours,BLIP Score,Aesthetic,CLIP Score),展示了每个分数对应的最佳和最差图像的百分比分布。The image (Figure 16 from the original paper) displays the
bucket distributionof the best and worst images in human annotation. It illustrates how different scoring methods (Ours,BLIP Score,Aesthetic,CLIP Score) distribute human-annotated best/worst images across their ranked buckets (1-8).ImageReward(Ours) clearly shows a higher concentration of best images in bucket 1 and worst images in bucket 8, demonstrating superior selection ability. -
Human Alignment Across Models (Table 1): When ranking various T2I models,
ImageReward's ranking (Spearman ) perfectly correlates with human evaluation. In contrast,CLIPscores show a moderate correlation (Spearman ), andzero-shot FIDhas a very low correlation (Spearman ). This highlightsImageReward's strong alignment with human preferences in evaluating overall model quality. -
Distinguishability (Figure 3): The
box plotin Figure 3 reveals thatImageRewardscores have a much largerinterquartile range (IQR)compared toCLIPscores for each generative model. This implies thatImageRewardcan better differentiate the quality of individual samples within a model's outputs. Furthermore, themediansofImageRewardscores align with the human ranking of models, a property thatCLIPscores fail to exhibit. -
Ablation Study (Table 2b):
-
Training Dataset Size: Increasing the training dataset size (from 1k to 8k comparisons) consistently improves
ImageReward'spreference accuracy(from to ), confirming the value of more annotated data. -
RM Backbone:
BLIPas a backbone outperformsCLIP(withMLP) even whenCLIPis trained on a larger dataset. This suggestsBLIP's architecture or pre-training is more suitable for learning human preference in this context. The following are the results from Table 2b of the original paper:Backbone | TrainingSet Size PreferenceAcc. CLIP 4k 61.87 8k 62.98 BLIP 1k 63.07 2k 63.18 4k 64.71 8k 65.14
-
6.1.2. ReFL: On Improving Diffusion Models with Human Preference
-
Human Evaluation (Table 4 & Figure 6):
ReFLfine-tuned models consistently achieve the highestwin ratesagainst theStable Diffusion v1.4baseline, both onReal User Prompts() andMulti-task Benchmark (MT Bench)prompts (). This indicates thatReFLeffectively improvesLDMs to generate images that are more preferred by humans. Figure 6 visually reinforcesReFL's dominance in pairwise comparisons against all other methods. The following are the results from Table 4 of the original paper:Methods Real User Prompts MT Bench [40] #Win WinRate #Win WinRate SD v1.4 (baseline) [45] 1315 - 718 - Dataset Filtering [61] 1394 55.17 735 51.72 Reward Weighted [23] 1075 39.52 585 43.33 RAFT [13] (iter=1) 1341 49.86 578 42.31 RAFT (iter=2) 753 30.85 452 33.02 RAFT (iter=3) 398 20.97 355 26.19 ReFL (Ours) 1508 58.79 808 58.49
该图像是一个热力图,展示了不同方法之间的胜率对比。各方法在任务中的表现通过颜色深浅展示,深色表示较高的胜率。图中包括了SD v1.4、Dataset Filtering、Reward Weighted、RAFT-1、RAFT-2、RAFT-3和ReFL等多个方法。The image (Figure 6 from the original paper) presents a heatmap of
win ratesbetween all methods forLDMoptimization. It clearly showsReFL (Ours)achieving the highest win rates against all other compared methods, indicated by its prominent green cells. -
Comparison to Other Fine-tuning Methods:
- Indirect Methods Underperform:
RAFTandReward Weightedmethods actually perform worse than theSD v1.4baseline in the paper's evaluation setup.- RAFT: Suffers from overfitting as iterations increase. The paper suggests this is because
RAFT's reliance on an "expert generator" (used as a regularizer) is limited when dealing with real-user prompts that can be challenging, leading to low-quality images in the constructed dataset. - Reward Weighted: The method's reward coefficients, constrained to
[0, 1], mean that non-preferred images still exert some influence. When real-user prompts introduce many non-preferred images (even if "relatively" the best among a poor set), this interference hinders effectiveness.
- RAFT: Suffers from overfitting as iterations increase. The paper suggests this is because
- Dataset Filtering: While partly effective by labeling "Weird image" and considering real images, it's still an indirect influence.
- ReFL's Advantage:
ReFLprovidesdirect gradient feedbackthrough rewards, allowing for more precise guidance towards desired "better" generations, which is more effective at problem-solving.
- Indirect Methods Underperform:
-
Qualitative Examples (Figure 7): Visual comparisons demonstrate
ReFL's ability to produce more preferred images. For instance, in prompts requiring "long pointy ears," only theReFL-tuned model correctly generates them, while others fail or produce inaccurate representations.
该图像是图表,展示了不同微调方法在生成图像方面的定性比较。ReFL微调的模型在多个提示下生成的图像更受到偏爱,例如在包含"长尖耳朵"的提示中,只有ReFL生成了正确的耳朵,而其他方法生成的图像要么缺少耳朵,要么表现不准确。The image (Figure 7 from the original paper) provides a qualitative comparison between
ReFLand other fine-tuning methods. It illustrates thatReFL-tuned models produce images that are more preferred overall. For example, for the prompt "long pointy ears," onlyReFLgenerates correct ears, while other methods either omit them or render them inaccurately.
6.2. Data Presentation (Tables)
6.2.1. Table 1: Text-to-image model ranking by humans and automatic metrics
The following are the results from Table 1 of the original paper:
| Dataset & Model | Real User Prompts | MS-COCO 2014 | ||||||||
| Human Eval. | ImageReward | CLIP | ImageReward | Zero-shot FID* | ||||||
| Rank | #Win | Rank | Score | Rank | Score | Rank | Score | Rank | Score | |
| Openjourney | 1 | 507 | 1 | 0.2614 | 2 | 0.2726 | 2 | -0.0455 | 5 | 20.7 |
| Stable Diffusion 2.1-base | 2 | 463 | 2 | 0.2458 | 4 | 0.2683 | 3 | 0.1553 | 4 | 18.8 |
| DALL-E 2 | 3 | 390 | 3 | 0.2114 | 3 | 0.2684 | 1 | 0.5387 | 1 | 10.9* |
| Stable Diffusion 1.4 | 4 | 362 | 4 | 0.1344 | 1 | 0.2763 | 4 | -0.0857 | 3 | 17.9 |
| Versatile Diffusion | 5 | 340 | 5 | -0.2470 | 5 | 0.2606 | 6 | -0.5485 | 2 | 18.4 |
| CogView 2 | 6 | 74 | 6 | -1.2376 | 6 | 0.2044 | 5 | -0.8510 | 6 | 26.2 |
| Spearman ρ to Human Eval. | - | 1.00 | 0.60 | 0.77 | 0.09 | |||||
6.2.2. Table 2: Data annotation agreement and ablation study on model backbones and dataset sizes
The following are the results from Table 2a of the original paper:
| researcher annotator | annotator ensemble | CLIP Score | Aesthetic | BLIP Score | Ours | ||
| researcher | 71.2% ± 11.1% | 65.3% ± 8.5% | 73.4% ± 6.2% | 57.8% ± 3.6% | 55.6% ± 3.1% | 57.0% ± 3.0% | 64.5% ± 2.5% |
| annotator | 65.3% ± 8.5% | 65.3% ± 5.6% | 53.9% ± 5.8% | 54.3% ± 3.2% | 55.9% ± 3.1% | 57.4% ± 2.7% | 65.3% ± 3.7% |
| annotator ensemble | 73.4% ± 6.2% | 53.9% ± 5.8% | - | 54.4% ± 21.1% | 57.5% ± 15.9% | 62.0% ± 16.1% ± 18.6% | 70.5% |
The following are the results from Table 2b of the original paper:
| Backbone | | TrainingSet Size | PreferenceAcc. |
| CLIP | 4k | 61.87 |
| 8k | 62.98 | |
| BLIP | 1k | 63.07 |
| 2k | 63.18 | |
| 4k | 64.71 | |
| 8k | 65.14 |
6.2.3. Table 3: Results of ImageReward and comparison methods on human preference prediction
The following are the results from Table 3 of the original paper:
| Model | Preference Acc. | @1 | Recall @2 | @4 | @1 | Filter @2 | @4 |
| CLIP Score | 54.82 | 27.22 | 48.52 | 78.17 | 29.65 | 51.75 | 76.82 |
| Aesthetic Score | 57.35 | 30.73 | 53.91 | 75.74 | 32.08 | 54.45 | 76.55 |
| BLIP Score | 57.76 | 30.73 | 50.67 | 77.63 | 33.42 | 56.33 | 80.59 |
| ImageReward (Ours) | 65.14 | 39.62 | 63.07 | 90.84 | 49.06 | 70.89 | 88.95 |
6.2.4. Table 4: Human evaluation on different LDM optimization methods
The following are the results from Table 4 of the original paper:
| Methods | Real User Prompts MT Bench [40] | |
| #Win WinRate | #Win WinRate | |
| SD v1.4 (baseline) [45] | 1315 - | 718 - |
| Dataset Filtering [61] | 1394 55.17 | 735 51.72 |
| Reward Weighted [23] | 1075 39.52 | 585 43.33 |
| RAFT [13] (iter=1) | 1341 49.86 | 578 42.31 |
| RAFT (iter=2) | 753 30.85 | 452 33.02 |
| RAFT (iter=3) | 398 20.97 | 355 26.19 |
| ReFL (Ours) | 1508 58.79 | 808 58.49 |
6.2.5. Table 5: Results of ImageReward and other reward models on human preference evaluation
The following are the results from Table 5 of the original paper:
| Model | Preference Acc. | @1 | Recall @2 | @4 | @1 | Filter @2 | @4 |
| HPS | 60.79 | 39.89 | 58.76 | 83.29 | 47.17 | 65.50 | 84.10 |
| PickScore | 62.78 | 38.27 | 63.07 | 84.10 | 46.36 | 65.77 | 84.91 |
| ImageReward (Ours) | 65.14 | 39.62 | 63.07 | 90.84 | 49.06 | 70.89 | 88.95 |
6.2.6. Table 6: Comparison between different reward models
The following are the results from Table 6 of the original paper:
| Methods | Real User Prompts | Multi-task Benchmark[40] | |||||
| Human Eval. | Image | Human Eval. | Image | ||||
| # Win | WinRate | Reward | | # Win | WinRate | Reward | ||
| SD v1.4 (baseline) | 399 | - | 0.1058 | 459 | - | 0.1859 | |
| Bo64 | HPS | 572 | 67.24 | 0.6274 | 662 | 69.15 | 0.6788 |
| PickScore | 620 | 72.16 | 0.7033 | 773 | 72.73 | 0.7579 | |
| ImageReward (Ours) | 676 | 73.33 | 1.3374 | 824 | 74.42 | 1.4098 | |
| ReFL PickScore | HPS | 428 | 52.86 | 0.4749 | 426 | 52.86 | 0.4646 |
| 472 | 56.91 | 0.4618 | 454 | 55.09 | 0.4908 | ||
| ImageReward (Ours) | 512 | 58.38 | 0.6072 | 492 | 58.67 | 0.4822 | |
6.3. Ablation Studies / Parameter Analysis
6.3.1. ImageReward Ablation Studies
-
Training Dataset Size (Table 2b): The
ablation studyon training dataset size clearly shows that increasing the number of expert comparisons leads to improvedPreference Accuracy. For instance, with aBLIPbackbone, accuracy rises from with 1k comparisons to with 8k comparisons. This supports the hypothesis that collecting more high-quality human annotations can further enhanceImageReward's performance. -
RM Backbone (Table 2b): The choice of backbone model significantly impacts performance.
BLIPconsistently outperformsCLIPas theImageRewardbackbone. Even with a larger training set (8k),CLIP'sPreference Accuracy() is lower thanBLIP's () at the same scale. The authors attributeBLIP's superiority partly to itsbootstrappingof training data and its use of animage-grounded text encoderwhich processes text features in a manner conditioned by the image context, unlikeCLIP's separate encoders. -
Interpolation Analysis (Figure 17): The paper explores interpolating scores from different models (
CLIP,Aesthetic,ImageReward). While combiningCLIPandAestheticcan slightly improve accuracy compared to either alone, their best combination is still significantly lower thanImageReward's standalone performance. WhenImageRewardis interpolated withCLIPorAesthetic, it shows only marginal improvement, suggesting thatImageRewardalready captures most of the relevant human preference signals effectively.
该图像是图表,展示了不同模型之间的准确度插值结果。左侧为我们的模型与 CLIP 模型混合的准确度,最高为 65.4%;中间是 CLIP 与 Aesthetic 模型混合的情况,最高为 58.9%;右侧为 Aesthetic 模型与我们的模型混合,最高为 65.9%。The image (Figure 17 from the original paper) shows the accuracy of
interpolationbetween different models. It indicates that whileCLIPandAestheticinterpolation can improve accuracy, it remains considerably lower thanImageReward's. Furthermore,ImageReward's performance is only marginally boosted when interpolated withCLIPorAesthetic, underscoring its strong standalone capability.
6.3.2. ReFL Ablation and Comparative Analysis
-
Comparison to Indirect Methods (Table 4, Figure 6, Figure 20):
ReFLconsistently achieves the highestwin ratescompared to otherLDMoptimization methods (Dataset Filtering,Reward Weighted,RAFT), and even surpasses theSD v1.4baseline.
该图像是插图,展示了一位身穿华丽服装的人物,服装上装饰有金色细节,展现出高贵的气质。背景色彩丰富,进一步突显人物的华丽装扮。The image (Figure 20 from the original paper) illustrates the
win ratebetween different fine-tuning methods. It demonstrates thatReFLachieves the highest win rate compared to any other method, with bars colored to show wins (green), ties (gray), and losses (red).- Underperformance of RAFT and Reward Weighted: The authors critically analyze why
RAFTandReward Weighteddid not improve over the baseline in their experiments, despite having shown positive results in their original papers.- RAFT's Limitations:
RAFTis susceptible tooverfittingwith increasing iterations (performance drops significantly from iter=1 to iter=3). This is attributed to theexpert generator(used for regularization) having limitations, especially when fine-tuning with challengingreal-user data(which the originalRAFTevaluation might not have focused on). If the expert generator fails to produce high-quality images for complex real-world prompts, thereward-rankeddataset for fine-tuning will be suboptimal. - Reward Weighted's Limitations: This method uses
reward coefficientsconstrained within[0, 1]. While preferred images get higher weights, the influence of non-preferred images is not entirely eliminated. When diversereal-user promptsintroduce many non-preferred images into the dataset (even if they are "relatively" the best among poor options), this residual interference negatively impacts the fine-tuning. - Dataset Filtering's Indirectness:
Dataset Filteringhandles non-preferred images by labeling them "Weird image," but its influence on model improvement is indirect.
- RAFT's Limitations:
- ReFL's Directness:
ReFL's superior performance is attributed to itsdirect gradient feedbackmechanism. By directly incorporatingRMscores into theLDM's loss function, it provides a clear and effective signal to guide the model towards generating human-preferred outputs, overcoming the indirectness and potential pitfalls of data-oriented methods.
- Underperformance of RAFT and Reward Weighted: The authors critically analyze why
6.4. ImageReward Compared to Other Reward Models (Appendix D)
-
The paper also compares
ImageRewardagainst other recently emergedreward modelslikeHPS[61] andPickScore[21]. -
Human Evaluation (Table 5):
ImageRewardstill maintains the highestPreference Accuracy() compared toPickScore() andHPS(). It also shows superiorRecall@4() andFilter@4(). -
Bo64andReFLscenarios (Table 6): When evaluating models selected by "Best of 64" (Bo64) or fine-tuned byReFLusing thesereward models,ImageReward(Ours) consistently leads in#Wincounts andWinRateacross bothReal User PromptsandMulti-task Benchmarkprompts. For example, in theBo64scenario onReal User Prompts,ImageRewardachievesWinRate, higher thanPickScore() andHPS(). -
Train Set Distribution (Figure 18): Visualization using
t-SNE[57] shows thatImageReward's training set is slightly more evenly distributed thanPickScore's, which might contribute to its generalization ability.
该图像是一个散点图,展示了 ImageReward 和 PickScore 训练集中的数据点,通过 t-SNE 方法可视化。图中的蓝色点代表 ImageReward 训练集,橙色点代表 Pick-a-Pic 训练集,两者在二维空间中的分布情况显示了其特征差异。The image (Figure 18 from the original paper) uses
t-SNEto visualize the prompt distribution of training sets forPickScoreandImageReward. The blue points representImageReward's dataset, while orange points representPickScore's. The visualization suggestsImageReward's training set is slightly more evenly distributed.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces ImageReward, the first general-purpose text-to-image (T2I) human preference reward model, and Reward Feedback Learning (ReFL), a novel direct fine-tuning approach for optimizing diffusion models using ImageReward's feedback. ImageReward is built upon a systematic annotation pipeline that collected 137,000 expert comparisons, demonstrating superior performance over existing metrics (CLIP, Aesthetic, BLIP) in aligning with human preferences and distinguishing image quality. ReFL addresses the limitations of applying traditional RLHF to diffusion models by leveraging the insight that ImageReward scores become reliable at later denoising steps, enabling direct gradient-based optimization. Extensive evaluations confirm that ReFL significantly improves generative models, leading to outputs that are more aligned with human values and preferences, outperforming indirect fine-tuning methods. The work provides open-sourced code and datasets, advancing the field of human-aligned T2I generation.
7.2. Limitations & Future Work
The authors acknowledge several limitations and outline future research directions:
- Annotation Scale, Diversity, and Quality: While 137k comparisons is substantial, a larger annotation dataset is still needed for even better
RMtraining. The current prompts fromDiffusionDB, though diverse, may still harbor biases; exploring more varied prompt distributions beyond typical user prompts is necessary. The currentsingle-person annotation with quality controlstrategy could be improved bymulti-person annotationfor better consistency. - RM Training Techniques: The observed overfitting during
RMtraining suggests that more advanced techniques, such asparameter-efficient tuning[27, 34, 25, 32], could further enhanceImageReward's performance. The choice of a stronger and largertext-image backbone modelbeyondBLIPis also expected to yield additional gains. - Improving Generative Models with RM: Although
ReFLis effective, it remains an approximation of originalRLHFalgorithms. There is a need for developing fundamentally unbiased and efficientfeedback learning algorithmswith solid theoretical grounding to achieve even better human alignment indiffusion models.
7.3. Personal Insights & Critique
This paper makes a significant contribution by bridging the gap between objective image quality metrics and subjective human preferences in T2I generation. The systematic development of ImageReward and ReFL is a crucial step towards more controllable and human-centric generative AI.
Personal Insights:
- Annotation as a Core Contribution: The detailed and systematic
annotation pipelineis arguably as important as the models themselves. The analysis ofprompt categories,problem distributions, and the impact offunction wordsprovides invaluable insights into the challenges and nuances of human preference inT2Igeneration, which can guide future data collection and model development. - The "Later Denoising Step" Insight: The observation that
ImageRewardscores become reliable at laterdenoising stepsis a clever and practical solution to thelikelihoodproblem ofdiffusion modelsinRLHF. This insight enables adirect optimizationstrategy that was previously elusive. - Direct vs. Indirect Optimization: The clear empirical evidence showcasing
ReFL's superiority overindirectdata-oriented methods highlights the power of directly incorporating human feedback into the generative process's loss function. This points towards a more efficient and effective paradigm for aligning generative AI. - Generalizability: The concept of a general-purpose human preference
reward modelforT2Ihas broad implications, not just for improving existing models but also for guiding the development of new architectures and serving as a robust automatic evaluation metric.
Critique:
-
Subjectivity of "Human Preference": While the paper meticulously designs its annotation pipeline, "human preference" is inherently subjective and culturally dependent. The collected data reflects the preferences of a specific group of annotators, predominantly in China (given affiliations), which might not fully represent global or diverse aesthetic and ethical standards. The paper briefly touches upon this in "Broader Impact" but doesn't elaborate on how the
RMmight handle conflicting preferences across different user groups or cultures. -
Potential for Reward Model Hacking/Bias Amplification: Like any
reward model,ImageRewardcould potentially behackedoroverfitto specific patterns, leading tomode collapseor amplification of biases present in the annotation data. While the paper mentions regularization, the long-term robustness against subtle adversarial inputs or unknown biases warrants further investigation. -
Cost and Scalability of Expert Annotation: Although the dataset is large, "expert comparisons" are costly. While the
RMreduces ongoing human effort, the initial investment forRMtraining can be substantial. Future work might explore how to combine expert annotation with cheaper, lower-quality feedback (e.g., from crowd-sourcing) effectively. -
Interpretation of "Function" Words: The analysis of
function wordsis interesting, but deeper causal links are not fully established. While correlation is shown, the exact mechanism by which "8k" or "highly detailed" influences the model and its problems (e.g., initial increase then decrease in body problems) could be explored further.This paper lays a strong foundation for integrating human feedback directly into the
T2Igeneration loop, moving beyond mere quantitative metrics to achieve truly human-aligned AI outputs. Its methods and insights are highly applicable to other generative tasks where subjective quality is paramount.
Similar papers
Recommended via semantic vector search.