Abstract

We present a comprehensive solution to learn and improve text-to-image models from human preference feedback. To begin with, we build ImageReward -- the first general-purpose text-to-image human preference reward model -- to effectively encode human preferences. Its training is based on our systematic annotation pipeline including rating and ranking, which collects 137k expert comparisons to date. In human evaluation, ImageReward outperforms existing scoring models and metrics, making it a promising automatic metric for evaluating text-to-image synthesis. On top of it, we propose Reward Feedback Learning (ReFL), a direct tuning algorithm to optimize diffusion models against a scorer. Both automatic and human evaluation support ReFL's advantages over compared methods. All code and datasets are provided at \url{https://github.com/THUDM/ImageReward}.

1. Bibliographic Information

1.1. Title

ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation

1.2. Authors

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, Yuxiao Dong. The authors are primarily affiliated with Tsinghua University, with additional affiliations from Zhipu AI and Beijing University of Posts and Telecommunications.

1.3. Journal/Conference

The paper was published on arXiv, a preprint server, on 2023-04-12T16:58:13.000Z. arXiv is a well-respected platform for disseminating research quickly, particularly in machine learning and artificial intelligence. While not a peer-reviewed journal or conference in its initial posting, papers on arXiv are often subsequently submitted to and published in top-tier venues. Its reputation lies in its role in fast and open scientific communication.

1.4. Publication Year

2023

1.5. Abstract

This paper introduces a comprehensive solution for learning from and improving text-to-image (T2I) models based on human preference feedback. The core of their approach is ImageReward, presented as the first general-purpose T2I human preference reward model. This model is trained using a systematic annotation pipeline, encompassing rating and ranking, which has accumulated 137,000 expert comparisons. ImageReward is shown to outperform existing scoring models and metrics in human evaluation, establishing it as a promising automatic metric for evaluating T2I synthesis. Building upon ImageReward, the authors propose Reward Feedback Learning (ReFL), a direct tuning algorithm designed to optimize diffusion models using feedback from a scorer. Both automatic and human evaluations demonstrate ReFL's advantages over alternative methods. The code and datasets associated with this work are publicly available on GitHub.

1.6. Original Source Link

https://arxiv.org/abs/2304.05977 The paper is available as a preprint on arXiv. PDF Link: https://arxiv.org/pdf/2304.05977v4.pdf

2. Executive Summary

2.1. Background & Motivation

The rapid advancements in text-to-image (T2I) generative models, particularly diffusion-based approaches, have enabled the creation of high-fidelity and semantically-related images from text prompts. Despite this progress, these models still face significant challenges in aligning with nuanced human preferences. The primary issues stem from the fact that existing models are often self-supervised pre-trained on noisy data distributions that do not perfectly match actual user-prompt distributions. This discrepancy leads to several well-documented problems in generated images:

Text-image Alignment: Models often fail to accurately depict all specified numbers, attributes, properties, and relationships described in text prompts.
Body Problem: Generated images frequently feature distorted, incomplete, duplicated, or abnormal body parts of humans or animals.
Human Aesthetic: The outputs can deviate from mainstream human aesthetic preferences.
Toxicity and Biases: Models may generate harmful, violent, sexual, discriminatory, illegal, or psychologically disturbing content.

These challenges are difficult to address solely through architectural improvements or more extensive pre-training data. The paper draws inspiration from the success of reinforcement learning from human feedback (RLHF) in natural language processing (NLP), which has been instrumental in aligning large language models with human preferences and values. RLHF relies on training a reward model (RM) to capture human preferences from expert-annotated comparisons of model outputs. However, extending RLHF to T2I generation presents unique challenges, particularly regarding the costly and complex annotation process for images and the difficulty of applying traditional RL algorithms to diffusion models which do not provide likelihoods for their generations.

The paper's entry point is to tackle these problems by systematically creating a human preference reward model for T2I generation and then developing a method to directly optimize generative models using this feedback.

2.2. Main Contributions / Findings

The paper makes several significant contributions to the field of text-to-image generation:

ImageReward Model: The authors introduce ImageReward, the first general-purpose text-to-image human preference reward model. This model is specifically designed to effectively encode human preferences for T2I outputs.
Systematic Annotation Pipeline and Dataset: A novel, systematic annotation pipeline was developed, incorporating both rating and ranking stages. This rigorous process led to the collection of a dataset comprising 137,000 expert comparisons, which serves as the foundation for training ImageReward.
Superior Evaluation Metric: ImageReward is demonstrated to significantly outperform existing scoring models and metrics (such as CLIP, Aesthetic, and BLIP scores) in understanding and predicting human preferences for T2I synthesis. It also exhibits better distinguishability across models and individual samples and aligns strongly with human rankings.
Reward Feedback Learning (ReFL): The paper proposes ReFL, a direct tuning algorithm tailored for optimizing diffusion models using feedback from a scorer (like ImageReward). This addresses the limitation of traditional RLHF approaches that are not directly applicable to diffusion models due to their lack of explicit likelihoods. A key insight for ReFL is that ImageReward scores become reliably identifiable at later denoising steps of a diffusion process, enabling direct gradient-based optimization.
Improved Generative Models: Both automatic and human evaluations validate ReFL's advantages over alternative methods, including data augmentation and loss re-weighting, in improving the quality and human alignment of generated images. The ReFL-trained models demonstrate enhanced text coherence and better adherence to human preferences.
Open-sourced Resources: All code and datasets developed in this research are made publicly available, fostering reproducibility and further research in the community.

3.1. Foundational Concepts

Text-to-Image (T2I) Generation Models: These are artificial intelligence models that create visual images from textual descriptions (prompts). They have revolutionized digital art and content creation. Early T2I models were based on generative adversarial networks (GANs), while more recent and higher-performing models leverage diffusion models or auto-regressive models.
- Auto-regressive Models: These models generate images pixel by pixel or token by token, conditioning each new generation on the previously generated elements. Examples include DALL-E and CogView.
- Diffusion Models: These models work by iteratively denoising an image that starts as pure random noise. They learn to reverse a diffusion process that gradually adds noise to an image. Latent Diffusion Models (LDMs), like Stable Diffusion, perform this process in a compressed latent space, making them more computationally efficient.
Reinforcement Learning from Human Feedback (RLHF): A training paradigm where a model (often a large language model or LLM) is fine-tuned using feedback provided by humans. This feedback is typically used to train a reward model (RM), which then assigns a score to different model outputs. The LLM is subsequently optimized (e.g., using Proximal Policy Optimization (PPO)) to generate outputs that maximize this reward score, thus aligning the model's behavior with human preferences.
Reward Model (RM): In the context of RLHF, a reward model is a neural network trained to predict human preferences or scores for a given model output (e.g., a generated text response or an image). It learns from human comparisons (e.g., "output A is better than output B"). Once trained, the RM can provide a continuous feedback signal for optimizing the generative model, circumventing the need for constant human supervision during the RL phase.
CLIP (Contrastive Language–Image Pre-training): A neural network trained on a massive dataset of image-text pairs to understand the relationship between images and text. CLIP can compute a similarity score between any given image and text prompt without being specifically trained on that task. It consists of an image encoder and a text encoder, both of which project inputs into a shared embedding space where semantic similarity can be measured (e.g., using cosine similarity).
BLIP (Bootstrapping Language-Image Pre-training): Another vision-language model that improves upon CLIP by bootstrapping its training data and using a more unified architecture for both understanding and generation tasks. It generally achieves better performance in various vision-language benchmarks due to its multi-modal mixture of encoder-decoder architecture.
FID (Fréchet Inception Distance): A widely used metric for evaluating the quality of images generated by generative adversarial networks (GANs) and diffusion models. FID measures the similarity between the feature distributions of generated images and real images. It does this by calculating the Fréchet distance between two multivariate Gaussian distributions fitted to the features extracted from an Inception-v3 model for both real and generated image sets. A lower FID score indicates higher quality and diversity of generated images, closer to the real data distribution.
Aesthetic Score: This refers to a score provided by a specialized predictor (often built on top of CLIP embeddings) that aims to quantify the aesthetic quality or appeal of an image according to human judgment. These predictors are typically trained on datasets where images have been rated by humans for aesthetics.
Denoising Diffusion Probabilistic Models (DDPMs): The foundational theory for many diffusion models. These models learn to reverse a Markov chain of gradually adding Gaussian noise to data. During generation (sampling), the model starts from pure noise and iteratively predicts and removes the noise to produce a clear image.
Latent Diffusion Models (LDMs): An advancement over DDPMs that performs the diffusion process in a compressed latent space rather than directly on pixel space. This makes LDMs significantly more efficient for generating high-resolution images while maintaining quality. Stable Diffusion is a prominent example of an LHM.

3.2. Previous Works

Early T2I Models and Evolution: The paper references GANs [17] as early pioneers, followed by auto-regressive models like DALL-E [43], CogView [11], and Make-A-Scene [16], and more recently, diffusion models such as GLIDE [37], DALL-E 2 [42], Stable Diffusion [45], and Imagen [46]. These models have continually improved image fidelity and text alignment but often struggle with subtle human preferences.
RLHF in NLP: The paper highlights RLHF's success in NLP with models like InstructGPT [39], ChatGPT, and WebGPT [36]. These works demonstrated that training a reward model from human comparisons and then optimizing language models with Proximal Policy Optimization (PPO) [51] could align LLMs with human values (e.g., helpfulness, harmlessness, truthfulness).
Evaluation Metrics for T2I:
- FID [18] and Inception Score (IS) [4]: These are commonly used objective metrics. However, the paper points out their limitations: they measure statistical similarity to real images but don't capture human preference, struggle with single-image evaluation, and zero-shot FID can be problematic due to potential data leakage.
- CLIP Score [41]: Often used to evaluate text-image coherence by measuring the cosine similarity of embeddings. While useful for alignment, it doesn't fully capture human aesthetic preference or other aspects of "goodness" from a human perspective.
- Aesthetic Predictors [50]: Models trained to predict image aesthetics, often using CLIP features. These contribute to assessing one aspect of human preference but lack comprehensive evaluation of text-image alignment and other factors.
Prior T2I Preference Alignment Efforts: The paper discusses concurrent works aimed at aligning T2I models with human feedback, categorizing them as indirect:
- Dataset Filtering [61; 13]: Uses a reward model to filter datasets, selecting high-quality samples or identifying low-quality ones for specific handling during fine-tuning.
- Loss Re-weighting [23]: Adjusts the weights of training losses based on a reward model's scores, giving more emphasis to preferred samples.
- RAFT [13]: Proposes Reward Ranked Fine-Tuning, which constructs datasets of generated images with higher rewards for fine-tuning. The paper argues that these approaches are indirect as they primarily manipulate data or loss coefficients rather than directly optimizing the generative process based on RM gradients.

3.3. Technological Evolution

The evolution of generative models for images has moved from GANs (e.g., StyleGAN) which could generate realistic images but were often hard to train and control, to auto-regressive models (e.g., DALL-E) that offered better text conditioning but were computationally intensive for high resolutions. The breakthrough of diffusion models (e.g., DDPMs, LDMs like Stable Diffusion) provided a more stable and high-quality generation process. Concurrently, vision-language models (CLIP, BLIP) emerged, demonstrating powerful text-image understanding.

The success of RLHF in NLP, particularly in aligning LLMs with complex human instructions and safety standards, highlighted a pathway for improving generative models beyond purely objective metrics. However, applying RLHF to image generation is challenging due to the continuous and high-dimensional nature of image data, the difficulty of obtaining human feedback at scale, and the architectural differences of diffusion models compared to LLMs (e.g., lack of explicit likelihoods). This paper's work represents a significant step in adapting RLHF principles for T2I models, bridging the gap between objective quality and subjective human preferences by creating a specialized reward model and a direct optimization strategy.

3.4. Differentiation Analysis

Compared to the main methods in related work, this paper's approach offers several core differences and innovations:

First General-Purpose T2I Human Preference RM: While CLIP, Aesthetic, and BLIP provide scores, they are not explicitly trained as general-purpose human preference reward models for T2I generation, encompassing alignment, fidelity, and harmlessness, and tuned specifically from human comparisons in this domain. ImageReward is specifically built for this comprehensive task based on a large, systematically annotated dataset.
Rigorous Annotation Pipeline: Unlike some concurrent works that rely on noisy online user clicking data [21, 61, 13], ImageReward's training dataset is built on a systematic annotation pipeline with expert comparisons, specific labeling criteria, annotator training, and quality validation. This ensures higher quality and consistency of the human feedback.
Direct Tuning of Diffusion Models (ReFL): Most prior and concurrent works aiming to use human feedback to improve T2I models (Dataset Filtering [61], Reward Weighted [23], RAFT [13]) adopt indirect methods like data augmentation or loss re-weighting. These approaches are often limited by the quality of the filtered data or the effectiveness of the re-weighting scheme. ReFL, in contrast, proposes a direct tuning algorithm that leverages the reward model's gradients to optimize the diffusion model's parameters. This is a more direct and potentially powerful approach, especially given the identified insight about ImageReward scores becoming reliable at later denoising steps.
Addressing LDM Likelihood Limitations: ReFL specifically addresses the challenge that LDMs do not provide explicit likelihoods (a requirement for traditional RLHF methods like PPO used in NLP). By using ImageReward to provide feedback on intermediate denoised latents, ReFL finds a novel way to apply preference-based optimization to diffusion models.

4. Methodology

4.1. Principles

The core idea behind this work is to leverage human preference feedback to systematically improve text-to-image (T2I) generative models. This involves two main principles:

Learning Human Preferences: Create a reward model (RM) that can accurately quantify human preferences for generated images given a text prompt. This RM (ImageReward) is trained on a large dataset of expert human comparisons, encoding complex criteria like text-image alignment, image fidelity, and harmlessness.
Optimizing Generative Models with RM Feedback: Develop a method to use the learned RM as a scorer to directly guide the training of diffusion models. This direct tuning approach (Reward Feedback Learning, or ReFL) allows the generative model to produce outputs that are inherently more aligned with human preferences, bypassing the inefficiencies of filtering or re-weighting existing data.

4.2. Core Methodology In-depth (Layer by Layer)

The overall architecture for ImageReward and ReFL is depicted in Figure 2.

Figure 2: An overview of the ImageReward and ReFL. (Upper) ImageReward's annotation and training, consisting of data collection, annotation, and preference learning. (Lower) ReFL leverages ImageReward's feedback to directly optimize diffusion models at a random latter denoising step. 该图像是图示，展示了ImageReward和ReFL的整体流程。上半部分描述了数据收集、注释和偏好学习的步骤，下半部分展示了ReFL如何利用ImageReward的反馈在随机去噪步骤中直接优化扩散模型，包括反向传播过程。

The image (Figure 2 from the original paper) illustrates the overall workflow of ImageReward and ReFL. The upper section details the ImageReward process, starting with Data Collection (real-world user prompts from DiffusionDB), followed by Annotation (human experts rating and ranking images), and finally Preference Learning to train the ImageReward model. The lower section shows ReFL, where the ImageReward model provides feedback to directly optimize the diffusion model at a random later denoising step through backpropagation.

4.2.1. ImageReward: Learning to Score and Evaluate Human Preferences

ImageReward is constructed through a systematic pipeline focusing on data collection, human annotation from experts, and reward model (RM) training.

4.2.1.1. Annotation Pipeline Design

The annotation pipeline is crucial for collecting high-quality human preference data.

Prompt Selection and Image Collection:
- Source: Real user prompts are sourced from DiffusionDB [58], a large open-source dataset.
- Diversity: To ensure a diverse and representative prompt distribution, a graph-based algorithm is employed. This algorithm uses Sentence-BERT [44] to calculate prompt similarity. Sentence-BERT transforms sentences into dense vector embeddings, allowing for efficient similarity comparison.
- Process: Prompts are grouped into sets (e.g., 100 sets of 20k prompts). Within each set, a graph-based algorithm selects 100 prompts. Each prompt is a vertex, connected to its $k$ nearest neighbors (where $k=150$ ) based on cosine similarity of Sentence-BERT embeddings. A score is calculated for each vertex based on its unselected neighbors, and vertices with high scores are iteratively selected.
- Result: 10,000 candidate prompts are selected, each paired with 4 to 9 sampled images from DiffusionDB, totaling 177,304 candidate pairs for annotation.
Human Annotation Design: The pipeline involves three stages to ensure comprehensive and consistent feedback.
- Prompt Annotation:
  - Categorization: Annotators classify prompts based on a schema adapted from Parti [63] (e.g., Abstract, Animals, People, Outdoor Scenes). This helps in understanding model performance across different domains.
  - Problem Identification: Annotators identify problematic prompts such as ambiguous, unclear, or those containing toxic content (pornographic, violent, discriminatory).
- Text-Image Rating: For each generated image corresponding to a prompt, annotators rate it based on three key measurements using seven-level Likert scales (1=worst, 7=best):
  - Alignment: How faithfully the image depicts objects, attributes, and relationships described in the prompt.
  - Fidelity: The quality of the image itself, including realism, aesthetic appeal, and absence of visual errors (e.g., distorted body parts).
  - Harmlessness: Absence of toxic, illegal, biased, or psychologically discomforting content.
  - Problem Checkboxes: Annotators also check specific issues like "Obvious 'repeated generation'," "Existence of body problem," "Too blurry to see objects," "Causes psychological discomfort," "Output contains sexual/violent content," or "Output contains content that defames certain groups." The annotation system screenshots are provided in Figure 8 of the paper's appendix.
    
    该图像是示意图，展示了ImageReward模型如何通过人类偏好反馈来改进文本到图像生成的过程。图中可能包含与训练和评估流程相关的框架或算法，展示了系统注解和评分机制的关联。
  The image (Figure 8a from the original paper) shows the user interface for Text-Image Rating. Annotators rate Overall Rating, Image-Text Alignment, and Fidelity using a 7-point Likert scale. They also select checkboxes to identify specific issues like repeated generation, body problems, blurriness, psychological discomfort, sexual content, violent content, or defamation.
- Image Ranking: After individual image ratings, annotators rank all generated images for a given prompt from best to worst.
  - Trade-offs: To handle contradictions (e.g., high alignment but high toxicity), explicit trade-off guidelines are provided (detailed in Appendix B). For instance, harmlessness generally outweighs alignment.
  - Slot-based Ranking: Images are dragged into 5 slots representing different preference levels, allowing for ties (max two images per slot).
    
    该图像是一个插图，展示了待排序的风景画，分为五个等级，从最佳到最差。每幅画的整体颜色和构图各有不同，排列在用户评分界面中，显示出人类对图像的偏好。
  The image (Figure 8b from the original paper) displays the Image Ranking interface. Annotators drag generated images into 5 slots, from "1=best" to "5=worst," to express their preference for images corresponding to a given prompt.
Annotation Management:
- Annotator Recruitment & Training: Professional data annotation company collaboration. Annotators are trained using detailed documents (Appendix B) describing labeling processes, criteria, and examples. Most annotators have at least college-level education.
- Quality Control: Quality inspectors double-check annotations, reassigning invalid ones.
Human Annotation Analysis:
- Scale: After 2 months, 8,878 valid prompts and 136,892 compared pairs were collected.
- Prompt Categories (Figure 9 in Appendix A.4): Distribution is diverse, with People (3,360), Arts (2,407), Outdoor Scenes (1,189), and Artifacts (831) being most common.
- Average Score Distribution by Category (Figure 10 in Appendix A.4): Abstract prompts yield the lowest scores, possibly due to Stable Diffusion's difficulty with vague descriptions. Plants, Outdoor Scenes, Indoor Scenes generally receive higher scores.
- Problem Distribution by Category (Figure 11 in Appendix A.4):
  - Body problem is the most severe (average 21.14%), most frequent in People and Arts categories.
  - Repeated generation (average 3.16%) appears in Word Knowledge, People, Arts, Vehicles.
  - Fuzzy images are common in Abstract, Animal, and Arts.
  - Toxic, pornographic, violent content is observed due to problematic prompts, highlighting models' inability to filter such content.
- "Function" Words Analysis (Figure 12, 13 in Appendix A.4):
  - Function words (e.g., "8k", "highly detailed") in prompts influence generated quality.
  - As the proportion of function words increases from low to moderate, average scores (alignment, fidelity, overall) generally improve.
  - The frequency of problems like repeated generation and body/fuzzy problems shows a complex trend, initially increasing, then decreasing, and potentially increasing again when function words dominate.

4.2.1.2. RM Training

The goal is to train a reward model (RM) that can function as a virtual evaluator.

Preference Formulation: Human annotations are formulated as rankings. For a prompt $T$ , $k$ images ( $x_1, x_2, ..., x_k$ ) are ranked from best to worst. This generates up to $C_k^2$ comparison pairs (if no ties). For each comparison where $x_i$ is preferred over $x_j$ , the model learns to assign a higher score to $x_i$ .
Loss Function: The reward model $f_\theta(T, x)$ $f_{θ} (T, x)$ is a neural network that takes a text prompt $T$ $T$ and an image $x$ $x$ as input and outputs a scalar preference score. The model is trained using a pairwise comparison loss, similar to those used in RLHF for language models. $ \mathrm{loss}(\theta) = - \mathbb{E}{(T, x_i, x_j) \sim \mathcal{D}} [ \mathrm{log} ( \sigma ( f\theta (T, x_i) - f_\theta (T, x_j) ) ) ] $
- Symbol Explanation:
  - $\theta$ : The parameters of the reward model $f$ .
  - $\mathbb{E}$ : Expectation over the dataset $\mathcal{D}$ .
  - $(T, x_i, x_j) \sim \mathcal{D}$ : A sampled triplet from the dataset, consisting of a text prompt $T$ , an image $x_i$ preferred by humans, and an image $x_j$ less preferred than $x_i$ .
  - $\mathrm{log}$ : The natural logarithm function.
  - $\sigma$ : The sigmoid function, $\sigma(z) = \frac{1}{1 + e^{-z}}$ . It squashes the difference in scores to a probability-like value between 0 and 1.
  - $f_\theta(T, x_i)$ : The preference score predicted by the reward model $f$ for prompt $T$ and image $x_i$ .
  - $f_\theta(T, x_j)$ : The preference score predicted by the reward model $f$ for prompt $T$ and image $x_j$ .
- Purpose: This loss function encourages the reward model to assign a higher score to the preferred image $x_i$ than to the less preferred image $x_j$ . Minimizing this negative log-likelihood maximizes the probability that $f_\theta(T, x_i)$ is greater than $f_\theta(T, x_j)$ .
Training Techniques:
- Backbone: BLIP [26] is chosen as the backbone over CLIP due to its superior performance in preliminary experiments (Table 2b). BLIP's image-grounded text encoder is used as a feature encoder.
- Architecture: BLIP's image and text features are combined using cross-attention, and an MLP (Multi-Layer Perceptron) head generates the scalar preference score.
- Overfitting Mitigation: Rapid convergence and overfitting were observed. Freezing a proper number of backbone transformer layers (70% in this case) was found to improve performance.
- Hyperparameter Tuning: Sensitivity to learning rate and batch size required careful grid search. Optimal values found were a learning rate of 1e-5 and a batch size of 64 (per-GPU batch size of 16 on 4 NVIDIA A100 GPUs).

4.2.1.3. As Metric: Re-Evaluating Human Preferences on Text-to-Image Models

The paper proposes ImageReward as a promising zero-shot automatic evaluation metric for T2I model comparison and individual sample selection, addressing limitations of existing metrics.

Limitations of FID:
- Zero-shot Usage: Fine-tuned FID may not reflect real-world zero-shot performance. Zero-shot FID can be affected by data leakage in pre-training.
- Human Preference: FID measures statistical distance to real images, not human preference, and is an average metric, not suitable for single-image selection.

Better Human Alignment Across Models:

Setup: Researcher annotation was performed on 6 popular T2I models (CogView 2, Versatile Diffusion, Stable Diffusion 1.4/2.1, DALL-E 2, Openjourney). 100 real-user prompts were used, with 10 outputs per model.
Process: For each prompt, the best image out of 10 was selected from each model. Annotators then ranked these best images from different models. Win counts were aggregated.
Metrics: ImageReward and CLIP scores were averaged over 1,000 text-image pairs per model. Zero-shot FID (30k) on MS-COCO 2014 was also calculated.

Result: As shown in Table 1, ImageReward (Spearman $\rho$ = 1.00) aligns significantly better with human ranking than CLIP (Spearman $\rho$ = 0.60) and zero-shot FID (Spearman $\rho$ = 0.09). The following are the results from Table 1 of the original paper:

Dataset & Model	Real User Prompts						MS-COCO 2014
	Human Eval.		ImageReward		CLIP		ImageReward		Zero-shot FID*
	Rank	#Win	Rank	Score	Rank	Score	Rank	Score	Rank	Score
Openjourney	1	507	1	0.2614	2	0.2726	2	-0.0455	5	20.7
Stable Diffusion 2.1-base	2	463	2	0.2458	4	0.2683	3	0.1553	4	18.8
DALL-E 2	3	390	3	0.2114	3	0.2684	1	0.5387	1	10.9*
Stable Diffusion 1.4	4	362	4	0.1344	1	0.2763	4	-0.0857	3	17.9
Versatile Diffusion	5	340	5	-0.2470	5	0.2606	6	-0.5485	2	18.4
CogView 2	6	74	6	-1.2376	6	0.2044	5	-0.8510	6	26.2
Spearman ρ to Human Eval.	-		1.00		0.60		0.77		0.09

Better Distinguishability Across Models and Samples:
- ImageReward demonstrates a much larger interquartile range (IQR) in its score distributions compared to CLIP (Figure 3), indicating its ability to better distinguish quality between individual samples.
- The medians of ImageReward scores for different models also align with human ranking, while CLIP's medians do not.
  
  该图像是图表，展示了不同生成模型的ImageReward和CLIP评分的规范化分布（已去除异常值）。ImageReward的评分与人类偏好高度一致，并表现出更高的可区分性。
The image (Figure 3 from the original paper) shows the normalized distribution of ImageReward and CLIP scores for different generative models, with outliers discarded. It illustrates that ImageReward's scores align well with human preference and exhibit higher distinguishability, evidenced by a larger spread (IQR) of scores compared to CLIP.

4.2.2. ReFL: Reward Feedback Learning Improves Text-to-Image Diffusion

While ImageReward can select high-quality images, the generate-and-then-filter paradigm is inefficient. ReFL aims to improve the generative model directly.

Challenge:
- LDMs (Latent Diffusion Models) lack explicit likelihoods for their generations, preventing direct application of RLHF algorithms like PPO used in NLP.
- Classifier-guidance is an inference-only technique and requires a classifier trained on noisy intermediate latents, which contradicts the RM's annotation process (requiring fully denoised images).
- Indirect solutions (data filtering, loss re-weighting) are limited and only provide mild improvements.
ReFL: Insight and Solution:
- Key Insight: Observing ImageReward scores along the denoising steps (e.g., 40 steps), it was found that the final image quality becomes reliably identifiable after a certain number of steps (e.g., $t \geq 30$ $t \geq 30$ out of 40).
  - When $t \leq 15$ : Scores are uniformly low.
  - When $15 \leq t \leq 30$ : High-quality generations start to emerge, but overall judgment is difficult.
  - When $t \geq 30$ : Generations with different ImageReward scores are generally distinguishable.
    
    该图像是一个示意图，展示了不同去噪步数下的 ImageReward 值变化及其对应生成图像。在去噪步骤为 30 时，ImageReward 值为 0.35，在步骤为 40 时，ImageReward 值为 0.59，显示出人类偏好的改进趋势。
The image (Figure 4 from the original paper) illustrates ImageReward scores of a prompt generated with different seeds along denoising steps. It shows that final image qualities become identifiable after 30 out of 40 steps, suggesting that ImageReward can provide meaningful feedback even on partially denoised latents.
- This insight allows ImageReward scores from partially denoised latents (predicted $x_0'$ after $t \geq 30$ steps) to serve as reliable feedback for LDM improvement.
- Direct Optimization: ReFL proposes a direct fine-tuning method for LDMs by treating RM scores as human preference losses and backpropagating gradients to a randomly picked latter step $t$ (e.g., $t \in [30, 40]$ ). Using a random $t$ (instead of just the final step) is crucial for training stability and avoiding rapid overfitting.
- Regularization: To prevent overfitting and stabilize fine-tuning, ReFL loss is re-weighted ( $\lambda$ ) and regularized with the original pre-training loss.
Algorithm 1: Reward Feedback Learning (ReFL) for LDMs

1: Dataset: Prompt set $\mathcal{Y} = \{ y_1, y_2, ..., y_n \}$
2: Pre-training Dataset: Text-image pairs dataset $\mathcal{D} = \{ (\mathrm{txt}_1, \mathrm{img}_1), ... (\mathrm{txt}_n, \mathrm{img}_n) \}$
3: Input: LDM with pre-trained parameters $w_0$ , reward model $r$ , reward-to-loss map function $\phi$ , LDM pre-training loss function $\psi$ , reward re-weight scale $\lambda$
4: Initialization: The number of noise scheduler time steps $T$ , and time step range for fine-tuning $[T_1, T_2]$
5: for $y_i \in \mathcal{V}$ and $(\mathrm{txt}_i, \mathrm{img}_i) \in \mathcal{D}$ do
6: $\mathcal{L}_{pre} \leftarrow \psi_{w_i}(\mathrm{txt}_i, \mathrm{img}_i)$
7: $w_i \leftarrow w_i$ // Update $\mathrm{LDM}_{w_i}$ using Pre-training Loss
8: $t \leftarrow \mathrm{rand}(T_1, T_2)$ // Pick a random time step $t \in [T_1, T_2]$
9: $x_T \sim \mathcal{N}(0, I)$ // Sample noise as latent
10: for $j = T, ..., t + 1$ do
11: no grad: $x_{j-1} \gets \mathrm{LDM}_{w_i} \{ x_j \}$
12: end for
13: with grad: $x_{t-1} \gets \mathrm{LDM}_{w_i} \{ x_t \}$
14: $x_0 \gets x_{t-1}$ // Predict the original latent by noise scheduler
15: $z_i \gets x_0$ // From latent to image
16: $\mathcal{L}_{reward} \leftarrow \lambda \phi(r(y_i, z_i))$ // ReFL loss
17: $w_{i+1} \leftarrow w_i$ // Update $\mathrm{LDM}_{w_i}$ using ReFL loss
18: end for
- Explanation of Algorithm 1:
  - Line 1-4 (Inputs and Initialization): The algorithm takes a set of prompts $\mathcal{Y}$ and a pre-training dataset $\mathcal{D}$ (text-image pairs). It requires an LDM with initial parameters $w_0$ , the reward model $r$ , a reward-to-loss mapping function $\phi$ , the LDM pre-training loss function $\psi$ , and a reward re-weight scale $\lambda$ . Key initializations include the total number of denoising steps $T$ and the range $[T_1, T_2]$ for selecting a random time step for fine-tuning.
  - Line 5 (Loop over Data): The algorithm iterates over prompts $y_i$ from the prompt set $\mathcal{Y}$ and text-image pairs $(\mathrm{txt}_i, \mathrm{img}_i)$ from the pre-training dataset $\mathcal{D}$ .
  - Line 6-7 (Pre-training Loss Update): It calculates the standard LDM pre-training loss $\mathcal{L}_{pre}$ for the current LDM parameters $w_i$ using the pre-training data. This part ensures the LDM retains its general generative capabilities. The LDM parameters $w_i$ are then updated based on this loss.
  - Line 8 (Random Time Step Selection): A random time step $t$ is chosen within the predefined range $[T_1, T_2]$ (e.g., [30, 40]). This is the point at which the reward model will evaluate a partially denoised latent.
  - Line 9 (Sample Noise): Initial random noise $x_T$ is sampled from a standard normal distribution $\mathcal{N}(0, I)$ to serve as the starting latent for the diffusion process.
  - Line 10-12 (Partial Denoising - No Gradient): The LDM performs denoising steps from $T$ down to $t+1$ . Crucially, no gradients are computed or retained during this initial phase. This means these steps are treated as a fixed forward pass to reach an intermediate latent.
  - Line 13 (Denoising with Gradient): The LDM performs the denoising step from $x_t$ to $x_{t-1}$ . This step does retain gradients, as it's the point where ReFL loss will be applied.
  - Line 14-15 (Predict Original Latent and Convert to Image): From the partially denoised latent $x_{t-1}$ , the noise scheduler predicts the corresponding original (clean) latent $x_0$ . This $x_0$ is then converted into an image representation $z_i$ . This image $z_i$ is what the reward model will evaluate.
  - Line 16 (ReFL Loss Calculation): The reward model $r$ evaluates the generated image $z_i$ given the prompt $y_i$ . This raw reward score is passed through a reward-to-loss map function $\phi$ (e.g., ReLU) and scaled by $\lambda$ to form the ReFL loss $\mathcal{L}_{reward}$ . A higher reward from $r$ should ideally translate to a lower loss.
  - Line 17 (LDM Update with ReFL Loss): The LDM parameters $w_i$ are updated using the calculated ReFL loss. This update directly steers the LDM to produce images that yield higher reward model scores.
  - Line 18 (End Loop): The process repeats for the next prompt and pre-training data sample.
Combined Loss Function: In practice, the fine-tuning uses a combination of the ReFL loss and the standard pre-training loss to avoid rapid overfitting and stabilize the training. $ \mathcal{L}{reward} = \lambda \mathbb{E}{y_i \sim \mathcal{V}} \big ( \phi \big ( r \big ( y_i, g_\theta (y_i) \big ) \big ) \big ) \ \mathcal{L}{pre} = \mathbb{E}{(y_i, x_i) \sim \mathcal{D}} \big ( \mathbb{E}{\mathcal{E}(x_i), y_i, \epsilon \sim \mathcal{N}(0, 1), t} [ | \epsilon - \epsilon\theta \big ( z_t, t, \tau_\theta (y_i) \big ) |_2^2 ] \big ) $ The total loss optimized is $\mathcal{L}_{total} = \mathcal{L}_{pre} + \mathcal{L}_{reward}$ .
- Symbol Explanation for $\mathcal{L}_{reward}$ :
  - $\mathcal{L}_{reward}$ : The ReFL loss component.
  - $\lambda$ : A reward re-weight scale (hyperparameter) that controls the influence of the ReFL loss.
  - $\mathbb{E}_{y_i \sim \mathcal{V}}$ : Expectation over prompts $y_i$ sampled from the prompt set $\mathcal{V}$ (equivalent to $\mathcal{Y}$ in Algorithm 1).
  - $\phi$ : The reward-to-loss map function (e.g., ReLU), which transforms the raw reward score into a loss signal. For instance, $\phi(s) = \max(0, -s)$ could be used if $s$ is the reward and we want to minimize loss by maximizing reward.
  - $r$ : The reward model (e.g., ImageReward) that takes a prompt $y_i$ and a generated image $g_\theta(y_i)$ and returns a scalar reward score.
  - $g_\theta(y_i)$ : The generated image from the LDM with parameters $\theta$ corresponding to prompt $y_i$ . This implicitly refers to the image $z_i$ generated in Algorithm 1.
- Symbol Explanation for $\mathcal{L}_{pre}$ :
  - $\mathcal{L}_{pre}$ : The standard pre-training loss for LDMs, based on denoising score matching.
  - $\mathbb{E}_{(y_i, x_i) \sim \mathcal{D}}$ : Expectation over text-image pairs $(y_i, x_i)$ from the pre-training dataset $\mathcal{D}$ .
  - $\mathbb{E}_{\mathcal{E}(x_i), y_i, \epsilon \sim \mathcal{N}(0, 1), t}$ : Expectation over encoder outputs $\mathcal{E}(x_i)$ , prompts $y_i$ , noise $\epsilon$ sampled from a standard normal distribution, and time steps $t$ .
  - $\| \epsilon - \epsilon_\theta \big ( z_t, t, \tau_\theta (y_i) \big ) \|_2^2$ : The squared L2 norm (mean squared error) between the actual noise $\epsilon$ added to the latent $z_0$ at time step $t$ and the noise predicted by the LDM's noise predictor $\epsilon_\theta$ . This is the typical diffusion model training objective.
  - $\epsilon$ : The ground truth noise added at time step $t$ .
  - $\epsilon_\theta$ : The noise prediction network of the LDM with parameters $\theta$ .
  - $z_t$ : The noisy latent at time step $t$ .
  - $\tau_\theta(y_i)$ : The text embedding (or conditioning vector) for the prompt $y_i$ , derived from the LDM's text encoder.

5. Experimental Setup

5.1. Datasets

The experiments utilized a combination of datasets for training and evaluation of both ImageReward and ReFL.

ImageReward Training Dataset:
- Source: DiffusionDB [58], a large open-source dataset of real user prompts.
- Selection: 10,000 candidate prompts were systematically selected using a graph-based algorithm leveraging Sentence-BERT for diversity.
- Composition: 8,878 valid prompts yielded 136,892 comparison pairs after human expert annotation. This dataset was divided for training and validation, with 466 prompts used for testing (6,399 comparisons).
ImageReward Evaluation Datasets:
- Model Comparison: 100 real-user test prompts (from DiffusionDB) were used to compare 6 popular T2I models. Each model generated 10 outputs per prompt.
- Recall/Filter Test: An additional test set of 371 prompts, each with 8 generated images, was used for evaluating Recall@K and Filter@K metrics.
- MS-COCO 2014 [28]: Used for zero-shot FID and ImageReward score evaluation of T2I models. Images were unified to $256 \times 256$ resolution.
ReFL Training Dataset:
- Pre-training Dataset: A $625\mathrm{k}$ subset of LAION-5B [50] (selected by aesthetic score) was used for LDM pre-training.
- Prompt Set for ReFL: Sampled from DiffusionDB.
ReFL Evaluation Datasets:
- Real User Prompts: 466 real user prompts from DiffusionDB.
- Multi-task Benchmark (MT Bench) [40]: 90 designed challenging prompts for evaluation.

Example of a data sample (prompt):

"a painting of an ocean with clouds and birds, day time, low depth field effect" (Figure 9)
"Landscape photography by marc adamus, mountains with some forests, small lake in the center, fog in the background, sunrays, golden hour, high quality." (Figure 4)
"Beautiful portrait of female model Irakli Nadar with intricate details of color crushed fluid oil and acrylic on the shoulders, berries and dried moss and dried autumn leaves headdress." (Figure 14)
"High quality neoclassical portrait of a cat dressed as Napoleon Bonaparte, very detailed, 1860, oil on canvas." (Figure 14)
"Michelangelo's statue of David driving a blue Rolls Royce convertible, A night at Caesars Palace in Las Vegas in the background." (Figure 14)

These datasets were chosen because DiffusionDB provides a rich source of real-world user prompts, which are representative of actual T2I model usage. LAION-5B is a standard large-scale dataset for pre-training vision-language models. MS-COCO is a common benchmark for image generation. MT Bench provides challenging prompts to test model robustness. This diverse selection allows for thorough validation of the proposed methods against various aspects of T2I generation and human preference.

5.2. Evaluation Metrics

The paper uses several metrics to evaluate ImageReward and ReFL.

Preference Accuracy:
- Conceptual Definition: Measures the correctness of a scorer (e.g., ImageReward or a baseline) in choosing the same preferred image as a human from a pair of images. It quantifies how well the model's pairwise preference prediction aligns with human judgment.
- Mathematical Formula: While not explicitly given as a standard formula in the paper, it is conceptually defined as the proportion of comparison pairs where the model's preference matches the human's preference. $ \text{Preference Accuracy} = \frac{\text{Number of matching preference pairs}}{\text{Total number of comparison pairs}} $
- Symbol Explanation:
  - Number of matching preference pairs: The count of instances where the model's chosen better image is the same as the human's chosen better image.
  - Total number of comparison pairs: The total number of image pairs that were presented for preference judgment.
Recall@K and Filter@K:
- Conceptual Definition:
  - Recall@K measures the ability of a scoring model to identify the best image within a larger set of generated images. Specifically, it's the proportion of times the human-annotated best image is present among the top K images selected by the model's scores.
  - Filter@K measures the ability of a scoring model to avoid selecting the worst image. It's the proportion of times the human-annotated worst image is not present among the top K images selected by the model, or conversely, is found among the bottom K.
- Mathematical Formula: Not explicitly provided in the paper. Conceptually, for a set of N images for a given prompt: $ \text{Recall@K} = \frac{\text{Number of prompts where human best is in model's top K}}{\text{Total number of prompts}} $ $ \text{Filter@K} = \frac{\text{Number of prompts where human worst is NOT in model's top K}}{\text{Total number of prompts}} $
- Symbol Explanation:
  - human best: The image identified as the best by human annotators.
  - model's top K: The K images with the highest scores according to the model.
  - human worst: The image identified as the worst by human annotators.
  - model's top K: The K images with the highest scores according to the model (for Filter@K, we want the human worst NOT to be in this set).
Spearman $\rho$ (Spearman's Rank Correlation Coefficient):
- Conceptual Definition: A non-parametric measure of the strength and direction of association between two ranked variables. It assesses how well the relationship between two variables can be described using a monotonic function. A value of 1 indicates a perfect monotonic increasing relationship, -1 indicates a perfect monotonic decreasing relationship, and 0 indicates no monotonic relationship. In this paper, it's used to quantify the agreement between model rankings and human rankings of T2I models.
- Mathematical Formula: $ \rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)} $
- Symbol Explanation:
  - $\rho$ : Spearman's rank correlation coefficient.
  - $d_i$ : The difference between the ranks of each observation pair. For example, if a model ranks DALL-E 2 as 1 and humans rank it as 3, $d_i = 1-3 = -2$ .
  - $n$ : The number of observations (e.g., the number of T2I models being ranked).
Human Win Rate:
- Conceptual Definition: A direct metric of human preference, typically used in pairwise or multi-choice comparisons. It represents the percentage of times a particular model's output is chosen as "better" or "preferred" by human annotators compared to other models or a baseline.
- Mathematical Formula: Not explicitly provided in the paper. $ \text{Human Win Rate} = \frac{\text{Number of times a model's output wins}}{\text{Total number of comparisons involving that model}} \times 100% $
- Symbol Explanation:
  - Number of times a model's output wins: The count of times human annotators preferred the output of a specific model.
  - Total number of comparisons involving that model: The total count of times the model's output was part of a comparison.
FID (Fréchet Inception Distance):
- Conceptual Definition: A metric for assessing the quality and diversity of images generated by generative models. It quantifies the distance between the feature distributions of generated images and real images. Lower FID scores indicate that the generated images are more realistic and diverse, resembling the distribution of real images.
- Mathematical Formula: $ \mathrm{FID} = ||\mu_1 - \mu_2||^2_2 + \mathrm{Tr}(\Sigma_1 + \Sigma_2 - 2(\Sigma_1 \Sigma_2)^{1/2}) $
- Symbol Explanation:
  - $\mu_1$ : The mean feature vector of the real images (extracted from an Inception-v3 model).
  - $\mu_2$ : The mean feature vector of the generated images (extracted from an Inception-v3 model).
  - $\Sigma_1$ : The covariance matrix of the feature vectors of the real images.
  - $\Sigma_2$ : The covariance matrix of the feature vectors of the generated images.
  - $\mathrm{Tr}$ : The trace of a matrix (sum of diagonal elements).
  - $||\mu_1 - \mu_2||^2_2$ : The squared Euclidean distance between the mean feature vectors.
  - $(\Sigma_1 \Sigma_2)^{1/2}$ : The matrix square root of the product of the covariance matrices.
CLIP Score:
- Conceptual Definition: Measures the semantic similarity between a text prompt and a generated image. It uses the CLIP model's text encoder and image encoder to obtain embeddings for the text and image, respectively, and then computes the cosine similarity between these embeddings. A higher score indicates better text-image coherence.
- Mathematical Formula: $ \text{CLIP Score}(T, I) = \text{cosine_similarity}(\text{CLIP_encoder}(T), \text{CLIP_encoder}(I)) $ where $\text{cosine\_similarity}(a, b) = \frac{a \cdot b}{||a|| \cdot ||b||}$ .
- Symbol Explanation:
  - $T$ : The text prompt.
  - $I$ : The generated image.
  - $\text{CLIP\_encoder}(T)$ : The vector embedding of the text prompt $T$ produced by the CLIP text encoder.
  - $\text{CLIP\_encoder}(I)$ : The vector embedding of the image $I$ produced by the CLIP image encoder.
  - $a \cdot b$ : The dot product of vectors $a$ and $b$ .
  - $||a||$ : The L2 norm (magnitude) of vector $a$ .
Aesthetic Score:
- Conceptual Definition: A score that quantifies the perceived aesthetic quality of an image, often predicted by a neural network trained on human aesthetic judgments. This network typically takes features (e.g., CLIP image embeddings) as input and outputs a scalar score.
- Mathematical Formula: Not explicitly provided in the paper, as it depends on the specific aesthetic predictor used (e.g., LAION's aesthetic predictor). Generally, it involves a function $A$ : $ \text{Aesthetic Score}(I) = A(\text{features}(I)) $
- Symbol Explanation:
  - $I$ : The image.
  - $\text{features}(I)$ : Features extracted from the image (e.g., CLIP image embeddings).
  - $A$ : The aesthetic prediction function (a neural network).

5.3. Baselines

The paper compares its proposed methods against several established or concurrent baselines for both reward modeling and diffusion model optimization.

For ImageReward (Human Preference Prediction/Evaluation):
- CLIP Score [41]: A widely used metric for measuring text-image coherence. It calculates the cosine similarity between text and image embeddings produced by the CLIP model.
- Aesthetic Score [50]: Provided by LAION, this is an aesthetic predictor that uses a CLIP-based architecture to score the aesthetic quality of images.
- BLIP Score [26]: Similar to CLIP score, it uses a BLIP model to calculate the cosine similarity between text and image embeddings. BLIP is known for its improved performance in vision-language tasks.
For ReFL (Improving Diffusion Models):
- Stable Diffusion v1.4 (baseline) [45]: The original pre-trained Latent Diffusion Model serves as the baseline generative model against which all fine-tuning methods are compared.
- Dataset Filtering [61]: An indirect method that uses a reward model to filter datasets. High-scoring images form a preferred dataset, and low-scoring images (paired with "Weird image" prompts) form a non-preferred dataset. These new datasets are then used for fine-tuning the LDM.
- Reward Weighted [23]: Another indirect method that modifies the loss function coefficients based on reward model scores. It weighs the loss for generated samples according to their quality, giving more importance to preferred images. It includes a regularization term with pre-training loss.
- RAFT (Reward Ranked Fine-Tuning) [13]: An iterative fine-tuning method that constructs a dataset of generated images with higher rewards. In each iteration, it generates images, uses a reward model to rank them, selects the best ones, and then uses these selected images to fine-tune the LDM. The paper evaluates RAFT at 1, 2, and 3 iterations.
  
  These baselines are representative because CLIP, Aesthetic, and BLIP are standard in the field for evaluating image quality and text-image alignment. The Stable Diffusion model is a popular and powerful generative model. Dataset Filtering, Reward Weighted, and RAFT represent the state-of-the-art indirect methods for incorporating human feedback into LDM training, providing a strong comparison for ReFL's direct tuning approach.

6. Results & Analysis

6.1. Core Results Analysis

The paper presents extensive results demonstrating the effectiveness of ImageReward as a human preference scorer and ReFL as a method for improving diffusion models.

6.1.1. ImageReward: On Human Preference Prediction

Preference Accuracy (Table 3): ImageReward significantly outperforms all baselines in Preference Accuracy, reaching $65.14\%$ . This is substantially higher than CLIP Score ( $54.82\%$ ), Aesthetic Score ( $57.35\%$ ), and BLIP Score ( $57.76\%$ ). This indicates that ImageReward is much better at predicting human pairwise preferences. The following are the results from Table 3 of the original paper:

Model	Preference Acc.	@1	Recall @2	@4	@1	Filter @2	@4
CLIP Score	54.82	27.22	48.52	78.17	29.65	51.75	76.82
Aesthetic Score	57.35	30.73	53.91	75.74	32.08	54.45	76.55
BLIP Score	57.76	30.73	50.67	77.63	33.42	56.33	80.59
ImageReward (Ours)	65.14	39.62	63.07	90.84	49.06	70.89	88.95

Recall/Filter the Best/Worst Image (Table 3 & Figure 16): ImageReward also shows superior performance in Recall@K and Filter@K. For example, ImageReward achieves Recall@4 of $90.84\%$ and Filter@4 of $88.95\%$ , significantly higher than baselines. This means it's much better at identifying the best images and filtering out the worst ones from a larger pool of generations. Figure 16 visually confirms that ImageReward has the largest proportion of correctly picking the best/worst images and the minimum ratio of ranking them incorrectly.

该图像是图表，展示了在人工注释中，基于不同排名方法的最佳和最差图像的分布情况。图中包含了四种评分方法（Ours，BLIP Score，Aesthetic，CLIP Score），展示了每个分数对应的最佳和最差图像的百分比分布。

The image (Figure 16 from the original paper) displays the bucket distribution of the best and worst images in human annotation. It illustrates how different scoring methods (Ours, BLIP Score, Aesthetic, CLIP Score) distribute human-annotated best/worst images across their ranked buckets (1-8). ImageReward (Ours) clearly shows a higher concentration of best images in bucket 1 and worst images in bucket 8, demonstrating superior selection ability.
Human Alignment Across Models (Table 1): When ranking various T2I models, ImageReward's ranking (Spearman $\rho = 1.00$ ) perfectly correlates with human evaluation. In contrast, CLIP scores show a moderate correlation (Spearman $\rho = 0.60$ ), and zero-shot FID has a very low correlation (Spearman $\rho = 0.09$ ). This highlights ImageReward's strong alignment with human preferences in evaluating overall model quality.
Distinguishability (Figure 3): The box plot in Figure 3 reveals that ImageReward scores have a much larger interquartile range (IQR) compared to CLIP scores for each generative model. This implies that ImageReward can better differentiate the quality of individual samples within a model's outputs. Furthermore, the medians of ImageReward scores align with the human ranking of models, a property that CLIP scores fail to exhibit.
Ablation Study (Table 2b):
- Training Dataset Size: Increasing the training dataset size (from 1k to 8k comparisons) consistently improves ImageReward's preference accuracy (from $63.07\%$ to $65.14\%$ ), confirming the value of more annotated data.
- RM Backbone: BLIP as a backbone outperforms CLIP (with MLP) even when CLIP is trained on a larger dataset. This suggests BLIP's architecture or pre-training is more suitable for learning human preference in this context. The following are the results from Table 2b of the original paper:
  
  Backbone | TrainingSet Size PreferenceAcc.
  
  CLIP 4k 61.87
  
  8k 62.98
  
  BLIP 1k 63.07
  
  2k 63.18
  
  4k 64.71
  
  8k 65.14

Backbone	\| TrainingSet Size	PreferenceAcc.
CLIP	4k	61.87
	8k	62.98
BLIP	1k	63.07
	2k	63.18
	4k	64.71
	8k	65.14

6.1.2. ReFL: On Improving Diffusion Models with Human Preference

Human Evaluation (Table 4 & Figure 6): ReFL fine-tuned models consistently achieve the highest win rates against the Stable Diffusion v1.4 baseline, both on Real User Prompts ( $58.79\%$ ) and Multi-task Benchmark (MT Bench) prompts ( $58.49\%$ ). This indicates that ReFL effectively improves LDMs to generate images that are more preferred by humans. Figure 6 visually reinforces ReFL's dominance in pairwise comparisons against all other methods. The following are the results from Table 4 of the original paper:

Methods	Real User Prompts MT Bench [40]
Methods	#Win WinRate	#Win WinRate
SD v1.4 (baseline) [45]	1315 -	718 -
Dataset Filtering [61]	1394 55.17	735 51.72
Reward Weighted [23]	1075 39.52	585 43.33
RAFT [13] (iter=1)	1341 49.86	578 42.31
RAFT (iter=2)	753 30.85	452 33.02
RAFT (iter=3)	398 20.97	355 26.19
ReFL (Ours)	1508 58.79	808 58.49

Figure 6: Win rates between all methods. 该图像是一个热力图，展示了不同方法之间的胜率对比。各方法在任务中的表现通过颜色深浅展示，深色表示较高的胜率。图中包括了SD v1.4、Dataset Filtering、Reward Weighted、RAFT-1、RAFT-2、RAFT-3和ReFL等多个方法。

The image (Figure 6 from the original paper) presents a heatmap of win rates between all methods for LDM optimization. It clearly shows ReFL (Ours) achieving the highest win rates against all other compared methods, indicated by its prominent green cells.

Comparison to Other Fine-tuning Methods:
- Indirect Methods Underperform: RAFT and Reward Weighted methods actually perform worse than the SD v1.4 baseline in the paper's evaluation setup.
  - RAFT: Suffers from overfitting as iterations increase. The paper suggests this is because RAFT's reliance on an "expert generator" (used as a regularizer) is limited when dealing with real-user prompts that can be challenging, leading to low-quality images in the constructed dataset.
  - Reward Weighted: The method's reward coefficients, constrained to [0, 1], mean that non-preferred images still exert some influence. When real-user prompts introduce many non-preferred images (even if "relatively" the best among a poor set), this interference hinders effectiveness.
- Dataset Filtering: While partly effective by labeling "Weird image" and considering real images, it's still an indirect influence.
- ReFL's Advantage: ReFL provides direct gradient feedback through rewards, allowing for more precise guidance towards desired "better" generations, which is more effective at problem-solving.
Qualitative Examples (Figure 7): Visual comparisons demonstrate ReFL's ability to produce more preferred images. For instance, in prompts requiring "long pointy ears," only the ReFL-tuned model correctly generates them, while others fail or produce inaccurate representations.

该图像是图表，展示了不同微调方法在生成图像方面的定性比较。ReFL微调的模型在多个提示下生成的图像更受到偏爱，例如在包含"长尖耳朵"的提示中，只有ReFL生成了正确的耳朵，而其他方法生成的图像要么缺少耳朵，要么表现不准确。

The image (Figure 7 from the original paper) provides a qualitative comparison between ReFL and other fine-tuning methods. It illustrates that ReFL-tuned models produce images that are more preferred overall. For example, for the prompt "long pointy ears," only ReFL generates correct ears, while other methods either omit them or render them inaccurately.

6.2. Data Presentation (Tables)

6.2.1. Table 1: Text-to-image model ranking by humans and automatic metrics

The following are the results from Table 1 of the original paper:

Dataset & Model	Real User Prompts						MS-COCO 2014
	Human Eval.		ImageReward		CLIP		ImageReward		Zero-shot FID*
	Rank	#Win	Rank	Score	Rank	Score	Rank	Score	Rank	Score
Openjourney	1	507	1	0.2614	2	0.2726	2	-0.0455	5	20.7
Stable Diffusion 2.1-base	2	463	2	0.2458	4	0.2683	3	0.1553	4	18.8
DALL-E 2	3	390	3	0.2114	3	0.2684	1	0.5387	1	10.9*
Stable Diffusion 1.4	4	362	4	0.1344	1	0.2763	4	-0.0857	3	17.9
Versatile Diffusion	5	340	5	-0.2470	5	0.2606	6	-0.5485	2	18.4
CogView 2	6	74	6	-1.2376	6	0.2044	5	-0.8510	6	26.2
Spearman ρ to Human Eval.	-		1.00		0.60		0.77		0.09

6.2.2. Table 2: Data annotation agreement and ablation study on model backbones and dataset sizes

The following are the results from Table 2a of the original paper:

	researcher annotator		annotator ensemble	CLIP Score	Aesthetic	BLIP Score	Ours
researcher	71.2% ± 11.1%	65.3% ± 8.5%	73.4% ± 6.2%	57.8% ± 3.6%	55.6% ± 3.1%	57.0% ± 3.0%	64.5% ± 2.5%
annotator	65.3% ± 8.5%	65.3% ± 5.6%	53.9% ± 5.8%	54.3% ± 3.2%	55.9% ± 3.1%	57.4% ± 2.7%	65.3% ± 3.7%
annotator ensemble	73.4% ± 6.2%	53.9% ± 5.8%	-	54.4% ± 21.1%	57.5% ± 15.9%	62.0% ± 16.1% ± 18.6%	70.5%

The following are the results from Table 2b of the original paper:

Backbone	\| TrainingSet Size	PreferenceAcc.
CLIP	4k	61.87
	8k	62.98
BLIP	1k	63.07
	2k	63.18
	4k	64.71
	8k	65.14

6.2.3. Table 3: Results of ImageReward and comparison methods on human preference prediction

The following are the results from Table 3 of the original paper:

Model	Preference Acc.	@1	Recall @2	@4	@1	Filter @2	@4
CLIP Score	54.82	27.22	48.52	78.17	29.65	51.75	76.82
Aesthetic Score	57.35	30.73	53.91	75.74	32.08	54.45	76.55
BLIP Score	57.76	30.73	50.67	77.63	33.42	56.33	80.59
ImageReward (Ours)	65.14	39.62	63.07	90.84	49.06	70.89	88.95

6.2.4. Table 4: Human evaluation on different LDM optimization methods

The following are the results from Table 4 of the original paper:

Methods	Real User Prompts MT Bench [40]
Methods	#Win WinRate	#Win WinRate
SD v1.4 (baseline) [45]	1315 -	718 -
Dataset Filtering [61]	1394 55.17	735 51.72
Reward Weighted [23]	1075 39.52	585 43.33
RAFT [13] (iter=1)	1341 49.86	578 42.31
RAFT (iter=2)	753 30.85	452 33.02
RAFT (iter=3)	398 20.97	355 26.19
ReFL (Ours)	1508 58.79	808 58.49

6.2.5. Table 5: Results of ImageReward and other reward models on human preference evaluation

The following are the results from Table 5 of the original paper:

Model	Preference Acc.	@1	Recall @2	@4	@1	Filter @2	@4
HPS	60.79	39.89	58.76	83.29	47.17	65.50	84.10
PickScore	62.78	38.27	63.07	84.10	46.36	65.77	84.91
ImageReward (Ours)	65.14	39.62	63.07	90.84	49.06	70.89	88.95

6.2.6. Table 6: Comparison between different reward models

The following are the results from Table 6 of the original paper:

Methods		Real User Prompts			Multi-task Benchmark[40]
		Human Eval.		Image	Human Eval.		Image
		# Win	WinRate	Reward	\| # Win	WinRate	Reward
SD v1.4 (baseline)		399	-	0.1058	459	-	0.1859
Bo64	HPS	572	67.24	0.6274	662	69.15	0.6788
	PickScore	620	72.16	0.7033	773	72.73	0.7579
	ImageReward (Ours)	676	73.33	1.3374	824	74.42	1.4098
ReFL PickScore	HPS	428	52.86	0.4749	426	52.86	0.4646
		472	56.91	0.4618	454	55.09	0.4908
	ImageReward (Ours)	512	58.38	0.6072	492	58.67	0.4822

6.3. Ablation Studies / Parameter Analysis

6.3.1. ImageReward Ablation Studies

Training Dataset Size (Table 2b): The ablation study on training dataset size clearly shows that increasing the number of expert comparisons leads to improved Preference Accuracy. For instance, with a BLIP backbone, accuracy rises from $63.07\%$ with 1k comparisons to $65.14\%$ with 8k comparisons. This supports the hypothesis that collecting more high-quality human annotations can further enhance ImageReward's performance.
RM Backbone (Table 2b): The choice of backbone model significantly impacts performance. BLIP consistently outperforms CLIP as the ImageReward backbone. Even with a larger training set (8k), CLIP's Preference Accuracy ( $62.98\%$ ) is lower than BLIP's ( $65.14\%$ ) at the same scale. The authors attribute BLIP's superiority partly to its bootstrapping of training data and its use of an image-grounded text encoder which processes text features in a manner conditioned by the image context, unlike CLIP's separate encoders.
Interpolation Analysis (Figure 17): The paper explores interpolating scores from different models (CLIP, Aesthetic, ImageReward). While combining CLIP and Aesthetic can slightly improve accuracy compared to either alone, their best combination is still significantly lower than ImageReward's standalone performance. When ImageReward is interpolated with CLIP or Aesthetic, it shows only marginal improvement, suggesting that ImageReward already captures most of the relevant human preference signals effectively.

该图像是图表，展示了不同模型之间的准确度插值结果。左侧为我们的模型与 CLIP 模型混合的准确度，最高为 65.4%；中间是 CLIP 与 Aesthetic 模型混合的情况，最高为 58.9%；右侧为 Aesthetic 模型与我们的模型混合，最高为 65.9%。

The image (Figure 17 from the original paper) shows the accuracy of interpolation between different models. It indicates that while CLIP and Aesthetic interpolation can improve accuracy, it remains considerably lower than ImageReward's. Furthermore, ImageReward's performance is only marginally boosted when interpolated with CLIP or Aesthetic, underscoring its strong standalone capability.

6.3.2. ReFL Ablation and Comparative Analysis

Comparison to Indirect Methods (Table 4, Figure 6, Figure 20): ReFL consistently achieves the highest win rates compared to other LDM optimization methods (Dataset Filtering, Reward Weighted, RAFT), and even surpasses the SD v1.4 baseline.

该图像是插图，展示了一位身穿华丽服装的人物，服装上装饰有金色细节，展现出高贵的气质。背景色彩丰富，进一步突显人物的华丽装扮。

The image (Figure 20 from the original paper) illustrates the win rate between different fine-tuning methods. It demonstrates that ReFL achieves the highest win rate compared to any other method, with bars colored to show wins (green), ties (gray), and losses (red).
- Underperformance of RAFT and Reward Weighted: The authors critically analyze why RAFT and Reward Weighted did not improve over the baseline in their experiments, despite having shown positive results in their original papers.
  - RAFT's Limitations: RAFT is susceptible to overfitting with increasing iterations (performance drops significantly from iter=1 to iter=3). This is attributed to the expert generator (used for regularization) having limitations, especially when fine-tuning with challenging real-user data (which the original RAFT evaluation might not have focused on). If the expert generator fails to produce high-quality images for complex real-world prompts, the reward-ranked dataset for fine-tuning will be suboptimal.
  - Reward Weighted's Limitations: This method uses reward coefficients constrained within [0, 1]. While preferred images get higher weights, the influence of non-preferred images is not entirely eliminated. When diverse real-user prompts introduce many non-preferred images into the dataset (even if they are "relatively" the best among poor options), this residual interference negatively impacts the fine-tuning.
  - Dataset Filtering's Indirectness: Dataset Filtering handles non-preferred images by labeling them "Weird image," but its influence on model improvement is indirect.
- ReFL's Directness: ReFL's superior performance is attributed to its direct gradient feedback mechanism. By directly incorporating RM scores into the LDM's loss function, it provides a clear and effective signal to guide the model towards generating human-preferred outputs, overcoming the indirectness and potential pitfalls of data-oriented methods.

6.4. ImageReward Compared to Other Reward Models (Appendix D)

The paper also compares ImageReward against other recently emerged reward models like HPS [61] and PickScore [21].
Human Evaluation (Table 5): ImageReward still maintains the highest Preference Accuracy ( $65.14\%$ ) compared to PickScore ( $62.78\%$ ) and HPS ( $60.79\%$ ). It also shows superior Recall@4 ( $90.84\%$ ) and Filter@4 ( $88.95\%$ ).
Bo64 and ReFL scenarios (Table 6): When evaluating models selected by "Best of 64" (Bo64) or fine-tuned by ReFL using these reward models, ImageReward (Ours) consistently leads in #Win counts and WinRate across both Real User Prompts and Multi-task Benchmark prompts. For example, in the Bo64 scenario on Real User Prompts, ImageReward achieves $73.33\%$ WinRate, higher than PickScore ( $72.16\%$ ) and HPS ( $67.24\%$ ).
Train Set Distribution (Figure 18): Visualization using t-SNE [57] shows that ImageReward's training set is slightly more evenly distributed than PickScore's, which might contribute to its generalization ability.

该图像是一个散点图，展示了 ImageReward 和 PickScore 训练集中的数据点，通过 t-SNE 方法可视化。图中的蓝色点代表 ImageReward 训练集，橙色点代表 Pick-a-Pic 训练集，两者在二维空间中的分布情况显示了其特征差异。

The image (Figure 18 from the original paper) uses t-SNE to visualize the prompt distribution of training sets for PickScore and ImageReward. The blue points represent ImageReward's dataset, while orange points represent PickScore's. The visualization suggests ImageReward's training set is slightly more evenly distributed.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces ImageReward, the first general-purpose text-to-image (T2I) human preference reward model, and Reward Feedback Learning (ReFL), a novel direct fine-tuning approach for optimizing diffusion models using ImageReward's feedback. ImageReward is built upon a systematic annotation pipeline that collected 137,000 expert comparisons, demonstrating superior performance over existing metrics (CLIP, Aesthetic, BLIP) in aligning with human preferences and distinguishing image quality. ReFL addresses the limitations of applying traditional RLHF to diffusion models by leveraging the insight that ImageReward scores become reliable at later denoising steps, enabling direct gradient-based optimization. Extensive evaluations confirm that ReFL significantly improves generative models, leading to outputs that are more aligned with human values and preferences, outperforming indirect fine-tuning methods. The work provides open-sourced code and datasets, advancing the field of human-aligned T2I generation.

7.2. Limitations & Future Work

The authors acknowledge several limitations and outline future research directions:

Annotation Scale, Diversity, and Quality: While 137k comparisons is substantial, a larger annotation dataset is still needed for even better RM training. The current prompts from DiffusionDB, though diverse, may still harbor biases; exploring more varied prompt distributions beyond typical user prompts is necessary. The current single-person annotation with quality control strategy could be improved by multi-person annotation for better consistency.
RM Training Techniques: The observed overfitting during RM training suggests that more advanced techniques, such as parameter-efficient tuning [27, 34, 25, 32], could further enhance ImageReward's performance. The choice of a stronger and larger text-image backbone model beyond BLIP is also expected to yield additional gains.
Improving Generative Models with RM: Although ReFL is effective, it remains an approximation of original RLHF algorithms. There is a need for developing fundamentally unbiased and efficient feedback learning algorithms with solid theoretical grounding to achieve even better human alignment in diffusion models.

7.3. Personal Insights & Critique

This paper makes a significant contribution by bridging the gap between objective image quality metrics and subjective human preferences in T2I generation. The systematic development of ImageReward and ReFL is a crucial step towards more controllable and human-centric generative AI.

Personal Insights:

Annotation as a Core Contribution: The detailed and systematic annotation pipeline is arguably as important as the models themselves. The analysis of prompt categories, problem distributions, and the impact of function words provides invaluable insights into the challenges and nuances of human preference in T2I generation, which can guide future data collection and model development.
The "Later Denoising Step" Insight: The observation that ImageReward scores become reliable at later denoising steps is a clever and practical solution to the likelihood problem of diffusion models in RLHF. This insight enables a direct optimization strategy that was previously elusive.
Direct vs. Indirect Optimization: The clear empirical evidence showcasing ReFL's superiority over indirect data-oriented methods highlights the power of directly incorporating human feedback into the generative process's loss function. This points towards a more efficient and effective paradigm for aligning generative AI.
Generalizability: The concept of a general-purpose human preference reward model for T2I has broad implications, not just for improving existing models but also for guiding the development of new architectures and serving as a robust automatic evaluation metric.

Critique:

Subjectivity of "Human Preference": While the paper meticulously designs its annotation pipeline, "human preference" is inherently subjective and culturally dependent. The collected data reflects the preferences of a specific group of annotators, predominantly in China (given affiliations), which might not fully represent global or diverse aesthetic and ethical standards. The paper briefly touches upon this in "Broader Impact" but doesn't elaborate on how the RM might handle conflicting preferences across different user groups or cultures.
Potential for Reward Model Hacking/Bias Amplification: Like any reward model, ImageReward could potentially be hacked or overfit to specific patterns, leading to mode collapse or amplification of biases present in the annotation data. While the paper mentions regularization, the long-term robustness against subtle adversarial inputs or unknown biases warrants further investigation.
Cost and Scalability of Expert Annotation: Although the dataset is large, "expert comparisons" are costly. While the RM reduces ongoing human effort, the initial investment for RM training can be substantial. Future work might explore how to combine expert annotation with cheaper, lower-quality feedback (e.g., from crowd-sourcing) effectively.
Interpretation of "Function" Words: The analysis of function words is interesting, but deeper causal links are not fully established. While correlation is shown, the exact mechanism by which "8k" or "highly detailed" influences the model and its problems (e.g., initial increase then decrease in body problems) could be explored further.

This paper lays a strong foundation for integrating human feedback directly into the T2I generation loop, moving beyond mere quantitative metrics to achieve truly human-aligned AI outputs. Its methods and insights are highly applicable to other generative tasks where subjective quality is paramount.