FreeStyle: Free Control of Style-Content Dual-Reference Generation from Community LoRA Mining

Published:06/19/2026

Analysis

~24 min read · 30,612 charsThis analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

1. Bibliographic Information

1.1. Title

FreeStyle: Free Control of Style-Content Dual-Reference Generation from Community LoRA Mining

The central topic is style‑content dual‑reference image generation—given a content reference image, a separate style reference image, and a text instruction, the model must generate a new image that preserves the structural and semantic information of the content reference while adopting the visual style of the style reference, without letting the semantic content of the style image leak into the output.

1.2. Authors

  • Jinghong Lan (equal contribution) – Fudan University, StepFun

  • Wei Cheng (equal contribution) – StepFun

  • Yunuo Chen – StepFun

  • Ziqi Ye – Fudan University

  • Peng Xing – StepFun

  • Yixiao Fang – StepFun

  • Rui Wang – StepFun

  • Yufeng Yang – StepFun

  • Xuanyang Zhang – StepFun

  • Xianfang Zeng – StepFun

  • Difan Zou – University of Hong Kong

  • Gang Yu (corresponding) – StepFun

  • Chi Zhang (corresponding) – Westlake University

    The authors come from a mix of academic institutions (Fudan, Westlake, HKU) and an industrial lab (StepFun), combining expertise in diffusion models, stylization, and large‑scale data engineering.

1.3. Journal / Conference

The paper is currently available only as an arXiv preprint (arXiv:2606.20506). It has not yet been accepted at a peer‑reviewed conference or journal. The arXiv posting date is 18 June 2026.

1.4. Publication Year

2026

1.5. Abstract

The paper tackles style‑content dual‑reference generation, where a model must preserve the structure and semantics of a content image while adopting the style of a separate style image. The key bottleneck is the lack of large‑scale triplet data with clean content‑style separation and broad long‑tail style coverage.
To address this, the authors propose FreeStyle, a scalable framework that mines community LoRAs as compositional anchors for style and content and builds a rigorous filtering pipeline to construct large‑scale triplets across multiple base models.
To combat content leakage (stray semantic elements from the style reference appearing in the output), they adopt a two‑stage curriculum with disentanglement mechanisms:

  • an attention‑level enrichment constraint that suppresses style‑reference dominance in the style‑transfer phase,

  • a frequency‑aware RoPE modulation that targets positional‑correspondence‑based leakage in the harder dual‑reference phase.

    They also release a benchmark that evaluates style‑reference and dual‑reference generation using a style‑invariant Content Alignment Score (CAS) and a calibrated VLM‑based Verification Score. Extensive experiments show that FreeStyle achieves a strong balance among style alignment, content preservation, and leakage suppression.

2. Executive Summary

2.1. Background & Motivation

Generative models can be conditioned on external reference images to control style or content, but simultaneously satisfying a content reference, a separate style reference, and a text instruction is extremely challenging. Previous methods either:

  • Cover only a narrow set of styles (often manually curated),

  • Produce triplets with imperfect content‑style separation,

  • Fail to effectively suppress semantic leakage from the style image (e.g., a style painting’s objects inadvertently appearing in the output).

    The core problem is the lack of large‑scale, diverse, and cleanly separated triplet data. The authors argue that the open‑source community harbors a vast collection of LoRA models (lightweight adapters that capture specific visual concepts, such as a particular artistic style or character) that can serve as “compositional anchors”. By systematically mining and combining these LoRAs, one can construct a rich dataset for dual‑reference generation.

2.2. Main Contributions / Findings

  1. FreeStyle Framework – A scalable data‑construction pipeline that treats community LoRAs as modular building blocks for style and content, enabling the automatic creation of hundreds of thousands of diverse, high‑quality triplets.
  2. Two‑Stage Training with Stage‑Specific Disentanglement
    • Stage 1 (style‑reference generation): An attention enrichment constraint that penalises over‑allocation of attention to the style‑reference tokens, especially in late denoising steps, thereby preventing content leakage while preserving style richness.
    • Stage 2 (dual‑reference generation): A frequency‑aware RoPE modulation that suppresses high‑frequency rotary positional components in the style branch (which encourage fine‑grained local copying) and amplifies low‑frequency components (which carry global stylistic structure).
  3. Open Benchmark – A comprehensive test suite covering both style‑reference and dual‑reference settings, using a mix of feature‑based metrics (DINOv2, CSD, CAS, etc.) and a novel VLM‑based Verification Score that separately quantifies style‑transfer reliability and content‑preservation reliability under potential cross‑reference leakage.
  4. State‑of‑the‑Art Balance – Experiments demonstrate that FreeStyle achieves the best style‑transfer performance among open‑source methods (highest VLM‑Style and Ver‑S scores) while maintaining competitive content preservation and aesthetics. The two‑stage design effectively eliminates two distinct leakage pathways.

3. Prerequisite Knowledge & Related Work

This section introduces the essential concepts and prior works needed to understand FreeStyle fully. Every technical term is explained from the ground up.

3.1. Foundational Concepts

3.1.1. Diffusion Models (Denoising Diffusion Probabilistic Models, DDPMs)

A diffusion model learns to generate data by reversing a gradual noising process. In the forward process, a clean image x0x_0 is repeatedly corrupted with Gaussian noise over TT time‑steps to produce a sequence x1,x2,,xTx_1, x_2, \dots, x_T, where xTx_T is almost pure noise. In the reverse (denoising) process, a neural network εθε_θ is trained to predict the added noise given xtx_t and the current time‑step tt.
During generation, one starts from random noise xTx_T and iteratively removes the predicted noise, eventually obtaining a clean image.
A common improvement is classifier‑free guidance, where the conditioning signal (e.g., a text prompt) is used by the network to steer the denoising direction.

3.1.2. Flow Matching

Flow matching is an alternative to diffusion that designs a continuous flow from a simple prior distribution (e.g., Gaussian noise) to the data distribution. Instead of predicting noise, the model learns a vector field that moves samples along a probability path. The training objective (often called the flow‑matching loss) minimises the difference between the predicted vector field and the true target.
FreeStyle uses Qwen‑Image‑Edit as its backbone, which is trained with a flow‑matching objective.

3.1.3. LoRA (Low‑Rank Adaptation)

Originally developed for fine‑tuning large language models, LoRA injects a small number of trainable parameters into a pretrained model while freezing the original weights. For a weight matrix WW, LoRA learns a low‑rank decomposition ΔW=ABΔW = A·B (with A, B being low‑rank matrices) and uses Wnew=W+ΔWW^{\mathrm{new}} = W + ΔW.
In the image‑generation community, community LoRAs are released by enthusiasts; each LoRA typically encapsulates a specific visual concept (e.g., “oil painting style”, “Studio Ghibli character”, “cyberpunk city”). Because these adapters are modular and composable, they can be combined on top of a base model to obtain mixtures of concepts.

3.1.4. Attention Mechanism in Transformers

The core operation in a transformer is scaled dot‑product attention:

$ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}!\left(\frac{Q K^T}{\sqrt{d_k}}\right) V, $

where QQ (query), KK (key), and VV (value) are matrices derived from the input tokens, and dkd_k is the dimensionality of the keys. In self‑attention, QQ, KK, VV come from the same sequence; in cross‑attention, QQ comes from one source (e.g., a latent image token) while KK and VV come from another (e.g., a text or image‑reference encoding).
The attention map A=softmax(QKT/dk)A = \mathrm{softmax}(QK^T/\sqrt{d_k}) indicates how much each query token “pays attention” to each key token.

3.1.5. RoPE (Rotary Position Embedding)

Transformers are order‑agnostic; RoPE adds positional information by rotating the query and key vectors according to their positions. For a token at position pp, RoPE applies a rotation matrix to pairs of dimensions, with rotation frequencies given by Θ={θd=100002d/D}\Theta = \{\theta_d = 10000^{-2d/D}\}.
The frequency of a RoPE component determines whether it encodes fine‑grained local relationships (high frequency) or coarse global patterns (low frequency). High‑frequency components are particularly adept at establishing local correspondence between patches, which can inadvertently facilitate patch‑level copying from a style reference.

3.2. Previous Works

3.2.1. Reference‑Based Generation

  • ControlNet [84], T2I‑Adapter [50] inject spatial conditions (edges, depth, pose) into diffusion models.
  • IP‑Adapter [82] uses a decoupled cross‑attention mechanism to incorporate image prompts.
  • Personalization methods such as DreamBooth [63], Textual Inversion [21], and StyleDrop [68] adapt models to specific concepts, often via fine‑tuning or LoRA.

3.2.2. Style Transfer and Stylization

  • Classic neural style transfer (Gatys et al. 2016) matches feature statistics between a content image and a style image.
  • AdaIN [31] and its descendants perform real‑time style transfer by aligning channel‑wise mean and variance.
  • In diffusion models, training‑free methods like StyleAlign [27] and InstantStyle [76] use shared attention or injection of style features, while DEADiff [58] and CleanStyle [18] attempt to disentangle style from content.

3.2.3. Dual‑Reference (Content + Style) Generation

  • USO [79] constructs content‑style triplets by subject‑driven generation and de‑stylisation, but the data scale and diversity remain limited.
  • CSGO [81] proposes end‑to‑end content‑style composition, which inspires FreeStyle’s use of the Content Alignment Score (CAS).
  • General‑purpose editors (GPT‑Image 1.5, FLUX.2 [klein], Qwen‑Image‑Edit) can accept multiple images, but they are not specifically optimised for content‑style disentanglement and often suffer from leakage.

3.3. Technological Evolution

The field has progressed from manual feature‑matching for style transfer, to feed‑forward real‑time methods (AdaIN), to diffusion‑based generative stylisation with shared attention, and finally to data‑driven dual‑reference models. FreeStyle sits at the intersection of scalable data mining and targeted disentanglement training, addressing the missing large‑scale triplet dataset and the specific leakage mechanisms that arise when two separate visual conditions are given.

3.4. Differentiation Analysis

  • Data: Instead of synthetic generation or manual curation, FreeStyle mines the community LoRA ecosystem, obtaining an extremely broad long‑tail style coverage and automatic content‑style separation.
  • Training: It is the first to explicitly analyse and combat two distinct leakage pathways with two dedicated mechanisms: (1) attention enrichment constraint for the style‑reference stage, (2) frequency‑aware RoPE modulation for the dual‑reference stage.
  • Evaluation: FreeStyle provides a unified benchmark that separately measures style fidelity, content fidelity, instruction following, and leakage via a VLM‑based Verification Score, exposing trade‑offs that single‑metric evaluations hide.

4. Methodology

FreeStyle consists of three tightly coupled components: a data pipeline that constructs large‑scale triplets, a two‑stage training curriculum with disentanglement mechanisms, and a benchmark for evaluation. Below we dissect each part in detail, integrating formulas and procedural steps as they appear in the original paper.

4.1. Overview of the Framework

The model accepts three inputs:

  • A content reference image IcI_c,

  • A style reference image IsI_s,

  • A text instruction (prompt) PP.

    The goal is to generate an image IgI_g that preserves the structure of IcI_c, adopts the visual style of IsI_s, and follows the instruction PP, while not copying semantic elements from IsI_s.

FreeStyle is built on the Qwen‑Image‑Edit backbone, which uses MMDiT (Multi‑Modal Diffusion Transformer) blocks. The text branch is encoded by a vision‑language model (VLM), and the generator is trained with a flow‑matching objective (a variant of diffusion that learns a continuous transformation between noise and data). The architecture follows the multi‑image input paradigm, where content‑reference and style‑reference tokens are injected into the transformer layers alongside text tokens.

4.2. Data Pipeline

Two complementary datasets are constructed to support the two‑stage training.

4.2.1. Style‑Transfer Data (for Stage 1)

This dataset provides clean style‑reference triplets (Icontent,Istyle,Itarget,prompt)(I_{\text{content}}, I_{\text{style}}, I_{\text{target}}, \text{prompt}).

Construction steps (visualised in Figure 2):

  1. Content collection: A large corpus of raw images (landscapes, humans, objects) is crawled from the web.

  2. Trigger‑word stylisation: A validated list of ~600 stable style trigger words (e.g., “watercolor”, “oil painting”) is maintained. Each content image is stylised using a fixed prompt template: “transfer into [trigger] style”, with the state‑of‑the‑art model Nano Banana Pro.

  3. Bilateral consistency filtering:

    • Content fidelity is checked by comparing DINOv2 feature similarity between the stylised output and the source image.
    • Style similarity is measured against anchor style images using the ONEIG style encoder. Only samples passing both checks are kept.
  4. Triplet assembly: The style reference is sampled from a different content source to ensure content‑style independence. Text prompts are drawn from a GPT‑generated pool of style‑transfer instructions.

    img-3.jpeg 该图像是示意图,展示了样式迁移数据构建流程的四个步骤,包括源图像收集、样式转移、三元组构建和特征过滤。流程包括从网站收集多样化内容,应用预定义的样式触发词进行样式迁移,构建有效的样式-内容三元组,并通过特征过滤确保内容一致性和样式相似性。

4.2.2. Community LoRA Mining (for Stage 2)

This is the central data contribution. Community LoRAs are treated as compositional anchors for style and content, allowing large‑scale construction of dual‑reference triplets.

Pipeline overview (see Figure 3):

  1. Collection and filtering of stable LoRA weights
    LoRA models are crawled from platforms like Civitai, TensorArt, and Liblib. They are pre‑filtered for stability: for each LoRA, a 3 ⁣× ⁣33\!\times\!3 grid of 9 preview images is generated; a LoRA is considered stably triggered if at least 7/9 images exhibit consistent quality. Human experts also label each LoRA as content‑oriented or style‑oriented.
    To curb combinatorial explosion, content LoRAs are sub‑sampled via aesthetic‑score‑weighted non‑uniform sampling (higher‑scoring LoRAs are retained with higher probability). The final distribution (Sankey diagram, Figure 4) shows the theme categories of retained LoRAs.

  2. Generation of rich reference images
    For each stable LoRA, at least 20 images are generated by combining the LoRA’s trigger word with a large prompt pool (~40k distinct combinations). Quality is verified by Qwen3‑VL using binary VLM judgments with majority voting (the same protocol used in the benchmark).

  3. Filtering of valid LoRA combinations
    A Cartesian product of style and content LoRAs is tested. One image is generated per combination and undergoes bilateral content and style verification against the single‑LoRA images. The bilateral pass rate is about 40%, yielding a massive set of stable, compatible style‑content pairs.

  4. Batch generation of target triplet data
    Using the validated LoRA combinations, target images are generated in batch and again subjected to bilateral verification against the corresponding single‑LoRA reference images. Thanks to pre‑filtering, the batch verification success rate rises to ~80%.
    The final dataset contains 273k triplets for FLUX, 33k for Qwen, and 172k for Illustrious.
    Prompts for the final triplets are generated by Qwen3‑VL.

    img-4.jpeg 该图像是一个示意图,展示了单一LoRA模型过滤的流程。图中包含来自社区的多种资源,通过网络爬虫和工作流程管道,对内容LoRA图像和风格LoRA图像进行预处理和筛选,以生成有效的LoRA列表。

    img-7.jpeg 该图像是一个示意图,展示了社区 LoRA 挖掘的流程。在图中,多个模型的内容和风格参考通过收集和分类形成了不同的 LoRA 类型,并通过不同的路径进行了样本采集,最终达到生成风格和内容参考的目的。

4.3. Two‑Stage Disentanglement Training

The model is initially trained on the style‑transfer data to acquire robust style‑reference generation, and then on a mixture of LoRA‑mined dual‑reference triplets and style‑transfer data for the harder dual‑reference task. Each stage faces a distinct content‑leakage mechanism and is addressed by a dedicated mechanism.

4.3.1. Stage 1: Attention‑Level Enrichment Constraint

Observation: In style‑reference generation, content leakage correlates with disproportionate attention allocation to style‑reference tokens during late denoising steps. As shown in Figure 5 (left), leakage cases exhibit broader and more persistent high‑response bands over the style‑reference region, especially in the first transformer block (where the global layout is determined).

To quantify this, the authors introduce the group‑wise enrichment score.

Let A(t,)A^{(t,\ell)} be the attention map at denoising step tt and transformer block \ell. Denote Gg\mathcal{G}_g as the set of key tokens belonging to semantic group gg (text, content‑reference cref, style‑reference sref), and K\mathcal{K} as all key tokens. Define:

$ p^{(t,\ell)}g = \frac{\sum{q} \sum_{k \in \mathcal{G}g} A^{(t,\ell)}{qk}}{\sum_{q} \sum_{k \in \mathcal{K}} A^{(t,\ell)}_{qk}}, \qquad E^{(t,\ell)}_g = \frac{p^{(t,\ell)}_g}{|\mathcal{G}_g| / |\mathcal{K}|}. $

  • pgp_g is the fraction of total attention that group gg receives.

  • EgE_g is the enrichment score: Eg>1E_g > 1 means the group draws more attention than its size warrants (enrichment), Eg<1E_g < 1 indicates suppression.

    The enrichment metric factors out group size, allowing comparison across groups of different cardinality.

Regularisation losses are applied only at the first transformer block (=0\ell = 0), because early blocks determine global layout. A two‑sided squared hinge is used:

$ \psi_{[a,b]}(x) = \max(0,, a - x)^2 + \max(0,, x - b)^2. $

Enrichment loss targets the style‑reference enrichment for each noisy‑latent query qq:

$ E_{\mathrm{sref}}(t,q) = \frac{\sum_{k \in \mathcal{G}{\mathrm{sref}}} A^{(t,0)}{qk}}{|\mathcal{G}{\mathrm{sref}}| / |\mathcal{K}|}, $ $ \mathcal{L}{\mathrm{enrich}} = \mathbb{E}{t,,q}\Big[ (1-t) , \psi{[\alpha_{\mathrm{lo}},,\alpha_{\mathrm{hi}}]}\big( E_{\mathrm{sref}}(t,q) \big) \Big]. $

  • αlo=0\alpha_{\mathrm{lo}} = 0, αhi=0.6\alpha_{\mathrm{hi}} = 0.6: the constraint only penalises when the style‑reference enrichment exceeds 0.6 times the size‑matched baseline. There is no penalty for under‑attention (the lower bound is 0).

  • The (1-t) weight concentrates the penalty on late denoising steps (small tt), where reference influence actually materialises.

    Entropy loss prevents the style‑reference attention from collapsing onto a few tokens. For each query qq, define the renormalised attention distribution over Gsref\mathcal{G}_{\mathrm{sref}}:

$ \tilde{A}{qk} = \frac{A^{(t,0)}{qk}}{\sum_{k' \in \mathcal{G}{\mathrm{sref}}} A^{(t,0)}{qk'}}, $ and its normalised entropy:

$ \hat{\mathcal{H}}q = -\frac{1}{\log|\mathcal{G}{\mathrm{sref}}|} \sum_{k \in \mathcal{G}{\mathrm{sref}}} \tilde{A}{qk} \log \tilde{A}_{qk} ;\in [0,1]. $

Then:

$ \mathcal{L}{\mathrm{ent}} = \mathbb{E}{t,,q}\Big[ \psi_{[\beta_{\mathrm{lo}},,\beta_{\mathrm{hi}}]}\big( \hat{\mathcal{H}}_q \big) \Big], $ with βlo=0.06\beta_{\mathrm{lo}} = 0.06, βhi=0.14\beta_{\mathrm{hi}} = 0.14. Entropy below the band means attention collapse; above the band means attention is too uniform and loses focus.

The full Stage 1 objective augments the flow‑matching diffusion loss Ldiff\mathcal{L}_{\mathrm{diff}}:

$ \mathcal{L} = \mathcal{L}{\mathrm{diff}} + \lambda{\mathrm{e}} \mathcal{L}{\mathrm{enrich}} + \lambda{\mathrm{h}} \mathcal{L}{\mathrm{ent}}, \qquad \lambda{\mathrm{e}} = \lambda_{\mathrm{h}} = 0.1. $

This two‑pronged regularisation gently shapes attention without overpowering the main generation objective.

img-9.jpeg 该图像是关于风格与内容缠结分析的示意图。上部分显示了在存在内容泄漏的情况下的风格和内容图像,而下部分展示了无泄漏的情形。右侧的步骤图展示了内容和风格的变化过程,没有泄漏的示例更为清晰。

4.3.2. Stage 2: Frequency‑Aware RoPE Modulation

In the dual‑reference setting, the content reference now absorbs a large share of the model’s attention budget, and the pronounced late‑stage style‑reference enrichment spike seen in Stage 1 disappears. However, content leakage persists through a different mechanism: local positional correspondence encoded by the high‑frequency components of RoPE.

RoPE recap: In the MMDiT blocks, positional information is injected into the query and key vectors by multiplying them with a rotation matrix that depends on the position and a set of frequencies θd=100002d/D\theta_d = 10000^{-2d/D}, where dd indexes the frequency dimension (from 0 to D/2-1). High‑frequency components capture local, fine‑grained relationships; low‑frequency components encode global, coarse information.

The idea is to modulate the RoPE applied to the style‑reference keys such that:

  • High‑frequency components are suppressed (sd<1s_d < 1), reducing the model’s ability to establish patch‑level correspondence and thus preventing local copying from the style image.

  • Low‑frequency components are amplified (sd>1s_d > 1), preserving global stylistic structure.

    The scaling factor for frequency index dd is:

$ s_d = s_{\mathrm{hf}} + (s_{\mathrm{lf}} - s_{\mathrm{hf}}) \left(\frac{d}{D/2-1}\right)^{\beta}, $

where:

  • shf=0.9s_{\mathrm{hf}} = 0.9 (suppression for high frequencies),

  • slf=1.2s_{\mathrm{lf}} = 1.2 (amplification for low frequencies),

  • β=2\beta = 2 controls the smoothness of interpolation.

    This modulation is fixed throughout the denoising process and applied only to the style‑reference branch; the content‑reference branch is left unchanged. It is introduced at the start of Stage 2.

The resulting loss is again the flow‑matching diffusion loss (no additional regularisation terms are needed, because the modulation is embedded in the forward pass).

4.3.3. Training Configuration

  • Stage 1: trained on style‑transfer data for 120k steps, batch size 32, learning rate 6e‑6.
  • Stage 2: mixes community‑LoRA dual‑reference triplets and style‑transfer data in a 3:1 ratio, trained for an additional 24k steps.
  • Both stages use 8 × H100 GPUs.

5. Experimental Setup

5.1. Datasets

5.1.1. Style‑Transfer Data

  • Source: Web‑crawled content images covering landscapes, humans, everyday objects.
  • Style triggers: ~645 validated style trigger words.
  • Stylisation: Nano Banana Pro with prompt “transfer into [trigger] style”.
  • Filtering: Bilateral consistency filtering (DINOv2 for content, ONEIG for style). Triplets are constructed so that style and content references come from different sources.
  • This dataset provides supervision for Stage 1.

5.1.2. Community‑LoRA‑Mined Triple Data

  • Sources: Civitai, TensorArt, Liblib.
  • Base models: Illustrious, FLUX‑dev, Qwen‑Image.
  • Filtering: LoRA stability screening, aesthetic‑score‑weighted sub‑sampling, reference‑image verification with Qwen3‑VL, and combination‑level bilateral verification.
  • Scale:
    • FLUX: 273k triplets
    • Qwen: 33k triplets
    • Illustrious: 172k triplets
  • Prompts: Generated by Qwen3‑VL to describe the target image while preserving content and style cues.

5.1.3. Evaluation Benchmark

The authors curated a fixed test set of:

  • 200 content reference images

  • 200 style reference images

    These images span a wide range of object categories, scenes, and artistic domains (oil painting, watercolor, 3D rendering, pixel art, abstract art, etc.). Pairings are sampled without replacement to prevent bias. Prompt variants for each style category are drawn from a GPT‑generated pool. The benchmark covers two evaluation modes:

  • Style‑Reference (SRef): only a style reference is provided alongside a source image.

  • Content‑Style Dual‑Reference (CRef + SRef): both content and style references are given, plus a text instruction.

5.2. Evaluation Metrics

Each evaluation axis includes both feature‑based and VLM‑based metrics. All VLM‑based scores use Qwen3‑VL as the judge.

5.2.1. Style‑Fidelity Axis

  • ONEIG [7]: A perceptual estimate of stylistic consistency.
  • CSD (Contrastive Style Descriptor) [69]: Measures style similarity using a contrastive network trained to separate style from content.
  • VLM‑S (VLM‑Style): The VLM rates how well the generated image adopts the visual style of the reference (0–10 scale).

5.2.2. Content‑Fidelity Axis

  • DINOv2 [53]: Cosine similarity between generated image and content reference in DINOv2 embedding space; sensitive to object layout and semantic structure.

  • CAS (Content Alignment Score) [81]: Measures structural agreement after removing style‑carrying channel statistics. It is defined as follows.

    Let ϕ(I)RL×C\phi(I) \in \mathbb{R}^{L \times C} be the DINOv2 patch‑token features of image II (resized to 512×512512 \times 512), with LL tokens and CC channels. For each channel cc, compute the mean and standard deviation over the LL tokens:

    $ \mu_c(I) = \frac{1}{L} \sum_{l=1}^{L} \phi(I){l,c},\quad \sigma_c(I) = \sqrt{\frac{1}{L} \sum{l=1}^{L} \big( \phi(I)_{l,c} - \mu_c(I) \big)^2 + \epsilon}. $

    Instance‑normalised features are then:

    $ \hat{\phi}(I){l,c} = \frac{\phi(I){l,c} - \mu_c(I)}{\sigma_c(I)}. $

    Given a generated image IgI_g and its content reference IcI_c, the Content Alignment Score is the mean squared error between the style‑normalised features:

    $ \mathrm{CAS} = \frac{1}{LC} \sum_{l=1}^{L} \sum_{c=1}^{C} \left( \hat{\phi}(I_g){l,c} - \hat{\phi}(I_c){l,c} \right)^2. $

    Lower CAS indicates better content preservation, because the metric isolates structural agreement after factoring out style differences.

  • VLM‑C (VLM‑Content): The VLM rates how well the generated image preserves the structure and semantics of the content reference (0–10 scale).

5.2.3. Instruction‑Following Axis (CRef + SRef only)

  • CLIP‑T [60]: Image‑text cosine similarity in CLIP space.
  • VLM‑F (VLM‑Follow): The VLM judges whether the generated image obeys the text instruction, yielding a scalar score.

5.2.4. VLM Verification

Two verification scores measure the proportion of generated images that are judged as successful:

  • Ver‑S (Style Verification Score): successful style transfer.

  • Ver‑C (Content Verification Score): successful content preservation.

    The verification protocol:

  1. For each generated–reference pair, the VLM outputs log‑probabilities 0\ell_0 and 1\ell_1 for labels “0” (failure) and “1” (success).

  2. Normalised probabilities:

    $ p_c = \frac{\exp(\ell_c)}{\exp(\ell_0) + \exp(\ell_1)},\qquad c \in {0,1}, \qquad \hat{y} = \operatorname*{arg,max}_{c} p_c. $

  3. The VLM is queried three times. A pair is counted as a success if at least two out of three judgments give y^=1\hat{y} = 1.

  4. Ver‑S and Ver‑C are the fractions of pairs that are successful in the style and content dimensions respectively. Higher values indicate more reliable generation.

5.2.5. Aesthetics

  • LAION‑Aesthetic [40]: Predictor trained on LAION aesthetics ratings.
  • V2.5‑Aesthetic [75]: SigLIP‑based aesthetic score.

5.3. Baselines

The paper compares against a mix of closed‑source commercial models (Gemini, Seedream 4.5) and open‑source methods:

  • USO [79]: Disentangled style‑and‑subject generation with reward learning.

  • CSGO [81]: Content‑style composition.

  • EasyRef [89]: Omni‑generalised group‑image reference via multimodal LLMs.

  • FLUX.2 [klein] [5]: A general‑purpose multi‑image editor from Black Forest Labs.

  • TeleStyle [85]: Content‑preserving style transfer.

  • Qwen‑Image‑Edit [59]: The base multi‑image editing model used as FreeStyle’s backbone.

  • OmniStyle [77]: A filtered style‑transfer dataset.

    These baselines represent the state of the art in reference‑based generation and stylisation.

6. Results & Analysis

6.1. Core Results Analysis

The authors evaluate FreeStyle on their open benchmark in both SRef and CRef + SRef settings. The full tables are presented below.

6.1.1. Style‑Reference (SRef) Benchmark

The following are the results from Table 1 of the original paper (note: the top section shows closed‑source models for reference only; the main comparison is among open‑source methods, with the three purple shades indicating top‑3 per metric and our model at the bottom):

Method Style Content VLM Verification Aesthetics
ONEIG↑ CSD↑ VLM-S↑ DINO↑ CAS↓ VLM-C↑ Ver-S↑ Ver-C↑ LAION↑ V2.5↑
Closed-source commercial models
Gemini 0.416 0.619 7.848 0.840 1.452 8.798 0.554 0.852 6.157 5.749
Seedream 0.441 0.636 7.023 0.836 1.077 9.154 0.334 0.905 6.495 5.923
Open-source methods
USO 0.542 0.531 3.744 0.808 1.282 9.237 0.382 0.928 5.971 5.575
CSGO 0.520 0.665 6.193 0.652 1.656 1.516 0.038 0.715 5.430 4.676
EasyRef 0.270 0.578 2.073 0.612 1.904 0.428 0.000 0.129 5.403 4.415
FLUX.2 [klein] 0.490 0.655 6.068 0.811 1.084 8.335 0.285 0.532 6.682 6.018
TeleStyle 0.459 0.613 5.555 0.859 0.924 9.566 0.449 0.961 6.173 5.470
Qwen-Image-Edit 0.290 0.589 3.487 0.865 0.918 9.155 0.261 0.910 6.131 5.657
OmniStyle 0.389 0.603 6.247 0.757 1.759 6.247 0.329 0.483 4.901 4.542
Ours 0.468 0.639 7.142 0.809 1.175 8.919 0.482 0.928 6.302 5.709

Key observations from the SRef benchmark:

  • FreeStyle achieves the highest VLM‑Style (7.142) and highest Ver‑S (0.482) among open‑source methods, indicating the strongest style‑transfer capability verified by the VLM.
  • Content preservation is competitive: Ver‑C is second (0.928) and CAS (1.175) is moderate—not the absolute best, but the balance is far superior.
  • Some baselines (e.g., TeleStyle, Qwen‑Image‑Edit) have excellent content metrics but poor style scores; others (e.g., CSGO) have high CSD but very low VLM‑C (1.516) and Ver‑C (0.715), revealing a strong style‑content trade‑off. FreeStyle avoids this extreme polarisation.

6.1.2. Dual‑Reference (CRef + SRef) Benchmark

The following are the results from Table 2 of the original paper:

Method Style Content Instruction VLM Verification Aesthetics
ONEIG↑ CSD↑ VLM-S↑ DINO↑ CAS↓ VLM-C↑ CLIP-T↑ VLM-F↑ Ver-S↑ Ver-C↑ LAION↑ V2.5↑
Closed-source commercial models
Gemini 0.369 0.590 5.113 0.749 1.723 7.447 0.314 9.492 0.479 0.411 6.740 6.135
Seedream 0.372 0.601 5.621 0.759 1.607 7.723 0.319 9.565 0.522 0.477 6.969 5.593
Open-source methods
USO 0.450 0.246 3.753 0.809 1.283 9.139 0.245 2.825 0.390 0.916 5.943 5.566
Qwen-Image-Edit 0.232 0.489 2.386 0.723 1.731 6.141 0.320 9.220 0.118 0.320 6.587 5.699
FLUX.2 [klein] 0.268 0.602 2.388 0.758 1.702 7.034 0.327 9.432 0.124 0.361 6.833 5.734
TeleStyle 0.362 0.585 4.251 0.760 1.590 7.338 0.302 7.994 0.315 0.550 6.800 5.774
Ours 0.387 0.575 5.467 0.739 1.639 7.038 0.308 8.909 0.409 0.462 6.747 5.643

Key observations from the dual‑reference benchmark:

  • Again, FreeStyle achieves the best VLM‑Style (5.467) and best Ver‑S (0.409) among open‑source methods, demonstrating that even under the more difficult dual‑conditioning, it transfers style reliably.

  • Content metrics are moderate but balanced; USO dominates content (DINO 0.809, Ver‑C 0.916) but has extremely poor style and instruction‑following scores.

  • The Ver‑S/Ver‑C pair reveals that many baselines either style‑transfer well or content‑preserve well, but not both; FreeStyle achieves the most harmonious trade‑off.

    Qualitative evidence (Figures 6 and 7 in the paper) supports these quantitative findings: FreeStyle’s outputs faithfully capture brushwork, texture, and colour while preserving the content layout, and show substantially less semantic leakage (e.g., no unintended objects from the style image) compared to all baselines.

    img-12.jpeg 该图像是一个对比图,展示了使用不同方法生成的风格化图像,其中包括我们的模型及多个基准模型。每一列代表不同的风格生成结果,对比内容保持与风格保持的一致性。图中展示了多种内容与风格的组合,突出显示了我们模型在风格对齐和内容保留方面的优势。

    img-13.jpeg 该图像是一个示意图,展示了使用不同方法生成的双参考图像比较。左侧是内容参考图像,中间是风格参考图像,右侧是通过FreeStyle方法生成的图像,展示了在保留内容布局的同时,成功转移风格属性,减少了语义泄漏。

6.2. Ablation Studies

6.2.1. Attention‑Enrichment Constraint (Stage 1)

Table 3 reports VLM‑based leakage scores (0–10, lower is better).

Component w/o w/
Enrichment loss (SRef) 2.674 0.522

Removing the enrichment loss drastically increases leakage (2.674 → 0.522). Visually (see Figure 8 in the paper), without the constraint, objects from the style reference are hallucinated in the output; with it, the style is transferred cleanly. This confirms that the attention‑level regularisation effectively suppresses cross‑reference semantic contamination.

img-14.jpeg 该图像是图表,展示了不同内容和风格的生成效果及注意力损失的对比。左侧为无约束条件下生成的内容和风格,右侧则是引入注意力损失后清晰分离的效果,体现了风格与内容的解耦合。

6.2.2. Frequency‑Aware RoPE Modulation (Stage 2)

Component w/o w/
RoPE modulation (CRef+SRef) 1.047 0.453

In the dual‑reference setting, disabling the RoPE modulation raises the leakage score from 0.453 to 1.047. Figure 9 shows that without modulation, content‑specific features (silhouettes, texture patterns) from the style image bleed into the output. Enabling the modulation suppresses this leakage while preserving strong style transfer.

img-15.jpeg 该图像是一个示意图,展示了不同风格与内容参考生成的示例。其中包括多个图像,分别展示了不同的内容和风格转换,突出显示了如何通过风格参考影响生成过程,反映了在内容保持与风格忠实之间的平衡。

6.2.3. Data Pipeline (OmniStyle vs. FreeStyle Data)

Table 4 compares the SRef performance when training with the same architecture but using OmniStyle data versus FreeStyle’s LoRA‑mined data.

The following are the results from Table 4 of the original paper:

Metric OmniStyle Ours
DINOv2 ↑ 0.897 0.809
CAS ↓ 0.762 1.175
ONEIG ↑ 0.277 0.468
CSD ↑ 0.577 0.639
LAION-Aes ↑ 5.957 6.302
V2.5-Aes ↑ 5.527 5.709
VLM-Style ↑ 3.038 7.142
VLM-Content ↑ 9.669 8.919
Ver-S ↑ 0.186 0.482
Ver-C ↑ 0.972 0.928

FreeStyle’s data yields:

  • Much higher style‑oriented scores: VLM‑Style from 3.038 to 7.142, Ver‑S from 0.186 to 0.482.

  • Slightly lower content preservation (DINOv2, CAS, VLM‑C) but still very strong. This demonstrates that the community LoRA mining pipeline produces data that teaches the model richer and more accurate style transfer, especially on complex and long‑tail styles, while maintaining acceptable content fidelity. The qualitative comparison in Figure 10 underscores this: OmniStyle‑trained models often capture only parts of the reference appearance, whereas FreeStyle consistently reproduces the overall visual style.

    img-16.jpeg 该图像是一个对比示意图,显示了不同风格生成方法的效果。左侧为内容、风格、OmniStyle 和 FreeStyle 四种生成模式,右侧与之对应展示了不同内容和风格的对比,FreeStyle 在风格细节的传达上明显更为生动,尤其是在复杂及长尾风格的表现上。

7. Conclusion & Reflections

7.1. Conclusion Summary

FreeStyle presents a complete system for style‑content dual‑reference generation. Its main contributions are:

  • A scalable data‑construction pipeline that mines the open‑source LoRA ecosystem to produce large‑scale, diverse triplets with clean content‑style separation.

  • A two‑stage training curriculum that targets two distinct leakage mechanisms: attention over‑allocation (Stage 1) and positional correspondence via RoPE (Stage 2), with dedicated countermeasures (enrichment constraint and frequency modulation).

  • An open benchmark that evaluates both style‑reference and dual‑reference settings with a mixed metric protocol, including a VLM‑based Verification Score that provides a more reliable picture of generation reliability.

    Experiments show that FreeStyle achieves the best style‑transfer performance among open‑source methods while maintaining competitive content preservation and aesthetics, and it significantly reduces semantic leakage.

7.2. Limitations & Future Work

The authors acknowledge several limitations:

  • Community LoRA quality follows a long‑tailed distribution and evolves rapidly; automated curation is an ongoing challenge.

  • Cross‑model domain shift remains: stylistic semantics across different base models (Illustrious, FLUX, Qwen) are not perfectly aligned, limiting transferability.

  • Existing metrics, including the Verification Score, still fall short of fine‑grained characterisation of style‑content conflict intensity.

    Future directions: automated LoRA quality assessment, cross‑model style alignment, and more granular leakage metrics.

7.3. Personal Insights & Critique

What makes this paper particularly strong:

  • The community LoRA mining idea is both clever and practical: it turns the chaotic, scale‑free collection of open‑source adapters into a structured, high‑quality dataset. This is a highly scalable alternative to manual curation and could be extended to other composition tasks (e.g., object‑style, scene‑object).
  • The attention enrichment metric and its conversion into a differentiable loss are elegantly derived from an empirical observation of attention asymmetry. The two‑sided hinge and the layer‑specific application (only first block) are well justified by feature visualisation (Figure 11 in the appendix). This kind of “attention auditing” could inspire future works on understanding leakage in multi‑modal transformers.
  • The frequency‑aware RoPE modulation is a novel application of position‑encoding manipulation to a stylisation context. Unlike typical uses of RoPE scaling (e.g., to extend context length in LLMs), here it is used to trade off global vs. local copying—a creative twist.
  • The VLM‑based Verification Score with majority voting over multiple queries is a pragmatic way to mitigate VLM noise and provides a more faithful signal than a single average score.

Potential weaknesses / unverified assumptions:

  • VLM bias: The VLM‑based metrics rely heavily on Qwen3‑VL. If this VLM has systematic biases (e.g., favouring certain colour palettes, compositions, or cultural styles), the verification scores could be skewed. A multi‑VLM ensemble or human evaluation would strengthen the conclusions.
  • Dataset biases: The community LoRA corpus is dominated by certain art styles (e.g., anime, illustrative) and may under‑represent fine art or photorealistic styles. The Sankey diagram (Figure 4) shows a high proportion of “2D Japanese anime” and “advertising illustration” styles. This could affect the generalisation claims.
  • Scalability of the RoPE modulation: The modulation parameters (shfs_{\mathrm{hf}}, slfs_{\mathrm{lf}}, β\beta) were fixed and likely tuned on a single backbone (Qwen‑Image‑Edit). Their robustness across different DiT architectures and training regimes is uncertain.
  • Lack of direct comparison to LoRA‑merging approaches: The paper does not compare with simple LoRA merging methods (e.g., ZipLoRA [66]) that could also achieve dual‑reference effects, albeit in an inference‑only manner. A systematic comparison would further validate the need for dedicated training.

Broader impact and applicability:

  • The data pipeline could be adapted for other multimodal generation tasks (e.g., adding a third reference for “camera angle” or “lighting conditions”).

  • The disentanglement strategies are architecture‑agnostic and could be applied to other transformer‑based diffusion models.

  • The benchmark and evaluation methodology set a high standard for future research, moving beyond simple average similarity scores to a more nuanced, multi‑axis protocol.

    In summary, FreeStyle demonstrates that co‑optimising data construction, targeted training constraints, and multi‑faceted evaluation leads to a far more robust dual‑reference generator than any single improvement could. It is a substantial step toward reliable, controllable, and artistically versatile image synthesis.