ScoreMix:

Synthetic Data Generation by Score Composition in Diffusion Models Improves Recognition

Arxiv 2025

1EPFL, 2Idiap
ScoreMix overview illustration highlighting score composition in diffusion models

ScoreMix composes class-conditioned diffusion scores to synthesize challenging augmentations that strengthen discriminators while relying on a single source dataset.

Abstract

Synthetic data generation is increasingly used for training and data augmentation, yet existing strategies often rely on external foundation models or auxiliary datasets that are impractical because of license, privacy, or domain-mismatch constraints. We introduce ScoreMix, a self-contained augmentation method that leverages the score composition phenomenon in diffusion models to produce hard samples for recognition tasks.

ScoreMix mixes class-conditioned diffusion scores during the reverse process, creating domain-specific augmentations without accessing any external data. We systematically study how to select source classes and show that mixing identities that are far apart in the discriminator’s embedding space yields the largest gains—providing up to 3% additional improvement over proximity-based selection. Across eight public face recognition benchmarks, ScoreMix improves accuracy by up to 7 percentage points without hyperparameter search, surpassing both training on real data alone and architectural scaling baselines. Code and synthetic datasets will be released.

ScoreMix in a Nutshell

ScoreMix was born from a simple puzzle: can we strengthen discriminators without calling in external data or proprietary generators? The story below shows how mixing diffusion scores became the answer.

Hypothesis. Convexly mixing class-conditioned diffusion scores can generate synthetic samples that meaningfully boost a downstream discriminator while using only the original training dataset.

We begin by training a diffusion generator and a face-recognition discriminator solely on WebFace160K. With both models anchored to the same dataset, we can explore what happens when the generator’s score functions are blended.

ScoreMix pairs identities that live far apart in the discriminator’s embedding space and composes their class-conditioned diffusion scores with a convex combination—typically an even split. Guided by this mixed score, the generator paints new faces that stay on-manifold yet deviate just enough to challenge the discriminator.

Those mixed samples go back into the training loop alongside the original images. Across eight public FR benchmarks, this recipe yields up to 7 percentage points verification gain—beating both AugGen and a larger IR101 model trained purely on real data. Selecting distant identities is crucial: it grants an extra 3% average boost compared with proximity-based choices.

Finally, we investigated why the strategy works. The generator’s condition space and the discriminator’s embedding space share only loose alignment, explaining why naive condition-space heuristics underperform. Embedding-aware selection is what turns score mixing into ScoreMix.



Motivation

ScoreMix augmentations composed from two source identities
Before mixing scores, we confronted the reality that most synthetic augmentation pipelines lean on powerful foundation models (e.g., Stable Diffusion, FLUX) or auxiliary datasets harvested at scale. These options often clash with privacy rules, licensing restrictions, or domain-specific deployments. ScoreMix asks a stricter question: can we boost discriminators using only the dataset already in hand? By training both the diffusion generator and the recognition model on the same data, ScoreMix avoids information leakage while still producing diverse, challenging samples.


Method: Generating Synthetic Mixes

Grid showing ScoreMix images produced by sweeping convex score mixing coefficients

Once the motivation was set, the recipe emerged as a series of deliberate steps that keep the pipeline self-contained while extracting the most out of diffusion scores.

  • Train a face recognition discriminator (IR50/IR101) and a class-conditional diffusion model on the same labeled dataset, ensuring a self-contained setup.
  • Measure pairwise distances between class centers in the discriminator’s embedding space and select identities that are far apart to mix.
  • During reverse diffusion, compose the two class-conditioned scores with a convex combination (typically lambda = 0.5), preserving magnitude and on-manifold guidance.
  • Sample with EDM2’s deterministic second-order solver and light autoguidance to obtain high-fidelity ScoreMix augmentations.
  • Merge the synthetic ScoreMix data with the original dataset and continue training the discriminator, achieving sizeable gains without external data.

Key Takeaways

Takeaway 1 — Self-contained augmentation pays off

ScoreMix relies solely on the data used to train the generator and discriminator, yet it outperforms both the real-only baselines and a larger IR101 backbone. This validates that a fully self-contained pipeline can still produce SOTA recognition gains.

Method IJB-C @1e-6 TinyFace R1 Avg-H
WebFace160K (IR50) 70.37 61.51 92.50
WebFace160K (IR101) 72.56 62.59 93.32

Takeaway 2 — Embedding-aware pairing beats condition heuristics

Selecting identities that are distant in the discriminator embedding space gives a clear boost compared with choosing based on condition-space proximity, reinforcing that the pairing strategy should be recognition-driven.

Strategy IJB-C @1e-6 Avg Δ Avg
Close Embedding (pairs) 71.86 64.92 +2.52
Dist Embedding (pairs) 78.62 67.44
Close Condition (pairs) 74.43 66.84 +0.11
Dist Condition (pairs) 76.97 66.95

Takeaway 3 — Freezing discriminative features breaks the generator

Replacing learned condition vectors with the discriminator’s class-centers causes the diffusion model to collapse. Training diverges quickly and produces unusable samples, so ScoreMix keeps the conditioning module learnable.

Condition Strategy Outcome
Learned conditions (ScoreMix) Stable training, high-fidelity mixed samples
Frozen discriminator centers Training diverges — generator fails to converge

Takeaway 4 — Pairs outperform triplets (and quads)

Exhaustive $m$-plet searches revealed that moving beyond two classes does not buy extra recognition accuracy, even with GPU-accelerated mining. ScoreMix sticks with pairs for the best trade-off.

Mixing Setup IJB-C @1e-6 Avg
Pairs (Dist Embedding) 78.62 67.44
Triplets (Sum Max) 74.36 65.69
Triplets (Sum Min) 73.11 64.62

Takeaway 5 — Condition space remains weakly aligned

Centered Kernel Alignment (CKA) shows that the diffusion model’s condition vectors never reach the geometry shared by multiple recognition backbones. This limited alignment explains why condition-space heuristics trail embedding-aware strategies.

CKA alignment between condition space and recognition backbones

CKA curves: alignment between condition space and FR backbones (solid lines) stays below inter-backbone alignment (dashed), highlighting the geometric gap.

Takeaway 6 — Forcing alignment hurts performance

Pushing generator outputs toward the discriminator’s class centers lowers verification accuracy. Alignment loss decreases, but intra-class similarity spikes, implying over-constrained synthetic samples.

Training Data IJB-C @1e-6 Avg
ScoreMix Repro (synthetic only) 54.66 92.47
ScoreMix Repro + alignment 45.79 46.55
Alignment loss to discriminator centers before and after regularization

Alignment loss drops with regularization, yet the resulting generator underperforms.

Intra-class similarity after enforcing alignment

Intra-class similarity shoots up, signaling over-constrained synthetic samples.

Glide through the carousel to watch ScoreMix sweep the convex weights (alpha, beta) for each identity pair—every grid reveals how the synthesis leans toward one source as the guidance balance shifts.

Qualitative ScoreMix Samples

Each grid shows original identities (outer columns), generator reproductions, and the ScoreMix augmentation in the center column—highlighting how subtle, identity-preserving cues are introduced.



Face Recognition Experiments

To see how this story plays out empirically, we train ScoreMix on WebFace160K, a 160K-image subset of WebFace4M with roughly ten thousand identities, and evaluate the resulting discriminators on eight public face recognition (FR) benchmarks. All models share the same IR50 backbone (with an IR101 variant shown for comparison) and ArcFace head; synthetic datasets are generated solely from the diffusion model trained on WebFace160K.
Method Synthetic Real IJB-B @1e-6 IJB-C @1e-6 TinyFace Rank-1 Avg-H
WebFace160K (IR50) 0 0.16M 32.13 70.37 61.51 92.50
WebFace160K (IR101) 0 0.16M 34.84 72.56 62.59 93.32
AugGen 0.20M 0.16M 34.83 75.02 61.41 93.78
ScoreMix (ours) 0.20M 0.16M 35.95 76.45 63.09 93.87
ScoreMix Repro (synthetic only) 0.20M 0 28.15 54.66 56.38 92.47
ScoreMix delivers the highest verification accuracy among self-contained approaches, surpassing strong baselines and even the larger IR101 architecture trained on real data alone. Synthetic-only training remains competitive, highlighting the fidelity of the mixed samples, while combining ScoreMix augmentations with real data yields the best overall results across both high-quality and challenging FR benchmarks.


Dataset

To make the experiments reproducible end-to-end, ScoreMix synthetic datasets generated from WebFace160K will be released soon.

Planned packaging includes MXNet `rec` files, image-folder tarballs (compatible with ImageTar data loaders), and an uncompressed folder hierarchy for quick inspection.

Paper

BibTeX


      @article{rahimi2025scoremix,
        title={ScoreMix: Synthetic Data Generation by Score Composition in Diffusion Models Improves Recognition}, 
        author={Parsa Rahimi and Sebastien Marcel},
        year={2025},
        eprint={2506.10226},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2506.10226}, 
    }