Synthetic data generation is increasingly used for training and data augmentation, yet existing strategies often rely on external foundation models or auxiliary datasets that are impractical because of license, privacy, or domain-mismatch constraints. We introduce ScoreMix, a self-contained augmentation method that leverages the score composition phenomenon in diffusion models to produce hard samples for recognition tasks.
ScoreMix mixes class-conditioned diffusion scores during the reverse process, creating domain-specific augmentations without accessing any external data. We systematically study how to select source classes and show that mixing identities that are far apart in the discriminator’s embedding space yields the largest gains—providing up to 3% additional improvement over proximity-based selection. Across eight public face recognition benchmarks, ScoreMix improves accuracy by up to 7 percentage points without hyperparameter search, surpassing both training on real data alone and architectural scaling baselines. Code and synthetic datasets will be released.
ScoreMix was born from a simple puzzle: can we strengthen discriminators without calling in external data or proprietary generators? The story below shows how mixing diffusion scores became the answer.
We begin by training a diffusion generator and a face-recognition discriminator solely on WebFace160K. With both models anchored to the same dataset, we can explore what happens when the generator’s score functions are blended.
ScoreMix pairs identities that live far apart in the discriminator’s embedding space and composes their class-conditioned diffusion scores with a convex combination—typically an even split. Guided by this mixed score, the generator paints new faces that stay on-manifold yet deviate just enough to challenge the discriminator.
Those mixed samples go back into the training loop alongside the original images. Across eight public FR benchmarks, this recipe yields up to 7 percentage points verification gain—beating both AugGen and a larger IR101 model trained purely on real data. Selecting distant identities is crucial: it grants an extra 3% average boost compared with proximity-based choices.
Finally, we investigated why the strategy works. The generator’s condition space and the discriminator’s embedding space share only loose alignment, explaining why naive condition-space heuristics underperform. Embedding-aware selection is what turns score mixing into ScoreMix.
Once the motivation was set, the recipe emerged as a series of deliberate steps that keep the pipeline self-contained while extracting the most out of diffusion scores.
Takeaway 1 — Self-contained augmentation pays off
ScoreMix relies solely on the data used to train the generator and discriminator, yet it outperforms both the real-only baselines and a larger IR101 backbone. This validates that a fully self-contained pipeline can still produce SOTA recognition gains.
| Method | IJB-C @1e-6 | TinyFace R1 | Avg-H |
|---|---|---|---|
| WebFace160K (IR50) | 70.37 | 61.51 | 92.50 |
| WebFace160K (IR101) | 72.56 | 62.59 | 93.32 |
| ScoreMix (ours) | 76.45 | 63.09 | 93.87 |
Takeaway 2 — Embedding-aware pairing beats condition heuristics
Selecting identities that are distant in the discriminator embedding space gives a clear boost compared with choosing based on condition-space proximity, reinforcing that the pairing strategy should be recognition-driven.
| Strategy | IJB-C @1e-6 | Avg | Δ Avg |
|---|---|---|---|
| Close Embedding (pairs) | 71.86 | 64.92 | +2.52 |
| Dist Embedding (pairs) | 78.62 | 67.44 | |
| Close Condition (pairs) | 74.43 | 66.84 | +0.11 |
| Dist Condition (pairs) | 76.97 | 66.95 |
Takeaway 3 — Freezing discriminative features breaks the generator
Replacing learned condition vectors with the discriminator’s class-centers causes the diffusion model to collapse. Training diverges quickly and produces unusable samples, so ScoreMix keeps the conditioning module learnable.
| Condition Strategy | Outcome |
|---|---|
| Learned conditions (ScoreMix) | Stable training, high-fidelity mixed samples |
| Frozen discriminator centers | Training diverges — generator fails to converge |
Takeaway 4 — Pairs outperform triplets (and quads)
Exhaustive $m$-plet searches revealed that moving beyond two classes does not buy extra recognition accuracy, even with GPU-accelerated mining. ScoreMix sticks with pairs for the best trade-off.
| Mixing Setup | IJB-C @1e-6 | Avg |
|---|---|---|
| Pairs (Dist Embedding) | 78.62 | 67.44 |
| Triplets (Sum Max) | 74.36 | 65.69 |
| Triplets (Sum Min) | 73.11 | 64.62 |
Takeaway 5 — Condition space remains weakly aligned
Centered Kernel Alignment (CKA) shows that the diffusion model’s condition vectors never reach the geometry shared by multiple recognition backbones. This limited alignment explains why condition-space heuristics trail embedding-aware strategies.
CKA curves: alignment between condition space and FR backbones (solid lines) stays below inter-backbone alignment (dashed), highlighting the geometric gap.
Takeaway 6 — Forcing alignment hurts performance
Pushing generator outputs toward the discriminator’s class centers lowers verification accuracy. Alignment loss decreases, but intra-class similarity spikes, implying over-constrained synthetic samples.
| Training Data | IJB-C @1e-6 | Avg |
|---|---|---|
| ScoreMix Repro (synthetic only) | 54.66 | 92.47 |
| ScoreMix Repro + alignment | 45.79 | 46.55 |
Alignment loss drops with regularization, yet the resulting generator underperforms.
Intra-class similarity shoots up, signaling over-constrained synthetic samples.
Each grid shows original identities (outer columns), generator reproductions, and the ScoreMix augmentation in the center column—highlighting how subtle, identity-preserving cues are introduced.
| Method | Synthetic | Real | IJB-B @1e-6 | IJB-C @1e-6 | TinyFace Rank-1 | Avg-H |
|---|---|---|---|---|---|---|
| WebFace160K (IR50) | 0 | 0.16M | 32.13 | 70.37 | 61.51 | 92.50 |
| WebFace160K (IR101) | 0 | 0.16M | 34.84 | 72.56 | 62.59 | 93.32 |
| AugGen | 0.20M | 0.16M | 34.83 | 75.02 | 61.41 | 93.78 |
| ScoreMix (ours) | 0.20M | 0.16M | 35.95 | 76.45 | 63.09 | 93.87 |
| ScoreMix Repro (synthetic only) | 0.20M | 0 | 28.15 | 54.66 | 56.38 | 92.47 |
To make the experiments reproducible end-to-end, ScoreMix synthetic datasets generated from WebFace160K will be released soon.
Planned packaging includes MXNet `rec` files, image-folder tarballs (compatible with ImageTar data loaders), and an uncompressed folder hierarchy for quick inspection.
@article{rahimi2025scoremix,
title={ScoreMix: Synthetic Data Generation by Score Composition in Diffusion Models Improves Recognition},
author={Parsa Rahimi and Sebastien Marcel},
year={2025},
eprint={2506.10226},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.10226},
}