AugGen

Abstract

The increasing dependence on large-scale datasets in machine learning introduces significant privacy and ethical challenges. Synthetic data generation offers a promising solution; however, most current methods rely on external datasets or pre-trained models, which add complexity and escalate resource demands. In this work, we introduce a novel self-contained synthetic augmentation technique that strategically samples from a conditional generative model trained exclusively on the target dataset. This approach eliminates the need for auxiliary data sources. Applied to face recognition datasets, our method achieves 1–12% performance improvements on the IJB-C and IJB-B benchmarks. It outperforms models trained solely on real data and exceeds the performance of state-of-the-art synthetic data generation baselines. Notably, these enhancements often surpass those achieved through architectural improvements, underscoring the significant impact of synthetic augmentation in data-scarce environments. These findings demonstrate that carefully integrated synthetic data not only addresses privacy and resource constraints but also substantially boosts model performance.

Key findings

H1: A generative model can boost the performance of a downstream discriminative model with an appropriate informed sampling, and augmenting the resulting data with the original data that was used for training the generative and discriminative models.

We propose a simple yet effective sampling technique that strategically conditions a generative model to produce beneficial samples, enhancing the discriminator’s training process.

We show that mixing our AugGen data with real samples often surpasses even architectural-level improvements, underscoring that synthetic dataset generation can be as impactful as architectural advances.

We demonstrate that AugGen training can be as effective as adding up to 1.7× real samples, reducing the need for more face images while preserving performance.

We show that current generative metrics (e.g., FD, KD) are poorly correlated with downstream discriminative performance, emphasizing the need for improved proxy metrics.

Motivation

Currently, the most common approach to using synthetic datasets is relying on pre-trained models like StableDiffusion and Flux or models trained on large-scale datasets like WebFace. This approach then utilizes further post-processing with off-the-shelf methods to generate synthetic data. We diverge from this trend by solely relying on a single source of real data to train both a discriminator and a generator. This prevents information leakage.

Method: Generating Synthetic Mixes

We start with a dataset of face images containing multiple identities (classes), each of which includes multiple images.
We train a discriminator and a class-conditional generator on top of this dataset.
Using the discriminator, we aim to find the optimal conditions for the generator to maximize both the inter-class separation of source classes and intra-class similarity.
By randomly sampling different identities, we apply the optimal conditions to generate new synthetic identities that retain cues from the original classes while remaining distinct.
By combining the generated hard samples with the original dataset, we demonstrate an improvement in the discriminator's performance.

@article{rahimi2025synthetic, title={AugGen: Synthetic Augmentation Can Improve Discriminative Models}, author={Rahimi, Parsa and Teney, Damien and Marcel, Sebastien}, journal={arXiv preprint arXiv:2503.11544}, year={2025} }

AugGen:

Synthetic Augmentation Can Improve Discriminative Models

arxiv 2025

AugGen demonstrates that informed sampling of a generator can produce challenging samples for training a discriminator, leading to improved overall performance. This is achieved using only a single source of real data.

Abstract

Key findings

Motivation

Method: Generating Synthetic Mixes

In each sample, the first and last columns correspond to identities in the original dataset. The second and fourth columns are reconstructions of the identities by the generator trained in a class-conditional way.

Dataset

Paper

BibTeX