When GANs See Race: How Training Data Shapes Who Gets Generated
When people talk about bias in generative models, they usually point at the training data. And they're not wrong. But in this work we wanted to know: is that the whole story? Or do the algorithms themselves make things worse?
In "Studying Bias in GANs through the Lens of Race" (ECCV 2022), we trained StyleGAN2-ADA on datasets with controlled racial compositions and studied what came out the other side. The short answer: the data matters, the algorithm matters, and even how you evaluate the outputs matters. Bias enters at every stage.
We organized our investigation around three questions:
Setup
We trained StyleGAN2-ADA on three controlled subsets of FairFace, each with 12,000 images but different racial splits:
We also trained on FFHQ. Same architecture, same hyperparameters, same 128×128 resolution, 2 GPUs—the only thing that changed between runs was the data.
A note on measurement: we're studying perceived race throughout. We collected labels via Amazon Mechanical Turk (50k+ annotations from 59 workers) and also trained a ResNet-18 classifier on FairFace that agreed with human labels 84% of the time. We collapsed FairFace's seven race categories down to three—Black, white, and neither—because these had the least annotator confusion.
Before even getting to the generated images, the FFHQ numbers were worth looking at:
4%. That's the Black representation in the dataset that most GAN research builds on. So anything trained on FFHQ out-of-the-box inherits this skew as a starting point.
A Taxonomy of Bias
To reason about what we were seeing, we found it useful to distinguish three types of bias:
We found evidence of all three and each tells a different story about where the problem lies.
Q1: Does the Generator Amplify Imbalance?
No, actually. The generated racial distributions closely matched the training distributions. An 80/20 dataset gave roughly 80/20 outputs; the 95% confidence intervals always contained the training ratio. So StyleGAN2-ADA doesn't spontaneously make things worse—it's exhibiting data distribution bias, not algorithmic bias.
That said, "the model only reproduces your bias, it doesn't add to it" is cold comfort when the standard dataset is 4% Black.
Q2: What Does Truncation Do?
The truncation trick is ubiquitous in GAN work. You interpolate latent codes toward the mean of StyleGAN's W space—trade diversity for sharper images. Nearly every demo, paper figure, and product built on GANs uses it to some degree.
In the FFHQ model, truncation dramatically reduced the fraction of Black-labeled faces, effectively erasing Black representation at moderate truncation levels.
Explore the Truncation Effect
Drag the slider to see how truncation collapses diversity in FFHQ-generated faces. Each grid shows 100 faces generated at that truncation level.
No truncation applied. Maximum diversity—the generator explores its full latent space. Faces vary widely in race, age, and appearance.
We verified this across all four generators at truncation levels from γ=0 to 1, classifying 10,000 generated images at each level (110,000 total). The pattern held everywhere: truncation pushes the output distribution toward whichever group is the majority in the training data. In the 80B-20W model, truncation pushed outputs toward Black faces. The mechanism itself is symmetric—but since most real-world face datasets skew white, the practical effect is that truncation erases minorities.
This is what we call symmetric algorithmic bias. The algorithm doesn't "know" about race, but it amplifies whatever imbalance you give it. And since basically everyone uses truncation, this is happening all the time, quietly.
Q3: Race and Perceived Quality
This one surprised us. We used Bradley-Terry pairwise comparisons (54,000 evaluations, 3,000 images) to rank generated faces by perceived quality. Three things stood out:
That last point is worth sitting with. The bias isn't only in the model—it's in the evaluation. Whether that comes from camera/sensor bias, the other-race effect in perception, or annotator demographics, we can't fully disentangle. Probably all of the above.
Meanwhile, FID scores for our three FairFace models were nearly identical (differing only slightly, indicating FID does not reflect these racial disparities). The standard automatic metric for GAN quality is completely blind to this disparity. You have to actually disaggregate by race to see it.
Takeaways
The framing "algorithmic bias is a data problem" is tempting but incomplete. Yes, the data matters—GANs reproduce whatever you give them. But truncation amplifies the imbalance, FID can't detect it, and human annotators bring their own biases to quality evaluation. It's the whole pipeline.
If you're using GANs for anything involving faces:
- Know your data. Audit the racial composition. Use model cards.
- Report your truncation level. You're trading diversity for sharpness, and that trade hits underrepresented groups hardest.
- Don't trust FID alone. Break your evaluation down by demographic group. Aggregate metrics hide disparities.
- Consider balanced data. It mitigates the symmetric algorithmic bias from truncation.
One thing I keep thinking about: classifier-free guidance in diffusion models does something conceptually similar to truncation—it pushes samples toward the conditional distribution's center to improve quality. Whether it has the same demographic effects is an open question and, I think, an important one.
Citation
Maluleke, V.H.*, Thakkar, N.*, Brooks, T., Weber, E., Darrell, T., Efros, A.A., Kanazawa, A., and Guillory, D. "Studying Bias in GANs through the Lens of Race." European Conference on Computer Vision (ECCV), 2022.