Research GANs Fairness ECCV 2022

When GANs See Race: How Training Data Shapes Who Gets Generated

Vongani H. Maluleke

When people talk about bias in generative models, they usually point at the training data. And they're not wrong. But in this work we wanted to know: is that the whole story? Or do the algorithms themselves make things worse?

In "Studying Bias in GANs through the Lens of Race" (ECCV 2022), we trained StyleGAN2-ADA on datasets with controlled racial compositions and studied what came out the other side. The short answer: the data matters, the algorithm matters, and even how you evaluate the outputs matters. Bias enters at every stage.

We organized our investigation around three questions:

Q1
Does a racially imbalanced training set produce an even more imbalanced set of generated images?
Q2
Does the truncation trick—a standard "quality improvement" at inference time—make racial imbalance worse?
Q3
Do people perceive generated images of different races as different in quality, and if so, why?

Setup

We trained StyleGAN2-ADA on three controlled subsets of FairFace, each with 12,000 images but different racial splits:

80B-20W
80% Black / 20% white
50B-50W
50% Black / 50% white
20B-80W
20% Black / 80% white

We also trained on FFHQ. Same architecture, same hyperparameters, same 128×128 resolution, 2 GPUs—the only thing that changed between runs was the data.

A note on measurement: we're studying perceived race throughout. We collected labels via Amazon Mechanical Turk (50k+ annotations from 59 workers) and also trained a ResNet-18 classifier on FairFace that agreed with human labels 84% of the time. We collapsed FairFace's seven race categories down to three—Black, white, and neither—because these had the least annotator confusion.

Before even getting to the generated images, the FFHQ numbers were worth looking at:

69%
White-perceived faces in FFHQ
4%
Black-perceived faces in FFHQ
27%
Non-Black or Non-white-perceived

4%. That's the Black representation in the dataset that most GAN research builds on. So anything trained on FFHQ out-of-the-box inherits this skew as a starting point.

A Taxonomy of Bias

To reason about what we were seeing, we found it useful to distinguish three types of bias:

Taxonomy of Bias in Generative Models
Data Distribution Bias
The model faithfully reproduces whatever imbalance exists in the training data. Garbage in, garbage out.
Symmetric Algorithmic Bias
The algorithm amplifies the training imbalance, regardless of which group is the majority.
Asymmetric Algorithmic Bias
The algorithm affects different groups unequally in ways that go beyond their representation in the data.

We found evidence of all three and each tells a different story about where the problem lies.

Q1: Does the Generator Amplify Imbalance?

No, actually. The generated racial distributions closely matched the training distributions. An 80/20 dataset gave roughly 80/20 outputs; the 95% confidence intervals always contained the training ratio. So StyleGAN2-ADA doesn't spontaneously make things worse—it's exhibiting data distribution bias, not algorithmic bias.

That said, "the model only reproduces your bias, it doesn't add to it" is cold comfort when the standard dataset is 4% Black.

Q2: What Does Truncation Do?

The truncation trick is ubiquitous in GAN work. You interpolate latent codes toward the mean of StyleGAN's W space—trade diversity for sharper images. Nearly every demo, paper figure, and product built on GANs uses it to some degree.

In the FFHQ model, truncation dramatically reduced the fraction of Black-labeled faces, effectively erasing Black representation at moderate truncation levels.

Explore the Truncation Effect

Drag the slider to see how truncation collapses diversity in FFHQ-generated faces. Each grid shows 100 faces generated at that truncation level.

Generated faces at truncation gamma=1.0
More diversity Less diversity
γ=1.0 0.8 0.6 0.4 0.2 0.0
γ = 1.0

No truncation applied. Maximum diversity—the generator explores its full latent space. Faces vary widely in race, age, and appearance.

We verified this across all four generators at truncation levels from γ=0 to 1, classifying 10,000 generated images at each level (110,000 total). The pattern held everywhere: truncation pushes the output distribution toward whichever group is the majority in the training data. In the 80B-20W model, truncation pushed outputs toward Black faces. The mechanism itself is symmetric—but since most real-world face datasets skew white, the practical effect is that truncation erases minorities.

This is what we call symmetric algorithmic bias. The algorithm doesn't "know" about race, but it amplifies whatever imbalance you give it. And since basically everyone uses truncation, this is happening all the time, quietly.

Q3: Race and Perceived Quality

This one surprised us. We used Bradley-Terry pairwise comparisons (54,000 evaluations, 3,000 images) to rank generated faces by perceived quality. Three things stood out:

1
More representation → better quality. The best images of a given race tended to come from generators where that race was well-represented. Among the highest-quality white-labeled images, most came from the white-majority model; for Black-labeled images, the Black-majority model contributed the largest share of top-quality samples.
2
White faces were consistently rated higher quality. This held across all training splits—even the 80B-20W model that was majority Black. AUPRC scores for white-labeled images were consistently higher than for Black-labeled images across all three training splits.
3
This wasn't just a GAN artifact. We ran the same pairwise comparison on real FairFace photos—no GAN involved—and annotators still preferred white faces 55.2% ± 2.3% of the time.

That last point is worth sitting with. The bias isn't only in the model—it's in the evaluation. Whether that comes from camera/sensor bias, the other-race effect in perception, or annotator demographics, we can't fully disentangle. Probably all of the above.

Meanwhile, FID scores for our three FairFace models were nearly identical (differing only slightly, indicating FID does not reflect these racial disparities). The standard automatic metric for GAN quality is completely blind to this disparity. You have to actually disaggregate by race to see it.

Takeaways

The framing "algorithmic bias is a data problem" is tempting but incomplete. Yes, the data matters—GANs reproduce whatever you give them. But truncation amplifies the imbalance, FID can't detect it, and human annotators bring their own biases to quality evaluation. It's the whole pipeline.

If you're using GANs for anything involving faces:

One thing I keep thinking about: classifier-free guidance in diffusion models does something conceptually similar to truncation—it pushes samples toward the conditional distribution's center to improve quality. Whether it has the same demographic effects is an open question and, I think, an important one.

Citation

Maluleke, V.H.*, Thakkar, N.*, Brooks, T., Weber, E., Darrell, T., Efros, A.A., Kanazawa, A., and Guillory, D. "Studying Bias in GANs through the Lens of Race." European Conference on Computer Vision (ECCV), 2022.

Next → It Takes Two: AI & Dance