Diverse Image Synthesis from Semantic Layouts via Conditional IMLE

Ke Li*, Tianhao Zhang*, Jitendra Malik

[Paper][Code][Related Papers]

Abstract

Most existing methods for conditional image synthesis are only able to generate a single plausible image for any given input, or at best a fixed number of plausible images. In this paper, we focus on the problem of generating images from semantic segmentation maps and present a simple new method that can generate an arbitrary number of images with diverse appearance for the same semantic layout. Unlike most existing approaches which adopt the GAN framework, our method is based on the recently introduced Implicit Maximum Likelihood Estimation (IMLE) framework. Compared to the leading approach, our method is able to generate more diverse images while producing fewer artifacts despite using the same architecture. The learned latent space also has sensible structure despite the lack of supervision that encourages such behaviour.


Example

Figure 3a

Figure 3a: Different samples for the same input scene layout

Interpolation Results

Figure 4a

Figure 4a: Transition from daytime to nighttime by interploting between latent noise vectors that correspond to daytime and nighttime
Figure 4b

Figure 4b: Change in car colour by interploting between latent noise vectors that correspond to different car colours
Figure 7

Figure 7: Style consistency across different scene layouts with the same latent noise vector
Figure 8

Figure 8: Transition between multiple different renderings by interpolating between multiple latent noise vectors

Evolving Scene Layouts

Figure 9a

Figure 9a: Generated video of moving car with the same latent noise vector across all frames.
Figure 9b

Figure 9b: Generated video of moving car with smooth interpolation from a daytime latent noise vector to a nighttime latent noise vector.
Figure 10

Figure 10: Generated video of moving car using a method that only predicts a single mode (pix2pix). Because the user cannot select the mode, appearance across frames is inconsistent even though each individual frame has high visual fidelity. This causes flickering when the frames are played as a video sequence.