Today, the American multinational technology company, Nvidia has announced the unveiling of the super stitched body PoE GAN, equipped with an input text sketch semantic map that can generate realistic photos.
The PoE GAN can receive a wide range of modal input, including text descriptions, image segmentation, sketches, and styles, all of which can be turned into images. The definition of PoE is that it can accept any two combinations of the aforementioned numerous input modes at the same time.
Hinton proposed the “product of experts” notion in 2002, which became known as PoE. On the input space, each expert is defined as a probability model.
Each individual input modality represents a constraint condition that the composite image must meet, therefore the intersection of all constraint sets yields a collection of images that satisfy all requirements.
The product of the single conditional probability distribution is used to define the distribution of the intersection, assuming that the joint conditional probability distribution of each constraint obeys the Gaussian distribution.
To satisfy each requirement, each individual distribution must have a high density in the region in order to make the multiplication and integral distribution have a high density in the region. The focus of PoE GAN is on how to combine each input.
To blend the changes of different types of inputs, the PoE GAN generator uses a global PoE-Net. Each modal input is encoded as a feature vector, which is subsequently summarized into the global PoE-Net using PoE. The decoder not only uses the global PoE-output, Net’s but also connects the segmentation and sketch encoders directly to the output images.
The global PoE-Net has the following structure: a possible feature vector z0 is used as a sample to use PoE, and then MLP processes the feature to produce the feature vector w.
The author suggests a multi-modal projection discriminator in the discriminator section and extends the projection discriminator to accommodate multiple conditional inputs.
The inner product of each input mode is calculated and added to obtain the final loss, unlike the normal projection discriminator, which calculates a single inner product between image embedding and conditional embedding.