In the first post, I referred to the paper on context encoders. In part of the paper, the authors tackle the same problem as this project: Inpainting the missing center of an image.
Their approach consists of using an autoencoder-like architecture to generate the center region from the border, similar to my first attempt for this project. The difference is that their loss incorporates both a reconstruction loss and an adversarial component. The reconstruction loss is the basic pixel-wise squared error while the adversarial loss is obtained by using a discriminator network to differentiate generated images from real ones (as in GANs).
The adversarial component reduces the blurriness in the inpainted region and, overall, more realistic images compared to those obtained with only the reconstruction loss.
I adopted the same idea with a slight variation; I used an autoencoder as the discriminator as done in BEGANs. Now, onto the details.
In the context encoder paper, the authors feed the border of the image into a generator and it outputs the center region. The discriminator then looks at the whole image (border + generated center) and decides if it is real or fake.
For my experiments, I tried out three different settings for the discriminator (which is an autoencoder):
1. Reconstruct only the center region.
2. Reconstruct the full image (border + center).
3. Reconstruct the center from the full image. ie. Input the whole image but only output the center.
A priori, I thought that option 2 and 3 might perform better than 1 since they might encourage the whole image to look realistic after inpainting.
After running the models, it seems like option 3 doesn’t work very well with this type of architecture. The variable used to balance the loss of real images and fake images seems to just decrease from the beginning and this behavior doesn’t stop after tens of thousands of iterations.
I suspect it is because the border is identical in both real and fake images. This might make it very difficult for the discriminator to autoencode fake and real images differently as it has to figure out that only the center region differs in the images. The effect is that the loss on fake images is very similar to the loss on real images (at least at the beginning of training) but, ideally, we want , with a lower loss on fake images than real images. Hence, to compensate, has to decrease a lot to make the discriminator place emphasis on fake images and reduce (Recall the discriminator loss is ).
With this in mind, it is possible that would eventually stabilize if training proceeded for long enough and the loss on fake images decreased to the equilibrium. Also, perhaps it would be more sensible to set to a negative value from the very beginning so the initial iterations wouldn’t be ‘wasted’ as decreases. In these experiments, I always initialized to 0, as done in the paper.
Options 1 and 3 were both successfully trained with very similar results, both in terms of the ‘convergence measure’ given in the BEGAN paper and a qualitative inspection of the generated samples.
At first, I used batch normalization on every layer aside from the output layer. I later tried training a network without it and, to my surprise, it improved the results a great deal, reducing the generator loss by about 30%.
Honestly, I am not sure why this happened and wasn’t able to diagnose the issue. I had a hunch it might have to do with the fact that I rescaled the images so the pixel values lie between 0 and 1. Then, the inputs do not have 0 mean, but I am unsure how this could affect the performance to that extent.
Reconstruction vs. Adversarial Losses ()
The loss of the generator is a combination of reconstruction and adversarial components. In the context encoder paper, the reconstruction loss is the mean squared pixel-wise error and the adversarial loss is the traditional GAN loss .
To control whether the emphasis is placed on the reconstruction or adversarial parts, they introduce a parameter and define the the total loss to be .
In the paper, they use a value of 0.999.
For the model we will be using, since BEGAN uses an autoencoder discriminator, the loss of the discriminator is also a pixel-wise error. This makes it more natural to balance the reconstruction loss and the adversarial loss seeing as they are both in the same ‘units’ as opposed to the previous approach which compares pixel-wise loss to a cross-entropy loss.
I tried a few settings, with . As expected, high values of emphasize the reconstruction loss and result in much blurrier images. So, for the final model, to get better images, I chose .
Equilibrium Parameter ()
In BEGAN, we must also set the hyperparameter , which controls where the equilibrium between fake and real images should be. In the original paper, the authors mention that lower values of lead to less diversity but better quality in generated images, suggesting that there is a tradeoff between these two measures mediated by .
For this model, as in the context encoder paper, I did not condition the generator on a noise vector since it supposedly gives better results. So, there is no actual diversity for any given image, making the previous interpretation of dubious in this context.
Nevertheless, I thought using lower values of could lead to higher quality images. I tried out and briefly but it didn’t seem to make much of a difference. Further testing would have to be done to see the impact of in this setting.
In the end, after some experimentation with these and other hyperparameters, the final model I trained was as follows:
- The discriminator only reconstructed the center of the image.
- The tradeoff between reconstruction and adversarial losses was set to .
- No batch normalization was used.
- The equilibrium parameter was set to
- The model was trained using the ADAM optimizer with . The initial learning rate was 0.001 and annealed by multiplying by 0.7 every 10,000 iterations. Minibatch size of 64.
- ELU activations were used on every layer except the the fully-connected one and the output which had identity activations.
The model architecture:
Generator - Input (3x64x64 Border) - 5x5 Conv (64 filters) - 5x5 Conv (64) (Stride 2) - 3x3 Conv (64) - 3x3 Conv (64) - 3x3 Conv (128) (Stride 2) - 3x3 Conv (128) - 3x3 Conv (128) - 3x3 Conv (128) (Stride 2) - 3x3 Conv (128) - 3x3 Conv (128) - FC (128 units) - FC (4096 units) - Reshape into 64x8x8 - 3x3 Conv (64) - 3x3 Conv (64) - Upsample (Nearest neighbor) - 3x3 Conv (64) - 3x3 Conv (64) - Upsample - 3x3 Conv (64) - 3x3 Conv (64) - 3x3 Conv (3) (Output 3x32x32)
Discriminator - Input (3x32x32 Center) - 3x3 Conv (64 filters) - 3x3 Conv (64) - 3x3 Conv (128) (Stride 2) - 3x3 Conv (128) - 3x3 Conv (128) - 3x3 Conv (128) (Stride 2) - 3x3 Conv (128) - 3x3 Conv (128) - FC (128 units) - FC (4096 units) - Reshape into 64x8x8 - 3x3 Conv (64) - 3x3 Conv (64) - Upsample (Nearest neighbor) - 3x3 Conv (64) - 3x3 Conv (64) - Upsample - 3x3 Conv (64) - 3x3 Conv (64) - 3x3 Conv (3) (Output 3x32x32)
Trained for 180,000 iterations. Here are some sample images (real on left, generated on right) :
There is definitely an improvement over the previous models! The images are much sharper although they still don’t look very realistic.
I would say that the model does a decently at filling in scenery but does a poor job when there are more complicated parts missing. As such, sometimes the model just omits important pieces. Eg: The dock in the fourth example.
The model also seems to have a lot of difficult filling in people and animals too. Oftentimes, only mangled bits appear or they are filled in with the background. Eg: Man on the horse.
Finally, the next post will be incorporating the caption information and see if we can improve on these images. Stay tuned!