## A Look at Generative Adversarial Networks

I have been on hiatus for a couple of weeks, but I have been running some experiments in the meantime so I have a couple of posts lined up. Initially, I wanted to explore the idea of using the DRAW model. I was thinking of augmenting it by feeding the caption words at each time step during the generation, so the model would hopefully learn to draw each word as it is inputted. Perhaps unsurprisingly, it turns out that this idea has already been explored and given rise to the AlignDRAW model.

In the paper, the authors show that it can be a successful approach to generating images from captions although it does lead to slightly blurry images. They resort to a post-processing step to sharpen the generated images using a GAN.

Since it seemed like GANs are currently the best approach for generative models of images, I hopped onto the GAN bandwagon.

For my first attempt with GANs, I tried implementing the DCGAN using these tricks and tips. I trained the model on the center 32×32 images to see if it might work out-of-the-box.

After a few unsuccessful attempts and seeing many brown squares, I looked for some alternatives. And so it BEGAN…

Bounded Equilibrium Generative Adversarial Network (BEGAN)

BEGAN is a recently proposed GAN variant which promises easier training without having to carefully balance the discriminator and generator networks’ progresses.

It differs form a regular GAN in several respects:

First, instead of using a discriminator which tries to output 0 for fake images and 1 for real images, the discriminator is now an autoencoder. It’s goal is to reconstruct a real image as well as possible but reconstruct fake images as poorly as possible. Similar to regular GANs, the generator’s goal is to fool the discriminator into having low autoencoder loss on generated images.

Second, a balancing variable is introduced to automatically keep the generator and discriminator strengths’ in check. This is achieved by modifying the discriminator’s loss dynamically during training to put more emphasis on real images or on fake images as required. Concretely, the objectives for the discriminator and generator respectively are:

$\mathcal{L}_D = \mathcal{L} - k_t \mathcal{L}(G(z_D))$

$\mathcal{L}_G = \mathcal{L}(G(z_G))$

The model tries to maintain an equilibrium between the autoencoder losses on real and fake images:

$\mathbf{E} [ \mathcal{L}(G(z)) ] = \gamma \mathbf{E} [ \mathcal{L}(x) ]$

This is done by updating the equilibrium parameter $k_t$ and updating it as follows:

$k_{t+1} = k_t + \lambda (\gamma \mathcal{L}(x) - \mathcal{L}(G(z_G))$

The authors claim that these changes allows BEGAN to be trained smoothly without many tricks such as a batch-norm and choosing different activation functions for the discriminator and generator.

After trying it myself, I have to agree; it seems to just work. The architecture I used is essentially the same as the one presented in the paper although it is a bit smaller due to memory constraints. I used a learning rate of 0.0001, $\gamma = 0.5$, the ADAM optimizer, ELU activations (except fully-connected layers which have linear activations) and batch normalization on all the layers other than the output.


Generator
- Input (64 dim noise)
- FC (4096 units)
- Reshape into 256x4x4
- 3x3 Conv (64 filters)
- 3x3 Conv (64)
- Upsample (Nearest neighbor)
- 3x3 Conv (64)
- 3x3 Conv (64)
- Upsample
- 3x3 Conv (64)
- 3x3 Conv (64)
- Upsample
- 3x3 Conv (64)
- 3x3 Conv (64)
- 3x3 Locally-connected (Output 3x32x32)

Discriminator
- Input (3x32x32)
- 3x3 Conv (64 filters)
- 3x3 Conv (64)
- 3x3 Conv (96) (Stride 2)
- 3x3 Conv (96)
- 3x3 Conv (96)
- 3x3 Conv (128) (Stride 2)
- 3x3 Conv (128)
- 3x3 Conv (128)
- 3x3 Conv (192) (Stride 2)
- 3x3 Conv (192)
- 3x3 Conv (192)
- 3x3 Conv (256) (Stride 2)
- 3x3 Conv (256)
- 3x3 Conv (256)
- FC (4096 units)
- Reshape into 256x4x4
- 3x3 Conv (64)
- 3x3 Conv (64)
- Upsample (Nearest neighbor)
- 3x3 Conv (64)
- 3x3 Conv (64)
- Upsample
- 3x3 Conv (64)
- 3x3 Conv (64)
- Upsample
- 3x3 Conv (64)
- 3x3 Conv (64)
- 3x3 Conv (3) (Output 3x32x32)

After a short training run of 30,000 iterations, here are some results:

Although these samples are definitely not realistic by any stretch of the imagination, we can see that the model is making progress. It is a bit interesting that the model seems to be falling into certain modes, with groups of images looking similar to each other. Also, there are a number of artifacts on the images (black splotches). Training the model for a longer period of time would definitely bring better results. In the original paper, they trained for around 200,000 iterations, much more than this quick run.

For the next post, we will be moving on to the main task: inpainting using BEGAN.

## First Attempt: Convolutional Neural Network

After finally getting this blog up, I will be detailing my experiences for the IFT6266 project.

Special thanks to Philip Paquette, whose comments on the student forum were invaluable to get started on Hades, and to Philip Lacaille, whose repo helped me organize my own code.

Without further ado, let’s proceed to the first experiment.

I used a convolutional neural network to predict the missing region directly from the visible pixels on the border. This is the naive approach and will serve as a baseline to compare the other models to.

The network architecture was inspired from VGGNet utilizing mainly 3×3 convolutions followed by pooling layers. I also added a locally connected layer as the output for the model to have more flexibility and used batch-norm on every layer except the last. The full architecture is as follows:

 - Input (3x64x64)
- 3x3 Conv (32 filters)
- 3x3 Conv (32)
- 2x2 MaxPool
- 3x3 Conv (64)
- 3x3 Conv (64)
- 2x2 MaxPool
- 3x3 Conv (64)
- 3x3 Conv (64)
- FC (3072 units)
- Reshape into 3x32x32
- 3x3 Conv (64)
- 3x3 Conv (64)
- 1x1 Locally-connected (Output 3x32x32)

I also tried another architecture resembling a classic autoencoder. Also, instead of using max pooling, it utilizes convolution layers with stride 2 to be able to learn the downsampling as recommended for GANs. Upsampling is done using nearest neighbor followed by convolutions. This one is as follows:

 - Input (3x64x64)
- 3x3 Conv (64 filters)
- 3x3 Conv (64)
- 3x3 ConvStride2 (64)
- 3x3 Conv (64)
- 3x3 Conv (64)
- 3x3 ConvStride2 (128)
- 3x3 Conv (128)
- 3x3 Conv (128)
- 3x3 ConvStride2 (128)
- 3x3 Conv (128)
- 3x3 Conv (128)
- 3x3 ConvStride2 (256)
- 3x3 Conv (256)
- 3x3 Conv (256)
- FC (3072 units)
- Reshape into 192x4x4
- 3x3 Locally-Connected
- 3x3 Conv (256)
- 3x3 Conv (256)
- NN-Upsample (double size)
- 3x3 Conv (256)
- 3x3 Conv (256)
- NN-Upsample (double size)
- 3x3 Conv (128)
- 3x3 Conv (128)
- NN-Upsample (double size)
- 3x3 Conv (96)
- 3x3 Conv (96)
- NN-Upsample (double size)
- 3x3 Conv (64)
- 3x3 Conv (64)
- 1x1 Locally-Connected (Output 3x32x32)

Some sample images from the validation set:

As expected, the results aren’t great but they do show that model has been able to learn something. The blurriness of both models is probably due to the pixel-wise reconstruction loss as discussed in class.

Interestingly, the output for the CNN method is grainier but also seems to be more detailed than the Autoencoder. I would guess that this happens due to the tighter bottleneck of the autoencoder and the fact that there is no upsampling. (There are a lot less weights in the Autoencoder-style model since the fully connected layer is much smaller with (256*4*4) * 3072 = 4096*3072 weights, as opposed to the CNN model which has (64*16*16)*3072 = 16384*3072 weights in the FC layer. Also, the size of the file for the saved weights is about 6 times larger for the CNN compared to the autoencoder).

Next Steps

I stumbled on an interesting paper which tackles the inpainting task. It utilizes a combination of the regular reconstruction loss and an adversarial loss to train a model to fill in missing regions. I will try to reimplement their model. But first, as a step towards that, I will turn to GANs next.