Final Results

This will be the final post for the project. We will recap our approach and present the results obtained.

Goal 

The goal of the project was to investigate the use of captions for image inpainting. Does providing a caption help for this task? Are there large differences between images filled with captions and those filled using no captions?

Approach

Since the region to be inpainted is a large square in the center of the image and we are given the border of the image, the natural way to utilize this information is to use a convolutional neural network to encode the border information.

After some experimentation, the model I settled on was similar to a context encoder but using BEGAN instead of a regular GAN (see previous post for details).

To incorporate information about the caption, I decided to convert them into sentence embeddings (specifically, skip-thought vectors). I chose this method as opposed to using, say, an recurrent neural networks fed with word embeddings because I thought it would be simpler to incorporate into the model and probably as effective. Also, using an RNN would be more costly memory-wise, thus constraining the size of the rest of the model.

The model utilizing the caption information is almost identical to the one without captions. The sentence embedding is incorporated into the fully-connected layer of the model (in the middle). First, the embedding is fed into a fully-connected layer to reduce its dimensionality from 4800 to 128 and then concatenated to the existing hidden representation of 128 units. I had to reduce the number of filters for some of the other layers and use a smaller minibatch size of 48 to accommodate this extra memory cost.

Results

After training for 100,000 iterations, here are some results. For a fair comparison, I have also used the no-caption model at 100,000 iterations even though I ran it for longer.
(left: real image, center: no caption, right: with caption)

  

A white toilet sitting next to a bathroom sink.
 

A kitchen table is lined with cooking materials.
  

Flowers are in a vase of water on a tabletop.
  

The wildebeest herd walked through the zebra herd.
  

A person walking down a walkway next to a tree.
  

People sitting on benches next to each other.
  

A tennis player in white is in action with the ball.
  

A herd of zebra standing next to two giraffe on a lush green field.
  

A young boy doing a trick on a skateboard
  

A young elephant walks near a herd of other elephants.

Disappointingly, the no-caption model seems to be doing significantly better than the with-caption model. It produces very blurry images, similar to the initial convnet attempt.

I am not too sure why this is the case. Perhaps the BEGAN model has to incorporate extra information in another way to be effective. It could be related to phenomenon I observed in the previous post, where using the discriminator to autoencode the whole image made the model perform poorly. I would note that during training the equilibrium variable k_t initially decreased like I observed when using the full-image discriminator although, contrary to the previous model, it eventually stabilized at around -1.283 and did not keep decreasing. Given more time, the next thing I would try is to remove the discriminator’s access to the captions and see if this solves the problem.

We can also take a look at the loss plots from training. Since the actual curves are quite noisy, these are moving averages of the past 150 iterations at each timestep.
The left plot is for the with-caption model and the left is for the
with-caption model. (red: discriminator loss, green: generator loss, blue: convergence measure for BEGAN)

It seems like straight from the beginning the no-caption model does better, with the generator quickly going under a loss of 0.04 while the with-caption model seems stuck over that mark.
It might be possible that with longer training, the performance eventually becomes better and it is just the early performance that suffers. This doesn’t seem to be likely though since from error plots, we can see the that the loss is flattening out.

In conclusion, for this type of model, it seems that conditioning on caption information makes the performance significantly worse. Further investigation would definitely be required. I incorporated the caption information in a simple manner which may not be effective for BEGAN.

Closing Thoughts

As I started off without any prior experience with deep neural networks, this project has definitely taught me a great deal about the nuts and bolts of deep learning and how to make them work in practice. I found the project topic to be very interesting and challenging. I would have liked to try out other models such as the WGAN and DRAW, but, overall, it was a fun project.

Inpainting with BEGAN

In the first post, I referred to the paper on context encoders. In part of the paper, the authors tackle the same problem as this project: Inpainting the missing center of an image.

Their approach consists of using an autoencoder-like architecture to generate the center region from the border, similar to my first attempt for this project. The difference is that their loss incorporates both a reconstruction loss and an adversarial component. The reconstruction loss is the basic pixel-wise squared error while the adversarial loss is obtained by using a discriminator network to differentiate generated images from real ones (as in GANs).
The adversarial component reduces the blurriness in the inpainted region and, overall, more realistic images compared to those obtained with only the reconstruction loss.

I adopted the same idea with a slight variation; I used an autoencoder as the discriminator as done in BEGANs. Now, onto the details.

Discriminator

In the context encoder paper, the authors feed the border of the image into a generator and it outputs the center region. The discriminator then looks at the whole image (border + generated center) and decides if it is real or fake.
For my experiments, I tried out three different settings for the discriminator (which is an autoencoder):
1. Reconstruct only the center region.
2. Reconstruct the full image (border + center).
3. Reconstruct the center from the full image. ie. Input the whole image but only output the center.

A priori, I thought that option 2 and 3 might perform better than 1 since they might encourage the whole image to look realistic after inpainting.

After running the models, it seems like option 3  doesn’t work very well with this type of architecture. The k variable used to balance the loss of real images and fake images seems to just decrease from the beginning and this behavior doesn’t stop after tens of thousands of iterations.
I suspect it is because the border is identical in both real and fake images. This might make it very difficult for the discriminator to autoencode fake and real images differently as it has to figure out that only the center region differs in the images. The effect is that the loss on fake images is very similar to the loss on real images (at least at the beginning of training) but, ideally, we want \mathbb{E} [ \mathcal{L}(G(x_{border})) ] = \gamma \mathbb{E} [ \mathcal{L}(x) ], 0< \gamma < 1, with a lower loss on fake images than real images. Hence, to compensate, k has to decrease a lot to make the discriminator place emphasis on fake images and reduce \mathbb{E} [ \mathcal{L}(G(x_{border})) ] (Recall the discriminator loss is \mathcal{L}_D = \mathcal{L}(x) - k_t \mathcal{L}(G(x_{border}))).
With this in mind, it is possible that k would eventually stabilize if training proceeded for long enough and the loss on fake images decreased to the equilibrium. Also, perhaps it would be more sensible to set k to a negative value from the very beginning so the initial iterations wouldn’t be ‘wasted’ as k decreases. In these experiments, I always initialized k to 0, as done in the paper.

Options 1 and 3 were both successfully trained with very similar results, both in terms of the ‘convergence measure’ given in the BEGAN paper and a qualitative inspection of the generated samples.

Batch Normalization

At first, I used batch normalization on every layer aside from the output layer. I later tried training a network without it and, to my surprise, it improved the results a great deal, reducing the generator loss by about 30%.
Honestly, I am not sure why this happened and wasn’t able to diagnose the issue. I had a hunch it might have to do with the fact that I rescaled the images so the pixel values lie between 0 and 1. Then, the inputs do not have 0 mean, but I am unsure how this could affect the performance to that extent.

Reconstruction vs. Adversarial Losses (\lambda)

The loss of the generator is a combination of reconstruction and adversarial components. In the context encoder paper, the reconstruction loss is the mean squared pixel-wise error and the adversarial loss is the traditional GAN loss \mathcal{L}_{adv} = max_D \mathbb{E} [log(D(x)) + log(1 - D(G(x_{border}))].

To control whether the emphasis is placed on the reconstruction or adversarial parts, they introduce a \lambda parameter and define the the total loss to be \mathcal{L}_{total} = \lambda \mathcal{L}_{rec} + (1-\lambda) \mathcal{L}_{adv}.
In the paper, they use a value of 0.999.

For the model we will be using, since BEGAN uses an autoencoder discriminator, the loss of the discriminator is also a pixel-wise error. This makes it more natural to balance the reconstruction loss and the adversarial loss seeing as they are both in the same ‘units’ as opposed to the previous approach which compares pixel-wise loss to a cross-entropy loss.

I tried a few settings, with \lambda \in [0.1, 0.9]. As expected, high values of \lambda emphasize the reconstruction loss and result in much blurrier images. So, for the final model, to get better images, I chose \lambda = 0.1.

Equilibrium Parameter (\gamma)

In BEGAN, we must also set the hyperparameter \gamma, which controls where the equilibrium between fake and real images should be. In the original paper, the authors mention that lower values of \gamma lead to less diversity but better quality in generated images, suggesting that there is a tradeoff between these two measures mediated by \gamma.

For this model, as in the context encoder paper, I did not condition the generator on a noise vector since it supposedly gives better results. So, there is no actual diversity for any given image, making the previous interpretation of \gamma dubious in this context.

Nevertheless, I thought using lower values of \gamma could lead to higher quality images. I tried out \gamma = 0.5 and \gamma = 0.3 briefly but it didn’t seem to make much of a difference. Further testing would have to be done to see the impact of \gamma in this setting.

Results

In the end, after some experimentation with these and other hyperparameters, the final model I trained was as follows:

  • The discriminator only reconstructed the center of the image.
  • The tradeoff between reconstruction and adversarial losses was set to \lambda = 0.1.
  • No batch normalization was used.
  • The equilibrium parameter was set to \gamma =0.3
  • The model was trained using the ADAM optimizer with \beta_1 = 0.5. The initial learning rate was 0.001 and annealed by multiplying by 0.7 every 10,000 iterations. Minibatch size of 64.
  • ELU activations were used on every layer except the the fully-connected one and the output which had identity activations.

The model architecture:

 Generator
 - Input (3x64x64 Border)
 - 5x5 Conv (64 filters) 
 - 5x5 Conv (64) (Stride 2)
 - 3x3 Conv (64)
 - 3x3 Conv (64)
 - 3x3 Conv (128) (Stride 2)
 - 3x3 Conv (128)
 - 3x3 Conv (128)
 - 3x3 Conv (128) (Stride 2)
 - 3x3 Conv (128)
 - 3x3 Conv (128)
 - FC (128 units)
 - FC (4096 units)
 - Reshape into 64x8x8
 - 3x3 Conv (64)
 - 3x3 Conv (64)
 - Upsample (Nearest neighbor)
 - 3x3 Conv (64)
 - 3x3 Conv (64)
 - Upsample 
 - 3x3 Conv (64)
 - 3x3 Conv (64)
 - 3x3 Conv (3) (Output 3x32x32)
 Discriminator
 - Input (3x32x32 Center)
 - 3x3 Conv (64 filters)
 - 3x3 Conv (64)
 - 3x3 Conv (128) (Stride 2)
 - 3x3 Conv (128)
 - 3x3 Conv (128)
 - 3x3 Conv (128) (Stride 2)
 - 3x3 Conv (128)
 - 3x3 Conv (128)
 - FC (128 units)
 - FC (4096 units) 
 - Reshape into 64x8x8
 - 3x3 Conv (64)
 - 3x3 Conv (64)
 - Upsample (Nearest neighbor)
 - 3x3 Conv (64)
 - 3x3 Conv (64)
 - Upsample 
 - 3x3 Conv (64)
 - 3x3 Conv (64)
 - 3x3 Conv (3) (Output 3x32x32)

Trained for 180,000 iterations. Here are some sample images (real on left, generated on right) :

There is definitely an improvement over the previous models! The images are much sharper although they still don’t look very realistic.

I would say that the model does a decently at filling in scenery but does a poor job when there are more complicated parts missing. As such, sometimes the model just omits important pieces. Eg: The dock in the fourth example.
The model also seems to have a lot of difficult filling in people and animals too. Oftentimes, only mangled bits appear or they are filled in with the background. Eg: Man on the horse.

Finally, the next post will be incorporating the caption information and see if we can improve on these images. Stay tuned!

 

 

A Look at Generative Adversarial Networks

I have been on hiatus for a couple of weeks, but I have been running some experiments in the meantime so I have a couple of posts lined up. Initially, I wanted to explore the idea of using the DRAW model. I was thinking of augmenting it by feeding the caption words at each time step during the generation, so the model would hopefully learn to draw each word as it is inputted. Perhaps unsurprisingly, it turns out that this idea has already been explored and given rise to the AlignDRAW model.

In the paper, the authors show that it can be a successful approach to generating images from captions although it does lead to slightly blurry images. They resort to a post-processing step to sharpen the generated images using a GAN.

Since it seemed like GANs are currently the best approach for generative models of images, I hopped onto the GAN bandwagon.

For my first attempt with GANs, I tried implementing the DCGAN using these tricks and tips. I trained the model on the center 32×32 images to see if it might work out-of-the-box.

After a few unsuccessful attempts and seeing many brown squares, I looked for some alternatives. And so it BEGAN…

Bounded Equilibrium Generative Adversarial Network (BEGAN)

BEGAN is a recently proposed GAN variant which promises easier training without having to carefully balance the discriminator and generator networks’ progresses.

It differs form a regular GAN in several respects:

First, instead of using a discriminator which tries to output 0 for fake images and 1 for real images, the discriminator is now an autoencoder. It’s goal is to reconstruct a real image as well as possible but reconstruct fake images as poorly as possible. Similar to regular GANs, the generator’s goal is to fool the discriminator into having low autoencoder loss on generated images.

Second, a balancing variable is introduced to automatically keep the generator and discriminator strengths’ in check. This is achieved by modifying the discriminator’s loss dynamically during training to put more emphasis on real images or on fake images as required. Concretely, the objectives for the discriminator and generator respectively are:

\mathcal{L}_D = \mathcal{L} - k_t \mathcal{L}(G(z_D))

\mathcal{L}_G = \mathcal{L}(G(z_G))

The model tries to maintain an equilibrium between the autoencoder losses on real and fake images:

\mathbf{E} [ \mathcal{L}(G(z)) ] = \gamma \mathbf{E} [ \mathcal{L}(x) ]

This is done by updating the equilibrium parameter k_t and updating it as follows:

k_{t+1} = k_t  + \lambda (\gamma \mathcal{L}(x) - \mathcal{L}(G(z_G))

The authors claim that these changes allows BEGAN to be trained smoothly without many tricks such as a batch-norm and choosing different activation functions for the discriminator and generator.

After trying it myself, I have to agree; it seems to just work. The architecture I used is essentially the same as the one presented in the paper although it is a bit smaller due to memory constraints. I used a learning rate of 0.0001, \gamma = 0.5, the ADAM optimizer, ELU activations (except fully-connected layers which have linear activations) and batch normalization on all the layers other than the output.

 
Generator
 - Input (64 dim noise)
 - FC (4096 units)
 - Reshape into 256x4x4
 - 3x3 Conv (64 filters)
 - 3x3 Conv (64)
 - Upsample (Nearest neighbor)
 - 3x3 Conv (64)
 - 3x3 Conv (64)
 - Upsample
 - 3x3 Conv (64)
 - 3x3 Conv (64)
 - Upsample 
 - 3x3 Conv (64)
 - 3x3 Conv (64)
 - 3x3 Locally-connected (Output 3x32x32)
 
Discriminator
 - Input (3x32x32)
 - 3x3 Conv (64 filters)
 - 3x3 Conv (64)
 - 3x3 Conv (96) (Stride 2)
 - 3x3 Conv (96)
 - 3x3 Conv (96)
 - 3x3 Conv (128) (Stride 2)
 - 3x3 Conv (128)
 - 3x3 Conv (128)
 - 3x3 Conv (192) (Stride 2)
 - 3x3 Conv (192)
 - 3x3 Conv (192)
 - 3x3 Conv (256) (Stride 2)
 - 3x3 Conv (256)
 - 3x3 Conv (256)
 - FC (4096 units) 
 - Reshape into 256x4x4 
 - 3x3 Conv (64)
 - 3x3 Conv (64)
 - Upsample (Nearest neighbor)
 - 3x3 Conv (64)
 - 3x3 Conv (64)
 - Upsample 
 - 3x3 Conv (64)
 - 3x3 Conv (64)
 - Upsample 
 - 3x3 Conv (64)
 - 3x3 Conv (64)
 - 3x3 Conv (3) (Output 3x32x32)

After a short training run of 30,000 iterations, here are some results:


 

 

 

 

 

 

 

 

 

 

Although these samples are definitely not realistic by any stretch of the imagination, we can see that the model is making progress. It is a bit interesting that the model seems to be falling into certain modes, with groups of images looking similar to each other. Also, there are a number of artifacts on the images (black splotches). Training the model for a longer period of time would definitely bring better results. In the original paper, they trained for around 200,000 iterations, much more than this quick run.

For the next post, we will be moving on to the main task: inpainting using BEGAN.

 

First Attempt: Convolutional Neural Network

After finally getting this blog up, I will be detailing my experiences for the IFT6266 project.

Special thanks to Philip Paquette, whose comments on the student forum were invaluable to get started on Hades, and to Philip Lacaille, whose repo helped me organize my own code.

Without further ado, let’s proceed to the first experiment.

I used a convolutional neural network to predict the missing region directly from the visible pixels on the border. This is the naive approach and will serve as a baseline to compare the other models to.

The network architecture was inspired from VGGNet utilizing mainly 3×3 convolutions followed by pooling layers. I also added a locally connected layer as the output for the model to have more flexibility and used batch-norm on every layer except the last. The full architecture is as follows:

 - Input (3x64x64)
 - 3x3 Conv (32 filters)
 - 3x3 Conv (32)
 - 2x2 MaxPool
 - 3x3 Conv (64)
 - 3x3 Conv (64)
 - 2x2 MaxPool
 - 3x3 Conv (64)
 - 3x3 Conv (64)
 - FC (3072 units)
 - Reshape into 3x32x32
 - 3x3 Conv (64)
 - 3x3 Conv (64)
 - 1x1 Locally-connected (Output 3x32x32)

I also tried another architecture resembling a classic autoencoder. Also, instead of using max pooling, it utilizes convolution layers with stride 2 to be able to learn the downsampling as recommended for GANs. Upsampling is done using nearest neighbor followed by convolutions. This one is as follows:

 - Input (3x64x64)
 - 3x3 Conv (64 filters)
 - 3x3 Conv (64)
 - 3x3 ConvStride2 (64)
 - 3x3 Conv (64)
 - 3x3 Conv (64)
 - 3x3 ConvStride2 (128)
 - 3x3 Conv (128)
 - 3x3 Conv (128)
 - 3x3 ConvStride2 (128)
 - 3x3 Conv (128)
 - 3x3 Conv (128)
 - 3x3 ConvStride2 (256)
 - 3x3 Conv (256)
 - 3x3 Conv (256)
 - FC (3072 units)
 - Reshape into 192x4x4
 - 3x3 Locally-Connected
 - 3x3 Conv (256)
 - 3x3 Conv (256)
 - NN-Upsample (double size)
 - 3x3 Conv (256)
 - 3x3 Conv (256)
 - NN-Upsample (double size)
 - 3x3 Conv (128)
 - 3x3 Conv (128)
 - NN-Upsample (double size) 
 - 3x3 Conv (96)
 - 3x3 Conv (96)
 - NN-Upsample (double size)
 - 3x3 Conv (64)
 - 3x3 Conv (64)
 - 1x1 Locally-Connected (Output 3x32x32)

Some sample images from the validation set:

As expected, the results aren’t great but they do show that model has been able to learn something. The blurriness of both models is probably due to the pixel-wise reconstruction loss as discussed in class.

Interestingly, the output for the CNN method is grainier but also seems to be more detailed than the Autoencoder. I would guess that this happens due to the tighter bottleneck of the autoencoder and the fact that there is no upsampling. (There are a lot less weights in the Autoencoder-style model since the fully connected layer is much smaller with (256*4*4) * 3072 = 4096*3072 weights, as opposed to the CNN model which has (64*16*16)*3072 = 16384*3072 weights in the FC layer. Also, the size of the file for the saved weights is about 6 times larger for the CNN compared to the autoencoder).

Next Steps

I stumbled on an interesting paper which tackles the inpainting task. It utilizes a combination of the regular reconstruction loss and an adversarial loss to train a model to fill in missing regions. I will try to reimplement their model. But first, as a step towards that, I will turn to GANs next.