This will be the final post for the project. We will recap our approach and present the results obtained.
The goal of the project was to investigate the use of captions for image inpainting. Does providing a caption help for this task? Are there large differences between images filled with captions and those filled using no captions?
Since the region to be inpainted is a large square in the center of the image and we are given the border of the image, the natural way to utilize this information is to use a convolutional neural network to encode the border information.
After some experimentation, the model I settled on was similar to a context encoder but using BEGAN instead of a regular GAN (see previous post for details).
To incorporate information about the caption, I decided to convert them into sentence embeddings (specifically, skip-thought vectors). I chose this method as opposed to using, say, an recurrent neural networks fed with word embeddings because I thought it would be simpler to incorporate into the model and probably as effective. Also, using an RNN would be more costly memory-wise, thus constraining the size of the rest of the model.
The model utilizing the caption information is almost identical to the one without captions. The sentence embedding is incorporated into the fully-connected layer of the model (in the middle). First, the embedding is fed into a fully-connected layer to reduce its dimensionality from 4800 to 128 and then concatenated to the existing hidden representation of 128 units. I had to reduce the number of filters for some of the other layers and use a smaller minibatch size of 48 to accommodate this extra memory cost.
After training for 100,000 iterations, here are some results. For a fair comparison, I have also used the no-caption model at 100,000 iterations even though I ran it for longer.
(left: real image, center: no caption, right: with caption)
A white toilet sitting next to a bathroom sink.
A kitchen table is lined with cooking materials.
Flowers are in a vase of water on a tabletop.
The wildebeest herd walked through the zebra herd.
A person walking down a walkway next to a tree.
People sitting on benches next to each other.
A tennis player in white is in action with the ball.
A herd of zebra standing next to two giraffe on a lush green field.
A young boy doing a trick on a skateboard
A young elephant walks near a herd of other elephants.
Disappointingly, the no-caption model seems to be doing significantly better than the with-caption model. It produces very blurry images, similar to the initial convnet attempt.
I am not too sure why this is the case. Perhaps the BEGAN model has to incorporate extra information in another way to be effective. It could be related to phenomenon I observed in the previous post, where using the discriminator to autoencode the whole image made the model perform poorly. I would note that during training the equilibrium variable initially decreased like I observed when using the full-image discriminator although, contrary to the previous model, it eventually stabilized at around -1.283 and did not keep decreasing. Given more time, the next thing I would try is to remove the discriminator’s access to the captions and see if this solves the problem.
We can also take a look at the loss plots from training. Since the actual curves are quite noisy, these are moving averages of the past 150 iterations at each timestep.
The left plot is for the with-caption model and the left is for the
with-caption model. (red: discriminator loss, green: generator loss, blue: convergence measure for BEGAN)
It seems like straight from the beginning the no-caption model does better, with the generator quickly going under a loss of 0.04 while the with-caption model seems stuck over that mark.
It might be possible that with longer training, the performance eventually becomes better and it is just the early performance that suffers. This doesn’t seem to be likely though since from error plots, we can see the that the loss is flattening out.
In conclusion, for this type of model, it seems that conditioning on caption information makes the performance significantly worse. Further investigation would definitely be required. I incorporated the caption information in a simple manner which may not be effective for BEGAN.
As I started off without any prior experience with deep neural networks, this project has definitely taught me a great deal about the nuts and bolts of deep learning and how to make them work in practice. I found the project topic to be very interesting and challenging. I would have liked to try out other models such as the WGAN and DRAW, but, overall, it was a fun project.