BERT, a neural network published by Google in 2018, excels in natural language understanding.
It can be used for multiple different tasks, such as sentiment analysis or next sentence prediction, and has recently been
integrated into Google Search. This novel model has brought a big change to language modeling as it outperformed all its
predecessors on multiple different tasks. Whenever such breakthroughs in deep learning happen, people wonder how the
network manages to achieve such impressive results, and what it actually learned. A common way of looking into neural
networks is feature visualization. The ideas of feature visualization are borrowed from Deep Dream, where we can obtain
inputs that excite the network by maximizing the activation of neurons, channels, or layers of the network. This way, we
get an idea about which part of the network is looking for what kind of input.
In Deep Dream, inputs are changed through gradient descent to maximize activation values. This can be thought of as
similar to the initial training process, where through many iterations, we try to optimize a mathematical equation. But
instead of updating network parameters, Deep Dream updates the input sample. What this leads to is somewhat psychedelic but
very interesting images, that can reveal to what kind of input these neurons react. Examples for Deep Dream processes with
images from the original Deep Dream blogpost. Here, they take a randomly initialized image and use Deep Dream to transform
the image by maximizing the activation of the corresponding output neuron. This can show what a network has learned about
different classes or for individual neurons.
Feature visualization works well for image-based models, but has not yet been widely explored for language models. This
blogpost will guide you through experiments we conducted with feature visualization for BERT. We show how we tried to get BERT to dream of highly
activating inputs, provide visual insights of why this did not work out as well as we hoped, and publish tools to
explore this research direction further. When dreaming for images, the input to the model is gradually changed. Language,
however, is made of discrete structures, ie. tokens, which represent words, or word-pieces. Thus, there is no such gradual
change to be made…Looking at a single pixel in an input image, such a change could be gradually going from green to red.
The green value would slowly go down, while the red value would increase. In language, however, we can not slowly go from
the word “green” to the word “red”, as everything in between does not make sense. To still be able to use Deep Dream, we
have to utilize the so-called Gumbel-Softmax trick, which has already been employed in a paper by Poerner et al 2018. This trick was introduced by
Jang et. al. and Maddison et. al.. It allows us to soften the requirement for discrete inputs, and instead use a linear
combination of tokens as input to the model. To assure that we do not end up with something crazy, it uses two mechanisms.
First, it constrains this linear combination so that the linear weights sum up to one. This, however, still leaves the
problem that we can end up with any linear combination of such tokens, including ones that are not close to real tokens in
the embedding space. Therefore, we also make use of a temperature parameter, which controls the sparsity of this linear
combination. By slowly decreasing this temperature value, we can make the model first explore different linear combinations
of tokens, before deciding on one token.
…The lack of success in dreaming words to highly activate specific neurons was surprising to us. This method uses
gradient descent and seemed to work for other models (see Poerner et al 2018). However, BERT is a complex model, arguably much more complex than the models that have been
previously investigated with this method.