BigGAN (Link Bibliography)

BigGAN links:

  1. Faces

  2. #danbooru2019e621-256px-biggan

  3. ⁠, Nearcyan, Aydao, Shawn Presser⁠, Gwern Branwen (Tensorfork) (2021-01-19):

    [Website demonstrating samples from a modified trained on Danbooru2019 using TPUs for ~5m iterations for ~2 months on a pod; this modified ‘StyleGAN2-ext’, removes various regularizations which make StyleGAN2 data-efficient on datasets like ⁠, but hobble its ability to model complicated images, and scales the model up >2×. This is surprisingly effective given StyleGAN’s previous inability to approach BigGAN’s Danbooru2019, and TADNE shows off the entertaining results.

    The interface reuses Said Achmiz’s These Waifus Do Not Exist grid UI.

    Writeup⁠; see also: Colab notebook to search by CLIP embedding; ⁠/​​​​⁠/​​​​⁠, TADNE face editing⁠, =guided ponies]

    Screenshot of “This Anime Does Not Exist” infinite-scroll website.
  4. ⁠, Andrew Brock, Jeff Donahue, Karen Simonyan (2018-09-28):

    Despite recent progress in generative image modeling, successfully generating high-resolution, diverse samples from complex datasets such as remains an elusive goal. To this end, we train Generative Adversarial Networks at the largest scale yet attempted, and study the instabilities specific to such scale. We find that applying orthogonal regularization to the generator renders it amenable to a simple “truncation trick,” allowing fine control over the trade-off between sample fidelity and variety by reducing the of the Generator’s input. Our modifications lead to models which set the new state of the art in class-conditional image synthesis. When trained on ImageNet at 128×128 resolution, our models (BigGANs) achieve an Inception Score (IS) of 166.5 and Frechet Inception Distance () of 7.4, improving over the previous best IS of 52.52 and FID of 18.6.


  6. ⁠, Han Zhang, Ian Goodfellow, Dimitris Metaxas, Augustus Odena (2018-05-21):

    In this paper, we propose the Self-Attention Generative Adversarial Network (SAGAN) which allows attention-driven, long-range dependency modeling for image generation tasks. Traditional convolutional generate high-resolution details as a function of only spatially local points in lower-resolution feature maps. In SAGAN, details can be generated using cues from all feature locations. Moreover, the discriminator can check that highly detailed features in distant portions of the image are consistent with each other. Furthermore, recent work has shown that generator conditioning affects GAN performance. Leveraging this insight, we apply spectral normalization to the GAN generator and find that this improves training dynamics. The proposed SAGAN achieves the state-of-the-art results, boosting the best published Inception score from 36.8 to 52.52 and reducing Frechet Inception distance from 27.62 to 18.65 on the challenging ImageNet dataset. Visualization of the attention layers shows that the generator leverages neighborhoods that correspond to object shapes rather than local regions of fixed shape.

  7. ⁠, Andrew Brock, Theodore Lim, J. M. Ritchie, Nick Weston (2016-09-22):

    The increasingly photorealistic sample quality of generative image models suggests their feasibility in applications beyond image generation. We present the Neural Photo Editor, an interface that leverages the power of generative neural networks to make large, semantically coherent changes to existing images. To tackle the challenge of achieving accurate reconstructions without loss of feature quality, we introduce the Introspective Adversarial Network, a novel of the VAE and GAN. Our model efficiently captures long-range dependencies through use of a computational block based on weight-shared dilated convolutions, and improves generalization performance with Orthogonal Regularization, a novel weight regularization method. We validate our contributions on CelebA⁠, SVHN, and CIFAR-100, and produce samples and reconstructions with high visual fidelity.

  8. ⁠, Yasin Yazıcı, Chuan-Sheng Foo, Stefan Winkler, Kim-Hui Yap, Georgios Piliouras, Vijay Chandrasekhar (2018-06-12):

    We examine two different techniques for parameter averaging in GAN training. Moving Average (MA) computes the time-average of parameters, whereas Exponential Moving Average (EMA) computes an exponentially discounted sum. Whilst MA is known to lead to convergence in bilinear settings, we provide the—to our knowledge—first theoretical arguments in support of EMA. We show that EMA converges to limit cycles around the equilibrium with vanishing amplitude as the discount parameter approaches one for simple bilinear games and also enhances the stability of general GAN training. We establish experimentally that both techniques are strikingly effective in the non-convex-concave GAN setting as well. Both improve inception and FID scores on different architectures and for different GAN objectives. We provide comprehensive experimental results across a range of datasets—mixture of Gaussians, CIFAR-10, STL-10⁠, CelebA and ImageNet—to demonstrate its effectiveness. We achieve state-of-the-art results on CIFAR-10 and produce clean CelebA face images. The code is available at https:    /​ ​​ ​​ ​​ ​    /​ ​​ ​​ ​​ ​    /​ ​​ ​​ ​​ ​yasinyazici    /​ ​​ ​​ ​​ ​EMA_GAN⁠.

  9. ⁠, Gauthier Gidel, Hugo Berard, Gaëtan Vignoud, Pascal Vincent, Simon Lacoste-Julien (2018-02-28):

    Generative adversarial networks (GANs) form a generative modeling approach known for producing appealing samples, but they are notably difficult to train. One common way to tackle this issue has been to propose new formulations of the GAN objective. Yet, surprisingly few studies have looked at optimization methods designed for this adversarial training. In this work, we cast GAN optimization problems in the general variational inequality framework. Tapping into the mathematical programming literature, we counter some common misconceptions about the difficulties of saddle point optimization and propose to extend techniques designed for variational inequalities to the training of GANs. We apply averaging, extrapolation and a computationally cheaper variant that we call extrapolation from the past to the stochastic gradient method () and Adam.


  11. Faces#psitruncation-trick

  12. ⁠, Sam McCandlish, Jared Kaplan, Dario Amodei () (2018-12-14):

    that the gradient noise scale, a simple statistical metric, predicts the parallelizability of neural network training on a wide range of tasks. Since complex tasks tend to have noisier gradients, increasingly large batch sizes are likely to become useful in the future, removing one potential limit to further growth of AI systems. More broadly, these results show that neural network training need not be considered a mysterious art, but can be rigorized and systematized.

    In an increasing number of domains it has been demonstrated that deep learning models can be trained using relatively large batch sizes without sacrificing data efficiency. However the limits of this massive data parallelism seem to differ from domain to domain, ranging from batches of tens of thousands in ImageNet to batches of millions in RL agents that play the game Dota 2. To our knowledge there is limited conceptual understanding of why these limits to batch size differ or how we might choose the correct batch size in a new domain. In this paper, we demonstrate that a simple and easy-to-measure statistic called the gradient noise scale predicts the largest useful batch size across many domains and applications, including a number of supervised learning datasets (MNIST, SVHN, CIFAR-10, ImageNet, Billion Word), domains (Atari and Dota), and even generative model training (autoencoders on SVHN). We find that the noise scale increases as the loss decreases over a training run and depends on the model size primarily through improved model performance. Our empirically-motivated theory also describes the tradeoff between compute-efficiency and time-efficiency, and provides a rough model of the benefits of adaptive batch-size training.

    The gradient noise scale (appropriately averaged over training) explains the vast majority (R2 = 80%) of the variation in critical batch size over a range of tasks spanning six orders of magnitude. Batch sizes are measured in either number of images, tokens (for language models), or observations (for games).

    …We have found that by measuring the gradient noise scale, a simple statistic that quantifies the signal-to-noise ratio of the network gradients, we can approximately predict the maximum useful batch size. Heuristically, the noise scale measures the variation in the data as seen by the model (at a given stage in training). When the noise scale is small, looking at a lot of data in parallel quickly becomes redundant, whereas when it is large, we can still learn a lot from huge batches of data…We’ve found it helpful to visualize the results of these experiments in terms of a tradeoff between wall time for training and total bulk compute that we use to do the training (proportional to dollar cost). At very small batch sizes, doubling the batch allows us to train in half the time without using extra compute (we run twice as many chips for half as long). At very large batch sizes, more parallelization doesn’t lead to faster training. There is a “bend” in the curve in the middle, and the gradient noise scale predicts where that bend occurs.

    Increasing parallelism makes it possible to train more complex models in a reasonable amount of time. We find that a Pareto frontier chart is the most intuitive way to visualize comparisons between algorithms and scales.

    …more powerful models have a higher gradient noise scale, but only because they achieve a lower loss. Thus, there’s some evidence that the increasing noise scale over training isn’t just an artifact of convergence, but occurs because the model gets better. If this is true, then we expect future, more powerful models to have higher noise scale and therefore be more parallelizable. Second, tasks that are subjectively more difficult are also more amenable to parallelization…we have evidence that more difficult tasks and more powerful models on the same task will allow for more radical data-parallelism than we have seen to date, providing a key driver for the continued fast exponential growth in training compute.


  14. #biggan-imagenet-danbooru2018-1k

  15. #biggan-256px-danbooru2018-1k





  20. {#linkBibliography-(deoldify)-et-al-2019 .docMetadata}, Jason Antic (Deoldify), Jeremy Howard (, Uri Manor (Salk Institute) (2019-05-03):

    Generative models are models that generate music, images, text, and other complex data types. In recent years generative models have advanced at an astonishing rate, largely due to deep learning, and particularly due to generative adversarial models (GANs). However, GANs are notoriously difficult to train, due to requiring a large amount of data, needing many GPUs and a lot of time to train, and being highly sensitive to minor hyperparameter changes. has been working in recent years towards making a range of models easier and faster to train, with a particular focus on using transfer learning. Transfer learning refers to pre-training a model using readily available data and quick and easy to calculate loss functions, and then fine-tuning that model for a task that may have fewer labels, or be more expensive to compute. This seemed like a potential solution to the GAN training problem, so in late 2018 worked on a transfer learning technique for generative modeling⁠.

    The pre-trained model that selected was this: Start with an image dataset and “crappify” the images, such as reducing the resolution, adding jpeg artifacts, and obscuring parts with random text. Then train a model to “decrappify” those images to return them to their original state. started with a model that was pre-trained for ImageNet classification, and added a U-Net upsampling network, adding various modern tweaks to the regular U-Net. A simple fast was initially used: mean squared pixel error. This could be trained in just a few minutes. Then, the loss function was replaced was a combination of other loss functions used in the generative modeling literature (more details in the f8 video) and trained for another couple of hours. The plan was then to finally add a GAN for the last few epochs—however it turned out that the results were so good that ended up not using a GAN for the final models….

    NoGAN Training: NoGAN is a new and exciting technique in GAN training that we developed, in pursuit of higher quality and more stable renders. How, and how well, it works is a bit surprising.

    Here is the NoGAN training process:

    1. Pretrain the Generator. The generator is first trained in a more conventional and easier to control manner—with Perceptual Loss (aka Feature Loss) by itself. GAN training is not introduced yet. At this point you’re training the generator as best as you can in the easiest way possible. This takes up most of the time in NoGAN training. Keep in mind: this pretraining by itself will get the generator model far. Colorization will be well-trained as a task, albeit the colors will tend toward dull tones. Self-Attention will also be well-trained at the at this stage, which is very important.

    2. Save Generated Images From Pretrained Generator.

    3. Pretrain the Critic as a Binary Classifier. Much like in pretraining the generator, what we aim to achieve in this step is to get as much training as possible for the critic in a more “conventional” manner which is easier to control. And there’s nothing easier than a binary classifier! Here we’re training the critic as a binary classifier of real and fake images, with the fake images being those saved in the previous step. A helpful thing to keep in mind here is that you can simply use a pre-trained critic used for another image-to-image task and refine it. This has already been done for super-resolution, where the critic’s pretrained weights were loaded from that of a critic trained for colorization. All that is needed to make use of the pre-trained critic in this case is a little fine-tuning.

    4. Train Generator and Critic in (Almost) Normal GAN Setting. Quickly! This is the surprising part. It turns out that in this pretraining scenario, the critic will rapidly drive adjustments in the generator during GAN training. This happens during a narrow window of time before an “inflection point” of sorts is hit. After this point, there seems to be little to no benefit in training any further in this manner. In fact, if training is continued after this point, you’ll start seeing artifacts and glitches introduced in renderings.

    In the case of DeOldify, training to this point requires iterating through only about 1% to 3% of ImageNet data (or roughly 2600 to 7800 iterations on a batch size of five). This amounts to just around 30–90 minutes of GAN training, which is in stark contrast to the three to five days of progressively-sized GAN training that was done previously. Surprisingly, during that short amount of training, the change in the quality of the renderings is dramatic. In fact, this makes up the entirety of GAN training for the video model. The “artistic” and “stable” models go one step further and repeat the NoGAN training process steps 2–4 until there’s no more apparent benefit (around five repeats).

    Note: a small but important change to this GAN training that deviates from conventional GANs is the use of a loss threshold that must be met by the critic before generator training commences. Until then, the critic continues training to “catch up” in order to be able to provide the generator with constructive gradients. This catch up chiefly takes place at the beginning of GAN training which immediately follows generator and critic pretraining.

  21. #brock-et-al-2018


  23. ⁠, Takashi Ishida, Ikko Yamane, Tomoya Sakai, Gang Niu, Masashi Sugiyama (2020-02-20):

    Overparameterized deep networks have the capacity to memorize training data with zero training error. Even after memorization, the training loss continues to approach zero, making the model overconfident and the test performance degraded. Since existing regularizers do not directly aim to avoid zero training loss, it is hard to tune their hyperparameters in order to maintain a fixed/​​​​preset level of training loss. We propose a direct solution called flooding that intentionally prevents further reduction of the training loss when it reaches a reasonably small value, which we call the flood level. Our approach makes the loss float around the flood level by doing mini-batched gradient descent as usual but gradient ascent if the training loss is below the flood level. This can be implemented with one line of code and is compatible with any stochastic optimizer and other regularizers. With flooding, the model will continue to “random walk” with the same non-zero training loss, and we expect it to drift into an area with a flat loss landscape that leads to better generalization. We experimentally show that flooding improves performance and, as a byproduct, induces a curve of the test loss.

  24. ⁠, Ting Chen, Simon Kornblith, Mohammad Norouzi, Geoffrey Hinton (2020-02-13):

    This paper presents SimCLR: a simple framework for contrastive learning of visual representations. We simplify recently proposed contrastive algorithms without requiring specialized architectures or a memory bank. In order to understand what enables the prediction tasks to learn useful representations, we systematically study the major components of our framework. We show that (1) composition of data augmentations plays a critical role in defining effective predictive tasks, (2) introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and (3) contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning. By combining these findings, we are able to considerably outperform previous methods for self-supervised and semi-supervised learning on ImageNet. A linear classifier trained on self-supervised representations learned by SimCLR achieves 76.5% top-1 accuracy, which is a 7% relative improvement over previous state-of-the-art, matching the performance of a supervised ⁠. When fine-tuned on only 1% of the labels, we achieve 85.8% top-5 accuracy, outperforming AlexNet with 100× fewer labels.