[samples] Despite recent progress in generative adversarial network(GAN)-based vocoders, where the model generates raw waveform conditioned on mel spectrogram, it is still challenging to synthesize high-fidelity audio for numerous speakers across varied recording environments.
In this work, we present BigVGAN, a universal vocoder that generalizes well under various unseen conditions in zero-shot setting. We introduce periodic nonlinearities [Snake] and anti-aliased representation into the generator, which brings the desired inductive bias for waveform synthesis and substantially improves audio quality.
Based on our improved generator and the state-of-the-art discriminators, we train our 14m parameter GAN vocoder up to 112M parameters, which is unprecedented in the literature. In particular, we identify and address the training instabilities specific to such scale, while maintaining high-fidelity output without over-regularization. Our BigVGAN achieves the state-of-the-art zero-shot performance for various out-of-distribution scenarios, including new speakers, novel languages, singing voices, music and instrumental audio in unseen (even noisy) recording environments.
…Furthermore, it synthesizes 24 kHz high-fidelity speech 44.72× faster than real-time on a GPU.
…3.4 BigVGAN with Large Scale Training: In this subsection, we set out to explore the limits of universal neural vocoding by scaling up the generator’s model capacity to its maximum while maintaining the stability of GAN training and practical usability as a high-speed neural vocoder. We start scaling up the model with our improved generator using the comparable V1 configuration of HiFi-GAN with 14M parameters, which is denoted as BigVGAN-base. We grow BigVGAN-base by increasing the number of upsampling blocks and convolution channels for each block. The BigVGAN-base upsamples the signal by 256× using 4 upsampling blocks with the ratio of [8, 8, 2, 2]. Each upsampling block is accompanied by multiple residual layers with dilated convolutions, i.e., the AMP module. We further divides the 256× upsampling into 6 blocks [4, 4, 2, 2, 2, 2] for more fine-grained feature refinement. In addition, we increase the number of channels of AMP module (analogous to MRF in HiFi-GAN) from 512 to 1,536. We denote the model with 1,536 channels and 112M parameters as BigVGAN.
We found that the default learning rate of 2 × 10−4 used in HiFi-GAN causes an early training collapse for BigVGAN training, where the losses from the discriminator submodules immediately converge to zero after several thousands of iterations. Halving the learning rate to 1 × 10−4 was able to reduce such failures. We also found that large batch size is helpful to reduce mode collapse, as it covers more modes per batch. We simply double the batch size from the usual 16 to 32 for a good trade-off between training efficiency and stability, as neural vocoders can require millions of steps to converge. Our required batch size is much smaller than the observed one for large scale GAN training in image domain (ie. 32 vs. 2,048), probably thanks to the strong conditioner in neural vocoding.
In addition to above efforts, we have explored other directions, including various ways to improve the model architecture, spectral normalization to stabilize GAN training, and data augmentation to improve model generalization. Unfortunately, all these trials resulted in worse perceptual quality in our study. The details can be found in Appendix C. We hope these practical lessons that we have learned would be useful to future research endeavors.
For stable training of generative adversarial networks (GANs), injecting instance noise into the input of the discriminator is considered as a theoretically sound solution, which, however, has not yet delivered on its promise in practice.
This paper introduces Diffusion-GAN that employs a Gaussian mixture distribution, defined over all the diffusion steps of a forward diffusion chain, to inject instance noise. A random sample from the mixture, which is diffused from an observed or generated data, is fed as the input to the discriminator. The generator is updated by backpropagating its gradient through the forward diffusion chain, whose length is adaptively adjusted to control the maximum noise-to-data ratio allowed at each training step.
Theoretical analysis verifies the soundness of the proposed Diffusion-GAN, which provides model-agnostic and domain-agnostic differentiable augmentation.
A rich set of experiments on diverse datasets show that Diffusion-GAN can provide stable and data-efficient GAN training, bringing consistent performance improvement over strong GAN baselines for synthesizing photo-realistic images.
A multi-layer image is more valuable than a single-layer image from a graphic designer’s perspective. However, most of the proposed image generation methods so far focus on single-layer images. In this paper, we propose MontageGAN, which is a Generative Adversarial Networks (GAN) framework for generating multi-layer images. Our method utilized a two-step approach consisting of local GANs and global GAN. Each local GAN learns to generate a specific image layer, and the global GAN learns the placement of each generated image layer. Through our experiments, we show the ability of our method to generate multi-layer images and estimate the placement of the generated image layers.
Cartoon domain has recently gained increasing popularity. Previous studies have attempted quality portrait stylization into the cartoon domain; however, this poses a great challenge since they have not properly addressed the critical constraints, such as requiring a large number of training images or the lack of support for abstract cartoon faces. Recently, a layer swapping method has been used for stylization requiring only a limited number of training images; however, its use cases are still narrow as it inherits the remaining issues. In this paper, we propose a novel method called Cross-domain Style mixing, which combines two latent codes from two different domains. Our method effectively stylizes faces into multiple cartoon characters at various face abstraction levels using only a single generator without even using a large number of training images.
Recent studies on StyleGAN show high performance on artistic portrait generation by transfer learning with limited data. In this paper, we explore more challenging exemplar-based high-resolution portrait style transfer by introducing a novel DualStyleGAN with flexible control of dual styles of the original face domain and the extended artistic portrait domain.
Different from StyleGAN, DualStyleGAN provides a natural way of style transfer by characterizing the content and style of a portrait with an intrinsic style path and a new extrinsic style path, respectively. The delicately designed extrinsic style path enables our model to modulate both the color and complex structural styles hierarchically to precisely pastiche the style example. Furthermore, a novel progressive fine-tuning scheme is introduced to smoothly transform the generative space of the model to the target domain, even with the above modifications on the network architecture.
Experiments demonstrate the superiority of DualStyleGAN over state-of-the-art methods in high-quality portrait style transfer and flexible style control.
We present Polarity Sampling, a theoretically justified plug-and-play method for controlling the generation quality and diversity of pre-trained deep generative networks DGNs).
Leveraging the fact that DGNs are, or can be approximated by, continuous piecewise affine splines, we derive the analytical DGN output space distribution as a function of the product of the DGN’s Jacobian singular values raised to a power ρ. We dub ρ the polarity parameter and prove that ρ focuses the DGN sampling on the modes (ρ < 0) or anti-modes (ρ > 0) of the DGN output-space distribution.
We demonstrate that nonzero polarity values achieve a better precision-recall (quality-diversity) Pareto frontier than standard methods, such as truncation, for a number of state-of-the-art DGNs. We also present quantitative and qualitative results on the improvement of overall generation quality (eg. in terms of the Fréchet Inception Distance) for a number of state-of-the-art DGNs, including StyleGAN3, BigGAN-deep, NVAE, for different conditional and unconditional image generation tasks. In particular, Polarity Sampling redefines the state-of-the-art for StyleGAN2 on the FFHQ Dataset to FID 2.57, StyleGAN2 on the LSUN Car Dataset to FID 2.27 and StyleGAN3 on the AFHQv2 Dataset to FID 3.95.
Generative Adversarial Networks (GANs) have established themselves as a prevalent approach to image synthesis. Of these, StyleGAN offers a fascinating case study, owing to its remarkable visual quality and an ability to support a large array of downstream tasks.
This state-of-the-art report covers the StyleGAN architecture, and the ways it has been employed since its conception, while also analyzing its severe limitations. It aims to be of use for both newcomers, who wish to get a grasp of the field, and for more experienced readers that might benefit from seeing current research trends and existing tools laid out.
Among StyleGAN’s most interesting aspects is its learned latent space. Despite being learned with no supervision, it is surprisingly well-behaved and remarkably disentangled. Combined with StyleGAN’s visual quality, these properties gave rise to unparalleled editing capabilities. However, the control offered by StyleGAN is inherently limited to the generator’s learned distribution, and can only be applied to images generated by StyleGAN itself.
Seeking to bring StyleGAN’s latent control to real-world scenarios, the study of GAN inversion and latent space embedding has quickly gained in popularity. Meanwhile, this same study has helped shed light on the inner workings and limitations of StyleGAN. We map out StyleGAN’s impressive story through these investigations, and discuss the details that have made StyleGAN the go-to generator.
We further elaborate on the visual priors StyleGAN constructs, and discuss their use in downstream discriminative tasks.
Looking forward, we point out StyleGAN’s limitations and speculate on current trends and promising directions for future research, such as task and target specific fine-tuning.
[anime example] StyleGAN is known to produce high-fidelity images, while also offering unprecedented semantic editing. However, these fascinating abilities have been demonstrated only on a limited set of datasets, which are usually structurally aligned and well curated.
In this paper, we show how StyleGAN can be adapted to work on raw uncurated images collected from the Internet. Such image collections impose two main challenges to StyleGAN: they contain many outlier images, and are characterized by a multi-modal distribution. Training StyleGAN on such raw image collections results in degraded image synthesis quality. To meet these challenges, we proposed a StyleGAN-based self-distillation approach, which consists of two main components: (1) A generative-based self-filtering of the dataset to eliminate outlier images, in order to generate an adequate training set, and (2) Perceptual clustering of the generated images to detect the inherent data modalities, which are then employed to improve StyleGAN’s “truncation trick” in the image synthesis process.
The presented technique enables the generation of high-quality images, while minimizing the loss in diversity of the data. Through qualitative and quantitative evaluation, we demonstrate the power of our approach to new challenging and diverse domains collected from the Internet.
Artificial intelligence (AI)-synthesized text, audio, image, and video are being weaponized for the purposes of nonconsensual intimate imagery, financial fraud, and disinformation campaigns. Our evaluation of the photorealism of AI-synthesized faces [of StyleGAN2] indicates that synthesis engines have passed through the uncanny valley and are capable of creating faces that are indistinguishable—and more trustworthy—than real faces.
…Experiment 1: In this study, 315 participants classified, one at a time, 128 of the 800 faces as real or synthesized. Shown in Figure 2A is the distribution of participant accuracy (blue bars). The average accuracy is 48.2% (95% CI [47.1%, 49.2%]), close to chance performance of 50%.
…Experiment 2: In this study, 219 new participants, with training and trial-by-trial feedback, classified 128 faces taken from the same 800 set of faces as in experiment 1. Shown in Figure 2A [also] is the distribution of participant accuracy (orange bars). The average accuracy improved slightly to 59.0% (95% CI [57.7%, 60.4%]), with no response bias: d’ = 0.46; β = 0.99. Despite providing trial-by-trial feedback, there was no improvement in accuracy over time, with an average accuracy of 59.3% (95% CI [57.8%, 60.7%]) for the first set of 64 faces and 58.8% (95% CI [57.4%, 60.3%]) for the second set of 64 faces.
[video; code] Computer graphics has experienced a recent surge of data-centric approaches for photorealistic and controllable content creation. StyleGAN in particular sets new standards for generative modeling regarding image quality and controllability. However, StyleGAN’s performance severely degrades on large unstructured datasets such as ImageNet. StyleGAN was designed for controllability; hence, prior works suspect its restrictive design to be unsuitable for diverse datasets.
Our final model, StyleGAN-XL, sets a new state-of-the-art on large-scale image synthesis and is the first to generate ImageNet images at a resolution of 10242 at such a dataset scale.
We demonstrate that this model can invert and edit images beyond the narrow domain of portraits or specific object classes.
…Our contributions enable us to train a much larger model than previously possible while requiring less computation than prior art. Our model is 3× larger in terms of depth and parameter count than a standard StyleGAN3. However, to match the prior state-of-the-art performance of ADM at a resolution of 5122 pixels, training the models on a single NVIDIA Tesla V100 takes 400 GPU-days compared to the previously required 1,914 V100-days.
The state-of-the-art photo upsampling method, PULSE, demonstrates that a sharp, high-resolution (HR) version of a given low-resolution (LR) input can be obtained by exploring the latent space of generative models. However, mapping an extreme LR input (162) directly to an HR image (10242) is too ambiguous to preserve faithful local facial semantics.
In this paper, we propose an enhanced upsampling approach, Pro-PULSE, that addresses the issues of semantic inconsistency and optimization complexity.
Our idea is to learn an encoder that progressively constructs the HR latent codes in the extended 𝑊+ latent space of StyleGAN. This design divides the complex 64× upsampling problem into several steps, and therefore small-scale facial semantics can be inherited from one end to the other.
In particular, we train 2 encoders, the base encoder maps latent vectors in 𝑊 space and serves as a foundation of the HR latent vector, while the second scale-specific encoder performed in 𝑊+ space gradually replaces the previous vector produced by the base encoder at each scale. This process produces intermediate side-outputs, which injects deep supervision into the training of encoder.
Extensive experiments demonstrate superiority over the latest latent space exploration methods, in terms of efficiency, quantitative quality metrics, and qualitative visual results.
Videos show continuous events, yet most—if not all—video synthesis frameworks treat them discretely in time. In this work, we think of videos of what they should be—time-continuous signals, and extend the paradigm of neural representations to build a continuous-time video generator.
For this, we first design continuous motion representations through the lens of positional embeddings. Then, we explore the question of training on very sparse videos and demonstrate that a good generator can be learned by using as few as 2 frames per clip. After that, we rethink the traditional image and video discriminators pair and propose to use a single hypernetwork-based one. This decreases the training cost and provides richer learning signal to the generator, making it possible to train directly on 10242 videos for the first time.
We build our model on top of StyleGAN2 and it is just 5% more expensive to train at the same resolution while achieving almost the same image quality. Moreover, our latent space features similar properties, enabling spatial manipulations that our method can propagate in time. We can generate arbitrarily long videos at arbitrary high frame rate, while prior work struggles to generate even 64 frames at a fixed rate. Our model achieves state-of-the-art results on four modern 2562 video synthesis benchmarks and one 10242 resolution one. Videos and the source code are available at the project website: https://universome.github.io/stylegan-v.
Unsupervised generation of high-quality multi-view-consistent images and 3D shapes using only collections of single-view 2D photographs has been a long-standing challenge. Existing 3D GANs are either compute-intensive or make approximations that are not 3D-consistent; the former limits quality and resolution of the generated images and the latter adversely affects multi-view consistency and shape quality. In this work, we improve the computational efficiency and image quality of 3D GANs without overly relying on these approximations. For this purpose, we introduce an expressive hybrid explicit-implicit network architecture that, together with other design choices, synthesizes not only high-resolution multi-view-consistent images in real time but also produces high-quality 3D geometry. By decoupling feature generation and neural rendering, our framework is able to leverage state-of-the-art 2D CNN generators, such as StyleGAN2, and inherit their efficiency and expressiveness. We demonstrate state-of-the-art 3D-aware synthesis with FFHQ and AFHQCats, among other experiments.
The success of StyleGAN has enabled unprecedented semantic editing capabilities, on both synthesized and real images. However, such editing operations are either trained with semantic supervision or described using human guidance. In another development, the CLIP architecture has been trained with internet-scale image and text pairings and has been shown to be useful in several zero-shot learning settings. In this work, we investigate how to effectively link the pretrained latent spaces of StyleGAN and CLIP, which in turn allows us to automatically extract semantically labeled edit directions from StyleGAN, finding and naming meaningful edit operations without any additional human guidance. Technically, we propose two novel building blocks; one for finding interesting CLIP directions and one for labeling arbitrary directions in CLIP latent space. The setup does not assume any pre-determined labels and hence we do not require any additional supervised text/attributes to build the editing framework. We evaluate the effectiveness of the proposed method and demonstrate that extraction of disentangled labeled StyleGAN edit directions is indeed possible, and reveals interesting and non-trivial edit directions.
The inversion of real images into StyleGAN’s latent space is a well-studied problem. Nevertheless, applying existing approaches to real-world scenarios remains an open challenge, due to an inherent trade-off between reconstruction and editability: latent space regions which can accurately represent real images typically suffer from degraded semantic control. Recent work proposes to mitigate this trade-off by fine-tuning the generator to add the target image to well-behaved, editable regions of the latent space. While promising, this fine-tuning scheme is impractical for prevalent use as it requires a lengthy training phase for each new image.
In this work, we introduce this approach into the realm of encoder-based inversion. We propose HyperStyle, a hypernetwork that learns to modulate StyleGAN’s weights to faithfully express a given image in editable regions of the latent space. A naive modulation approach would require training a hypernetwork with over three billion parameters. Through careful network design, we reduce this to be in line with existing encoders. HyperStyle yields reconstructions comparable to those of optimization techniques with the near real-time inference capabilities of encoders. Lastly, we demonstrate HyperStyle’s effectiveness on several applications beyond the inversion task, including the editing of out-of-domain images which were never seen during training.
One of the major challenges in training text-to-image generation models is the need of a large number of high-quality image-text pairs. While image samples are often easily accessible, the associated text descriptions typically require careful human captioning, which is particularly time/cost-consuming.
In this paper, we propose the first work to train text-to-image generation models without any text data. Our method leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model: the requirement of text-conditioning is seamlessly alleviated via generating text features from image features.
Extensive experiments are conducted to illustrate the effectiveness of the proposed method. We obtain state-of-the-art results in the standard text-to-image generation tasks. Importantly, the proposed language-free model outperforms most existing models trained with full image-text pairs.
Furthermore, our method can be applied in fine-tuning pre-trained models, which saves both training time and cost in training text-to-image generation models. Our pre-trained model obtains competitive results in zero-shot text-to-image generation on the MS-COCO dataset, yet with around only 1% of the model size and training data size relative to the recently proposed large DALL·E model.
We present a new perspective of achieving image synthesis by viewing this task as a visual token generation problem. Different from existing paradigms that directly synthesize a full image from a single input (eg. a latent code), the new formulation enables a flexible local manipulation for different image regions, which makes it possible to learn content-aware and fine-grained style control for image synthesis. Specifically, it takes as input a sequence of latent tokens to predict the visual tokens for synthesizing an image. Under this perspective, we propose a token-based generator (ie.TokenGAN). Particularly, the TokenGAN inputs two semantically different visual tokens, i.e., the learned constant content tokens and the style tokens from the latent space. Given a sequence of style tokens, the TokenGAN is able to control the image synthesis by assigning the styles to the content tokens by attention mechanism with a Transformer. We conduct extensive experiments and show that the proposed TokenGAN has achieved state-of-the-art results on several widely-used image synthesis benchmarks, including FFHQ and LSUN CHURCH with different resolutions. In particular, the generator is able to synthesize high-fidelity images with 1024×1024 size, dispensing with convolutions entirely.
Generating images that fit a given text description using machine learning has improved greatly with the release of technologies such as the CLIP image-text encoder model; however, current methods lack artistic control of the style of image to be generated. We introduce StyleCLIPDraw which adds a style loss to the CLIPDraw text-to-drawing synthesis model to allow artistic control of the synthesized drawings in addition to control of the content via text. Whereas performing decoupled style transfer on a generated image only affects the texture, our proposed coupled approach is able to capture a style in both texture and shape, suggesting that the style of the drawing is coupled with the drawing process itself. More results and our code are available at https://github.com/pschaldenbrand/StyleCLIPDraw.
In this paper, we perform an in-depth study of the properties and applications of aligned generative models. We refer to two models as aligned if they share the same architecture, and one of them (the child) is obtained from the other (the parent) via fine-tuning to another domain, a common practice in transfer learning. Several works already utilize some basic properties of aligned StyleGAN models to perform image-to-image translation. Here, we perform the first detailed exploration of model alignment, also focusing on StyleGAN.
First, we empirically analyze aligned models and provide answers to important questions regarding their nature. In particular, we find that the child model’s latent spaces are semantically aligned with those of the parent, inheriting incredibly rich semantics, even for distant data domains such as human faces and churches. Second, equipped with this better understanding, we leverage aligned models to solve a diverse set of tasks. In addition to image translation, we demonstrate fully automatic cross-domain image morphing. We further show that zero-shot vision tasks may be performed in the child domain while relying exclusively on supervision in the parent domain. We demonstrate qualitatively and quantitatively that our approach yields state-of-the-art results, while requiring only simple fine-tuning and inversion.
We present an algorithm for re-rendering a person from a single image under arbitrary poses. Existing methods often have difficulties in hallucinating occluded contents photo-realistically while preserving the identity and fine details in the source image.
We first learn to inpaint the correspondence field between the body surface texture and the source image with a human body symmetry prior. The inpainted correspondence field allows us to transfer/warp local features extracted from the source to the target view even under large pose changes. Directly mapping the warped local features to an RGB image using a simple CNN decoder often leads to visible artifacts.
Thus, we extend the StyleGAN generator so that it takes pose as input (for controlling poses) and introduces a spatially varying modulation for the latent space using the warped local features (for controlling appearances). We show that our method compares favorably against the state-of-the-art algorithms in both quantitative evaluation and visual comparison.
[video] Generative adversarial network (GAN)-based models possess superior capability of high-fidelity image synthesis. There are a wide range of semantically meaningful directions in the latent representation space of well-trained GANs, and the corresponding latent space walks are meaningful for semantic controllability in the synthesized images.
To explore the underlying organization of a latent space, we propose an unsupervised Density-Preserving Latent Semantics Exploration model (DP-LaSE). The important latent directions are determined by maximizing the variations in intermediate features, while the correlation between the directions is minimized. Considering that latent codes are sampled from a prior distribution, we adopt a density-preserving regularization approach to ensure latent space walks are maintained in iso-density regions, since moving to a higher/lower density region tends to cause unexpected transformations. To further refine semantics-specific transformations, we perform subspace learning over intermediate feature channels, such that the transformations are limited to the most relevant subspaces.
Extensive experiments on a variety of benchmark datasets demonstrate that DP-LaSE is able to discover interpretable latent space walks, and specific properties of synthesized images can thus be precisely controlled.
The state-of-the-art StyleGAN2 network supports powerful methods to create and edit art, including generating random images, finding images “like” some query, and modifying content or style. Further, recent advancements enable training with small datasets. We apply these methods to synthesize card art, by training on a novel Yu-Gi-Oh dataset. While noise inputs to StyleGAN2 are essential for good synthesis, we find that, for small datasets, coarse-scale noise interferes with latent variables because both control long-scale image effects. We observe over-aggressive variation in art with changes in noise and weak content control via latent variable edits. Here, we demonstrate that training a modified StyleGAN2, where coarse-scale noise is suppressed, removes these unwanted effects. We obtain a superior FID; changes in noise result in local exploration of style; and identity control is markedly improved. These results and analysis lead towards a GAN-assisted art synthesis tool for digital artists of all skill levels, which can be used in film, games, or any creative industry for artistic ideation.
We propose a novel method for solving regression tasks using few-shot or weak supervision. At the core of our method is the fundamental observation that GANs are incredibly successful at encoding semantic information within their latent space, even in a completely unsupervised setting. For modern generative frameworks, this semantic encoding manifests as smooth, linear directions which affect image attributes in a disentangled manner. These directions have been widely used in GAN-based image editing.
We show that such directions are not only linear, but that the magnitude of change induced on the respective attribute is ~linear with respect to the distance traveled along them. By leveraging this observation, our method turns a pre-trained GAN into a regression model, using as few as two labeled samples. This enables solving regression tasks on datasets and attributes which are difficult to produce quality supervision for. Additionally, we show that the same latent-distances can be used to sort collections of images by the strength of given attributes, even in the absence of explicit supervision.
Extensive experimental evaluations demonstrate that our method can be applied across a wide range of domains, leverage multiple latent direction discovery frameworks, and achieve state-of-the-art results in few-shot and low-supervision settings, even when compared to methods designed to tackle a single task.
[Github] We observe that despite their hierarchical convolutional nature, the synthesis process of typical generative adversarial networks depends on absolute pixel coordinates in an unhealthy manner. This manifests itself as, eg. detail appearing to be glued to image coordinates instead of the surfaces of depicted objects.
We trace the root cause to careless signal processing that causes aliasing in the generator network. Interpreting all signals in the network as continuous, we derive generally applicable, small architectural changes that guarantee that unwanted information cannot leak into the hierarchical synthesis process.
The resulting networks match the FID of StyleGAN2 but differ dramatically in their internal representations, and they are fully equivariant to translation and rotation even at subpixel scales. Our results pave the way for generative models better suited for video and animation.
…G. Energy Consumption: The entire project consumed ~225 megawatt hours (MWh) of electricity. ~70% of it was used for exploratory runs, where we gradually built the new configurations; first in an unstructured manner and then specifically ironing out the new Alias-Free-T and Alias-Free-R configurations. Setting up the intermediate configurations between StyleGAN2 and our generators, as well as, the key parameter ablations was also quite expensive at ~15%. Training a single instance of Alias-Free-R at 1024×1024 is only slightly more expensive (0.9MWh) than training StyleGAN2 (0.7MWh)
Generative adversarial networks (GANs) learn to map noise latent vectors to high-fidelity image outputs. It is found that the input latent space shows semantic correlations with the output image space. Recent works aim to interpret the latent space and discover meaningful directions that correspond to human interpretable image transformations. However, these methods either rely on explicit scores of attributes (eg. memorability) or are restricted to binary ones (eg. gender), which largely limits the applicability of editing tasks, especially for free-form artistic tasks like style/anime editing.
In this paper, we propose an adversarial method, AdvStyle, for discovering interpretable directions in the absence of well-labeled scores or binary attributes. In particular, the proposed adversarial method simultaneously optimizes the discovered directions and the attribute assessor using the target attribute data as positive samples, while the generated ones being negative. In this way, arbitrary attributes can be edited by collecting positive data only, and the proposed method learns a controllable representation enabling manipulation of non-binary attributes like anime styles and facial characteristics. Moreover, the proposed learning strategy attenuates the entanglement between attributes, such that multi-attribute manipulation can be easily achieved without any additional constraint.
Furthermore, we reveal several interesting semantics with the involuntarily learned negative directions. Extensive experiments on 9 anime attributes and 7 human attributes demonstrate the effectiveness of our adversarial approach qualitatively and quantitatively. Code is available at GitHub.
Mostly used to stop CUDA ML model training from making my desktop unusable. Simply monitors keyboard/mouse idleness using xprintidle, and pauses the given process using SIGSTOP whenever the machine is in use (defined as no activity within the last second).
GANs largely increases the potential impact of generative models. Therefore, we propose a novel knowledge transfer method for generative models based on mining the knowledge that is most beneficial to a specific target domain, either from a single or multiple pretrained GANs. This is done using a miner network that identifies which part of the generative distribution of each pretrained GAN outputs samples closest to the target domain. Mining effectively steers GAN sampling towards suitable regions of the latent space, which facilitates the posterior finetuning and avoids pathologies of other methods, such as mode collapse and lack of flexibility. Furthermore, to prevent overfitting on small target domains, we introduce sparse subnetwork selection, that restricts the set of trainable neurons to those that are relevant for the target dataset. We perform comprehensive experiments on several challenging datasets using various GAN architectures (BigGAN, Progressive GAN, and StyleGAN) and show that the proposed method, called MineGAN, effectively transfers knowledge to domains with few target images, outperforming existing methods. In addition, MineGAN can successfully transfer knowledge from multiple pretrained GANs.
[blog>] Image classification models can depend on multiple different semantic attributes of the image. An explanation of the decision of the classifier needs to both discover and visualize these properties.
Here we present StylEx, a method for doing this, by training a generative model to specifically explain multiple attributes that underlie classifier decisions. A natural source for such attributes is the StyleSpace of StyleGAN, which is known to generate semantically meaningful dimensions in the image. However, because standard GAN training is not dependent on the classifier, it may not represent these attributes which are important for the classifier decision, and the dimensions of StyleSpace may represent irrelevant attributes. To overcome this, we propose a training procedure for a StyleGAN, which incorporates the classifier model, in order to learn a classifier-specific StyleSpace. Explanatory attributes are then selected from this space. These can be used to visualize the effect of changing multiple attributes per image, thus providing image-specific explanations.
We apply StylEx to multiple domains, including animals, leaves, faces and retinal images. For these, we show how an image can be modified in different ways to change its classifier output. Our results show that the method finds attributes that align well with semantic ones, generate meaningful image-specific explanations, and are human-interpretable as measured in user-studies.
Face image manipulation via three-dimensional guidance has been widely applied in various interactive scenarios due to its semantically-meaningful understanding and user-friendly controllability. However, existing 3D-morphable-model-based manipulation methods are not directly applicable to out-of-domain faces, such as non-photorealistic paintings, cartoon portraits, or even animals, mainly due to the formidable difficulties in building the model for each specific face domain. To overcome this challenge, we propose, as far as we know, the first method to manipulate faces in arbitrary domains using human 3DMM. This is achieved through two major steps: (1) disentangled mapping from 3DMM parameters to the latent space embedding of a pre-trained StyleGAN2 that guarantees disentangled and precise controls for each semantic attribute; and (2) cross-domain adaptation that bridges domain discrepancies and makes human 3DMM applicable to out-of-domain faces by enforcing a consistent latent space embedding. Experiments and comparisons demonstrate the superiority of our high-quality semantic manipulation method on a variety of face domains with all major 3D facial attributes controllable: pose, expression, shape, albedo, and illumination. Moreover, we develop an intuitive editing interface to support user-friendly control and instant feedback. Our project page is https://cassiepython.github.io/sigasia/cddfm3d.html .
We introduce DatasetGAN: an automatic procedure to generate massive datasets of high-quality semantically segmented images requiring minimal human effort. Current deep networks are extremely data-hungry, benefiting from training on large-scale datasets, which are time consuming to annotate. Our method relies on the power of recent GANs to generate realistic images. We show how the GAN latent code can be decoded to produce a semantic segmentation of the image. Training the decoder only needs a few labeled examples to generalize to the rest of the latent space, resulting in an infinite annotated dataset generator! These generated datasets can then be used for training any computer vision architecture just as real datasets are. As only a few images need to be manually segmented, it becomes possible to annotate images in extreme detail and generate datasets with rich object and part segmentations. To showcase the power of our approach, we generated datasets for 7 image segmentation tasks which include pixel-level labels for 34 human face parts, and 32 car parts. Our approach outperforms all semi-supervised baselines significantly and is on par with fully supervised methods, which in some cases require as much as 100× more annotated data as our method.
This degree work shows the implementation of generative adversarial networks (GANs) to generate completely new illustrations, making use of comic images that have been previously studied, normalized, and filtered. The report will reflect the evolution of the research, which takes as a starting point the results of a previous research in which the first steps to follow in order to obtain a valid dataset are proposed.
The set of steps to obtain the final result are presented: the work methodology used, the remote work configuration using Google Colab, the search and study of architectures and the evolution and improvement of the different networks to achieve the final results.
To carry out the generation of comic illustrations, a set of models is implemented progressively, taking an evolution from simpler to more complex and actual models, in order to better assimilate the necessary knowledge. In this way, the use of progressively growing generative adversarial networks or ProGANs is ultimately proposed. By using the ProGAN architecture, it is possible to improve the results compared to the use of traditional methods, obtaining images of great similarity to the original ones, maintaining the originality and avoiding the direct copy of images from the dataset.
In order to validate and demonstrate that the results obtained are replicable, a comparison between 2 different sets of images has been performed. Although both datasets have the same drawing style, they present substantial differences in their composition. On the one hand, results are shown for images of characters with the whole body and later using illustrations of character faces in the foreground.
We propose an unsupervised segmentation framework for StyleGAN generated objects. We build on two main observations. First, the features generated by StyleGAN hold valuable information that can be utilized towards training segmentation networks. Second, the foreground and background can often be treated to be largely independent and be composited in different ways.
For our solution, we propose to augment the StyleGAN2 generator architecture with a segmentation branch and to split the generator into a foreground and background network. This enables us to generate soft segmentation masks for the foreground object in an unsupervised fashion. On multiple object classes, we report comparable results against state-of-the-art supervised segmentation networks, while against the best unsupervised segmentation approach we demonstrate a clear improvement, both in qualitative and quantitative metrics.
This paper tackles a challenging problem of generating photorealistic images from semantic layouts in few-shot scenarios where annotated training pairs are hardly available but pixel-wise annotation is quite costly. We present a training strategy that performs pseudo labeling of semantic masks using the StyleGAN prior. Our key idea is to construct a simple mapping between the StyleGAN feature and each semantic class from a few examples of semantic masks. With such mappings, we can generate an unlimited number of pseudo semantic masks from random noise to train an encoder for controlling a pre-trained StyleGAN generator. Although the pseudo semantic masks might be too coarse for previous approaches that require pixel-aligned masks, our framework can synthesize high-quality images from not only dense semantic masks but also sparse inputs such as landmarks and scribbles. Qualitative and quantitative results with various datasets demonstrate improvement over previous approaches with respect to layout fidelity and visual quality in as few as one-shot or five-shot settings.
While GANs have shown success in realistic image generation, the idea of using GANs for other tasks unrelated to synthesis is underexplored. Do GANs learn meaningful structural parts of objects during their attempt to reproduce those objects? In this work, we test this hypothesis and propose a simple and effective approach based on GANs for semantic part segmentation that requires as few as one label example along with an unlabeled dataset. Our key idea is to leverage a trained GAN to extract pixel-wise representation from the input image and use it as feature vectors for a segmentation network. Our experiments demonstrate that GANs representation is “readily discriminative” and produces surprisingly good results that are comparable to those from supervised baselines trained with statistically-significantly more labels. We believe this novel repurposing of GANs underlies a new class of unsupervised representation learning that is applicable to many other tasks. More results are available at https://repurposegans.github.io/ .
We introduce the GANformer, a novel and efficient type of transformer, and explore it for the task of visual generative modeling. The network employs a bipartite structure that enables long-range interactions across the image, while maintaining computation of linear efficiency, that can readily scale to high-resolution synthesis. It iteratively propagates information from a set of latent variables to the evolving visual features and vice versa, to support the refinement of each in light of the other and encourage the emergence of compositional representations of objects and scenes. In contrast to the classic transformer architecture, it utilizes multiplicative integration that allows flexible region-based modulation, and can thus be seen as a generalization of the successful StyleGAN network. We demonstrate the model’s strength and robustness through a careful evaluation over a range of datasets, from simulated multi-object environments to rich real-world indoor and outdoor scenes, showing it achieves state-of-the-art results in terms of image quality and diversity, while enjoying fast learning and better data-efficiency. Further qualitative and quantitative experiments offer us an insight into the model’s inner workings, revealing improved interpretability and stronger disentanglement, and illustrating the benefits and efficacy of our approach. An implementation of the model is available at Github.
While we instantaneously recognize a face as attractive, it is much harder to explain what exactly defines personal attraction. This suggests that attraction depends on implicit processing of complex, culturally and individually defined features. Generative adversarial neural networks (GANs), which learn to mimic complex data distributions, can potentially model subjective preferences unconstrained by pre-defined model parameterization.
Here, we present generative brain-computer interfaces (GBCI), coupling GANs with brain-computer interfaces. GBCI first presents a selection of images and captures personalized attractiveness reactions toward the images via electroencephalography. These reactions are then used to control a ProGAN model, finding a representation that matches the features constituting an attractive image for an individual. We conducted an experiment (N = 30) to validate GBCI using a face-generating GAN and producing images that are hypothesized to be individually attractive. In double-blind evaluation of the GBCI-produced images against matched controls, we found GBCI yielded highly accurate results.
Thus, the use of EEG responses to control a GAN presents a valid tool for interactive information-generation. Furthermore, the GBCI-derived images visually replicated known effects from social neuroscience, suggesting that the individually responsive, generative nature of GBCI provides a powerful, new tool in mapping individual differences and visualizing cognitive-affective processing.
…Thus, negative generated images were evaluated as highly attractive for other people, but not for the participant themselves. Taken together, the results suggest that the GBCI was highly accurate in generating personally attractive images (83.33%). They also show that while both negative and positive generated images were evaluated as highly attractive for the general population (respectively M = 4.43 and 4.90 on a scale of 1–5), only the positive generated images (M = 4.57) were evaluated as highly personally attractive.
Qualitative results: In semi-structured post-test interviews, participants were shown the generated images that were expected to be found attractive/ unattractive. Thematic analysis found predictions of positive attractiveness were experienced as accurate: There were no false positives (generated unattractive found personally attractive). The participants also expressed being pleased with results (eg. “Quite an ideal beauty for a male!”; “I would be really attracted to this!”; “Can I have a copy of this? It looks just like my girlfriend!”).
In this research work we present CLIP-GLaSS, a novel zero-shot framework to generate an image (or a caption) corresponding to a given caption (or image). CLIP-GLaSS is based on the CLIP neural network, which, given an image and a descriptive caption, provides similar embeddings. Differently, CLIP-GLaSS takes a caption (or an image) as an input, and generates the image (or the caption) whose CLIP embedding is the most similar to the input one. This optimal image (or caption) is produced via a generative network, after an exploration by a genetic algorithm. Promising results are shown, based on the experimentation of the image Generators BigGAN and StyleGAN2, and of the text Generator GPT2
[Source code for working with the OpenAI CLIP zero-shot universal image classifier and the This Anime Does Not Exist.ai (TADNE) StyleGAN2-ext model: CLIP can use text descriptions to score images by how well they match the text description, and this scoring can be used to generate images matching the description by iteratively refining the pixels to make CLIP increase the description score (gradient ascent).]
[Website demonstrating samples from a modified StyleGAN2 trained on Danbooru2019 using TRC TPUs for ~5m iterations for ~2 months on a TPUv3-32 pod; this modified ‘StyleGAN2-ext’, removes various regularizations which make StyleGAN2 data-efficient on datasets like FFHQ, but hobble its ability to model complicated images, and scales the model up >2×. This is surprisingly effective given StyleGAN’s previous inability to approach BigGAN’s Danbooru2019, and TADNE shows off the entertaining results.
Recent advances in generative adversarial networks (GANs) have shown remarkable progress in generating high-quality images. However, this gain in performance depends on the availability of a large amount of training data. In limited data regimes, training typically diverges, and therefore the generated samples are of low quality and lack diversity. Previous works have addressed training in low data setting by leveraging transfer learning and data augmentation techniques.
We propose a novel transfer learning method for GANs in the limited data domain by leveraging informative data prior derived from self-supervised/supervised pre-trained networks trained on a diverse source domain. We perform experiments on several standard vision datasets using various GAN architectures (BigGAN, SNGAN, StyleGAN2) to demonstrate that the proposed method effectively transfers knowledge to domains with few target images, outperforming existing state-of-the-art techniques in terms of image quality and diversity. We also show the utility of data instance prior in large-scale unconditional image generation and image editing tasks.
Current GAN-based art generation methods produce unoriginal artwork due to their dependence on conditional input. Here, we propose Sketch-And-Paint GAN (SAPGAN), the first model which generates Chinese landscape paintings from end to end, without conditional input. SAPGAN is composed of two GANs: SketchGAN for generation of edge maps, and PaintGAN for subsequent edge-to-painting translation. Our model is trained on a new dataset of traditional Chinese landscape paintings never before used for generative research. A 242-person Visual Turing Test study reveals that SAPGAN paintings are mistaken as human artwork with 55% frequency, significantly outperforming paintings from baseline GANs. Our work lays a groundwork for truly machine-original art generation.
Image-to-Image (I2I) translation is a heated topic in academia, and it also has been applied in real-world industry for tasks like image synthesis, super-resolution, and colorization. However, traditional I2I translation methods train data in two or more domains together. This requires lots of computation resources. Moreover, the results are of lower quality, and they contain many more artifacts. The training process could be unstable when the data in different domains are not balanced, and modal collapse is more likely to happen. We proposed a new I2I translation method that generates a new model in the target domain via a series of model transformations on a pre-trained StyleGAN2 model in the source domain. After that, we proposed an inversion method to achieve the conversion between an image and its latent vector. By feeding the latent vector into the generated model, we can perform I2I translation between the source domain and target domain. Both qualitative and quantitative evaluations were conducted to prove that the proposed method can achieve outstanding performance in terms of image quality, diversity and semantic similarity to the input and reference images compared to state-of-the-art works.
GANs can generate photo-realistic images from the domain of their training data. However, those wanting to use them for creative purposes often want to generate imagery from a truly novel domain, a task which GANs are inherently unable to do. It is also desirable to have a level of control so that there is a degree of artistic direction rather than purely curation of random results. Here we present a method for interpolating between generative models of the StyleGAN architecture in a resolution dependent manner. This allows us to generate images from an entirely novel domain and do this with a degree of control over the nature of the output.
State-of-the-art models for high-resolution image generation, such as BigGAN and VQVAE-2, require an incredible amount of compute resources and/or time (512 TPU-v3 cores) to train, putting them out of reach for the larger research community. On the other hand, GAN-based image super-resolution models, such as ESRGAN, can not only upscale images to high dimensions, but also are efficient to train. In this paper, we present not-so-big-GAN (nsb-GAN), a simple yet cost-effective two-step training framework for deep generative models (DGMs) of high-dimensional natural images. First, we generate images in low-frequency bands by training a sampler in the wavelet domain. Then, we super-resolve these images from the wavelet domain back to the pixel-space with our novel wavelet super-resolution decoder network. Wavelet-based down-sampling method preserves more structural information than pixel-based methods, leading to significantly better generative quality of the low-resolution sampler (eg. 64×64). Since the sampler and decoder can be trained in parallel and operate on much lower dimensional spaces than end-to-end models, the training cost is substantially reduced. On ImageNet 512×512, our model achieves a Fréchet Inception Distance (FID) of 10.59—beating the baseline BigGAN model—at half the compute (256 TPU-v3 cores).
In my previous post about attempting to create an ukiyo-e portrait generator I introduced a concept I called “layer swapping” in order to mix two StyleGAN models. The aim was to blend a base model and another created from that using transfer learning, the fine-tuned model. The method was different to simply interpolating the weights of the two models as it allows you to control independently which model you got low and high resolution features from; in my example I wanted to get the pose from normal photographs, and the texture/style from ukiyo-e prints…after the a recent Twitter thread popped up again on model interpolation, I realised that I had missed a really obvious variation on my earlier experiments. Rather than taking the low resolution layers (pose) from normal photos and high res layers (texture) from ukiyo-e I figured it would surely be interesting to try the other way round.
…I’ve shared an initial version of some code to blend two networks in this layer swapping manner (with some interpolation thrown into the mix) in my StyleGAN2 fork (see the blend_models.py file). There’s also an example Colab notebook to show how to blend some StyleGAN models, in the example I use a small faces model and one I trained on satellite images of the earth above.
…It was originally Arfa who asked me to share some of the layer swapping code I had been working on. He followed up by combining both the weight interpolation and layer swapping ideas, combining a bunch of different models (with some neat visualisations). The results are pretty amazing, this sort of “resolution dependent model interpolation” is the logical generalisation of both the interpolation and swapping ideas. It looks like it gives a completely new axis of control over a generative model (assuming you have some fine-tuned models which can be combined). Take these example frames from one of the above videos:
On the left is the output of the anime model, on the right the My Little Pony model, and in the middle the mid-resolution layers have been transplanted from My Little Pony into anime. This essentially introduces middle resolution features such as the eyes and nose from My Little Pony into anime characters!
Going further: I think there’s lots of potential to look at these blending strategies further, in particular not only interpolating between models dependent on the resolution, but also differently for different channels. If you can identify the subset of neurons which correspond (for example) to the My Little Pony eyes you could swap those specifically into the anime model, and be able to modify the eyes without affecting other features, such as the nose. Simple clustering of the internal activations has already been shown to be an effective way of identifying neurons which correspond to attributes in the image in the Editing in Style paper so this seems pretty straightforward to try!
A deep generative model such as a GAN learns to model a rich set of semantic and physical rules about the target distribution, but up to now, it has been obscure how such rules are encoded in the network, or how a rule could be changed. In this paper, we introduce a new problem setting: manipulation of specific rules encoded by a deep generative model. To address the problem, we propose a formulation in which the desired rule is changed by manipulating a layer of a deep network as a linear associative memory. We derive an algorithm for modifying one entry of the associative memory, and we demonstrate that several interesting structural rules can be located and modified within the layers of state-of-the-art generative models. We present an user interface to enable users to interactively change the rules of a generative model to achieve desired effects, and we show several proof-of-concept applications. Finally, results on multiple datasets demonstrate the advantage of our method against standard fine-tuning methods and edit transfer algorithms.
A rich set of interpretable dimensions has been shown to emerge in the latent space of the Generative Adversarial Networks (GANs) trained for synthesizing images. In order to identify such latent dimensions for image editing, previous methods typically annotate a collection of synthesized samples and train linear classifiers in the latent space. However, they require a clear definition of the target attribute as well as the corresponding manual annotations, limiting their applications in practice. In this work, we examine the internal representation learned by GANs to reveal the underlying variation factors in an unsupervised manner. In particular, we take a closer look into the generation mechanism of GANs and further propose a closed-form factorization algorithm for latent semantic discovery by directly decomposing the pre-trained weights. With a lightning-fast implementation, our approach is capable of not only finding semantically meaningful dimensions comparably to the state-of-the-art supervised methods, but also resulting in far more versatile concepts across multiple GAN models trained on a wide range of datasets.
The performance of generative adversarial networks (GANs) heavily deteriorates given a limited amount of training data. This is mainly because the discriminator is memorizing the exact training set. To combat it, we propose Differentiable Augmentation (DiffAugment), a simple method that improves the data efficiency of GANs by imposing various types of differentiable augmentations on both real and fake samples. Previous attempts to directly augment the training data manipulate the distribution of real images, yielding little benefit; DiffAugment enables us to adopt the differentiable augmentation for the generated samples, effectively stabilizes training, and leads to better convergence. Experiments demonstrate consistent gains of our method over a variety of GAN architectures and loss functions for both unconditional and class-conditional generation. With DiffAugment, we achieve a state-of-the-art FID of 6.80 with an IS of 100.8 on ImageNet 128×128 and 2–4× reductions of FID given 1,000 images on FFHQ and LSUN. Furthermore, with only 20% training data, we can match the top performance on CIFAR-10 and CIFAR-100. Finally, our method can generate high-fidelity images using only 100 images without pre-training, while being on par with existing transfer learning algorithms. Code is available.
Training generative adversarial networks (GAN) using too little data typically leads to discriminator overfitting, causing training to diverge. We propose an adaptive discriminator augmentation mechanism that substantially stabilizes training in limited data regimes. The approach does not require changes to loss functions or network architectures, and is applicable both when training from scratch and when fine-tuning an existing GAN on another dataset. We demonstrate, on several datasets, that good results are now possible using only a few thousand training images, often matching StyleGAN2 results with an order of magnitude fewer images. We expect this to open up new application domains for GANs. We also find that the widely used CIFAR-10 is, in fact, a limited data benchmark, and improve the record FID from 5.59 to 2.67.
Recent successes in Generative Adversarial Networks (GAN) have affirmed the importance of using more data in GAN training. Yet it is expensive to collect data in many domains such as medical applications. Data Augmentation (DA) has been applied in these applications.
In this work, we first argue that the classical DA approach could mislead the generator to learn the distribution of the augmented data, which could be different from that of the original data. We then propose a principled framework, termed Data Augmentation Optimized for GAN (DAG), to enable the use of augmented data in GAN training to improve the learning of the original distribution. We provide theoretical analysis to show that using our proposed DAG aligns with the original GAN in minimizing the Jensen-Shannon (JS) divergence between the original distribution and model distribution. Importantly, the proposed DAG effectively leverages the augmented data to improve the learning of discriminator and generator.
We conduct experiments to apply DAG to different GAN models: unconditional GAN, conditional GAN, self-supervised GAN and CycleGAN using datasets of natural images and medical images. The results show that DAG achieves consistent and considerable improvements across these models. Furthermore, when DAG is used in some GAN models, the system establishes state-of-the-art Fréchet Inception Distance (FID) scores.
Data augmentations have been widely studied to improve the accuracy and robustness of classifiers. However, the potential of image augmentation in improving GAN models for image synthesis has not been thoroughly investigated in previous studies.
In this work, we systematically study the effectiveness of various existing augmentation techniques for GAN training in a variety of settings. We provide insights and guidelines on how to augment images for both vanilla GANs and GANs with regularizations, improving the fidelity of the generated images substantially. Surprisingly, we find that vanilla GANs attain generation quality on par with recent state-of-the-art results if we use augmentations on both real and generated images. When this GAN training is combined with other augmentation-based regularization techniques, such as contrastive loss and consistency regularization, the augmentations further improve the quality of generated images.
We provide new state-of-the-art results for conditional generation on CIFAR-10 with both consistency loss and contrastive loss as additional regularizations.
Much of the fun of internet drama comes from its frivolousness, but sometimes an online shitfest points to something bigger. Last week, the AI-powered furry art site This Fursona Does Not Exist did just that, igniting a fandom firestorm while also highlighting an important debate about digital art. Trained on more than 55,000 images pulled (without permission) from a furry art forum, the algorithm was a simple case of art theft to some. For others, it was a chance to break out the popcorn. But legal scholars who spoke with Gizmodo said the conflict raises thorny questions about ownership in the age of AI—questions that may ultimately have to be answered in court.
…At least one person tried (and failed) to find proof that the algorithm was copying images from e621.net outright. And within days, the entire site was slapped with a DMCA copyright infringement complaint. (The company whose name the DMCA was issued in, according to Arfa, denied filing the notice and requested it be withdrawn.) Some degree of backlash is understandable. Furry fandom has long been a close-knit community of independent creators supported by individual commissions. A project aimed at mass-producing fursonas—using original art as training material, no less—could be seen as a threat to creators’ livelihood. Some commenters accused Arfa of disrespect and asked for the choice to opt out of the project. Others complained that their work had been uploaded to e621 without their permission in the first place.
Description of 3 anime datasets for machine learning based on Danbooru: cropped anime faces, whole-single-character crops, and hand crops (with hand detection model).
Documentation of 3 anime datasets for machine learning based on Danbooru: 300k cropped anime faces (primarily used for StyleGAN/This Waifu Does Not Exist), 855k whole-single-character figure crops (extracted from Danbooru using AniSeg), and 58k hand crops (based on a dataset of 14k hand-annotated bounding boxes used to train a YOLOv3 hand detection model).
These datasets can be used for machine learning directly, or included as data augmentation: faces, figures, and hands are some of the most noticeable features of anime images, and by cropping images down to just those 3 features, they can enhance modeling of those by eliminating distracting context, zooming in, and increasing the weight during training.
[A StyleGAN2 showcase: TFDNE is high-quality GAN-generated furry (anthropomorphic animals) faces, trained on n = 55k faces cropped from the e621 furry image booru. For higher quality, the creator heavily filtered faces and aligned them, and upscaled using waifu2×. For display, it reuses Obormot’s “These Waifus Do Not Exist” scrolling grid code to display an indefinite number of faces rather than one at a time. (TFDNE is also available on Artbreeder for interactive editing/crossbreeding, and a Google Colab notebook for Ganspace-based editing.)
These faces are “ambigrams”: images that are legible both upside-down and right-side up. Created with a machine learning system, they may be displayed in any orientation. In this project, a collection of 55 such ambigrammatic faces have been generated in high resolution, and printed as a limited edition deck of poker-sized cards.
…Ultimately, our Ambigrammatic Figures were synthesized with the StyleGAN2 generative adversarial network, using pre-trained weights from the Flickr-Faces-HQ Dataset (FFHQ), and enhanced using the waifu2x super-resolution library.
Our work uses the StyleGAN2“projection” technique, in which the GAN attempts to find a given face in its latent space, starting its search from a “generic” “neutral” face located at the origin. We provide the StyleGAN an upside-down face as a query—and the projector tries its best to find it, but can never serve a perfect match, because it has only been trained on exclusively right-side-up faces. In short, the GAN projector finds upside-down faces in the latent space (or “generatable manifold”) of right-side up faces. Through the struggle to match an upside-down face using right-side-up ones, the GAN tends to converge on a face that can be looked at both ways.
I have trained StyleGAN2 from scratch with a dataset of female portraits at 1024px resolution. The samples quality was further improved by tuning the parameters and augmenting the dataset with zoomed-in images, allowing the network to learn more details and to achieved FID metrics that are comparable to the results of the original work…I was curious how it would work on the human anatomy, so I decided to try to train SG2 with a dataset of head and shoulders portraits. To alleviate capacity issues mentioned in the SG2 paper I preferred to use portraits without clothes (a substantial contributing factor to dataset variance); furthermore, the dataset was limited to just one gender in order to further reduce the dataset’s complexity.
…I haven’t quite been able to achieve the quality of SG2 trained with the FFHQ dataset. After over than 30,000 kimg, the samples are not yet as detailed as it is desirable. For example, teeth look blurry and pupils are not perfectly round. Considering the size of my dataset as opposed to the FFHQ one, the cause is unlikely to be the lack of training data. Continuing the training does not appear to help as is evident from the plateau in FIDs.
Overall, my experience with SG2 is well in line with what others are observing. Limiting the dataset to a single domain leads to major quality improvements. SG2 is able to model textures and transitions quite well. At the same time it is struggling as the complexity of the object increases with, for instance, greater diversity in poses. It should be noted that SG2 is much more efficient for single domain tasks compared to other architectures, resulting in acceptable results much faster.
Normalization layers and activation functions are fundamental components in deep networks and typically co-locate with each other. Here we propose to design them using an automated approach. Instead of designing them separately, we unify them into a single tensor-to-tensor computation graph, and evolve its structure starting from basic mathematical functions. Examples of such mathematical functions are addition, multiplication and statistical moments. The use of low-level mathematical functions, in contrast to the use of high-level modules in mainstream NAS, leads to a highly sparse and large search space which can be challenging for search methods. To address the challenge, we develop efficient rejection protocols to quickly filter out candidate layers that do not work well. We also use multi-objective evolution to optimize each layer’s performance across many architectures to prevent overfitting.
Our method leads to the discovery of EvoNorms, a set of new normalization-activation layers with novel, and sometimes surprising structures that go beyond existing design patterns. For example, some EvoNorms do not assume that normalization and activation functions must be applied sequentially, nor need to center the feature maps, nor require explicit activation functions. Our experiments show that EvoNorms work well on image classification models including ResNets, MobileNets and EfficientNets but also transfer well to Mask R-CNN with FPN/SpineNet for instance segmentation and to BigGAN for image synthesis, outperforming BatchNorm and GroupNorm based layers in many cases.
This paper describes a simple technique to analyze Generative Adversarial Networks (GANs) and create interpretable controls for image synthesis, such as change of viewpoint, aging, lighting, and time of day. We identify important latent directions based on Principal Components Analysis (PCA) applied either in latent space or feature space. Then, we show that a large number of interpretable controls can be defined by layer-wise perturbation along the principal directions. Moreover, we show that BigGAN can be controlled with layer-wise inputs in a StyleGAN-like manner. We show results on different GANs trained on various datasets, and demonstrate good qualitative matches to edit directions found through earlier supervised approaches.
We introduce a simple (one line of code) modification to the Generative Adversarial Network (GAN) training algorithm that materially improves results with no increase in computational cost: When updating the generator parameters, we simply zero out the gradient contributions from the elements of the batch that the critic scores as ‘least realistic’. Through experiments on many different GAN variants, we show that this ‘top-k update’ procedure is a generally applicable improvement. In order to understand the nature of the improvement, we conduct extensive analysis on a simple mixture-of-Gaussians dataset and discover several interesting phenomena. Among these is that, when gradient updates are computed using the worst-scoring batch elements, samples can actually be pushed further away from their nearest mode. We also apply our method to recent GAN variants and improve state-of-the-art FID for conditional generation from 9.21 to 8.57 on CIFAR-10.
The latent spaces of GAN models often have semantically meaningful directions. Moving in these directions corresponds to human-interpretable image transformations, such as zooming or recoloring, enabling a more controllable generation process. However, the discovery of such directions is currently performed in a supervised manner, requiring human labels, pretrained models, or some form of self-supervision. These requirements severely restrict a range of directions existing approaches can discover. In this paper, we introduce an unsupervised method to identify interpretable directions in the latent space of a pretrained GAN model. By a simple model-agnostic procedure, we find directions corresponding to sensible semantic manipulations without any form of (self-)supervision. Furthermore, we reveal several non-trivial findings, which would be difficult to obtain by existing methods, eg. a direction corresponding to background removal. As an immediate practical benefit of our work, we show how to exploit this finding to achieve competitive performance for weakly-supervised saliency detection.
The style-based GAN architecture (StyleGAN) yields state-of-the-art results in data-driven unconditional generative image modeling. We expose and analyze several of its characteristic artifacts, and propose changes in both model architecture and training methods to address them. In particular, we redesign the generator normalization, revisit progressive growing, and regularize the generator to encourage good conditioning in the mapping from latent codes to images. In addition to improving image quality, this path length regularizer yields the additional benefit that the generator becomes significantly easier to invert. This makes it possible to reliably attribute a generated image to a particular network. We furthermore visualize how well the generator utilizes its output resolution, and identify a capacity problem, motivating us to train larger models for additional quality improvements. Overall, our improved model redefines the state of the art in unconditional image modeling, both in terms of existing distribution quality metrics as well as perceived image quality.
Despite an impressive performance from the latest GAN for generating hyper-realistic images, GAN discriminators have difficulty evaluating the quality of an individual generated sample. This is because the task of evaluating the quality of a generated image differs from deciding if an image is real or fake. A generated image could be perfect except in a single area but still be detected as fake. Instead, we propose a novel approach for detecting where errors occur within a generated image. By collaging real images with generated images, we compute for each pixel, whether it belongs to the real distribution or generated distribution. Furthermore, we leverage attention to model long-range dependency; this allows detection of errors which are reasonable locally but not holistically. For evaluation, we show that our error detection can act as a quality metric for an individual image, unlike FID and IS. We leverage Improved Wasserstein, BigGAN, and StyleGAN to show a ranking based on our metric correlates impressively with FID scores. Our work opens the door for better understanding of GAN and the ability to select the best samples from a GAN model.
Fascinated by virtual YouTubers, I put together a deep neural network system that makes becoming one much easier. More specifically, the network takes as input an image of an anime character’s face and a desired pose, and it outputs another image of the same character in the given pose.
Generative Adversarial Networks (GANs) are a type of generative model which have received much attention due to their ability to model complex real-world data. Despite their recent successes, the process of training GANs remains challenging, suffering from instability problems such as non-convergence, vanishing or exploding gradients, and mode collapse. In recent years, a diverse set of approaches have been proposed which focus on stabilizing the GAN training procedure. The purpose of this survey is to provide a comprehensive overview of the GAN training stabilization methods which can be found in the literature. We discuss the advantages and disadvantages of each approach, offer a comparative summary, and conclude with a discussion of open problems.
[Artbreeder is an interactive GAN generator website. Originally named “Ganbreeder” and providing only the 256px BigGAN generator, it now provides a variety of BigGAN & StyleGAN models, including the anime portrait StyleGAN model. (It is more general than the similar Waifu Labs, but my anime model is not as good.)
Users can generate random samples and explore slight variants of them to gradually explore the “latent space” and find interesting images, but they can also edit images more directly, upload existing images to find the most similar image produced by the model, etc. A popular website, it has generated >56m images from September 2019 to January 2020.]
Despite the recent advance of Generative Adversarial Networks (GANs) in high-fidelity image synthesis, there lacks enough understanding of how GANs are able to map a latent code sampled from a random distribution to a photo-realistic image. Previous work assumes the latent space learned by GANs follows a distributed representation but observes the vector arithmetic phenomenon. In this work, we propose a novel framework, called InterFaceGAN, for semantic face editing by interpreting the latent semantics learned by GANs. In this framework, we conduct a detailed study on how different semantics are encoded in the latent space of GANs for face synthesis. We find that the latent code of well-trained generative models actually learns a disentangled representation after linear transformations. We explore the disentanglement between various semantics and manage to decouple some entangled semantics with subspace projection, leading to more precise control of facial attributes. Besides manipulating gender, age, expression, and the presence of eyeglasses, we can even vary the face pose as well as fix the artifacts accidentally generated by GAN models. The proposed method is further applied to achieve real image manipulation when combined with GAN inversion methods or some encoder-involved models. Extensive results suggest that learning to synthesize faces spontaneously brings a disentangled and controllable facial attribute representation.
[Design company Sizigi Studios discusses their creation of Waifu Labs, a deep learning GAN website for interactive generation of anime faces, and their experience running a prototype of it at the Anime Expo (AX) 2019 anime convention in Los Angeles, where it was a popular exhibit. Laptops were setup attached to printers in an enclosed booth, making a ‘vending machine’.
Challenges included: no electricity outlets and no WiFi. Multiple laptops were cycled through as batteries wore out, while a gaming PC ran the neural network GANs locally rather than in a cloud VM. The failed WiFi was bypassed by using a smartphone as a local router. Further bugs were discovered in the code while many users waited in a long line. but were fixed in time, and the waifu vending machine was a success.]
[Waifu Labs is an interactive website for generating (1024px?) anime faces using a customized StyleGAN trained on Danbooru2018. Similar to Artbreeder, it supports face exploration and face editing, and at the end, an user can purchase prints of a particular face.]
We taught a world-class artificial intelligence how to draw anime. All the drawings you see were made by a non-human artist! Wild, right? It turns out machines love waifus almost as much as humans do.
We proudly present the next chapter of human history: lit waifu commissions from the world’s smartest AI artist. In less than 5 minutes, the artist learns your preferences to make the perfect waifu just for you.
An open secret in contemporary machine learning is that many models work beautifully on standard benchmarks but fail to generalize outside the lab. This has been attributed to biased training data, which provide poor coverage over real world events. Generative models are no exception, but recent advances in generative adversarial networks (GANs) suggest otherwise—these models can now synthesize strikingly realistic and diverse images. Is generative modeling of photos a solved problem? We show that although current GANs can fit standard datasets very well, they still fall short of being comprehensive models of the visual manifold. In particular, we study their ability to fit simple transformations such as camera movements and color changes. We find that the models reflect the biases of the datasets on which they are trained (eg. centered objects), but that they also exhibit some capacity for generalization: by “steering” in latent space, we can shift the distribution while still creating realistic images. We hypothesize that the degree of distributional shift is related to the breadth of the training data distribution. Thus, we conduct experiments to quantify the limits of GAN transformations and introduce techniques to mitigate the problem. Code is released on our project page: this URL.
One of the main motivations for training high quality image generative models is their potential use as tools for image manipulation. Recently, generative adversarial networks (GANs) have been able to generate images of remarkable quality. Unfortunately, adversarially-trained unconditional generator networks have not been successful as image priors. One of the main requirements for a network to act as a generative image prior, is being able to generate every possible image from the target distribution. Adversarial learning often experiences mode-collapse, which manifests in generators that cannot generate some modes of the target distribution. Another requirement often not satisfied is invertibility i.e. having an efficient way of finding a valid input latent code given a required output image. In this work, we show that differently from earlier GANs, the very recently proposed style-generators are quite easy to invert. We use this important observation to propose style generators as general purpose image priors. We show that style generators outperform other GANs as well as Deep Image Prior as priors for image enhancement tasks. The latent space spanned by style-generators satisfies linear identity-pose relations. The latent space linearity, combined with invertibility, allows us to animate still facial images without supervision. Extensive experiments are performed to support the main contributions of this paper.
Generative models are models that generate music, images, text, and other complex data types. In recent years generative models have advanced at an astonishing rate, largely due to deep learning, and particularly due to generative adversarial models (GANs). However, GANs are notoriously difficult to train, due to requiring a large amount of data, needing many GPUs and a lot of time to train, and being highly sensitive to minor hyperparameter changes.
fast.ai has been working in recent years towards making a range of models easier and faster to train, with a particular focus on using transfer learning. Transfer learning refers to pre-training a model using readily available data and quick and easy to calculate loss functions, and then fine-tuning that model for a task that may have fewer labels, or be more expensive to compute. This seemed like a potential solution to the GAN training problem, so in late 2018 fast.ai worked on a transfer learning technique for generative modeling.
The pre-trained model that fast.ai selected was this: Start with an image dataset and “crappify” the images, such as reducing the resolution, adding jpeg artifacts, and obscuring parts with random text. Then train a model to “decrappify” those images to return them to their original state. fast.ai started with a model that was pre-trained for ImageNet classification, and added a U-Net upsampling network, adding various modern tweaks to the regular U-Net. A simple fast loss function was initially used: mean squared pixel error. This U-Net could be trained in just a few minutes. Then, the loss function was replaced was a combination of other loss functions used in the generative modeling literature (more details in the f8 video) and trained for another couple of hours. The plan was then to finally add a GAN for the last few epochs—however it turned out that the results were so good that fast.ai ended up not using a GAN for the final models….
NoGAN Training: NoGAN is a new and exciting technique in GAN training that we developed, in pursuit of higher quality and more stable renders. How, and how well, it works is a bit surprising.
Here is the NoGAN training process:
Pretrain the Generator. The generator is first trained in a more conventional and easier to control manner—with Perceptual Loss (aka Feature Loss) by itself. GAN training is not introduced yet. At this point you’re training the generator as best as you can in the easiest way possible. This takes up most of the time in NoGAN training. Keep in mind: this pretraining by itself will get the generator model far. Colorization will be well-trained as a task, albeit the colors will tend toward dull tones. Self-Attention will also be well-trained at the at this stage, which is very important.
Save Generated Images From Pretrained Generator.
Pretrain the Critic as a Binary Classifier. Much like in pretraining the generator, what we aim to achieve in this step is to get as much training as possible for the critic in a more “conventional” manner which is easier to control. And there’s nothing easier than a binary classifier! Here we’re training the critic as a binary classifier of real and fake images, with the fake images being those saved in the previous step. A helpful thing to keep in mind here is that you can simply use a pre-trained critic used for another image-to-image task and refine it. This has already been done for super-resolution, where the critic’s pretrained weights were loaded from that of a critic trained for colorization. All that is needed to make use of the pre-trained critic in this case is a little fine-tuning.
Train Generator and Critic in (Almost) Normal GAN Setting. Quickly! This is the surprising part. It turns out that in this pretraining scenario, the critic will rapidly drive adjustments in the generator during GAN training. This happens during a narrow window of time before an “inflection point” of sorts is hit. After this point, there seems to be little to no benefit in training any further in this manner. In fact, if training is continued after this point, you’ll start seeing artifacts and glitches introduced in renderings.
In the case of DeOldify, training to this point requires iterating through only about 1% to 3% of ImageNet data (or roughly 2600 to 7800 iterations on a batch size of five). This amounts to just around 30–90 minutes of GAN training, which is in stark contrast to the three to five days of progressively-sized GAN training that was done previously. Surprisingly, during that short amount of training, the change in the quality of the renderings is dramatic. In fact, this makes up the entirety of GAN training for the video model. The “artistic” and “stable” models go one step further and repeat the NoGAN training process steps 2–4 until there’s no more apparent benefit (around five repeats).
Note: a small but important change to this GAN training that deviates from conventional GANs is the use of a loss threshold that must be met by the critic before generator training commences. Until then, the critic continues training to “catch up” in order to be able to provide the generator with constructive gradients. This catch up chiefly takes place at the beginning of GAN training which immediately follows generator and critic pretraining.
The ability to automatically estimate the quality and coverage of the samples produced by a generative model is a vital requirement for driving algorithm research. We present an evaluation metric that can separately and reliably measure both of these aspects in image generation tasks by forming explicit, non-parametric representations of the manifolds of real and generated data. We demonstrate the effectiveness of our metric in StyleGAN and BigGAN by providing several illustrative examples where existing metrics yield uninformative or contradictory results. Furthermore, we analyze multiple design variants of StyleGAN to better understand the relationships between the model architecture, training methods, and the properties of the resulting sample distribution. In the process, we identify new variants that improve the state-of-the-art. We also perform the first principled analysis of truncation methods and identify an improved method. Finally, we extend our metric to estimate the perceptual quality of individual samples, and use this to study latent space interpolations.
We propose an efficient algorithm to embed a given image into the latent space of StyleGAN. This embedding enables semantic image editing operations that can be applied to existing photographs. Taking the StyleGAN trained on the FFHQ dataset as an example, we show results for image morphing, style transfer, and expression transfer. Studying the results of the embedding algorithm provides valuable insights into the structure of the StyleGAN latent space. We propose a set of experiments to test what class of images can be embedded, how they are embedded, what latent space is suitable for embedding, and if the embedding is semantically meaningful.
…Going beyond faces, interestingly, we find that although the FFHQ StyleGAN generator is trained on a human face dataset, the embedding algorithm is capable to go far beyond human faces. As Figure 1 shows, although slightly worse than those of human faces, we can obtain reasonable and relatively high-quality embeddings of cats, dogs and even paintings and cars. This reveals the effective embedding capability of the algorithm and the generality of the learned filters of the generator.
The use of Artificial Intelligence (AI) machines using deep learning neural networks to create material that facially looks like it should be protected by copyright is growing exponentially. From articles in national news media to music, film, poetry and painting, AI machines create material that has economic value and that competes with productions of human authors. The Article reviews both normative and doctrinal arguments for and against the protection by copyright of literary and artistic productions made by AI machines.
The Article finds that the arguments in favor of protection are flawed and unconvincing and that a proper analysis of the history, purpose, and major doctrines of copyright law all lead to the conclusion that productions that do not result from human creative choices belong to the public domain.
The Article proposes a test to determine which productions should be protected, including in case of collaboration between human and machine. Finally, the Article applies the proposed test to three specific fact patterns to illustrate its application.
While Generative Adversarial Networks (GANs) have seen huge successes in image synthesis tasks, they are notoriously difficult to adapt to different datasets, in part due to instability during training and sensitivity to hyperparameters. One commonly accepted reason for this instability is that gradients passing from the discriminator to the generator become uninformative when there isn’t enough overlap in the supports of the real and fake distributions. In this work, we propose the Multi-Scale Gradient Generative Adversarial Network (MSG-GAN), a simple but effective technique for addressing this by allowing the flow of gradients from the discriminator to the generator at multiple scales. This technique provides a stable approach for high resolution image synthesis, and serves as an alternative to the commonly used progressive growing technique. We show that MSG-GAN converges stably on a variety of image datasets of different sizes, resolutions and domains, as well as different types of loss functions and architectures, all with the same set of fixed hyperparameters. When compared to state-of-the-art GANs, our approach matches or exceeds the performance in most of the cases we tried.
I describe how I made the website ThisWaifuDoesNotExist.net (TWDNE) for displaying random anime faces generated by StyleGAN neural networks, and how it went viral.
Generating high-quality anime faces has long been a task neural networks struggled with. The invention of StyleGAN in 2018 has effectively solved this task and I have trained a StyleGAN model which can generate high-quality anime faces at 512px resolution. To show off the recent progress, I made a website, “This Waifu Does Not Exist” for displaying random StyleGAN 2 faces. TWDNE displays a different neural-net-generated face & plot summary every 15s. The site was popular and went viral online, especially in China. The model can also be used interactively for exploration & editing in the Artbreeder online service.
[This Person Does Not Exist is a StyleGAN-based noninteractive website, which uses the Nvidia-trained FFHQ face StyleGAN model to generate realistic 1024px faces (upgraded to StyleGAN2 in 2020). One face is displayed at a time, and a new face automatically generated every few seconds. It inspired a rash of copycats, including “This Waifu Does Not Exist” (TWDNE).]
Recently a talented group of researchers at Nvidia released the current state of the art generative adversarial network, StyleGAN, over at https://github.com/NVlabs/stylegan I have decided to dig into my own pockets and raise some public awareness for this technology. Faces are most salient to our cognition, so I’ve decided to put that specific pretrained model up. Their research group have also included pretrained models for cats, cars, and bedrooms in their repository that you can immediately use. Each time you refresh the site, the network will generate a new facial image from scratch from a 512 dimensional vector.
Compilation of failed neural network experiments in generating anime images, pre-StyleGAN/BigGAN.
My experiments in generating anime faces, tried periodically since 2015, succeeded in 2019 with the release of StyleGAN. But for comparison, here are the failures from some of my older GAN or other NN attempts; as the quality is worse than StyleGAN, I won’t bother going into details—creating the datasets & training the ProGAN & tuning & transfer-learning were all much the same as already outlined at length for the StyleGAN results.
BigGAN unofficial (official BigGAN is covered elsewhere)
A tutorial explaining how to train and generate high-quality anime faces with StyleGAN 1/2 neural networks, and tips/scripts for effective StyleGAN use.
Generative neural networks, such as GANs, have struggled for years to generate decent-quality anime faces, despite their great success with photographic imagery such as real human faces. The task has now been effectively solved, for anime faces as well as many other domains, by the development of a new generative adversarial network, StyleGAN, whose source code was released in February 2019.
The anime face graveyard gives samples of my failures with earlier GANs for anime face generation, and I provide samples & model from a relatively large-scale BigGAN training run suggesting that BigGAN may be the next step forward to generating full-scale anime images.
A minute of reading could save an hour of debugging!
Generative Adversarial Networks (GAN) boast impressive capacity to generate realistic images. However, like much of the field of deep learning, they require an inordinate amount of data to produce results, thereby limiting their usefulness in generating novelty. In the same vein, recent advances in meta-learning have opened the door to many few-shot learning applications. In the present work, we propose Few-shot Image Generation using Reptile (FIGR), a GAN meta-trained with Reptile.
Our model successfully generates novel images on both MNIST and Omniglot with as little as 4 images from an unseen class. We further contribute FIGR-8, a new dataset for few-shot image generation, which contains 1,548,944 icons categorized in over 18,409 classes. Trained on FIGR-8, initial results show that our model can generalize to more advanced concepts (such as “bird” and “knife”) from as few as 8 samples from a previously unseen class of images and as little as 10 training steps through those 8 images.
This work demonstrates the potential of training a GAN for few-shot image generation and aims to set a new benchmark for future work in the domain.
We propose an alternative generator architecture for generative adversarial networks, borrowing from style transfer literature. The new architecture leads to an automatically learned, unsupervised separation of high-level attributes (eg. pose and identity when trained on human faces) and stochastic variation in the generated images (eg. freckles, hair), and it enables intuitive, scale-specific control of the synthesis. The new generator improves the state-of-the-art in terms of traditional distribution quality metrics, leads to demonstrably better interpolation properties, and also better disentangles the latent factors of variation. To quantify interpolation quality and disentanglement, we propose two new, automated methods that are applicable to any generator architecture. Finally, we introduce a new, highly varied and high-quality dataset of human faces.
We present a novel CNN-based image editing strategy that allows the user to change the semantic information of an image over an arbitrary region by manipulating the feature-space representation of the image in a trained GAN model. We will present two variants of our strategy: (1) spatial conditional batch normalization (sCBN), a type of conditional batch normalization with user-specifiable spatial weight maps, and (2) feature-blending, a method of directly modifying the intermediate features. Our methods can be used to edit both artificial image and real image, and they both can be used together with any GAN with conditional normalization layers. We will demonstrate the power of our method through experiments on various types of GANs trained on different datasets. Code will be available at Github.
We propose a rejection sampling scheme using the discriminator of a GAN to correct errors in the GAN generator distribution. We show that under quite strict assumptions, this will allow us to recover the data distribution exactly. We then examine where those strict assumptions break down and design a practical algorithm—called Discriminator Rejection Sampling (DRS)—that can be used on real data-sets.
Finally, we demonstrate the efficacy of DRS on a mixture of Gaussians and on the SAGAN model, state-of-the-art in the image generation task at the time of developing this work. On ImageNet, we train an improved baseline that increases the Inception Score from 52.52 to 62.36 and reduces the Fréchet Inception Distance from 18.65 to 14.79. We then use DRS to further improve on this baseline, improving the Inception Score to 76.08 and the FID to 13.75.
Training Generative Adversarial Networks (GANs) is notoriously challenging.
We propose and study an architectural modification, self-modulation, which improves GAN performance across different data sets, architectures, losses, regularizers, and hyperparameter settings. Intuitively, self-modulation allows the intermediate feature maps of a generator to change as a function of the input noise vector. While reminiscent of other conditioning techniques, it requires no labeled data.
In a large-scale empirical study we observe a relative decrease of 5%–35% in FID. Furthermore, all else being equal, adding this modification to the generator leads to improved performance in 124⁄144 (86%) of the studied settings.
Self-modulation is a simple architectural change that requires no additional parameter tuning, which suggests that it can be applied readily to any GAN.
Adversarial learning methods have been proposed for a wide range of applications, but the training of adversarial models can be notoriously unstable. Effectively balancing the performance of the generator and discriminator is critical, since a discriminator that achieves very high accuracy will produce relatively uninformative gradients. In this work, we propose a simple and general technique to constrain information flow in the discriminator by means of an information bottleneck. By enforcing a constraint on the mutual information between the observations and the discriminator’s internal representation, we can effectively modulate the discriminator’s accuracy and maintain useful and informative gradients.
We demonstrate that our proposed variational discriminator bottleneck (VDB) leads to significant improvements across three distinct application areas for adversarial learning algorithms. Our primary evaluation studies the applicability of the VDB to imitation learning of dynamic continuous control skills, such as running. We show that our method can learn such skills directly from raw video demonstrations, substantially outperforming prior adversarial imitation learning methods. The VDB can also be combined with adversarial inverse reinforcement learning to learn parsimonious reward functions that can be transferred and re-optimized in new settings. Finally, we demonstrate that VDB can train GANs more effectively for image generation, improving upon a number of prior stabilization methods.
The Super-Resolution Generative Adversarial Network (SRGAN) is a seminal work that is capable of generating realistic textures during single image super-resolution. However, the hallucinated details are often accompanied with unpleasant artifacts.
To further enhance the visual quality, we thoroughly study three key components of SRGAN—network architecture, adversarial loss and perceptual loss, and improve each of them to derive an Enhanced SRGAN (ESRGAN). In particular, we introduce the Residual-in-Residual Dense Block (RRDB) without batch normalization as the basic network building unit. Moreover, we borrow the idea from relativistic GAN to let the discriminator predict relative realness instead of the absolute value. Finally, we improve the perceptual loss by using the features before activation, which could provide stronger supervision for brightness consistency and texture recovery.
Benefiting from these improvements, the proposed ESRGAN achieves consistently better visual quality with more realistic and natural textures than SRGAN and won the first place in the PIRM2018-SR Challenge.
Cartoon Set is a collection of random, 2D cartoon avatar images. The cartoons vary in 10 artwork categories, 4 color categories, and 4 proportion categories, with a total of ~1013 possible combinations. We provide sets of 10k and 100k randomly chosen cartoons and labeled attributes…We’ve also used the dataset to research cross-domain image translation: “XGAN: Unsupervised Image-to-Image Translation for Many-to-Many Mappings”.
Github repo with screenshot samples of style2paints, a neural network for colorizing anime-style illustrations (trained on Danbooru2018), with or without user color hints, which was available as an online service in 2018. style2paints produces high-quality colorizations often on par with human colorizations. Many examples can be seen on Twitter or the Github repo:
Sketch or line art colorization is a research field with substantial market demand. Different from photo colorization which strongly relies on texture information, sketch colorization is more challenging as sketches may not have texture. Even worse, color, texture, and gradient have to be generated from the abstract sketch lines. In this paper, we propose a semi-automatic learning-based framework to colorize sketches with proper color, texture as well as gradient. Our framework consists of two stages. In the first drafting stage, our model guesses color regions and splashes a rich variety of colors over the sketch to obtain a color draft. In the second refinement stage, it detects the unnatural colors and artifacts, and try to fix and refine the result.Comparing to existing approaches, this two-stage design effectively divides the complex colorization task into two simpler and goal-clearer subtasks. This eases the learning and raises the quality of colorization. Our model resolves the artifacts such as water-color blurring, color distortion, and dull textures.
We build an interactive software based on our model for evaluation. Users can iteratively edit and refine the colorization. We evaluate our learning model and the interactive system through an extensive user study. Statistics shows that our method outperforms the state-of-art techniques and industrial applications in several aspects including, the visual quality, the ability of user control, user experience, and other metric
Generative adversarial networks (GANs) learn a deep generative model that is able to synthesise novel, high-dimensional data samples. New data samples are synthesised by passing latent samples, drawn from a chosen prior distribution, through the generative model. Once trained, the latent space exhibits interesting properties, that may be useful for down stream tasks such as classification or retrieval. Unfortunately, GANs do not offer an “inverse model”, a mapping from data space back to latent space, making it difficult to infer a latent representation for a given data sample. In this paper, we introduce a technique, inversion, to project data samples, specifically images, to the latent space using a pre-trained GAN. Using our proposed inversion technique, we are able to identify which attributes of a dataset a trained GAN is able to model and quantify GAN performance, based on a reconstruction loss. We demonstrate how our proposed inversion technique may be used to quantitatively compare performance of various GAN models trained on three image datasets. We provide code for all of our experiments, Github.
We describe a new training methodology for generative adversarial networks. The key idea is to grow both the generator and discriminator progressively: starting from a low resolution, we add new layers that model increasingly fine details as training progresses. This both speeds the training up and greatly stabilizes it, allowing us to produce images of unprecedented quality, eg. CelebA images at 1024^2. We also propose a simple way to increase the variation in generated images, and achieve a record inception score of 8.80 in unsupervised CIFAR10. Additionally, we describe several implementation details that are important for discouraging unhealthy competition between the generator and discriminator. Finally, we suggest a new metric for evaluating GAN results, both in terms of image quality and variation. As an additional contribution, we construct a higher-quality version of the CelebA dataset.
Although Generative Adversarial Networks (GANs) have shown remarkable success in various tasks, they still face challenges in generating high quality images. In this paper, we propose Stacked Generative Adversarial Networks (StackGAN) aiming at generating high-resolution photo-realistic images. First, we propose a two-stage generative adversarial network architecture, StackGAN-v1, for text-to-image synthesis. The Stage-I GAN sketches the primitive shape and colors of the object based on given text description, yielding low-resolution images. The Stage-II GAN takes Stage-I results and text descriptions as inputs, and generates high-resolution images with photo-realistic details. Second, an advanced multi-stage generative adversarial network architecture, StackGAN-v2, is proposed for both conditional and unconditional generative tasks. Our StackGAN-v2 consists of multiple generators and discriminators in a tree-like structure; images at multiple scales corresponding to the same scene are generated from different branches of the tree. StackGAN-v2 shows more stable training behavior than StackGAN-v1 by jointly approximating multiple distributions. Extensive experiments demonstrate that the proposed stacked generative adversarial networks significantly outperform other state-of-the-art methods in generating photo-realistic images.
Automatic generation of facial images has been well studied after the Generative Adversarial Network (GAN) came out. There exists some attempts applying the GAN model to the problem of generating facial images of anime characters, but none of the existing work gives a promising result. In this work, we explore the training of GAN models specialized on an anime facial image dataset. We address the issue from both the data and the model aspect, by collecting a more clean, well-suited dataset and leverage proper, empirical application of DRAGAN. With quantitative analysis and case studies we demonstrate that our efforts lead to a stable and high-quality model. Moreover, to assist people with anime character design, we build a website (http://make.girls.moe) with our pre-trained model available online, which makes the model easily accessible to general public.
Since its appearance, Generative Adversarial Networks (GANs) have received a lot of interest in the AI community. In image generation several projects showed how GANs are able to generate photorealistic images but the results so far did not look adequate for the quality standard of visual media production industry. We present an optimized image generation process based on a Deep Convolutional Generative Adversarial Networks (DCGANs), in order to create photorealistic high-resolution images (up to 1024×1024 pixels). Furthermore, the system was fed with a limited dataset of images, less than two thousand images. All these results give more clue about future exploitation of GANs in Computer Graphics and Visual Effects.
Gatys et al 2015 recently introduced a neural algorithm that renders a content image in the style of another image, achieving so-called style transfer. However, their framework requires a slow iterative optimization process, which limits its practical application. Fast approximations with feed-forward neural networks have been proposed to speed up neural style transfer. Unfortunately, the speed improvement comes at a cost: the network is usually tied to a fixed set of styles and cannot adapt to arbitrary new styles.
In this paper, we present a simple yet effective approach that for the first time enables arbitrary style transfer in real-time. At the heart of our method is a novel adaptive instance normalization (AdaIN) layer that aligns the mean and variance of the content features with those of the style features.
Our method achieves speed comparable to the fastest existing approach, without the restriction to a pre-defined set of styles. In addition, our approach allows flexible user controls such as content-style trade-off, style interpolation, color & spatial controls, all using a single feed-forward neural network.
Synthesizing high-quality images from text descriptions is a challenging problem in computer vision and has many practical applications. Samples generated by existing text-to-image approaches can roughly reflect the meaning of the given descriptions, but they fail to contain necessary details and vivid object parts. In this paper, we propose Stacked Generative Adversarial Networks (StackGAN) to generate 256×256 photo-realistic images conditioned on text descriptions. We decompose the hard problem into more manageable sub-problems through a sketch-refinement process. The Stage-I GAN sketches the primitive shape and colors of the object based on the given text description, yielding Stage-I low-resolution images. The Stage-II GAN takes Stage-I results and text descriptions as inputs, and generates high-resolution images with photo-realistic details. It is able to rectify defects in Stage-I results and add compelling details with the refinement process. To improve the diversity of the synthesized images and stabilize the training of the conditional-GAN, we introduce a novel Conditioning Augmentation technique that encourages smoothness in the latent conditioning manifold. Extensive experiments and comparisons with state-of-the-arts on benchmark datasets demonstrate that the proposed method achieves significant improvements on generating photo-realistic images conditioned on text descriptions.
Referring to existing illustrations helps novice drawers to realize their ideas.
To find such helpful references from a large image collection, we first build a semantic vector representation of illustrations by training convolutional neural networks. As the proposed vector space correctly reflects the semantic meanings of illustrations, users can efficiently search for references with similar attributes. Besides the search with a single query, a semantic morphing algorithm that searches the intermediate illustrations that gradually connect two queries is proposed.
Several experiments were conducted to demonstrate the effectiveness of our methods.
Japanese Woodblock Print Search: Ukiyo-e Search provides an incredible resource: The ability to both search for Japanese woodblock prints by simply taking a picture of an existing print AND the ability to see similar prints across multiple collections of prints.
…The Ukiyo-e.org database and image similarity analysis engine, created by John Resig to aide researchers in the study of Japanese woodblock prints, was launched in December 2012. The database currently contains over 213,000 prints from 24 institutions and, as of September 2013, has received 3.4 million page views from 150,000 people.
The database has the following major features:
A database of Japanese woodblock print images and metadata aggregated from a variety of museums, universities, libraries, auction houses, and dealers around the world.
An indexed text search engine of all the metadata provided by the institutions about the prints.
An image search engine of all the images in the database, searchable by uploading an image of a print.
Each print image is analyzed and compared against all other print images in the database. Similar prints are displayed together for comparison and analysis.
Multiple copies of the same print are automatically lined up with each other and made viewable in a gallery for easy comparison.
The entire web site, and all artist information contained within it, is available in both English and Japanese, aiding international researchers.
These features, available in the Ukiyo-e.org database, are already providing researchers with substantial benefit. New copies of prints have been located by scholars at museums. Museums have been able to correct unattributed prints, finding the correct artist. Prints have been identified by lay people who cannot read Japanese and/or are unable to interpret the imagery depicted in a print.
It is challenging to reconcile information from numerous databases, many of which are in different languages. The difficulty of finding and utilizing an effective image similarity search engine, one that is capable of working with images of different sizes, colors, or even in black-and-white, is a point that deserves considerable attention.
The Ukiyo-e.org database is already importantly impacting Japanese woodblock print studies and may have implications for visual art research and digital humanities at large.
[TRC is an easy-to-apply cloud credit program which grants free access to up to hundreds of GCP TPUs (typically ~100 pre-emptible individual TPUv2s and a scattering of TPUv2-8s and TPUv3s as of 2020) and sometimes whole TPU pods to researchers & hobbyists like me; I relied on TRC credits to train a variety of GPT-2-1.5b models which are infeasible on consumer GPUs. It took seconds to apply with an email address, they replied in hours with credits, and were highly responsive thereafter as we encountered various TPU issues.]