[Slight variant on Jukebox] Modeling long-term dependencies for audio signals is a particularly challenging problem, as even small-time scales yield on the order of a hundred thousand samples. With the recent advent of Transformers, neural architectures became good at modeling dependencies over longer time scales, but they suffered from quadratic constraints to scale them.
We propose a generative auto-regressive architecture that can model audio waveforms over quite a large context, greater than 500,000 samples. Our work is adapted to learn time dependencies by learning a latent representation by a CNN front-end, and then learning dependencies over these representations using Transformer encoders, fully trained end-to-end: thereby allowing to learn representations as it deems fit for the next sample.
Unlike previous works that compared different time scales to show improvement, we use a standard dataset, with the same number of parameters/context to show improvements. We achieve a state-of-the-art performance as compared to other approaches such as WaveNet, SaSHiMi, and SampleRNN on a standard dataset for modeling long-term structure.
This work gives very exciting direction for the field, given improvements in context modeling that can be scaled with more data, as well as potentially better results by using billions/trillions of parameters.
We demonstrate that language models pre-trained on codified (discretely-encoded) music audio learn representations that are useful for downstream MIR tasks. Specifically, we explore representations from Jukebox (Dhariwal et al 2020): a music generation system containing a language model trained on codified audio from 1M songs.
To determine if Jukebox’s representations contain useful information for MIR, we use them as input features to train shallow models on several MIR tasks. Relative to representations from conventional MIR models which are pre-trained on tagging, we find that using representations from Jukebox as input features yields 30% stronger performance on average across four MIR tasks: tagging, genre classification, emotion recognition, and key detection.
For key detection, we observe that representations from Jukebox are considerably stronger than those from models pre-trained on tagging, suggesting that pre-training via codified audio language modeling may address blind spots in conventional approaches. We interpret the strength of Jukebox’s representations as evidence that modeling audio instead of tags provides richer representations for MIR.
We present VideoGPT: a conceptually simple architecture for scaling likelihood based generative modeling to natural videos.
VideoGPT uses VQ-VAE that learns downsampled discrete latent representations of a raw video by employing 3D convolutions and axial self-attention. A simple GPT-like architecture is then used to autoregressively model the discrete latents using spatio-temporal position encodings.
Despite the simplicity in formulation and ease of training, our architecture is able to generate samples competitive with state-of-the-art GAN models for video generation on the BAIR Robot dataset, and generate high fidelity natural images from UCF101 and Tumbler GIF Dataset (TGIF).
We hope our proposed architecture serves as a reproducible reference for a minimalistic implementation of transformer based video generation models. Samples and code are available at https://wilson1yan.github.io/videogpt/index.html.
[Paper; samples; followup paper probing Jukebox as pretraining for music analysis (posing similar difficulties in extracting the right embedding as iGPT). An album made using it is Shadow Planet] A typical 4-minute song at CD quality (44 kHz, 16-bit) has over 10 million timesteps. For comparison, GPT-2 had 1,000 timesteps and OpenAI Five took tens of thousands of timesteps per game. Thus, to learn the high level semantics of music, a model would have to deal with extremely long-range dependencies. One way of addressing the long input problem is to use an autoencoder that compresses raw audio to a lower-dimensional space by discarding some of the perceptually irrelevant bits of information. We can then train a model to generate audio in this compressed space, and upsample back to the raw audio space.
We chose to work on music because we want to continue to push the boundaries of generative models. Our previous work on MuseNet explored synthesizing music based on large amounts of MIDI data. Now in raw audio, our models must learn to tackle high diversity as well as very long range structure, and the raw audio domain is particularly unforgiving of errors in short, medium, or long term timing.
…Jukebox’s autoencoder model compresses audio to a discrete space, using a quantization-based approach called VQ-VAE. Hierarchical VQ-VAEs can generate short instrumental pieces from a few sets of instruments, however they suffer from hierarchy collapse due to use of successive encoders coupled with autoregressive decoders. A simplified variant called VQ-VAE-226 avoids these issues by using feedforward encoders and decoders only, and they show impressive results at generating high-fidelity images…We use three levels in our VQ-VAE, shown below, which compress the 44kHz raw audio by 8×, 32×, and 128×, respectively, with a codebook size of 2,048 for each level. This downsampling loses much of the audio detail, and sounds noticeably noisy as we go further down the levels. However, it retains essential information about the pitch, timbre, and volume of the audio.
Jukebox architecture
…The top-level prior models the long-range structure of music, and samples decoded from this level have lower audio quality but capture high-level semantics like singing and melodies. The middle and bottom upsampling priors add local musical structures like timbre, substantially improving the audio quality. We train these as autoregressive models using a simplified variant of Sparse Transformers. Each of these models has 72 layers of factorized self-attention on a context of 8192 codes, which corresponds to ~24 seconds, 6 seconds, and 1.5 seconds of raw audio at the top, middle and bottom levels, respectively. Once all of the priors are trained, we can generate codes from the top level, upsample them using the upsamplers, and decode them back to the raw audio space using the VQ-VAE decoder to sample novel songs.
…While Jukebox represents a step forward in musical quality, coherence, length of audio sample, and ability to condition on artist, genre, and lyrics, there is a substantial gap between these generations and human-created music. For example, while the generated songs show local musical coherence, follow traditional chord patterns, and can even feature impressive solos, we do not hear familiar larger musical structures such as choruses that repeat. Our downsampling and upsampling process introduces discernible noise. Improving the VQ-VAE so its codes capture more musical information would help reduce this. Our models are also slow to sample from, because of the autoregressive nature of sampling. It takes approximately 9 hours to fully render 1 minute of audio through our models, and thus they cannot yet be used in interactive applications.
[A large dataset of 7131 musical songs generated by OpenAI’s Jukebox neural network, classified & searchable by model (including samples generated by models early in training & of low quality), type of song (the model continuing an existing piece of music, generating in the style of a specific artist, given brandnew lyrics, no lyrics at all), genre (hip hop/jazz/country/rock/pop/metal/etc), artist (everyone from Elton John to Rage to Eagles), and sampling temperature (top-k, governing tradeoff between randomness and predictability).]
We introduce Jukebox, a model that generates music with singing in the raw audio domain.
We tackle the long context of raw audio using a multi-scale VQ-VAE to compress it to discrete codes, and modeling those using autoregressive Transformers. We show that the combined model at scale can generate high-fidelity and diverse songs with coherence up to multiple minutes.
We can condition on artist and genre to steer the musical and vocal style, and on unaligned lyrics to make the singing more controllable.
We are releasing thousands of non cherry-picked samples, along with model weights and code.
There are two prevailing technical theories about what it will take to reach AGI. In one, all the necessary techniques already exist; it’s just a matter of figuring out how to scale and assemble them. In the other, there needs to be an entirely new paradigm; deep learning, the current dominant technique in AI, won’t be enough. Most researchers fall somewhere between these extremes, but OpenAI has consistently sat almost exclusively on the scale-and-assemble end of the spectrum. Most of its breakthroughs have been the product of sinking dramatically greater computational resources into technical innovations developed in other labs.
Brockman and Sutskever deny that this is their sole strategy, but the lab’s tightly guarded research suggests otherwise. A team called “Foresight” runs experiments to test how far they can push AI capabilities forward by training existing algorithms with increasingly large amounts of data and computing power. For the leadership, the results of these experiments have confirmed its instincts that the lab’s all-in, compute-driven strategy is the best approach. For roughly six months, these results were hidden from the public because OpenAI sees this knowledge as its primary competitive advantage. Employees and interns were explicitly instructed not to reveal them, and those who left signed nondisclosure agreements. It was only in January that the team, without the usual fanfare, quietly posted a paper on one of the primary open-source databases for AI research. People who experienced the intense secrecy around the effort didn’t know what to make of this change. Notably, another paper with similar results from different researchers had been posted a month earlier.
…One of the biggest secrets is the project OpenAI is working on next. Sources described it to me as the culmination of its previous four years of research: an AI system trained on images, text, and other data using massive computational resources. A small team has been assigned to the initial effort, with an expectation that other teams, along with their work, will eventually fold in. On the day it was announced at an all-company meeting, interns weren’t allowed to attend. People familiar with the plan offer an explanation: the leadership thinks this is the most promising way to reach AGI. [See DALL·E 1, CLIP.]
…The man driving OpenAI’s strategy is Dario Amodei, the ex-Googler who now serves as research director. When I meet him, he strikes me as a more anxious version of Brockman. He has a similar sincerity and sensitivity, but an air of unsettled nervous energy. He looks distant when he talks, his brows furrowed, a hand absentmindedly tugging his curls. Amodei divides the lab’s strategy into two parts. The first part, which dictates how it plans to reach advanced AI capabilities, he likens to an investor’s “portfolio of bets.” Different teams at OpenAI are playing out different bets. The language team, for example, has its money on a theory postulating that AI can develop a substantial understanding of the world through mere language learning. The robotics team, in contrast, is advancing an opposing theory that intelligence requires a physical embodiment to develop. As in an investor’s portfolio, not every bet has an equal weight. But for the purposes of scientific rigor, all should be tested before being discarded. Amodei points to GPT-2, with its remarkably realistic auto-generated texts, as an instance of why it’s important to keep an open mind. “Pure language is a direction that the field and even some of us were somewhat skeptical of”, he says. “But now it’s like, ‘Wow, this is really promising.’” Over time, as different bets rise above others, they will attract more intense efforts. Then they will cross-pollinate and combine. The goal is to have fewer and fewer teams that ultimately collapse into a single technical direction for AGI. This is the exact process that OpenAI’s latest top-secret project has supposedly already begun.
We explore the use of Vector Quantized Variational AutoEncoder (VQ-VAE) models for large scale image generation. To this end, we scale and enhance the autoregressive priors used in VQ-VAE to generate synthetic samples of much higher coherence and fidelity than possible before. We use simple feed-forward encoder and decoder networks, making our model an attractive candidate for applications where the encoding and/or decoding speed is critical. Additionally, VQ-VAE requires sampling an autoregressive model only in the compressed latent space, which is an order of magnitude faster than sampling in the pixel space, especially for large images. We demonstrate that a multi-scale hierarchical organization of VQ-VAE, augmented with powerful priors over the latent codes, is able to generate samples with quality that rivals that of state of the art Generative Adversarial Networks on multifaceted datasets such as ImageNet, while not suffering from GAN’s known shortcomings such as mode collapse and lack of diversity.
Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length.
In this paper we introduce sparse factorizations of the attention matrix which reduce this to 𝑂(n √n). We also introduce (1) a variation on architecture and initialization to train deeper networks, (2) the recomputation of attention matrices to save memory, and (3) fast attention kernels for training. We call networks with these changes Sparse Transformers, and show they can model sequences tens of thousands of timesteps long using hundreds of layers.
We use the same architecture to model images, audio, and text from raw bytes, setting a new state of the art for density modeling of Enwik8, CIFAR-10, and ImageNet-64.
We generate unconditional samples that demonstrate global coherence and great diversity, and show it is possible in principle to use self-attention to model sequences of length one million or more.