Install Theme

frank’s image generation model, explained

Last week, I released a new feature for @nostalgebraist-autoresponder that generates images. Earlier I promised a post explaining how the model works, so here it is.

I’ll try to make this post as accessible as I can, but it will be relatively technical.

Why so technical? The interesting thing (to me) about the new model is not that it makes cool pictures – lots of existing models/techniques can do that – it’s that it makes a new kind of picture which no other model can make, as far as I know. As I put it earlier:

As far as I know, the image generator I made for Frank is the first neural image generator anyone has made that can write arbitrary text into the image!! Let me know if you’ve seen another one somewhere.

The model is solving a hard machine learning problem, which I didn’t really believe could be solved until I saw it work. I had to “pull out all the stops” to do this one, building on a lot of prior work. Explaining all that context for readers with no ML background would take a very long post.

tl;dr for those who speak technobabble: the new image generator is OpenAI-style denoising diffusion, with a 128x128 base model and a 128->256 superresolution model, both with the same set of extra features added. The extra features are: a transformer text encoder with character-level tokenization and T5 relative position embeddings; a layer of image-to-text and then text-to-image cross-attention between each resnet layer in the lower-resolution parts of the U-Net’s upsampling stack, using absolute axial position embeddings in image space; a positional “line embedding” in the text encoder that does a cumsum of newlines; and information about the diffusion timestep injected in two places, as another embedding fed to the text encoder, and injected with AdaGN into the queries of the text-to-image cross-attention. I used the weights of the trained base model to initialize the parts of the superresolution model’s U-Net that deal with resolutions below 256.

This post is extremely long, so the rest is under a readmore

The task

The core of my bot is a text generator. It can only see text.

People post a lot of images on tumblr, though, and the bot would miss out on a lot of key context if these images were totally invisible to it.

So, long ago, I let my bot “see” pictures by sending them to AWS Rekognition’s DetectText endpoint. This service uses a scene text recognition (STR) model to read text in the image, if it exists. (“STR” is the term for OCR when when the pictures aren’t necessarily printed text on paper.)

If Rekognition saw any text in the image, I let the bot see the text, between special delimiters so it knows it’s an image.

For example, when Frank read the OP of this post, this is what generator model saw:

#1 fipindustries posted:

i was perusing my old deviant art page and i came across a thing of beauty.

the ultimate “i was a nerdy teen in the mid 2000′s starter pack”. there was a challenge in old deviant art where you had to show all the different characters that had inspired an OC of yours. and so i came up with this list

=======
“Inspirations Meme” by Phantos
peter
=======

(This is actually less information than I get back from AWS. It also gives me bounding boxes, telling me where each line of text is in the image. I figured GPT wouldn’t be able to do much with this info, so I exclude it.)

Images are presented this way, also, in the tumblr dataset I use to finetune the generator.

As a result, the generator knows that people post images, and it knows a thing or two about what types of images people post in what contexts – but only through the prism of what their STR transcripts would look like.

This has the inevitable – but weird and delightful – result that the generator starts to invent its own “images,” putting them in its posts. These invented images are transcripts without originals (!). Invented tweets, represented the way STR would view a screenshot of them, if they existed; enigmatically funny strings of words that feel like transcripts of nonexistent memes; etc.

So, for a long time, I’ve had a vision of “completing the circuit”: generating images from the transcripts, images which contain the text specified in the transcripts. The novel pictures the generator is imagining itself seeing, through the limited prism of STR.

It turns out this is very difficult.

Image generators: surveying the field

We want to make a text-conditioned image generation model, which writes the text into the generated image.

There are plenty of text-conditioned image generators out there: DALL-E, VQGAN+CLIP, (now) GLIDE, etc. But they don’t write the text, they just made an image the text describes. (Or, they may write text on occasion, but only in a very limited way.)

When you design a text-conditioned image generation method, you make two nearly independent choices:

  1. How do you generate images at all?
  2. How do you make the images depend on the text?

That is, all these methods (including mine) start with some well-proven approach for generating images without the involvement of text, and then add in the text aspect somehow.

Let’s focus on the first part first.

There are roughly 4 distinct flavors of image generator out there. They differ largely in how they provide signal about which image are plausible to the model during training. A survey:

1. VAEs (variational autoencoders).

These have an “encoder” part that converts raw pixels to a compressed representation – e.g. 512 floating-point numbers – and a “decoder” part that converts the compressed representation back into pixels.

The compressed representation is usually referred to as “the latent,” a term I’ll use below.

During training, you tell the model to make its input match its output; this forces it to learn a good compression scheme. To generate a novel image, you ignore the encoder part, pick a random value for the latent, and turn it into pixels with the decoder.

That’s the “autoencoder” part. The “variational” part is an extra term in the loss that tries to make the latents fill up their N-dimensional space in a smooth, uniform way, rather than squashing all the training images into small scrunched-up pockets of space here and there. This increases the probability that a randomly chosen latent will decode to a natural-looking image, rather than garbage.

VAEs on their own are not as good at the other methods, but provide a foundation for VQ-autoregressive methods, which are now popular. (Though see this paper)

2. GANs (generative adversarial networks).

Structurally, these are like VAEs without the encoder part. They just have a latent, and a have a decoder that turns the latent into pixels.

How do you teach the decoder what images ought to look like? In a GAN, you train a whole separate model called the “discriminator,” which looks at pixels and tries to decide whether they’re a real picture or a generated one.

During training, the “G” (generator) and the “D” (discriminator) play a game of cat-and-mouse, where the G tries to fool the D into thinking its pictures are real, and the D tries not to get fooled.

To generate a novel image, you do the same thing as with a VAE: pick a random latent and feed it through the G (here, ignoring the D).

GANs are generally high-performing, but famously finicky/difficult to train.

3. VQVAEs (vector quantized VAEs) + autoregressive models.

These have two parts (you may be noticing a theme).

First, you have a “VQVAE,” which is like a VAE, with two changes to the nature of the latent: it’s localized, and it’s discrete.

Localized: instead of one big floating-point vector, you break the image up into little patches (typically 8x8), and the latent takes on a separate value for each patch.

Discrete: the latent for each patch is not a vector of floating-point numbers. It’s an element of a finite set: a “letter” or “word” from a discrete vocabulary.

Why do this? Because, once you have an ordered sequence of discrete elements, you can “do GPT to it!” It’s just like text!

Start with (say) the upper-leftmost patch, and generate (say) the one to its immediate right, and then the one to its immediate right, etc.

Train the model to do this in exactly the same way you train GPT on text, except it’s seeing representations that your VQVAE came up with.

These models are quite powerful and popular, see (the confusingly named) “VQ-VAE” and “VQ-VAE-2.”

They get even more powerful in the form of “VQGAN,” an unholy hybrid where the VQ encoder part is trained like a GAN rather than like a VAE, plus various other forbidding bells and whistles.

Somehow this actually works, and in fact works extremely well – at the current cutting edge.

(Note: you can also just “do GPT” to raw pixels, quantized in a simple way with a palette. This hilarious, “so dumb it can’t possibly work” approach is called “Image GPT,” and actually does work OK, but can’t scale above small resolutions.)

4. Denoising diffusion models.

If you’re living in 2021, and you want to be one of the really hip kids on the block – one of the kids who thinks VQGAN is like, sooooo last year – then these are the models for you. (They were first introduced in 2020, but came into their own with two OpenAI papers in 2021.)

Diffusion models are totally different from the above. They don’t have two separate parts, and they use a radically different latent space that is not really a “compressed representation.”

How do they work? First, let’s talk about (forward) diffusion. This just means taking a real picture, and steadily adding more random pixel noise to it, until it eventually becomes purely random static.

Here’s what this looks like (in its “linear” and “cosine” variants), from OA’s “Improved denoising diffusion probabilistic models”:

image

OK, that’s … a weird thing to do. I mean, if turning dogs into static entertains you, more power to you, your hobby is #valid. But why are we doing it in machine learning?

Because we can train a model to reverse the process! Starting with static, it gradually removes the noise step by step, revealing a dog (or anything).

There are a few different ways you can parameterize this, but in all of them, the model learns to translate frame n+1 into a probability distribution (or just a point prediction) for frame n. Applying this recursively, you recover the first frame from the last.

This is another bizarre idea that sounds like it can’t possibly work. All it has at the start is random noise – this is its equivalent of the “latent,” here.

(Although – since the sampling process is stochastic, unless you use a specific deterministic variant called DDIM – arguably the random draws at every sampling step are an additional latent. A different random seed will give you a different image, even from the same starting noise.)

Through the butterfly effect, one arrangement of random static gradually “decodes to” a dog, and another one gradually “decodes to” a bicycle, or whatever. It’s not that the one patch of RGB static is “more doglike” than the other; it just so happens to send the model on a particular self-reinforcing trajectory of imagined structure that spirals inexorably towards dog.

But it does work, and quite well. How well? Well enough that the 2nd 2021 OA paper on diffusion was titled simply, “Diffusion Models Beat GANs on Image Synthesis.

Conditioning on text

To make an image generator that bases the image on text, you pick one of the approaches above, and then find some way to feed text into it.

There are essentially 2 ways to do this:

The hard way: the image model can actually see the text

This is sort of the obvious way to do it.

You make a “text encoder” similar to GPT or BERT or w/e, that turns text into an encoded representation. You add a piece to the image generator that can look at the encoded representation of the text, and train the whole system end-to-end on text/image pairs.

If you do this by using a VQVAE, and simply feed in the text as extra tokens “before” all the image tokens – using the same transformer for both the “text tokens” and the VQ “image tokens” – you get DALL-E.

If you do this by adding a text encoder to a diffusion model, you get … my new model!! (Well, that’s the key part of it, but there’s more)

My new model, or GLIDE. Coincidentally, OpenAI was working on the same idea around the same time as me, and released a slightly different version of it called GLIDE.

—–

This text-encoder approach is fundamentally more powerful than the other one I’ll describe next. But also much harder to get working, and it’s hard in a different way for each image generator you try it with.

Whereas the other approach lets you take any image generator, and give it instant wizard powers. Albeit with limits.

Instant wizard powers: CLIP guidance

CLIP is an OpenAI text-image association model trained with contrastive learning, which is a mindblowingly cool technique that I won’t derail this post by explaining. Read the blog post, it’s very good.

The relevant tl;dr is that CLIP looks at texts and images together, and matches up images with texts that would be reasonable captions for them on the internet. It is very good at this. But, this is the only thing it does. It can’t generate anything; it can only look at pictures and text and decide whether they match.

So here’s what you do with CLIP (usually).

You take an existing image generator, from the previous section. You take a piece of text (your “prompt”). You pick a random compressed/latent representation, and use the generator to make an image from it. Then ask CLIP, “does this match the prompt?”

At this point, you just have some randomly chosen image. So, CLIP, of course, says “hell no, this doesn’t match the prompt at all.”

But CLIP also tells you, implicitly, how to change the latent representation so the answer is a bit closer to “yes.”

How? You take CLIP’s judgment, which is a complicated nested function of the latent representation: schematically,

judgment = clip(text, image_generator(latent))

All the functions are known in closed form, though, so you can just … analytically take the derivative with respect to “latent,” chain rule-ing all the way through “clip” and then through “image_generator.”

That’s a lot of calculus, but thankfully we have powerful chain rule calculating machines called “pytorch” and “GPUs” that just do it for you.

You move latent a small step in the direction of this derivative, then recompute the derivative again, take another small step, etc., and eventually CLIP says “hell yes” because the picture looks like the prompt.

This doesn’t quite work as stated, though, roughly because the raw CLIP gradients can’t break various symmetries like translation/reflection that you need to break to get a natural image with coherent pieces of different-stuff-in-different-places.

(This is especially a problem with VQ models, where you assign a random latent to each image patch independently, which will produce a very unstructured and homogeneous image.)

To fix this, you add “augmentations” like randomly cropping/translating the image before feeding it to CLIP. You then use the averaged CLIP derivatives over a sample of (say) 32 randomly distorted images to take each step.

A crucial and highly effective augmentation – for making different-stuff-in-different-places – is called “cutouts,” and involves blacking out everything in the image but a random rectangle. Cutouts is greatly helpful but also causes some glitches, and is (I believe) the cause of the phenomenon where “AI-generated” images often put a bunch of distinct unrelated versions of a scene onto the same canvas.

This CLIP-derivative-plus-augmentations thing is called CLIP guidance. You can use it with whichever image generator you please.

The great thing is you don’t need to train your own model to do the text-to-image aspect – CLIP is already a greater text-to-image genius than anything you could train, and its weights are free to download. (Except for the forbidden CLIPs, the best and biggest CLIPs, which are OA’s alone. But you don’t need them.)

For the image generator, a natural choice is the very powerful VQGAN – which gets you VQGAN+CLIP, the source of most of the “AI-generated images” you’ve seen papered all over the internet in 2021.

You know, the NeuralBreeders, or the ArtBlenders, or whatever you’re calling the latest meme one. They’re all just VQGAN+CLIP.

Except, sometimes they’re a different thing, pioneered by RiversHaveWings: CLIP-guided diffusion. Which is just like VQGAN+CLIP, except instead of VQGAN, the image generator is a diffusion model.

(You can also do something different called CLIP-conditioned diffusion, which is cool but orthogonal to this post)

Writing text … ?

OK but how do you get it to write words into the image, though.

None of the above was really designed with this in mind, and most of it just feels awkward for this application.

For instance…

Things that don’t work: CLIP guidance

CLIP guidance is wonderful if you don’t want to write the text. But for writing text, it has many downsides:

  • CLIP can sort of do some basic OCR, which is neat, but it’s not nearly good enough to recognize arbitrary text. So, you’d have to finetune CLIP on your own text/image data.
  • CLIP views images at a small resolution, usually 224x224. This is fine for its purposes, but may render some text illegible.
  • Writing text properly means creating a coherent structure of parts in the image, where their relation in space matters. But the augmentations, especially cutouts, try to prevent CLIP from seeing the image globally. The pictures CLIP actually sees will generally be crops/cutouts that don’t contain the full text you’re trying to write, so it’s not clear you even want CLIP to say “yes.” (You can remove these augmentations, but then CLIP guidance loses its magic and starts to suck.)

I did in fact try this whole approach, with my own trained VQVAE, and my own finetuned CLIP.

This didn’t really work, in exactly the ways you’d expect, although the results were often very amusing. Here’s my favorite one – you might even be able to guess what the prompt was:

image

OK, forget CLIP guidance then. Let’s do it the hard way and use a text encoder.

I tried this too, several times.

Things that don’t work: DALL-E

I tried training my own DALL-E on top of the same VQVAE used above. This was actually the first approach I tried, and where I first made the VQVAE.

(Note: that VQVAE itself can auto-encode pictures from tumblr splendidly, so it’s not the problem here.)

This failed more drastically. The best I could ever get was these sort of “hieroglyphics”:

image

This makes sense, given that the DALL-E approach has steep downsides of its own for this task. Consider:

  • The VQVAE imposes an artificial “grain” onto the image, breaking it up into little patches of (typically) 8x8 pixels. When text is written in an image, the letters could be aligned anywhere with respect to this “grain.”

    The same letters will look very different if they’re sitting in the middle of a VQ patch, vs. if they’re sitting right on the edge between two, or mostly in one patch and partly in another. The generator has to learn the mapping from every letter (or group of letters) to each of these representations. And then it has to do that again for every font size! And again for every font!
  • Learning to “do GPT” on VQ patches is generally just harder than learning to do stuff on raw pixels, since the relation to the image is more abstract. I don’t think I had nearly enough data/compute for a VQ-autoregressive model to work.

Things that don’t work: GANs with text encoders

OK, forget DALL-E … uh … what if we did a GAN, I guess?? where both the G and the D can see the encoded text?

This was the last thing I tried before diffusion. (StyleGAN2 + DiffAug, with text encoder.) It failed, in boring ways, though I tried hard.

GANs are hard to train and I could never get the thing to “use the text” properly.

One issue was: there is a lot of much simpler stuff for the G and D to obsess over, and make the topic of their game, before they have to think about anything as abstract as text. So you have to get pretty far in GAN training for the point where the text would matter, and only at that point does the text encoder start being relevant.

But I think a deeper issue was that VAE/GAN-style latent states don’t really make sense for text. I gave the G both the usual latent vector and a text encoding, but this effectively implies that every possible text should be compatible with every possible image.

For that to make sense, the latent should have a contextual meaning conditional on the text, expressing a parameterization of the space of “images consistent with this text.” But that intuitively seems like a relatively hard thing for an NN to learn.

Diffusion

Then I was on the EleutherAI discord, and RiversHaveWings happened to say this:

image

And I though, “oh, maybe it’s time for me to learn this new diffusion stuff. It won’t work, but it will be educational.”

So I added a text encoder to a diffusion model, using cross-attention. Indeed, it didn’t work.

Things that don’t work: 256x256 diffusion

For a long time, I did all my diffusion experiments at 256x256 resolution. This seemed natural: it was the biggest size that didn’t strain the GPU too much, and it was the smallest size I’d feel OK using in the bot. Plus I was worried about text being illegible at small resolutions.

For some reason, I could never get 256x256 text writing to work. The models would learn to imitate fonts, but they’d always write random gibberish in them.

I tried a bunch of things during this period that didn’t fix the problem, but which I still suspect were very helpful later:

  • Timestep embeddings: at some point, RiversHaveWings pointed out that my text encoder didn’t know the value of the diffusion timestep. This was bad b/c presumably you need different stuff from the text at different noise levels. I added that. Also added some other pieces like a “line embedding,” and timestep info injected into the cross-attn queries.
  • Line embeddings: I was worried my encoder might have trouble learning to determine which tokens were on which line of text. So I added an extra positional embedding that expresses how many newlines have happened so far.
  • Synthetic data: I made a new, larger synthetic, grayscale dataset of text in random fonts/sizes on flat backgrounds of random lightness. This presented the problem in a much crisper, easier to learn form. (This might have helped if I’d had it for the other approaches, although I went back and tried DALL-E on it and still got hieroglyphics, so IDK.)

Baby’s first words: 64x64 diffusion

A common approach with diffusion models is to make 2 of them, one at low resolution, and one that upsamples low-res images to a higher resolution.

At wit’s end, I decided to try this, with train a 64x64 low-res model. I trained it with my usual setup, and …

it can write!!!

It can write, in a sense … but with misspellings. Lots of misspellings. Epic misspellings.

One of my test prompts, which I ran on all my experimental models for ease of comparison, was the following (don’t ask):

the
what string
commit
String evolved
LEGGED

Here are two samples from the model, both with this prompt. (I’ve scaled them up in GIMP just so they’re easier to see, which is why they’re blurry.)

image
image

Interestingly, the misspellings vary with the conditioning noise (and the random draws during sampling since I’m not using DDIM). The model has “overly noisy/uncertain knowledge” as opposed to just being ignorant.

Spelling improves: relative positional embeddings

At this point, I was providing two kinds of position info to the model:

- Which line of text is this? (line embedding)

- Which character in the string is this, counting from the first one onward? (Absolute pos embedding)

I noticed that the model often got spelling right near the beginning of lines, but degraded later in them. I hypothesized that it was having trouble reconstructing relative position within a line from the absolute positions I was giving it.

cfoster0 on discord suggested I try relative positional embeddings, which together with the line embedding should convey the right info in an easy-to-use form.

I tried this, using the T5 version of relative positional embeddings.

This dramatically improved spelling. Given that test prompt, this model spelled it exactly right in 2 of 4 samples I generated:

image
image

I showed off the power of this new model by thanking discord user cfoster0 for their suggestion:

image

(At some point slightly before this, I also switched from a custom BPE tokenizer to character-level tokenization, which might have helped.)

Doing it for real: modeling text in natural images

OK, we can write text … at least, in the easiest possible setting: tiny 64x64 images that contain only text on a flat background, nothing else.

The goal, though, is to make “natural” images that just so happen to contain text.

Put in a transcript of a tweet, get a screenshot of a tweet. Put in a brief line of text, get a movie still with the text as the subtitle, or a picture of a book whose the text is the title, or something. This is much harder:

  • I have fewer real tumblr images (around 169k) than synthetic text images (around 407k, and could make more if needed)
  • The real data is much more diverse and complex
  • The real data introduces new ways the image can depend on the text
  • Much of the real data is illegible at 64x64 resolution

Let’s tackle the resolution issue first. On the synthetic data, we know 64x64 works, and 256x256 doesn’t work (even with relative embeds.)

What about 128x128, though? For some reason, that works just as well as 64x64! It’s still small, and ideally I’d want to make images bigger than that, but it makes legibility less of a concern.

OK, so I can generate text that looks like the synthetic dataset, at 128x128 resolution. If I just … finetune that model on my real dataset, what happens?

It works!

The model doesn’t make recognizable objects/faces/etc most of the time, which is not surprising given the small size and diverse nature of the data set.

But it does learn the right relationships between text and image, without losing its ability to write text itself. It does misspell things sometimes, about as often as it did on the synthetic data, but that seems acceptable.

Here’s a generated tweet from this era:

image

The prompt for this was a real STR transcript of this tweet. (Sorry about the specific choice of image here, it’s just a tweet that ended up in my test split and was thus a useful testing prompt)

At this point, I was still doing everything in monochrome (with monochrome noise), afraid that adding color might screw things up. Does it, though?

Nope! So I re-do everything in color, although the synthetic font data is still monochrome (but now diffused with RGB noise). Works just as well.

(Sometime around this point, I added extra layers of image-to-text cross-attn before each text-to-image one, with an FF layer in the middle. This was inspired by another cfoster0 suggestion, and I thought it might help the model use image context to guide how it uses the text.

This is called “weave attn” in my code. I don’t know if it’s actually helpful, but I use it from here on.)

One last hurdle: embiggening

128x128 is still kinda small, though.

Recall that, when I originally did 64x64, the plan was to make a second “superresolution” model later to convert small images into bigger ones.

(You do this, in diffusion, by simply giving the model the [noiseless] low-res image as an extra input, alongside the high-res but noised image that is an input to any diffusion model. In my case, I also fed it the text, using the same architecture as elsewhere.)

How was that going? Not actually that well, even though it felt like the easy part.

My 128 -> 256 superresolution models would get what looked like great metrics, but when I looked at the upsampled images, they looked ugly and fuzzy, a lot like low-quality JPEGs.

I had warm-started the text encoder part of the super-res model with the encoder weights from the base model, so it should know a lot about text. But it wasn’t very good at scaling up small, closely printed text, which is the most important part of its job.

I had to come up with one additional trick, to make this last part work.

My diffusion models use the standard architecture for such models, the “U-net,” so called for its U shape.

It takes the image, processes it a bit, and then downsamples it to half the resolution. It processed it there, then downsamples it again, etc – all the way down to 8x8. Then it goes back up again, to 16x16, etc. When it reaches the original resolution, it spits out its prediction for the noise.

Therefore, most of the structure of my 256-res model looks identical to the structure of my 128-res model. Only it’s “sandwiched” between a first part that downsamples from 256, and a final part that upsamples to 256.

The trained 128 model knows a lot about how these images tend to look, and about writing text. What if I warm-start the entire middle of the U-Net with weights of the 128 model?

That is, at initialization, my 256 super-res model would just be my 128 model sandwiched inside two extra parts, with random weights for the “bread” only.

I can imagine lots of reasons this might not work, but it was easy to try, and in fact it did work!

Super-res models initialized in this way rapidly learned to do high-quality upsampling, both of text and non-text image elements.

At this point, I had the model (or rather, the two models) I would deploy in the bot.

Using it in practice: rejection sampling

To use this model in practice, the simplest workflow would be:

  1. Generate a single 128x128 image from the prompt
  2. Using the prompt and the 128x128 image, upsample to 256x256
  3. We’re done

However, recall that we have access to STR model, which we can ask to read images.

In some sense, the point of all this work is to “invert” the STR model, making images from STR transcripts. If this worked perfectly, feeding the image we make through STR would always return the original prompt.

The model isn’t that good, but we can get it closer by using this workflow instead:

  1. Generate multiple 128x128 images from the prompt
  2. Read all the 128x128 images with STR
  3. Using some metric like n-gram similarity, measure how close the transcripts are to the original prompt, and remove the “worst” images from the batch
  4. Using the prompt and the 128x128 images that were kept in step 3, upsample 256x256
  5. Feed all the 256x256 images through STR
  6. Pick the 256x256 image that most closely matches the prompt
  7. We’re done

For step 3, I use character trigram similarity and a slightly complicated pruning heuristic with several thresholds. The code for this is here.

Why did diffusion work?

A few thoughts on why diffusion worked for this problem, unlike anything else:

- Diffusion doesn’t have the problem that VQ models have, where the latent exists on an arbitrary grid, and the text could have any alignment w/r/t the grid.

- Unlike VQ models, and GAN-type models with a single vector latent, the “latent” in diffusion isn’t trying to parameterize the manifold of plausible images in any “nice” way. It’s just noise.

Since noise works fine without adding some sort of extra “niceness” constraint, we don’t have to worry about the constraint being poorly suited to text.

- During training, diffusion models take partially noised real images as inputs, rather than getting a latent and having to invent the entire image de novo. And it only gets credit for making this input less noised, not for any of the structure that’s already there.

I think this helps it pick up nuances like “what does the text say?” more quickly than other models. At some diffusion timesteps, all the obvious structure (that other models would obsess over) has already been revealed, and the only way to score more points is to use nuanced knowledge.

In a sense, diffusion learns the hard stuff and the easy stuff in parallel, rather than in stages like other models. So it doesn’t get stuck in a trap where it over-trains itself to one stage, and then can’t learn the later stages, because the loss landscape has a barrier in between (?). I don’t know how to make this precise, but it feels true.

Postscript: GLIDE

Three days before I deployed this work in the bot, OpenAI released its own text-conditioned diffusion model, called GLIDE. I guess it’s an idea whose time has come!

Their model is slightly different in how it joins the text encoder to the U-net. Instead of adding cross-attn, it simply appends the output of the text encoder as extra positions in the standard attention layers, which all diffusion U-Nets have in their lower-resolution middle layer(s).

I’m not sure if this would have worked for my problem. (I don’t know because they didn’t try to make their model write text – it models the text-image relation more like CLIP and DALL-E.)

In any event, it makes bigger attn matrices than my approach, of size (text_len + res^2)^2 rather than my (text_len * res^2). The extra memory needed might be prohibitive for me in practice, not sure.

I haven’t tried their approach, and it’s possible it would beat mine in a head-to-head comparison on this problem. If so, I would want to use theirs instead.

The end

Thanks for reading this giant post!

Thanks again to people in EleutherAI discord for help and discussion.

You can see some of the results in this tag.

  1. ball-lightning reblogged this from nostalgebraist
  2. i-am-carbon14 reblogged this from andmaybegayer
  3. gender-trash reblogged this from andmaybegayer
  4. thehylianbatman reblogged this from andmaybegayer
  5. andmaybegayer reblogged this from firebendinglemur
  6. stumpyjoepete reblogged this from nostalgebraist
  7. firebendinglemur reblogged this from argumate
  8. argumate reblogged this from nostalgebraist and added:
    you're doing robot god's work.
  9. nostalgebraist posted this
    Last week, I released a new feature for @nostalgebraist-autoresponder that generates images. Earlier I promised a post...