Language-Conditioned Absolute Unit NNs

Gwern Branwen

Proposal for applying the AUNN neural net architecture to reconstruction of historical documents using pretrained large language models.

2022-10-22–2023-08-08 finished certainty: unlikely importance: 3 backlinks similar bibliography

Herculaneum Papyri
Language Priors
- Optimizing For Posterior Probability
- Decoding Many Reads
Language-Condition Everything

[Warning: JavaScript Disabled!]

[For support of key website features (link annotation popups/popovers & transclusions, collapsible sections, backlinks, tablesorting, image zooming, sidenotes etc), you must enable JavaScript.]

As an application of my proposed AUNN MLP neural net architecture which handles arbitrary-modality data, I sketch out a system for plugging large language models (LLMs) into AUNNs. The advantage is that this efficiently provides a highly-informative Greco-Roman language prior for reconstruction of the text of damaged Herculaneum papyri using advanced imaging modalities like X-rays.

Because there is so little raw data, and obtaining more is infeasible indefinitely in the absence of convincing reconstructions which could justify the risk of excavating more fragile papyri, posing a chicken-and-egg bootstrap problem, it is critical to use all available sources of information jointly & end-to-end.

Since AUNNs concentrate all raw data about all papyri into a single model, it can generate embeddings of the implicit reconstructed text at given locations in a papyrus. These embeddings are differentiable and can be passed into a frozen Greco-Roman large language model to be scored by their plausibility as real natural language, and then run backwards to update the AUNN weights to emit more plausible embeddings.

This constrains the naive raw/physics-only reconstructions (which are highly under-determined by the raw data), to the vanishingly small subset of reconstructions consistent with our extensive data on Greek/Latin natural language, and can potentially produce meaningful reconstructions out of the reach of conventional approaches using naive priors & separate analyses.

AUNNs can be combined with other pretrained models or data sources, treating the AUNN as a condensed summary of world knowledge to query as needed (background).

Herculaneum Papyri

Going back to the motivating example of highly-complex multimodal Herculaneum papyrus data, we can observe that scan data may always be insufficient to perfectly reconstruct each letter on a papyrus—when treated in isolation. However, being able to reconstruct the letter corresponding to a tiny bit of carbonized papyrus is unnecessary, as we really want the document as a whole. Paradoxically, the latter may be a vastly easier task.

We are in a situation similar to cryptography, where a process has ‘encrypted’ most of the original information, baffling all conventional methods, and appearing secure. Long experience has taught militaries & intelligence agencies to never underestimate what information can be extracted from apparently harmless things or side-channels, and insist on extreme measures like non-reused one-time pads or complete physical destruction of storage media—“attacks only get better”. Given this, cryptography can succeed in decrypting even the most convoluted brief ciphertexts to recover (much of) the plaintext; beyond notable successes like the WWII defeat of Enigma/Lorenz by cryptographers like Alan Turing using colossal amounts of compute to try as many possibilities while looking for subtle statistical deviations, we can also point to the Venona project, whose analysts (by thinking hard & using domain knowledge) managed to defeat only-briefly-reused one-time pads. So maybe Herculaneum papyri can yield their secrets too.¹

Language Priors

Why can breaking cryptography work so well? Why don’t you wind up with a morass of equally plausible decrypts?

Meaning is rare. One answer is that meaningful messages are astronomically, absurdly, unlikely; in the Library of Babel of all possible text strings, getting a single coherent word in a short message is rare, and getting an entire coherent contextually-appropriate message is so vanishingly rare that you can be certain you cracked the cipher. If you are cracking an Enigma message in the middle of WWII, humanity will be long gone before your random decryption attempts accidentally decode a Nazi order to ‘a submarine pack in the middle of the Atlantic’ when the real plaintext was about ‘a sauerkraut shortage in Paris’. Simple statistical properties of natural language, like how common is the letter ‘e’ in English (the equivalent of 0-grams), let you distinguish almost all gibberish mistaken decrypts from progress towards a correct decrypt; and once one has cracked the Nazi settings for the day, all the other messages will also decrypt coherently. Indeed, frequencies are so powerful that even a child can decipher a substitution cipher by simply knowing that the most common symbol is ‘e’ and then the next few are probably vowels, and so on; Persi Diaconis offers an even more fun example of an undergraduate exercise in decrypting a real prison gang cipher using just 1-gram character-level transitions in English and a simple Markov chain optimizer which tweaks a substitution cipher to try to make the ‘decrypt’ of the gang message match the English statistics (even though the message turns out to actually be “a mix of English, Spanish and prison jargon”!). If 0-gram & 1-gram approaches can work so well, then one is no longer surprised if n-gram approaches can crack much harder approaches—and these days neural language models are far better at modeling language than n-grams.

Must be real Latin/Greek. A similar principle applies to any Herculaneum papyrus decoding. The prior probability of a papyrus containing Greco-Roman Latin-Greek etc is ~1. If the text doesn’t sound like it’s Greco-Roman Latin/Greek of the sort an elite aristocrat would have in his personal library, and is something like random gibberish² then it does not matter what the supposed decoding algorithm is, nor how many physicists swear that it’s correct; it has failed. No argument or validation can convince us that what was written on the papyrus really was random gibberish, rather than the method being incorrect. We may not be able to check the external evidence of possible decodes to verify that some object is where the decodes says it is, given that most things have been lost or dug up already, and archaeological verification might take decades But we can say much more about what the internal evidence of the decodes have to be: they have to be long (papyri being too expensive to just leave mostly blank), well-written (because they were curated into an elite library), scribed with high-quality (many were probably gifts from other aristocrats or copied specifically for that library), as a group highly novel (because only a tiny fraction of the Greco-Roman corpus has survived and it would be extremely unlikely for all of the decodes to be texts we already possess) and where not novel differing in many small places (because we have definitely not purged all textual corruptions), etc. And we have enough Greco-Roman texts (as well as human-authored texts in general) that we can, and have, trained n-gram & large neural language models (LLMs) on them (eg. the Pythia/Ithaca NNs predict missing letters in Greek epigraphs). The information from the LLM could make the difference between being able to reconstruct only the occasional letters, and entire fragments: compared to the irresponsibly naive approach of assuming a flat uninformative prior over individual letters, the extremely rich informative prior of a LLM model with near-human priors about global texts is worth a mountain of carbonized papyri.

Optimizing For Posterior Probability

Predict real Latin/Greek. So the question becomes how to benefit from this knowledge? We could just score decodes & reject ones which are insufficiently probable, but we didn’t need NNs for that. We want to use the NN knowledge much more directly, and ideally, use it as supervision for the AUNN operating on raw data: coherent language is an incredibly powerful constraint—almost no physically-possible decodes yield coherent language. If the parameters of an AUNN trained on a pile of raw data yield a gibberish decode on a papyrus, then the parameters are wrong, and they need to be completely rewritten until they reach coherency on it; and then they can be trained on the next papyrus. If it’s vanishingly unlikely to decrypt a coherent but wrong message once, then it is even more unlikely to do so across multiple papyri.

So we want to optimize the AUNN to jointly minimize its loss on both the prediction/reconstruction of the raw data but also the likelihood of decoded text according to a LLM, thereby instilling into the AUNN an expert knowledge of all Greco-Roman text.

This looks like a common setup these days in multimodal learning, where we wish to plug in a LLM to a vision model (eg. Flamingo). You freeze the LLM model to preserve the linguistic & world knowledge and avoid low-quality information from the other modality erasing it, and use its embeddings. Because the LLM model is still fully-differentiable, it can be used in end-to-end training (we just do not bother with updating its parameters).

Backprop backwards. This means we can go in any direction: while we usually only think of doing backprop to optimize the model parameters given a fixed input to minimize the loss on a fixed output, we can fix any two of the 3 of input/output/parameters and do backprop. We could, and often do in generative models ³, fix the output & parameters and do gradient ascent to find an input which most closely yields the output. This gives us a way to feed the LLM back into the AUNN: if we hook it up right, somehow the AUNN can ‘generate language’ (an embedding) from an input papyrus, and feed this into the LLM, which produces predictions about the plausibility (log-likelihood) of that output language, and can be run backwards to (gradient-ascent) improve the plausibility of the AUNN’s ‘language’, and then the AUNN’s parameters (but not the frozen LLM’s) get updated to better predict both raw data & language.

Embed & generate. More concretely: we can modify the AUNN by adding an extra ‘head’ to it, ie. picking a late layer from it to feed into additional layers to output an embedding.⁴ What is this embedding going to do, since it’s not being trained to predict the raw data? First, the embedding is passed into a new learned adapter layer(s), which connects it to the frozen LLM. The LLM is unrolled to generate a full language sample, such as 1 complete line of papyrus text, yielding a total likelihood. This is then used as the objective to maximize in gradient ascent, backpropping through the AUNN to nudge the embedding towards something that would yield a more likely 1-line-of-text. To avoid forgetting the raw data, the AUNN is also running the usual raw data predictions at random indices. The two losses are given a weighted sum to force joint optimization of the two separate objectives, and a gradient ascent+descent step then updates the AUNN parameters.

So, over many updates, the AUNN bounces around the embedding yielding gibberish, but being constantly pushed to yield slightly more coherent gibberish each time—not necessarily any specific coherent gibberish, because there is no label & we are not regressing on a specific desired Greco-Roman text output, just pushing it to be more coherent in general.

Could one add more such supervision to the AUNN? Surely there’s more information about ancient texts which could be exploited (look at paleography). Well, one could certainly imagine adding image supervision, using an image generative model trained on images of actual manuscripts or synthetic scroll images, so the decodes are further constrained to having plausible-sized lines, realistic layouts, and so on. I suspect that would add a lot of complexity for little gain compared to the linguistic supervision, however—there are probably not a lot of decodes which differ on the exact length of lines on the papyrus.

Decoding Many Reads

Decoding windows. Because there are no contexts or memories, and the entire knowledge of the papyrus is stored inside the AUNN’s parameters, there’s not really a need to specify exactly “where to look”, and given how twisted & degraded the papyri are, there’s not necessarily a usefully precise notion of ‘where’ a text is to begin with. The carbonized ink may stick out to various depths from the attached papyrus roll, there may be useful signals on the outside or inside neighboring pieces of papyrus, the text can be predicted by drawing on text before & after, and so on. The joint constraint will probably lead to indices generating embeddings which summarize ‘nearby text’, in some sense.⁵

So one can read the entire scroll by generating text reads from many indices, and stitching them together. The amount of disagreements & variants can help indicate the level of certainty the AUNN has in the decoding. (Perhaps doing another level of gradient ascent and/or regression: denoise a combined transcript, and then use that as the target.)

Language-Condition Everything

One can take this general idea in many directions (as multimodal vision-language research already has). For example, the embedding does not have to go straight into a LLM; one could imagine, say, a NeRF model which generates an image using LIDAR scan data but then which feeds it into the vision-half of CLIP to yield an embedding for the text-half, and which would allow one to do text-guided reconstruction where the original data is uninformative, like make it reconstruct 3D objects but “more bouba/kiki” (or any other adjective which can be recognized by a model and meaningfully constrains the space of possibilities).

Cryptographer Ralph Merkle has argued along these lines for the possibility, at least in principle, of cryonics.↩︎
Or worse, something like contemporary English, which would indicate something has gone terribly wrong.↩︎
So in a GAN which turns a latent z into a photorealistic fake image, we could take an actual real photo and run the GAN backwards to find the latent z which makes the GAN generate that image; and now we know the z of that image and all the knowledge encoded into that z, which can tell us what the model thinks that image contains, how to edit it by tweaking specific parts of that image, what other images it is similar to, and so on.↩︎
Usually, the ante-penultimate layer or a combination of high layers; the final layer before the prediction typically turns out to be a bad choice for embeddings because the model has already thrown away most of the general information as it prioritizes the final prediction step.↩︎
One risk is that the AUNN will cheap out—“neural nets are lazy”—and satisfy the joint optimization by a degenerate solution like overfitting on a single plausible Greco-Roman text output & simply outputting that as a constant embedding, thereby letting it learn the raw data without any distractions. This can be fought by adding a third contrastive loss, using the indices (and papyrus IDs): nearby indices (and within an ID) should have similar embeddings, but indices far from each other (or in separate IDs) should have distant embeddings.↩︎