Skip to main content

tutorial directory


“GPT-2 Preference Learning for Music Generation”, Branwen 2019

GPT-2-preference-learning: “GPT-2 Preference Learning for Music Generation”⁠, Gwern Branwen (2019-12-16; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Experiments with OpenAI’s ‘preference learning’ approach, which trains a NN to predict global quality of datapoints, and then uses reinforcement learning to optimize that directly, rather than proxies. I am unable to improve quality, perhaps due to too-few ratings.

Standard language generation neural network models, like GPT-2⁠, are trained via likelihood training to imitate human text corpuses. Generated text suffers from persistent flaws like repetition, due to myopic generation word-by-word, and cannot improve on the training data because they are trained to predict ‘realistic’ completions of the training data.

A proposed alternative is to use reinforcement learning to train the NNs, to encourage global properties like coherence & lack of repetition, and potentially improve over the original corpus’s average quality. Preference learning trains a reward function on human ratings, and uses that as the ‘environment’ for a blackbox DRL algorithm like PPO⁠.

OpenAI released a codebase implementing this dual-model preference learning approach for textual generation, based on GPT-2. Having previously used GPT-2 for poetry & music generation⁠, I experimented with GPT-2 preference learning for unconditional music and poetry generation.

I found that preference learning seemed to work better for music than poetry, and seemed to reduce the presence of repetition artifacts, but the results, at n ≈ 7,400 ratings compiled over 23 iterations of training+sampling November 2019–January 2020, are not dramatically better than alternative improvements like scaling up models or more thorough data-cleaning or more stringent sample curation. My blind ratings using n ≈ 200 comparisons showed no large advantage for the RL-tuned samples (winning only 93 of 210 comparisons, or 46%).

This may be due to insufficient ratings, bad hyperparameters, or not using samples generated with common prefixes, but I suspect it’s the former, as some NLP tasks in Ziegler et al 2019 required up to 60k ratings for good performance, and the reward model appeared to achieve poor performance & succumb to adversarial examples easily.

Working with it, I suspect that preference learning is unnecessarily sample-inefficient & data-inefficient, and that the blackbox reinforcement learning approach is inferior to directly using the reward model to optimize text samples, and propose two major architectural overhauls: have the reward model directly model the implied ranking of every datapoint, and drop the agent model entirely in favor of backprop-powered gradient ascent which optimizes sequences to maximize the reward model’s output⁠.

“GPT-2 Neural Network Poetry”, Branwen & Presser 2019

GPT-2: “GPT-2 Neural Network Poetry”⁠, Gwern Branwen, Shawn Presser (2019-03-03; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Demonstration tutorial of retraining OpenAI’s GPT-2 (a text-generating Transformer neural network) on large poetry corpuses to generate high-quality English verse.

In February 2019, following up on my 2015–2016 text-generation experiments with char-RNNs⁠, I experiment with the cutting-edge Transformer NN architecture for language modeling & text generation. Using OpenAI’s GPT-2-117M (117M) model pre-trained on a large Internet corpus and nshepperd’s finetuning code, I retrain GPT-2-117M on a large (117MB) Project Gutenberg poetry corpus. I demonstrate how to train 2 variants: “GPT-2-poetry”, trained on the poems as a continuous stream of text, and “GPT-2-poetry-prefix”, with each line prefixed with the metadata of the PG book it came from. In May 2019, I trained the next-largest GPT-2, GPT-2-345M, similarly, for a further quality boost in generated poems. In October 2019, I retrained GPT-2-117M on a Project Gutenberg corpus with improved formatting, and combined it with a contemporary poem dataset based on Poetry Foundation’s website⁠.

With just a few GPU-days on 1080ti GPUs, GPT-2-117M finetuning can produce high-quality poetry which is more thematically consistent than my char-RNN poems, capable of modeling subtle features like rhyming, and sometimes even a pleasure to read. I list the many possible ways to improve poem generation and further approach human-level poems. For the highest-quality AI poetry to date, see my followup pages, “GPT-3 Creative Writing”⁠/​“GPT-3 Non-Fiction”⁠.

For anime plot summaries, see TWDNE⁠; for generating ABC-formatted folk music, see “GPT-2 Folk Music” & “GPT-2 Preference Learning for Music and Poetry Generation”⁠; for playing chess, see “A Very Unlikely Chess Game”⁠; for the Reddit comment generator, see SubSimulatorGPT-2⁠; for fanfiction, the Ao3⁠; and for video games, the walkthrough model⁠. For OpenAI’s GPT-3 followup, see “GPT-3: Language Models are Few-Shot Learners”⁠.

“This Waifu Does Not Exist”, Branwen 2019

TWDNE: “This Waifu Does Not Exist”⁠, Gwern Branwen (2019-02-19; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

I describe how I made the website (TWDNE) for displaying random anime faces generated by StyleGAN neural networks, and how it went viral.

Generating high-quality anime faces has long been a task neural networks struggled with. The invention of StyleGAN in 2018 has effectively solved this task and I have trained a StyleGAN model which can generate high-quality anime faces at 512px resolution. To show off the recent progress, I made a website, “This Waifu Does Not Exist” for displaying random StyleGAN 2 faces. TWDNE displays a different neural-net-generated face & plot summary every 15s. The site was popular and went viral online, especially in China. The model can also be used interactively for exploration & editing in the Artbreeder online service⁠.

TWDNE faces have been used as screensavers, user avatars, character art for game packs or online games⁠, painted watercolors⁠, uploaded to Pixiv, given away in streams⁠, and used in a research paper (Noguchi & Harada 2019). TWDNE results also helped inspired Sizigi Studio’s online interactive waifu GAN⁠, Waifu Labs⁠, which generates even better anime faces than my StyleGAN results.

“Making Anime Faces With StyleGAN”, Branwen 2019

Faces: “Making Anime Faces With StyleGAN”⁠, Gwern Branwen (2019-02-04; ⁠, ⁠, ⁠, ; backlinks; similar):

A tutorial explaining how to train and generate high-quality anime faces with StyleGAN 1/​2 neural networks, and tips/​scripts for effective StyleGAN use.

Generative neural networks, such as GANs, have struggled for years to generate decent-quality anime faces, despite their great success with photographic imagery such as real human faces. The task has now been effectively solved, for anime faces as well as many other domains, by the development of a new generative adversarial network, StyleGAN⁠, whose source code was released in February 2019.

I show off my StyleGAN 1/​2 CC-0-licensed anime faces & videos, provide downloads for the final models & anime portrait face dataset⁠, provide the ‘missing manual’ & explain how I trained them based on Danbooru2017 /  ​ 2018 with source code for the data preprocessing⁠, document installation & configuration & training tricks⁠.

For application, I document various scripts for generating images & videos⁠, briefly describe the website “This Waifu Does Not Exist” I set up as a public demo & its followup This Anime Does Not (TADNE) (see also Artbreeder), discuss how the trained models can be used for transfer learning such as generating high-quality faces of anime characters with small datasets (eg. Holo or Asuka Souryuu Langley), and touch on more advanced StyleGAN applications like encoders & controllable generation.

The anime face graveyard gives samples of my failures with earlier GANs for anime face generation, and I provide samples & model from a relatively large-scale BigGAN training run suggesting that BigGAN may be the next step forward to generating full-scale anime images.

A minute of reading could save an hour of debugging!

“Making Anime With BigGAN”, Branwen 2019

BigGAN: “Making Anime With BigGAN”⁠, Gwern Branwen (2019-02-04; ⁠, ⁠, ⁠, ; backlinks; similar):

Experiments in using BigGAN to generate anime faces and whole anime images; semi-successful.

Following my StyleGAN anime face experiments⁠, I explore BigGAN⁠, another recent GAN with SOTA results on one of the most complex image domains tackled by GANs so far (ImageNet). BigGAN’s capabilities come at a steep compute cost, however.

Using the unofficial BigGAN-PyTorch reimplementation, I experimented in 2019 with 128px ImageNet transfer learning (successful) with ~6 GPU-days, and from-scratch 256px anime portraits of 1000 characters on an 8×2080ti machine for a month (mixed results). My BigGAN results are good but compromised by the compute expense & practical problems with the released BigGAN code base. While BigGAN is not yet superior to StyleGAN for many purposes, BigGAN-like approaches may be necessary to scale to whole anime images.

For followup experiments, Shawn Presser⁠, I and others (collectively, “Tensorfork”) have used Tensorflow Research Cloud TPU credits & the compare_gan BigGAN reimplementation. Running this at scale on the full Danbooru2019 dataset in May 2020, we have reached the best anime GAN results to date (later exceeded by This Anime Does Not Exist).

“Internet Search Tips”, Branwen 2018

Search: “Internet Search Tips”⁠, Gwern Branwen (2018-12-11; ⁠, ⁠, ⁠, ; backlinks; similar):

A description of advanced tips and tricks for effective Internet research of papers/​books, with real-world examples.

Over time, I developed a certain google-fu and expertise in finding references, papers, and books online. Some of these tricks are not well-known, like checking the Internet Archive (IA) for books. I try to write down my search workflow, and give general advice about finding and hosting documents, with demonstration case studies⁠.

“Candy Japan’s New Box A/B Test”, Branwen 2016

Candy-Japan: “Candy Japan’s new box A/B test”⁠, Gwern Branwen (2016-05-06; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Bayesian decision-theoretic analysis of the effect of fancier packaging on subscription cancellations & optimal experiment design.

I analyze an A/​B test from a mail-order company of two different kinds of box packaging from a Bayesian decision-theory perspective, balancing posterior probability of improvements & greater profit against the cost of packaging & risk of worse results, finding that as the company’s analysis suggested, the new box is unlikely to be sufficiently better than the old. Calculating expected values of information shows that it is not worth experimenting on further, and that such fixed-sample trials are unlikely to ever be cost-effective for packaging improvements. However, adaptive experiments may be worthwhile.

“Easy Cryptographic Timestamping of Files”, Branwen 2015

Timestamping: “Easy Cryptographic Timestamping of Files”⁠, Gwern Branwen (2015-12-04; ⁠, ⁠, ⁠, ; backlinks; similar):

Scripts for convenient free secure Bitcoin-based dating of large numbers of files/​strings

Local archives are useful for personal purposes, but sometimes, in investigations that may be controversial, you want to be able to prove that the copy you downloaded was not modified and you need to timestamp it and prove the exact file existed on or before a certain date. This can be done by creating a cryptographic hash of the file and then publishing that hash to global chains like centralized digital timestampers or the decentralized Bitcoin blockchain. Current timestamping mechanisms tend to be centralized, manual, cumbersome, or cost too much to use routinely. Centralization can be overcome by timestamping to Bitcoin; costing too much can be overcome by batching up an arbitrary number of hashes and creating just 1 hash/​timestamp covering them all; manual & cumbersome can be overcome by writing programs to handle all of this and incorporating them into one’s workflow. So using an efficient cryptographic timestamping service (the OriginStamp Internet service), we can write programs to automatically & easily timestamp arbitrary files & strings, timestamp every commit to a Git repository, and webpages downloaded for archival purposes. We can implement the same idea offline, without reliance on OriginStamp, but at the cost of additional software dependencies like a Bitcoin client.

“RNN Metadata for Mimicking Author Style”, Branwen 2015

RNN-metadata: “RNN Metadata for Mimicking Author Style”⁠, Gwern Branwen (2015-09-12; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Teaching a text-generating char-RNN to automatically imitate many different authors by labeling the input text by author; additional experiments include imitating Geocities and retraining GPT-2 on a large Project Gutenberg poetry corpus.

Char-RNNs are unsupervised generative models which learn to mimic text sequences. I suggest extending char-RNNs with inline metadata such as genre or author prefixed to each line of input, allowing for better & more efficient metadata, and more controllable sampling of generated output by feeding in desired metadata. A 2015 experiment using torch-rnn on a set of ~30 Project Gutenberg e-books (1 per author) to train a large char-RNN shows that a char-RNN can learn to remember metadata such as authors, learn associated prose styles, and often generate text visibly similar to that of a specified author.

I further try & fail to train a char-RNN on Geocities HTML for unclear reasons.

More successfully, I experiment in 2019 with a recently-developed alternative to char-RNNs⁠, the Transformer NN architecture, by finetuning training OpenAI’s GPT-2-117M Transformer model on a much larger (117MB) Project Gutenberg poetry corpus using both unlabeled lines & lines with inline metadata (the source book). The generated poetry is much better. And GPT-3 is better still.

“When Should I Check The Mail?”, Branwen 2015

Mail-delivery: “When Should I Check The Mail?”⁠, Gwern Branwen (2015-06-21; ⁠, ⁠, ⁠, ; backlinks; similar):

Bayesian decision-theoretic analysis of local mail delivery times: modeling deliveries as survival analysis⁠, model comparison, optimizing check times with a loss function⁠, and optimal data collection.

Mail is delivered by the USPS mailman at a regular but not observed time; what is observed is whether the mail has been delivered at a time, yielding somewhat-unusual “interval-censored data”. I describe the problem of estimating when the mailman delivers, write a simulation of the data-generating process, and demonstrate analysis of interval-censored data in R using maximum-likelihood (survival analysis with Gaussian regression using survival library), MCMC (Bayesian model in JAGS), and likelihood-free Bayesian inference (custom ABC, using the simulation). This allows estimation of the distribution of mail delivery times. I compare those estimates from the interval-censored data with estimates from a (smaller) set of exact delivery-times provided by USPS tracking & personal observation, using a multilevel model to deal with heterogeneity apparently due to a change in USPS routes/​postmen. Finally, I define a loss function on mail checks, enabling: a choice of optimal time to check the mailbox to minimize loss (exploitation); optimal time to check to maximize information gain (exploration); Thompson sampling (balancing exploration & exploitation indefinitely), and estimates of the value-of-information of another datapoint (to estimate when to stop exploration and start exploitation after a finite amount of data).

“The Sort --key Trick”, Branwen 2014

Sort: “The sort --key Trick”⁠, Gwern Branwen (2014-03-03; ⁠, ⁠, ⁠, ; backlinks; similar):

Commandline folklore: sorting files by filename or content before compression can save large amounts of space by exposing redundancy to the compressor. Examples and comparisons of different sorts.

Programming folklore notes that one way to get better lossless compression efficiency is by the precompression trick of rearranging files inside the archive to group ‘similar’ files together and expose redundancy to the compressor, in accordance with information-theoretical principles. A particularly easy and broadly-applicable way of doing this, which does not require using any unusual formats or tools and is fully compatible with the default archive methods, is to sort the files by filename and especially file extension.

I show how to do this with the standard Unix command-line sort tool, using the so-called “sort --key trick”, and give examples of the large space-savings possible from my archiving work for personal website mirrors and for making darknet market mirror datasets where the redundancy at the file level is particularly extreme and the sort --key trick shines compared to the naive approach.

“A/B Testing Long-form Readability on”, Branwen 2012

AB-testing: “A/B testing long-form readability on”⁠, Gwern Branwen (2012-06-16; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

A log of experiments done on the site design, intended to render pages more readable, focusing on the challenge of testing a static site, page width, fonts, plugins, and effects of advertising.

To gain some statistical & web development experience and to improve my readers’ experiences, I have been running a series of CSS A/​B tests since June 2012. As expected, most do not show any meaningful difference.

“Silk Road 1: Theory & Practice”, Branwen 2011

Silk-Road: “Silk Road 1: Theory & Practice”⁠, Gwern Branwen (2011-07-11; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

History, background, visiting, ordering, using, & analyzing the drug market Silk Road 1

The cypherpunk movement laid the ideological roots of Bitcoin and the online drug market Silk Road; balancing previous emphasis on cryptography, I emphasize the non-cryptographic market aspects of Silk Road which is rooted in cypherpunk economic reasoning, and give a fully detailed account of how a buyer might use market information to rationally buy, and finish by discussing strengths and weaknesses of Silk Road, and what future developments are predicted by cypherpunk ideas.

“Archiving GitHub”, Branwen 2011

Archiving-GitHub: “Archiving GitHub”⁠, Gwern Branwen (2011-03-20; ⁠, ; similar):

Scraping and downloading Haskell-related repositories from GitHub

Tutorial of how to write a Haskell program to scrape Haskell-related repositories on GitHub and download them for offline installation, search, reference, and source code analysis, using TagSoup & curl⁠.


“Archiving URLs”, Branwen 2011

Archiving-URLs: “Archiving URLs”⁠, Gwern Branwen (2011-03-10; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Archiving the Web, because nothing lasts forever: statistics, online archive services, extracting URLs automatically from browsers, and creating a daemon to regularly back up URLs to multiple sources.

Links on the Internet last forever or a year, whichever comes first. This is a major problem for anyone serious about writing with good references, as link rot will cripple several% of all links each year, and compounding.

To deal with link rot, I present my multi-pronged archival strategy using a combination of scripts, daemons, and Internet archival services: URLs are regularly dumped from both my web browser’s daily browsing and my website pages into an archival daemon I wrote, which pre-emptively downloads copies locally and attempts to archive them in the Internet Archive. This ensures a copy will be available indefinitely from one of several sources. Link rot is then detected by regular runs of linkchecker, and any newly dead links can be immediately checked for alternative locations, or restored from one of the archive sources.

As an additional flourish, my local archives are efficiently cryptographically timestamped using Bitcoin in case forgery is a concern, and I demonstrate a simple compression trick for substantially reducing sizes of large web archives such as crawls (particularly useful for repeated crawls such as my DNM archives).

“Writing a Wikipedia RSS Link Archive Bot”, Branwen 2009

Wikipedia-RSS-Archive-Bot: “Writing a Wikipedia RSS Link Archive Bot”⁠, Gwern Branwen (2009-11-02; ⁠, ⁠, ; backlinks; similar):

Archiving using Wikipedia Recent Changes RSS feed (obsolete).

Continuation of the 2009 Haskell Wikipedia link archiving bot tutorial, extending it from operating on a pre-specified list of articles to instead archiving links live by using TagSoup parsing Wikipedia Recent Changes for newly-added external links which can be archived using WebCite in parallel. (Note: these tutorials are obsolete. WebCite is largely defunct, doing archiving this way is not advised, and WP link archiving is currently handled by Internet Archive-specific plugins by the WMF. For a more general approach suitable for personal use, see the writeup of archiver-bot in Archiving URLs⁠.)

“Writing a Wikipedia Link Archive Bot”, Branwen 2008

Wikipedia-Archive-Bot: “Writing a Wikipedia Link Archive Bot”⁠, Gwern Branwen (2008-09-26; ⁠, ⁠, ; backlinks; similar):

Haskell: tutorial on writing a daemon to archive links in Wikipedia articles with TagSoup and WebCite; obsolete.

This is a 2008 tutorial demonstrating how to write a Haskell program to automatically archive Internet links into WebCite & Internet Archive to avoid linkrot, by parsing WP dumps, downloading & parsing WP articles for external links with the TagSoup HTML parsing library, using the WebCite/​IA APIs to archive them, and optimizing runtime. This approach is suitable for one-off crawls but not for live archiving using the RSS feed; for the next step, see Wikipedia RSS Archive Bot for a demonstration of how one could write a RSS-oriented daemon.