GPT-2 Preference Learning for Music Generation

Experiments with OpenAI’s ‘preference learning’ approach, which trains a NN to predict global quality of datapoints, and then uses reinforcement learning to optimize that directly, rather than proxies. I am unable to improve quality, perhaps due to too-few ratings.
statistics, NN, fiction, shell, GPT, tutorial, poetry, music
2019-12-162020-04-18 finished certainty: likely importance: 7

Stan­dard lan­guage gen­er­a­tion neural net­work mod­els, like GPT-2, are trained via like­li­hood train­ing to imi­tate human text cor­pus­es. Gen­er­ated text suf­fers from per­sis­tent flaws like rep­e­ti­tion, due to myopic gen­er­a­tion word-by-­word, and can­not improve on the train­ing data because they are trained to pre­dict ‘real­is­tic’ com­ple­tions of the train­ing data.

A pro­posed alter­na­tive is to use rein­force­ment learn­ing to train the NNs, to encour­age global prop­er­ties like coher­ence & lack of rep­e­ti­tion, and poten­tially improve over the orig­i­nal cor­pus’s aver­age qual­i­ty. Pref­er­ence learn­ing trains a reward func­tion on human rat­ings, and uses that as the ‘envi­ron­ment’ for a black­box DRL algo­rithm like PPO.

Ope­nAI released a code­base imple­ment­ing this dual-­model pref­er­ence learn­ing approach for tex­tual gen­er­a­tion, based on GPT-2. Hav­ing pre­vi­ously used & , I exper­i­mented with GPT-2 pref­er­ence learn­ing for uncon­di­tional music and poetry gen­er­a­tion.

I found that pref­er­ence learn­ing seemed to work bet­ter for music than poet­ry, and seemed to reduce the pres­ence of rep­e­ti­tion arti­facts, but the results, at n≅7,400 rat­ings com­piled over 23 iter­a­tions of train­ing+sam­pling Novem­ber 2019–­Jan­u­ary 2020, are not dra­mat­i­cally bet­ter than alter­na­tive improve­ments like scal­ing up mod­els or more thor­ough data-­clean­ing or more strin­gent sam­ple cura­tion. My blind rat­ings using n≅200 com­par­isons showed no large advan­tage for the RL-­tuned sam­ples (win­ning only 93 of 210 com­par­isons, or 46%).

This may be due to insuf­fi­cient rat­ings, bad hyper­pa­ra­me­ters, or not using sam­ples gen­er­ated with com­mon pre­fix­es, but I sus­pect it’s the for­mer, as some NLP tasks in Ziegler et al 2019 required up to 60k rat­ings for good per­for­mance, and the reward model appeared to achieve poor per­for­mance & suc­cumb to adver­sar­ial exam­ples eas­i­ly.

Work­ing with it, I sus­pect that pref­er­ence learn­ing is unnec­es­sar­ily sam­ple-in­ef­fi­cient & data-in­ef­fi­cient, and that the black­box rein­force­ment learn­ing approach is infe­rior to directly using the reward model to opti­mize text sam­ples, and pro­pose two major archi­tec­tural over­hauls: have the reward model directly model the implied rank­ing of every dat­a­point, and drop the agent model entirely in favor of back­prop-pow­ered gra­di­ent ascent which opti­mizes sequences to max­i­mize the reward mod­el’s out­put.

Neural nets for gen­er­at­ing text typ­i­cally treat it as a pre­dic­tion prob­lem: pre­dict the next word given pre­vi­ous text, and max­i­mize prob­a­bil­ity of a cor­rect pre­dic­tion of the next word. This can effi­ciently train large NNs on large text cor­puses and can gen­er­ate sur­pris­ingly good text on aver­age, as in my past poetry/music gen­er­a­tion projects with or GPT-2—but gen­er­ates only aver­age text, like the cor­pus on aver­age, and has per­sis­tent prob­lems with arti­facts like rep­e­ti­tion or lack of global coher­ence & themes due to greedy myopic gen­er­a­tion word-by-­word. Pre­dic­tion is fun­da­men­tally dif­fer­ent from con­trol­ling & opti­miza­tion: a like­li­hood-­trained NN is a pas­sive observer, sim­ply try­ing to pre­dict.

The inline trick? There are some ways to con­trol the gen­er­ated text, like the ‘inline trick’, where meta­data (such as the author or source) is prepended to the raw text (like in my char-RNN/GPT-2 poetry where I con­trol the style by insert­ing the author dur­ing train­ing & prompt­ing with a desired author, or a broader scale, ‘s use of ’genre’ meta­data), but these approaches seem lim­it­ed. What if we want to gen­er­ate the best text, like the best poems or best music? Would the inline trick work if we trained on a cor­pus of rated text and prompted the NN with ‘5 stars:’…? Prob­a­bly not—things like ‘good­ness’ are too sub­tle com­pared to author or gen­re, even if we had many megabytes of rated text to train on.

Why Preference Learning?

“…while a proof of might has­ten a robot upris­ing, it would­n’t guar­an­tee one. For again, what P≟NP asks is not whether all cre­ativ­ity can be auto­mat­ed, but only cre­ativ­ity whose fruits can quickly be ver­i­fied by com­puter pro­grams. To illus­trate, sup­pose we wanted to pro­gram a com­puter to cre­ate new Mozart-qual­ity sym­phonies and Shake­speare-qual­ity plays. If P=NP via a prac­ti­cal algo­rithm, then these feats would reduce to the seem­ingly eas­ier prob­lem of writ­ing a com­puter pro­gram to rec­og­nize great works of art. And inter­est­ing­ly, P=NP might also help with the recog­ni­tion prob­lem: for exam­ple, by let­ting us train a neural net­work that reverse-engi­neered the expressed artis­tic pref­er­ences of hun­dreds of human experts. But how well that neural net­work would per­form is an empir­i­cal ques­tion out­side the scope of math­e­mat­ics.”

, 2017

Like­able, not like­ly. An alter­na­tive is to treat it as a prob­lem. From a RL per­spec­tive, like­li­hood train­ing is a kind of ‘imi­ta­tion learn­ing’, where the NN learns to ‘copy’ an expert, and its flaws are as expected from imi­ta­tion learn­ing when one tries to apply it: the NN has never seen its own com­ple­tions, and has no way of recov­er­ing from errors (some­times dubbed the ), which aren’t rep­re­sented in the dataset, and the com­ple­tions it is imi­tat­ing are of both high and low qual­ity and it must attempt to imi­tate the bad as well as good. Imi­tat­ing human experts: lim­it­ed. Unsur­pris­ing­ly, its out­put is often bad, and if sam­pling goes a lit­tle hay­wire, it may then ‘explode’. It is also not sur­pris­ing if an imi­ta­tion-learn­ing NN has bizarre blind spot­s—there is noth­ing in the process which seeks out blind spots and fixes them, after all, since fix­ing them does­n’t improve pre­dic­tions on the fixed cor­pus.

Reward good tex­t—but what defines ‘good’? A bet­ter approach is enable tri­al-and-er­ror learn­ing: have an ‘agent’ or ‘gen­er­a­tor’ NN try to learn how to gen­er­ate text which max­i­mizes total long-term reward over the entire sequence, regard­less of each indi­vid­ual word’s prob­a­bil­ity (only the final result mat­ter­s). But who defines the ‘reward’? You can’t write a sim­ple rule defin­ing good poetry or music, so you ask humans, pre­sum­ably—but no one is patient enough to rate mil­lions of text snip­pets, which is how many sam­ples you would need for stan­dard deep rein­force­ment learn­ing on large com­plex lan­guage NNs. That’s an issue with neural net­works: they are good at super­vised learn­ing, where the right answer can be defined, but not so good at tri­al-and-er­ror, where one is told how good a sequence was but not what the right answer was.

Train a NN to imi­tate human crit­ics, not experts. In pref­er­ence learn­ing, we con­vert our intractable RL prob­lem into a super­vised learn­ing prob­lem: we try to learn the util­ity func­tion or reward func­tion from humans instead of attempt­ing to con­jure up some nonex­is­tent def­i­n­i­tion of good poetry or expect­ing a NN to some­how learn what good poetry is from a giant pile of medioc­re, good, great, or just plain ter­ri­ble poet­ry. Exam­ples of reward/preference learn­ing are ////, but the kind of pref­er­ence learn­ing I am using here was intro­duced by Ope­nAI in , et al 2017 (blogs: , ), where humans looked at short video snip­pets from sim­ple video games, and picked the ones which looked more cor­rect; the sequences need not be video or images but can be text, and in , Ziegler et al 2019 (; ), they exper­i­ment with train­ing GPT-2 to gen­er­ate bet­ter text using human rat­ings of fac­tors like ‘descrip­tive­ness’ (which would be hard to write a rule for). A related paper, explores fine­tun­ing GPT-2 to not gen­er­ate ‘offen­sive’ or nor­m-vi­o­lat­ing text, using a NN clas­si­fier trained pre­vi­ously (, which used, amus­ing­ly, along with n = 1000 other labeled sam­ples); Peng et al 2020 did not do pure RL train­ing, but com­bined the nor­ma­tive clas­si­fier as a RL loss with sen­tence-level like­li­hood fine­tun­ing on a sci­ence fic­tion text cor­pus, and was able to halve the nor­m-vi­o­la­tion rate.

Boot­strap NN critic from human crit­i­cism. Chris­tiano et al ask: if NNs are good at super­vised learn­ing like pre­dic­tion, and we would need mil­lions of human rat­ings to get any­where with the RL approach that might fix our imi­ta­tion learn­ing prob­lems… why not have a NN learn to pre­dict human rat­ings, and use that instead? Since NNs are so good at super­vised learn­ing, they should be able to learn to pre­dict human rat­ings rel­a­tively eas­i­ly. (Be­cause it can be hard to rate every­thing on a scale 1–10, we can ask the humans to make rat­ings which are better/worse-than pair­wise com­par­isons, which, if there are enough rat­ings, allows infer­ring an under­ly­ing latent vari­able through one of many sta­tis­ti­cal mod­els like the , and amounts to the same thing.) So, in addi­tion to the NN doing the RL learn­ing, we have a sec­ond ‘reward model’ or ‘critic’ NN learn to pre­dict a small set of human forced-­choice rat­ings which choose between two pos­si­ble text snip­pets (eg thou­sand­s); this NN then rates as many snip­pets as nec­es­sary for the orig­i­nal NN doing rein­force­ment learn­ing (eg mil­lion­s). Sim­i­lar to GANs (agen­t=­Gen­er­a­tor, reward mod­el=Dis­crim­i­na­tor), the reward model dis­tills the human rater’s pref­er­ences in a NN which can be used arbi­trar­ily often. Now it’s OK that the RL-based NN is slow and needs mil­lions of tri­als to learn from its errors, since we can run the reward model as many times as nec­es­sary.

Iter­ate as nec­es­sary. Since the reward model has only an imper­fect idea of human pref­er­ences, it will make errors and may even be ‘fooled’ by the agent (much as a Gen­er­a­tor may defeat a Dis­crim­i­na­tor per­ma­nent­ly), but one can then take the agen­t’s out­puts and get a human rat­ing of them, fix­ing the prob­lem and improv­ing the reward mod­el, forc­ing the agent to find a bet­ter strat­egy in the next iter­a­tion of the process. This process can repeat as many times as nec­es­sary, and all of these steps can run in par­al­lel:

Ziegler et 2019: “Fig­ure 1: Our train­ing processes for reward model and pol­i­cy. In the online case, the processes are inter­leaved.”

For Music or Poetry

Greedy gen­er­a­tion. Chris­tiano’s pref­er­ence learn­ing seems like it could help in gen­er­at­ing nat­ural lan­guage types which have proven tricky, due to global prop­er­ties where word-by-­word imi­ta­tion is flawed.

Prob­lem: rep­e­ti­tion. Lan­guage gen­er­a­tion mod­els trained by max­i­mum like­li­hood have long had seri­ous prob­lems with falling into rep­e­ti­tions, and with hav­ing any kind of ‘theme’ or ‘mes­sage’ or even just basic con­sis­ten­cy. The stereo­typ­i­cal neural text sam­ple from a char-RNN or Trans­former model is made up of indi­vid­ual sen­tences which are flaw­lessly spelled, per­fectly gram­mat­i­cal, a lit­tle con­fus­ingly obtuse, and com­pletely unre­lated to 10 sen­tences pre­vi­ous­ly, rem­i­nis­cent of schiz­o­phrenic ‘word salad’—and degen­er­at­ing after a few pages into end­less rep­e­ti­tion of “the the the”.

How can NNs be so good & so bad? The rep­e­ti­tion is par­tic­u­larly per­plex­ing, because highly sophis­ti­cated char-RNN or Trans­former mod­els appear to encode all sorts of seman­tics and knowl­edge about the world and achieve dif­fi­cult tasks, and yet fall prey to a pathol­ogy the hum­blest or -style tem­plate algo­rithm man­ages to avoid, and which still has no good solu­tion. Unlike super­vised seq2seq tasks, more sophis­ti­cated decod­ing search strate­gies like help only a lit­tle in lan­guage gen­er­a­tion, and can make things much worse by trig­ger­ing rep­e­ti­tion faster. (The recent is a patch, and one can still induce rep­e­ti­tion with low top-p set­tings; appears to be bet­ter, but it is unknown if it’s a full solu­tion.) But because it is so sim­ple, a reward model should be able to detect it eas­i­ly—how hard could it be to penal­ize using a BPE 20 times in a row?

Global coherency and themes are hard­er, but it is still some­thing one expects a reward model to be able to pick up on even­tu­ally and notice when a sam­ple has wan­dered off course in an illog­i­cal way: even if each indi­vid­ual word is a rea­son­ably likely next word, the end­ing will be highly unlikely given the begin­ning, and a model look­ing at the big pic­ture can detect that incon­sis­ten­cy.


The Ziegler et al 2019 code­base, mod­els, and datasets (but not the rat­ing tools) were released by Ope­nAI in Sep­tem­ber 2019 for pub­lic use, and I began work­ing on adapt­ing it to poetry & music.


The OA source code can be down­loaded from Github as usu­al:

git clone '' && cd ./lm-human-preferences/

The nec­es­sary Python 3 pack­ages are listed in the Pipfile.

One unusual require­ment is : the OA datasets are stored on a . This bucket is pub­lic, but must be accessed through spe­cial cre­den­tial­s—if you are get­ting invalid_grant: Bad Request errors, you are run­ning into this issue, and you need to get spe­cial cre­den­tials, per­haps via gcloud auth login.

At least in the­o­ry, if you have your Google cre­den­tial ducks in a row and cor­rectly pip installed all the depen­den­cies, you should be able to run the com­bined-run exam­ple from their README:

experiment_name=testdesc-$(date +%y%m%d%H%M)
./ train_policy $experiment $experiment_name

This will auto­mat­i­cally down­load GPT-2-124M1 & the ‘descrip­tive­ness’ dataset (as defined in, which uses snip­pets from Book­Cor­pus, to train a reward model using 8 GPUs (OA used 8×) on best-of-4 com­par­isons of book pas­sage com­ple­tions based on how phys­i­cally descrip­tive or evoca­tive of the scene they are for 1 epoch; and then attempt to train a PPO for ~2k steps/iterations to opti­mize fic­tion gen­er­a­tion for descrip­tive­ness.


The OA code­base user ought to be aware of a few things before run­ning on a generic new dataset:

  1. GCP per­mis­sions: as dis­cussed above, the OA datasets may not down­load unless one has the cor­rect gsutil cre­den­tials gen­er­ated

  2. Python/Parallelism Issues: I ran into 2 errors which ulti­mately ter­mi­nated at a call to mpiexec.

    • Python 2 vs 3: The first was a Python inter­preter ver­sion error, where it was some­how call­ing a Python 2 inter­preter even though my virtualenv was set to Python 3, and run­ning explic­itly with python3 did­n’t help, so I patched the source code as fol­lows to force Python 3:

      diff --git a/lm_human_preferences/utils/ b/lm_human_preferences/utils/
      index 30f3440..62f4fe3 100644
      --- a/lm_human_preferences/utils/
      +++ b/lm_human_preferences/utils/
      @@ -11,7 +11,7 @@ def launch(name, f, *, namespace='safety', mode='local', mpi=1) -> None:
               with open('/tmp/pickle_fn', 'wb') as file:
                   cloudpickle.dump(f, file)
      -        subprocess.check_call(['mpiexec', '-n', str(mpi), 'python', '-c', 'import sys; import pickle; \
          pickle.loads(open("/tmp/pickle_fn", "rb").read())()'])
      +        subprocess.check_call(['mpiexec', '-n', str(mpi), 'python3', '-c', 'import sys; import pickle; \
          pickle.loads(open("/tmp/pickle_fn", "rb").read())()'])
           raise Exception('Other modes unimplemented!')
    • Fail­ing on 1 GPU: The README claims that run­ning on 1 GPU should be pos­si­ble, but when I tried run­ning on 1 GPU (so I could keep fine­tun­ing GPT-2 on my other GPU), mpiexec always failed.

      I sus­pect that the call may need to be removed entire­ly. I avoided it by always run­ning on both GPUs, and doing fine­tun­ing in the gaps between iter­a­tions, when I was busy with rat­ings or other things.

  3. Dis­abling GCP: the Ope­nAI GCP bucket is hard­wired, and aside from that, it’d be a pain to set up a GCP buck­et, set its per­mis­sions, and work with it when train­ing locally rather than on a cloud GPU instance.

    The load­ing can be fixed by edit­ing to spec­ify local file paths. To dis­able sav­ing to GCP, I did another edit:

    diff --git a/lm_human_preferences/language/ b/lm_human_preferences/language/
    index f149a0c..99827fa 100644
    --- a/lm_human_preferences/language/
    +++ b/lm_human_preferences/language/
    @@ -10,7 +10,7 @@ class TrainedModel():
         def __init__(self, name, *, savedir=None, scope=None):
    = name
             self.scope = scope
    -        self.savedir = savedir if savedir else os.path.join('gs://gpt-2/models/', name)
    +        self.savedir = savedir if savedir else name
             if name == 'test':
                 self.encoding = encodings.Test
  4. encod­ing of BPEs: once you can load a local dataset, you need to cre­ate said dataset, of course. Unfor­tu­nate­ly, the code­base does­n’t make life easy for you, as the dataset must fol­low strict length lim­its & already be BPE encod­ed. Rat­ing is com­pli­cated enough to require a sep­a­rate sec­tion.

  5. Hard­wired Eng­lish Heuris­tic Loss:

    The sin­gle most frus­trat­ing bug I encoun­tered in this code is due to a ‘clever’ hand-engi­neered fea­ture added to try to fil­ter out bad sam­ples early2. The code by default looks for a period (.) within the first n BPEs, and if there is not one, the sam­ple is auto­mat­i­cally penal­ized −1!

    I did­n’t notice this in the poetry runs, but when I switched over to music, it became a huge prob­lem with ABC sam­ples—as it turns out, ABC does not require use of peri­ods and most ABC music sam­ples will have no peri­ods. So every sin­gle sam­ple was auto­mat­i­cally rated −1, ren­der­ing train­ing impos­si­ble. This turns out to be men­tioned briefly in the paper but I had com­pletely over­looked the impli­ca­tions until I reread it try­ing to under­stand how the ABC (but not poet­ry) reward model could be so badly mis­tak­en:

    To make the label­ing task more nat­u­ral, we select excerpts that start and end with a peri­od. When sam­pling con­tin­u­a­tions that will be pre­sented to humans, we use to ensure there is a period between tokens 16 and 24 and then trun­cate at that peri­od. [This is a crude approx­i­ma­tion for “end of sen­tence.” We chose it because it is easy to inte­grate into the RL loop, and even a crude approx­i­ma­tion is suf­fi­cient for the intended pur­pose of mak­ing the human eval­u­a­tion task some­what eas­i­er.] Dur­ing the RL fine­tun­ing, we penal­ize con­tin­u­a­tions that don’t have such a period by giv­ing them a fixed reward of −1.

    This ‘fea­ture’ is spec­i­fied in at the begin­ning but it’s unclear how to dis­able it entire­ly. To guar­an­tee that it can­not inter­fere, I patched it out:

    diff --git a/lm_human_preferences/ b/lm_human_preferences/
    @@ -398,7 +399,7 @@ def make_score_fn(hparams, score_model):
         def score_fn(queries, responses):
             responses = postprocess(responses)
    -        score = penalize(responses, unpenalized_score_fn(queries, responses))
    +        score = unpenalized_score_fn(queries, responses)
             return score, responses, dict(score=score)
         score_fn.stat_schemas = dict(score=Schema(tf.float32, (None,)))
         return score_fn
  6. sym­link model direc­tory: since I was retrain­ing the base­line mod­els as I went, par­tic­u­larly to fix the , it’s con­ve­nient to sym­link over to the reg­u­lar GPT-2 repo’s model direc­to­ry, instead of deal­ing with copy­ing over fresh check­points. (Saves disk space too.) Some­thing like ln -s ../gpt-2/models/irish-nospaces3/ 117M-irish works.

  7. con­fig changes: all data-re­lated para­me­ters are hard­wired, and must be man­u­ally set.

    The length of prefixes/queries/conditioning and the length of all sam­ples must be exactly right; fur­ther, the size of the dataset (the n of rat­ings) must be man­u­ally spec­i­fied, and even fur­ther, the spec­i­fied n must be an exact mul­ti­ple of the reward mod­el’s mini­batch size (it can, how­ev­er, be lower than the actual n inside the dataset, so one does­n’t need to delete rat­ings if one has rated a few more than an exact mul­ti­ple).

    So for exam­ple, if one is train­ing the reward model with a mini­batch of n = 8 and one has n = 11,203 total rat­ings, that is not an exact mul­ti­ple of 8 (11203⁄8 = 1400.375) and one would instead spec­ify n = 11,200 (which is both & an exact mul­ti­ple: ).

  8. Zom­bie processes:

    Make sure GPU is GCed
    OOM crashes are not uncom­mon dur­ing reward model train­ing, puz­zling­ly, and one will typ­i­cally kill a diverged process with Control-c; how­ev­er, these may leave zom­bie processes tying up GPU VRAM! Par­tic­u­larly if you are tin­ker­ing with set­tings like length or mini­batch, this is a great dan­ger—you may make a change, get an OOM crash (which leaves zom­bies), and any sub­se­quent change you make will look like a fail­ure. This caused me great trou­ble at least twice, as I began try­ing to debug which (harm­less) con­fig change now trig­gered instant OOMs.

    To avoid this, I sug­gest get­ting in the habit of always run­ning nvidia-smi after a train­ing run so you can check that has not left any orphans (and if so, you can put them out of their mis­ery).

ABC Music configuration

Con­fig by source edit­ing. All of the hyper­pa­ra­me­ters & dataset meta­data is defined in; there are no rel­e­vant CLI options. It is struc­tured in two parts, for the reward model and then the agent; the con­fig­u­ra­tion is a cas­cade of increas­ing­ly-spe­cial­ized objects. So for the reward model for the descrip­tive exper­i­ment, the books_task object is spe­cial­ized by _books_task, which is fur­ther spe­cial­ized by descriptiveness; and like­wise for the agent/PPO train­ing.

Hijack­ing exist­ing con­fig. For my ABC music, instead of defin­ing a new cas­cade, I sim­ply hijacked the descriptiveness-re­lated vari­ables. I begin with the reward model in books_task, by cut­ting the con­di­tion­ing down to the min­i­mum which causes the code to not crash, 2, and expand­ing the response length con­sid­er­ably to cover entire ABC music pieces, and I change the base model name to the ABC GPT-2 I trained nor­mal­ly:

 books_task = combos(
-    bind('query_length', 64),
+    bind('query_length', 2), # must be a minimum of 2 (but why?)
     bind('query_dataset', 'books'),
-    bind('response_length', 24),
-    bind('start_text', '.'), # Start the context at the beginning of a sentence
+    bind('response_length', 256),
+    bind('start_text', ''), # no conditioning aside from 'X:' in
     bind('end_text', '.'), # End the context at the end of a sentence.
     bind('truncate_token', 13), # Encoding of '.' -- end completions at the end of a sentence.
     bind('truncate_after', 16), # Make sure completions are at least 16 tokens long.

-    bind('policy.temperature', 0.7),
-    bind('policy.initial_model', '124M'),
+    bind('policy.temperature', 1.0),
+    bind('policy.initial_model', '117M-irish'),

The train­ing code needs to be mod­i­fied for the rat­ing data type (pair­wise) and for my lim­ited com­pute resources (2×1080ti instead of OA’s 8×V100)—I have to cut down mini­batch size & roll­out batch size:

 def get_train_reward_experiments():
     _shared = combos(
-        bind('labels.type', 'best_of_4'),
+        bind('labels.type', 'best_of_2'),
         bind('normalize_after', True),
         bind('normalize_before', True),
         bind('normalize_samples', 256),
@@ -58,9 +58,9 @@ def get_train_reward_experiments():
     _books_task = combos(
         bind_nested('task', books_task),
-        bind('batch_size', 32),
-        bind('rollout_batch_size', 512),
+        bind('batch_size', 10),
+        bind('rollout_batch_size', 226),

Final­ly, I spec­ify my local dataset & man­u­ally spec­ify its cor­pus size as a mul­ti­ple of the mini­batch size (this must be updated every time I add rat­ings or they won’t be trained on):

@@ -75,8 +75,8 @@ def get_train_reward_experiments():
     descriptiveness = combos(

-        bind('labels.source', 'gs://lm-human-preferences/labels/descriptiveness/offline_5k.json'),
-        bind('labels.num_train', 4_992),
+        bind('labels.source', 'irish.json'),
+        bind('labels.num_train', 16900),
         bind('run.seed', 1)

The agent model is eas­ier to con­fig­ure because I need only to adjust for com­pute:

 def get_experiments():
     train_reward_experiments = get_train_reward_experiments()

     _books_task = combos(
         bind_nested('task', books_task),

-        bind('', 1e-5),
-        bind('ppo.total_episodes', 1_000_000),
-        bind('ppo.batch_size', 512),
+        bind('', 1e-6), # original: 5e-5
+        bind('ppo.total_episodes', 1_000_000),
+        # original: 1_000_000; note, this is *episodes*, not *steps*; each step consists of _n_ episodes
+        bind('ppo.batch_size', 18), # original: 512

I also change the -3 as it appears to far too harshly pun­ish diver­gence from the base­line mode for ABC music and effec­tively dis­ables explo­ration:

@@ -139,9 +138,9 @@ def get_experiments():

     descriptiveness = combos(
-        bind('rewards.kl_coef', 0.15),
+        bind('rewards.kl_coef', 0.02),
         bind('rewards.adaptive_kl', 'on'),
-        bind('', 6.0),
+        bind('', 25.0),

For ABC music specif­i­cal­ly, I made some fur­ther changes to the rest of the code:

  • con­di­tion­ing on X: for gen­er­at­ing ABC music dur­ing train­ing: all ABC music sam­ples , as an ID. I fig­ured that if I had to con­di­tion on at least 2 BPEs, I might as well spec­ify the X: and make it more likely that sam­ples will be valid:

    diff --git a/lm_human_preferences/ b/lm_human_preferences/
    index db02c98..b349717 100644
    --- a/lm_human_preferences/
    +++ b/lm_human_preferences/
    @@ -282,6 +282,7 @@ class PPOTrainer():
             step_started_at = time.time()
             queries = self.sample_queries()
    +        queries = np.tile([55,25], (queries.shape[0],1)) # Irish ABC prefix: 'X:' (ie for the initial numeric ID)
             rollouts = self.policy.respond(queries, length=self.hparams.task.response_length)
             responses = rollouts['responses']
  • in reg­u­lar gen­er­a­tion of sam­ples from a trained agent/policy mod­el, the default set­tings are a tem­per­a­ture of 1 & top-k = 40; the lat­ter is fine but the for­mer is too high, and I lower it to 0.8. (The code claims to sup­port nucleus sam­pling, with a top_p argu­ment, but when I changed that, it sim­ply broke.) The diff:

    diff --git a/lm_human_preferences/language/ b/lm_human_preferences/language/
    index 96e56e9..76e56a3 100644
    --- a/lm_human_preferences/language/
    +++ b/lm_human_preferences/language/
    @@ -5,7 +5,7 @@ from lm_human_preferences.utils import core as utils
     def sample_sequence(*, step, model_hparams, length, batch_size=None, context=None,
    -                    temperature=1, top_k=0, top_p=1.0, extra_outputs={}, cond=None):
    +                    temperature=0.8, top_k=40, top_p=1.0, extra_outputs={}, cond=None):
         Sampling from an autoregressive sequence model.

My full diff/patch for run­ning ABC music train­ing is avail­able to look at in case there is any ambi­gu­i­ty.


./ train_policy descriptiveness irish-combined-20191222.17 --mpi 2 ; sleep 4s; nvidia-smi
Remem­ber to check the nvidia-smi out­put after a crash or inter­rupt to make sure your GPU VRAM has been released & the zom­bie processes aren’t eat­ing it.


The OA code­base comes with no built-in sup­port for doing rat­ings; they used which exposes an API and pre­sum­ably felt there was not much point in pro­vid­ing the glue code. So, I rolled my own.

Data Formatting

The JSON schema. The input data for­mat is JSON: it is an array of hashmap objects with, in the sim­plest case of best-of-two/pairwise rat­ings4, 4 fields (the first 3 of which are not strings but inte­ger arrays, where each inte­ger is assumed to be a BPE): the con­di­tion­ing text query, the first can­di­date string sample0, the sec­ond can­di­date string sample1, and the rat­ing best which is a sin­gle inte­ger where 0 = first sam­ple won / 1 = sec­ond sam­ple and so on. How does this han­dle ties? Ties don’t seem to be expressed as the inte­ger 3 as one would guess. For ties, I sim­ply encode it as two rat­ings with each sam­ple win­ning once, which should be roughly equiv­a­lent.

Hard­wired n

The inte­ger array lengths must be the length defined in the con­fig, and so if a sam­ple is too long or short, it must be trun­cated or padded to fit.

Record dubi­ous sam­ples. The JSON parser code here appears to not be strict, so you can append addi­tional fields if you want. Because of issues with adver­sar­ial sam­ples or ABC sam­ples being syn­tac­ti­cally invalid & not com­pil­ing to MIDI, and my con­cerns about what inclu­sion of them might do to train­ing dynam­ics (per­haps they should be excluded entire­ly?), I add a field, broken, to allow fil­ter­ing black­listed sam­ples out and dis­tin­guish­ing their rat­ings from my hand-rat­ings.

So, a full and com­plete exam­ple of a valid JSON rat­ings dataset with n = 1 pair­wise rat­ings for ABC music would look like this:

  {"query": [0,0],
  "sample0": [   27,   91,  437, 1659, 5239,   91,   29,   55,   25,23349,  198,   51,
      25,14874, 1252,25308,22495,  198,   44,   25,   17,   14,   19,  198,
      43,   25,   16,   14,   23,  198,   42,   25,   34,  198,   38,  535,
      33,   91,   32,   17,   32,   17,   91,   38, 4339,   33,   91,   66,
      17,   89,   17,   91,   38,  535,   33,   91,   32,   17,   32,   17,
      91,   38, 4339,   33,   91,   66,   17,   32,   17,   91,  198,   38,
    4339,   33,   91,   66,   17,   32,   17,   91,   38, 4339,   33,   91,
      66,   17,   32,   17,   91,   38, 4339,   33,   91,   66,   17,   32,
      17,   91,   38, 4339,   33,   91,   66,   17,   32,   17,15886,   59,
      77,   27,   91,  437, 1659, 5239,   91,   29,  198,   27,   91,  437,
    1659, 5239,   91,   29,   27,   27,   27,   27,   27,   27,   27,   27,
      27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,
      27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,
      27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,
      27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,
      27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,
      27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,
      27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,
      27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,
      27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,
      27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,
      27,   27,   27,   27],
  "sample1": [   27,   91,  437, 1659, 5239,   91,   29,   55,   25,14208, 2816,  198,
      51,   25,   47,13218,34831,  338,  198,   44,   25,   19,   14,   19,
     198,   43,   25,   16,   14,   23,  198,   42,   25,   35,   76, 1228,
     198,   91,   25,   33,  198,   91,   93,   32,   17,   67,   32,   61,
      38, 6242,   67,   91,    7,   18,   32, 4339,   37, 2782,   33,   66,
      32,   91,12473,   93,   36,   17,   38,   18,   32,   91,   33,   67,
      70, 5036,   17,36077,   91,  198,   93,   32,   17,   67,   32,   61,
      38, 6242,   67,   91,    7,   18,   32, 4339,   37, 2782,   33,   66,
      32,   91,   33,   67,   93,   67,   17,  276,   33,   67,   91, 8579,
      38, 1961,   18,   25,   91,  198,   91,   25,   32,   91,   67,   17,
      69, 9395,16344,   91, 2782,   69, 2934,   17,36077,   91,   93,   67,
      17,   69, 9395,16344,   91,   70,   69,  891,   67,   17,36077,   91,
     198,   93,   32,   17,   69, 9395,16344,   91, 2782,   69, 2934,   17,
   36077,   91,   93,   70,   18,19082,   33,   67,   91, 8579,   38, 1961,
      18,   25,   91,   59,   77,   27,   91,  437, 1659, 5239,   91,   29,
     198,   27,   91,  437, 1659, 5239,   91,   29,   27,   27,   27,   27,
      27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,
      27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,
      27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,
      27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,
      27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,
      27,   27,   27,   27],
  "best": 1,
  "broken": 1}

Dis­sect­ing JSON exam­ple. In this ABC music exam­ple, it appears that sample0 is invalid for some rea­son: it either could­n’t be com­piled to MIDI by abc2midi, Timid­ity could­n’t com­pile the MIDI to WAV, or it con­tained a string that vio­lated the man­ual black­list com­piled from diverged adver­sar­ial sam­ples. So, sample0 was auto­mat­i­cally marked as the loser and sample1 won. Both sam­ples have been padded out with the BPE 27. Check­ing the BPE encod­ing (which can be done con­ve­niently with jq . encoder.json), BPE 0 is !, and the BPE encod­ing has odd han­dling of spaces so it’s unclear how to pad with spaces. 27 was cho­sen arbi­trar­i­ly.

For script­ing pur­pos­es, we’d like a CLI fil­ter which takes text and prints out the BPE encod­ing. I hacked up the from the nshep­perd code­base to make, which reads from std­in, con­verts, and pads to our tar­get length of 256:

#!/usr/bin/env python3

import argparse
import numpy as np
import sys
import encoder
from load_dataset import load_dataset

parser = argparse.ArgumentParser(
    description='Pre-encode text files into tokenized training set.',
parser.add_argument('--model_name', metavar='MODEL', type=str, default='117M', help='Pretrained model name')
parser.add_argument('--combine', metavar='CHARS', type=int, default=50000, help='Concatenate files with <|endoftext|> separator into chunks of this minimum size')
parser.add_argument('in_text', metavar='PATH', type=str, help='Input file, directory, or glob pattern (utf-8 text).')

target_length = 256

def main():
    args = parser.parse_args()
    enc = encoder.get_encoder(args.model_name)
    chunks = load_dataset(enc, args.in_text, args.combine)
    with np.printoptions(threshold=sys.maxsize):
        result = chunks[0][0:target_length] # np.zeros(24)
        if len(result) != target_length:
            padding = [27] * target_length
            result = np.concatenate((result, padding))
            result = result[0:target_length]
        print(np.array2string(result, separator=','))

if __name__ == '__main__':

Interactive rating

For both poetry and ABC music, there is no need for a GUI or web inter­face. A Bash script suf­fices.

Par­al­lelize & pre-­com­pile the ABC. For a rat­ing script, we want to min­i­mize laten­cy, and avoid doing any pro­cess­ing in the main thread, so all rated files are pre­com­piled to WAVs before rat­ing begins. All stages of the gen­er­ated files are left in /tmp/ for eas­ier debug­ging or pick­ing out good pieces. The orig­i­nal text ver­sion of the rat­ings are saved to an addi­tional text file for ref­er­ence, since BPEs are hard to read.

To avoid repeat­edly eval­u­at­ing the same piece, which would hap­pen occa­sion­ally with ran­dom sam­ples, I shuf­fle the sam­ples, store in an array, and pro­ceed in a pair­wise fash­ion to eval­u­ate #1 vs n⁄2+1 (where n is the list length) etc, so no com­par­i­son over­laps or dupli­cates anoth­er.

Auto-­fail bro­ken sam­ples. While rat­ing, each music piece is auto­mat­i­cally checked (au­to-rat­ed) for valid­ity as a WAV (no WAV with a mean­ing­ful file­size = failed sam­ple), and its ABC against a hand-writ­ten black­list of adver­sar­ial exam­ples. (I con­sid­ered includ­ing ‘com­press­ibil­ity’ as a cri­te­ri­a—pipe sam­ples into gzip and fail sam­ples which com­press too much—s­ince I noticed bad sam­ples were typ­i­cally highly repet­i­tive aside from their meta­data block, but did­n’t get around to it.) If both pieces in a com­par­i­son fail one of the checks, they are judged to be a tie (which is imple­mented as a pair of rat­ings). This saves an enor­mous amount of time & effort when extract­ing rat­ings from through­out a run, as many will be either bro­ken or adver­sar­i­al.

I print out the ABC pieces as well as play them—I find it help­ful to see them while lis­ten­ing. The ABC pieces are played at slightly higher speed than nor­mal, for ~10s each. (Be­cause the gen­er­a­tion is autore­gres­sive, a piece which starts off badly prob­a­bly isn’t going to wind up being stel­lar, so there’s no point in rat­ing 3–4× fewer pieces by insist­ing on lis­ten­ing to entire pieces before rat­ing. It’s more impor­tant to get through a lot of rat­ings than make each rat­ing per­fec­t.)

My rat­ing script requires, parallel, abc2midi, timidity, & mpv to be installed. The script is less than ele­gant but (most­ly) works:

set +o posix

# $ bash 1000 irish-samples-1.txt irish-1

N="$1" # currently needs to be a multiple of 8
CORPUS="$2" # "/home/gwern/wiki/docs/ai/2019-03-06-gpt2-poetry-1000samples.txt"

encode() {
    TMP_FILE=$(mktemp /tmp/XXXXXX.txt)
    echo "$@" >> $TMP_FILE
    ENCODED=$(PYTHONPATH=src python3 --model_name 117M  $TMP_FILE)
    echo "$ENCODED"; rm "$TMP_FILE"; }
export -f encode

generateJson() {
    echo "{\"query\": [0,0], \"sample0\": $2, \"sample1\": $3, \"best\": $1}," >> $JSON-encoded.json;
    ## Store a backup copy of the plain text for easier consultation
    echo "{\"query\": [0,0], \"sample0\": $4, \"sample1\": $5, \"best\": $1}," >> $JSON-text.json;
generateJsonBroken() {
    echo "{\"query\": [0,0], \"sample0\": $2, \"sample1\": $3, \"best\": $1, \"broken\": 1}," >> $JSON-encoded.json;
    echo "{\"query\": [0,0], \"sample0\": $4, \"sample1\": $5, \"best\": $1, \"broken\": 1}," >> $JSON-text.json;

rm -rf /tmp/music-samples/; mkdir /tmp/music-samples/
cat "$CORPUS" | sed -e 's/===.*/<|endoftext|>/g' -e 's/⏎/\n/g' | \
    csplit --quiet --elide-empty-files --suppress-matched --prefix /tmp/music-samples/sample- - '/<|endoftext|>/' '{*}'

# Pre-compute all versions for speed; this also helps debugging since all stages can be inspected on disk in /tmp/music-samples/
generateEncoded() {
    echo "Starting encoding: $@"
    FIRST=$(cat $POEM)
    encode "<|endoftext|>$FIRST\n<|endoftext|>" >> $POEM.encoded
    abc2midi "$POEM" -o $POEM.midi -Q 130
    timidity -A125 -G5-20 $POEM.midi -Ow -o $POEM.wav
export -f generateEncoded
ls /tmp/music-samples/sample-* | shuf | head -$1 | parallel generateEncoded

filterMusic () {
    fgrep -i -e "2D2|G2D2G2D2|G2" -e "=e'|=e'=a'=a'=a'=g'=e'|=e'" -e "a' a' a' a'" -e "a=g|=f=g=a=c'=a=g|=f" \
     -e "|=c'=d'=c'=c'=a=g|=c'=d'=c'=c'=a=g|" -e "|=c=e=g=c'=g=e|=c=e=g=c'=g=e|" -e "|=d=c'=a=g=d=e|=d=c'=a=g=d=e|=d" \
     -e '(3B)(3B)(3B)(3B)(3B)' -e ',2B,2B,2|B,2B,2B,2|' -e ',C,C,C,C,C,C,C,C,C,C,C,C,C,' -e ',G,|G,A,B,G,A,B,G,|' \
     -e ',|G,2A,2G,2G,A,|G,2A,2G,2A' -e '-ghhathan-ghhathan-ghhathan' -e '////////////' \
     -e '/2B/2B/2B/2B/2B/2B/2B' -e '/g/g/g/g/g/g/g/g/g' -e '222222222' -e '2A2A2A2A2G2A2A2A2G2A2A2A2' \
     -e '2A2G2A2G2A2A2' -e '2D2D2D2D2D2D2D2D2' -e '2F2A2G2A2A2G2A2A2' -e '2G,|G,2G,A,2G,A,2G,|C2G' \
     -e '2G2E2|C2G2A2G2E2|' -e '2G2G2G2G2G2G2G2G2' -e '2c/2c/2c/2c/2c/2c/' -e '2d/2d/2d/2d/2d/2d/2d/2d/' \
     -e '2g/2g/2g/2g/2g/2g/2g/2g/' -e '2|G2G2G2G2|' -e '4g/4a/4g/4a/4g/4a/4g' -e '=A|=c=A=A2=c=A=G=A|=c=A=A2=c=A=G=A|' \
     -e '=D2=D2|=D2=D2=D2|' -e '=E=F=G=F=F=F=F=F=E=F' -e '=G,|=G,=A,=A,=A,=G,=G,|=G,' -e '=G2|=G2=G2=G2=G2=G2|=G2' \
     -e '=G2|=G2=G2=G2|=G2' -e '=G=G=G=G=G=G=G=G=G=G' -e '=G|=A=c=A=A2=G|=A=c=A=A2=G|=A' -e '=G|=G=G=G=G=G=G=G|=G' \
     -e '=G|=G=G=G=G=G=G|' -e '=a=a=a=a=a=a=a' -e '=b=d=b=d=b=d=b=d=b=d' -e '=c|=d=c=A=c=A=c|=d=c=A=c=d=d|' \
     -e '=e=d=g=d=e=d=g=d|=e=d=g' -e '=g|=a=f=a=g=e=g|=a' -e '=g|=d=f=f=f=f=g|=d=f=f=f=f=g|=' -e 'A2G2A2G2A2G2A2G2A2A2G2A2G2A' \
     -e 'A2|=A2G2A2|=A' -e 'AcAcAcAcAcAcAcA' -e 'B,B,B,B,B,B,B' -e 'B/B/B/B/B/B/B/B' -e 'B=G=A|=B=c=d=B=c=B=G=A|=B=c=d' \
     -e 'BcB|BcBcBcB|BcB' -e 'CA,CA,CA,CA,CA,CA,CA,CA,CA,' -e 'D2|=D2=D2=C2|=C2=D2=D2|' -e 'DADDADDADDA' \
     -e 'EGGAGEDC|EGGAGACD|E' -e 'G,G,G,G,G,G,G' -e 'G,G,G,G,G,G,G,G' -e 'G,G,G,G,G,G,G,G,G,G,' \
     -e 'G,|G,2G,G,|G,2G,G,|G' -e 'G,|G,G,G,|G,G,G,|' -e 'G/G/G/G/G/G/G/G/' -e 'G2|G2G2G2|G2' \
     -e 'G=A=c=G=A=c=G=A=c=G=A=c=G=A' -e 'G|=G=A=G=G2=G|=G=A=G=G2=G|' -e 'G|=G=G=G=G=G=G=G=G|' \
     -e '\n\n\n' -e '\n|\n|\n|\n|\n|' -e '^A|^A^A^A^A^A^A^A^A|^' -e '^D|^D^D^D^D^D^D^D|^' \
     -e '^f=f=f^f=f^f=f^d=f^f=f^' -e '^g|^g=f=f^d^d^c^c^g|^g=f' -e 'a a a a' -e 'a=a|=a=a=a=a=a=a|=' \
     -e 'aaeaaeaaeaaeaaeaaea' -e 'abbabbaba' -e 'b b b b' -e 'b=a=g|=b=a=g=b=a=g|=b=a' -e 'c/2c/2c/2c/2c/2c/2c/' \
     -e 'c/c/c/c/c/c/c/c/c/c/c' -e 'cccccccccccccccccc' -e 'e/e/e/e/e/e/e/e/e/e' -e 'f=a=a|=c=e=e=f=a=a|=c=e' \
     -e 'f=e|=f=g=f=e=g=f=e=g=f' -e 'fBfBfBfBfBfBfBfBfBfB' -e 'f^d|^c^d^f^g^f^d|^c' -e 'g g g g g' \
     -e 'g=e=g|=a=e=e=a=a=g=e=g|=a=e=' -e 'g=g^g|^g^g=g^g=g^g=g^g|' -e 'g=g|=a=g=f=e=g=g|=d' \
     -e 'g=g|=d=g=g=d=g=g|' -e 'g|=d=g=g=b=g=g|=d=g=g=b=g=g|=d' -e '|(3DDDD2(3DDDD2|(3DDDD2(3DDDD2|' -e '|(G,G,G,G,G' \
     -e '|=A=F=A=F=A=F=A=F|=A=F=A' -e '|=A=G=A=C2=G|=A=G=A=C2=G|=A=G=A=C2=G|' -e '|=E=G=G=E=F=A=A=F|=E=G=G=E=F=A=A=F|' \
     -e '|=E=G=G=E=G=A=G=F|=E=G=G=E=G=A=G=F|' -e '|=G,2=G,2=G,2|=G,2' -e '|=G=A=G=c=G=G|=G=A=G=c=A=G|' \
     -e '|=G=E=E=G=A=B=c=A|' -e '|=G=E=E=G=G=E=G=E|=G=E=' -e '|=G=G=G=G=G=G=G=G|' -e '|=G=G=G=G=G=G=G|=G=G=G=G=G=G=G|' \
     -e '|=G=G=G=G|=G=G=G=G|' -e '|=a=f=a=a=f=a|=a=f=a=a=f=a|' -e '|=a=g=f=e=g=g|=d=g=g=d=g=g|=a=g=' -e '|=a=g=g=g=f=e|=d=g=g=e=g=g|' \
     -e '|=b=a=g=b=a=g|' -e '|=c=c=c=g=c=e=c=g=c|=c' -e '|=c=d=c=B=c=d=e=d|=c=d=c=B=c=d=e=d|' -e '|=c=g=e=g=c=g=e=g|=c=g=e=g=c=g=e=g|' \
     -e '|=d=c=e=d=c=e|=d=c=e=d=c=e|=d' -e '|=d=f=g=f=g=f=d=c|=d=f=g=f=g=f=d=c|' -e '|=d=g=g=d=g=g|=a=g=f=e=g=g|=d=g=g=d=g=g|' \
     -e '|=d=g=g=d=g=g|=a=g=f=e=g=g|=d=g=g=d=g=g|=a=g=g=g=f=e|' -e '|=d=g=g=d=g=g|=a=g=g=g=f=e|=d=g=g=d' -e '|=d=g=g=d=g=g|=d=g=g=d=g=g' \
     -e '|=d=g=g=d=g=g|=d=g=g=d=g=g|' -e '|=d=g=g=d=g=g|=d=g=g=d=g=g|' -e '|=d=g=g=d=g=g|=d=g=g=d=g=g|=d' \
     -e '|=e=d=e=d=e=d=e=d|=e=d=e=d=e=d=e=d|' -e '|=e>=f=g>=e=d>=c=c>=d|=e>=' -e '|=g=e=g=g=e=g|=g=e=g=g=e=g|' \
     -e '|=g=f=e=d=c=d=e=f|=g=f=e=d=c=d=e=f|' -e '|A,A,A,A,A,A,A,|A,A' -e '|C2CD2E|C2CD2E|C' \
     -e '|C2ED2E|C2ED2E|' -e '|D2D2D2|D2D2D2|D2D2D2|' -e '|D2E2D2D2|D2E2D2D2|D2' -e '|D2E2D2|E2D2C2|D2E2D2|' \
     -e '|EDDD|EDDD|EDDD|' -e '|EDEEDE|EDEEDE|EDEDED|' -e '|G,2G,2G,2|G,' -e '|G,A,A,A,A,2G,G,|G,A,A' \
     -e '|G,A,G,A,G,A,G,A,|G,A,G,A,' -e '|G,B,CDDB,|G,B,CDDB,|G,B,CDDB,|' -e '|G,ED|EG,G,|G,ED|' \
     -e '|G,EEEDEGG2|G,EEEDEGG2|G' -e '|G,G,A,G,A,G,F,G,|G,G,A,G,A,G,F,' -e '|G2A2A2G2A2|G2A2G2A2G2' \
     -e '|G2G2A2|G2G2A2|G' -e '|G2G2G2G2G2G2|' -e '|G2G2G2G2|G2A2G2A2|G2A2G2A2|' -e '|GB\n|GB\n|GB\n|GB\n|GB\n|GB\n|' \
     -e '|GGGGGGGGGGGGGGGGGGGG|' -e '|^A=c^c^A^G=F|^A=c^c^A^G=F|' -e '|^G^A^G^G2^G|^G^A^G^G2^G|' \
     -e '|^G^G^G^G^G^G^G^G^G|' -e '|^c2^A2^A^G^A=c|^c2^A2^A^G^A=c|' -e '|^g=f^d^c^a^a|^g=f^' \
     -e '|^g^a^a^a^g=f|^g^a^a^a^g=f|' -e '|^g^f^g^f^d^d|^g^f^g^f^d^d|' -e '|f/a/g/f/e/d/|f/a/g/f/e/d/|f/a/g/f/e/d/|f/a/g' \
     -e '|gggggg|gggggg|'
} # }}}}

echo "[" >> $JSON-encoded.json; echo "[" >> $JSON-text.json; # "](((((

# should we simply auto-rate all pieces where possible (skipping pairs where both are valid) and not ask for any manual ratings?
# to avoid duplicate comparisons, we split the set of items in half and select from the top & bottom half in each loop;
# if there are 100 shuffled items, we compare #1 & #51, then #2 & #52 etc; thus, every item will be used once and only once
for file in $(ls /tmp/music-samples/sample-*.encoded | sed -e 's/\.encoded//' | shuf)
    if [[ -f $file ]]; then

LENGTH=$((${#POEMS[@]} / 2))
for ITERATION in `seq $I $LENGTH`; do
    echo "POEM: ${POEMS[I]}"

    FIRST=$(cat $POEM)
    FIRST_ENCODED=$(cat $POEM.encoded)

    SECOND=$(cat $POEM2)
    SECOND_ENCODED=$(cat $POEM2.encoded)

    # if first but not second is broken, second is the winner; if second but not first is broken, first wins;
    # and if both are broken, then we insert a pair where both win to tell the model that they are equally bad.
    # The check is a >100kb WAV file; if the ABC file is syntactically broken or too short to bother rating or
    # something goes wrong with abc2midi/timidity, then there will be no or small WAV files, so this checks most
    # errors. The other major error case is long repetitive degenerate ABC pieces generated by the model, so we
    # have a 'filterMusic' blacklist for snippets which show up in degenerate pieces.
    if [ ! $(wc -c < $POEM.wav) -ge 100000 ] || [[ -n $(echo "$FIRST" | filterMusic) ]]; then
        generateJsonBroken 1 "$FIRST_ENCODED" "$SECOND_ENCODED" "$FIRST" "$SECOND"
    elif [ ! $(wc -c < $POEM2.wav) -ge 100000 ] || [[ -n $(echo "$SECOND" | filterMusic) ]]; then
        generateJsonBroken 0 "$FIRST_ENCODED" "$SECOND_ENCODED" "$FIRST" "$SECOND"
    elif [ ! $(wc -c < $POEM.wav) -ge 100000 ] || [[ -n $(echo "$FIRST" | filterMusic) ]] && \
             ([ ! $(wc -c < $POEM2.wav) -ge 100000 ] || [[ -n $(echo "$SECOND" | filterMusic) ]]); then
        generateJson 1 "$FIRST_ENCODED" "$SECOND_ENCODED" "$FIRST" "$SECOND"
        generateJson 0 "$FIRST_ENCODED" "$SECOND_ENCODED" "$FIRST" "$SECOND"
        if [ -z "$SKIP_ALL" ]; then
            echo -e "\n\e[1m--------------------------------------------\e[0m"
            echo "$FIRST"
            timeout 10s mpv --af=scaletempo=scale=1.1:speed=pitch $POEM.wav
            sleep 1s

            echo "============================================="
            echo "$SECOND"
            timeout 9s mpv --af=scaletempo=scale=1.1:speed=pitch $POEM2.wav
            echo "" # print a newline to make output easier to read and divide from the foregoing

            echo -e "[$I] 1: \e[1mFirst\e[0m wins | 2: \e[1mSecond\e[0m wins | 3: Equal | \
                          r: stop & auto-Rate Rest | x: e\e[1mX\e[0mit immediately"
            read -N 1 RATING
            case "$RATING" in

                # skip

                    generateJson 0 "$FIRST_ENCODED" "$SECOND_ENCODED" "$FIRST" "$SECOND"
                    generateJson 1 "$FIRST_ENCODED" "$SECOND_ENCODED" "$FIRST" "$SECOND"
                    generateJson 0 "$FIRST_ENCODED" "$SECOND_ENCODED" "$FIRST" "$SECOND"
                    generateJson 1 "$FIRST_ENCODED" "$SECOND_ENCODED" "$FIRST" "$SECOND"
echo "]" >> $JSON-text.json; echo "]" >> $JSON-encoded.json

When I run the PPO in a screen ses­sion, I can extract the full ter­mi­nal his­to­ry, with all printed out sam­ples, to rate (C-a C-[ C-Space, PgUp to begin­ning of run, then C-space C-> to save ter­mi­nal tran­script to /tmp/screen-exchange), and fil­ter out the sam­ples and select only unique (im­por­tant with diver­gence) sam­ples with 42 char­ac­ters or more for rat­ings:

fgrep -v -e ppo -e 'k =' -e 'score =' -e 'kl =' -e 'total =' /tmp/screen-exchange | \
  sed -e 's/^X:$//' | sort --unique | sed -e 's/^/X:/' | sed -e 's/<|endoftext|>/\n/g'  | \
  sed -r '/^.{,42}$/d' | sed -e 's/^/<|endoftext|>\n===================\n/g'  -e 's/⏎/\n/g'| \
  egrep -v "^$" > $TARGET

## add the newly-encoded JSON ratings to the master dataset, remembering to close brackets:
emacs -nw abc-01-encoded.json irish.json
## update `` with the new dataset # of ratings, or else they won't be used
fgrep 'best' irish.json | wc --lines
# 16901
emacs -nw

Alter­nate­ly, I could gen­er­ate sam­ples from a check­point (as­sum­ing it’s not too far diverged):

./ sample --mpi 2 --save_dir /tmp/save/train_policy/irish-combined-20191223.18/ --savescope policy \
  --temperature 0.9 --nsamples 2000 --batch_size 30 | tee --append rlsamples-combinedabc-06.txt


~1 day iter­a­tions. Early on, each iter­a­tion required a few hours at most on my 2×1080ti (a few min­utes for the reward model then hours for PPO), and the PPO would diverge within ~3k iter­a­tions. As rat­ings accu­mu­lat­ed, train­ing the reward model began tak­ing up to an hour (oc­ca­sion­ally crash­ing with ran­dom OOMs at the end), and PPO began tak­ing up to 24h, some­times diverg­ing as late as 9k iter­a­tions.

Exam­ple ter­mi­nal out­put of an ABC music (com­bined mod­el) PPO run, at 8,000 steps (~30 wall­clock hours/60 GPU-hours)
Exam­ple Ten­sor­Board logs of an ABC music (com­bined mod­el) PPO run, in the process of diverg­ing after two ‘bounces’ (full screen­shot of an exam­ple diver­gence).

n = 7k rat­ings / 40 hours. I ran ~23 iter­a­tions of train­ing and then rat­ing sam­ples; not all iter­a­tions diverged, because they crashed or I acci­den­tally killed them or I decided they were show­ing signs of diver­gence (like a col­lapse in the entropy of the pol­i­cy). Exclud­ing ‘bro­ken’ auto-rat­ings, I rated n = 7,429 pairs of ABC music (this is an over­es­ti­mate of the actual rat­ings, due to my imple­ment­ing ties as dou­ble-sam­ples); includ­ing auto-rat­ings, I had n = 25,508. I found I was able to get through ~200 rat­ings per ses­sion (~1h) before my skull began leak­ing out my ears, so I some­times had to take breaks. Since each sam­ple takes ~20s total (~10s per sam­ple), this required a total of >40 hours of con­cen­trated rat­ing. (I ini­tially tried doing other things while rat­ing to lessen the time bur­den, but quickly dis­cov­ered it was impos­si­ble to remem­ber the first musi­cal piece to com­pare it to the sec­ond piece if I was doing any­thing at all like read­ing.)

Diver­gences. The con­stant diver­gence cre­ated a lot of prob­lems, and I tried to deal with them by auto­mat­i­cally black­list­ing exam­ples with patho­log­i­cal pat­terns, but this did not help. Since the OA paper did not report any diver­gence issues, I tried going back to the OA setup by increas­ing the KL reg­u­lar­iza­tion, but while this gen­er­ated dif­fer­ent dynam­ics (in­stead of a grad­ual ‘dou­ble bounce’, there is a long steady state fol­lowed by a sin­gle abrupt col­lapse), it did not fix the issue:

Exam­ple Ten­sor­Board of the com­bined mod­el, diverg­ing in a sin­gle bounce despite full KL reg­u­lar­iza­tion

Diverged exam­ples:


KL reg­u­lar­iza­tion did­n’t help. Final­ly, I gave up: after 3 months & 7k rat­ings, if it was­n’t work­ing, it was­n’t going to start work­ing just because I spent a few more weeks adding more rat­ings. I ran one last iter­a­tion, stop­ping it at ~7k iter­a­tions, not long before I expected it to diverge but before entropy had col­lapsed too much (final RL train­ing run Ten­sor­Board log). Some sam­ples from the final mod­el:

“Bour­réeàsixde­Bri­antes” sam­ple (2020-01-25):
“100 GPT-2 Pref­er­ence-Learn­ing-­Tuned Tunes” sam­ple (2020-01-25):

Model & Data

Avail­able for down­load:

  1. All rat­ings & sam­ples (26MB; mir­ror)

  2. the final pol­icy mod­el:

    rsync -v rsync:// ./

Blind Ratings

No improve­ment from RL. I was not impressed by the qual­ity of RL sam­ples either dur­ing train­ing or when sam­pled, which did not strike me as clear improve­ments. (In con­trast, the ‘space­less’ ABC data clean­ing change made an imme­di­ate dif­fer­ence to sam­ples.) To eval­u­ate the final sam­ples, I used the 7k iter­a­tion to gen­er­ate sam­ples (tem­per­a­ture: 0.95) and com­pared to the 117M ABC space­less base­line (with top-p = 0.95), and adapted my rat­ing script to load from the two sam­ple files, ran­dom­ize left/right pre­sen­ta­tion, and record which file won. I expanded the rat­ing time to 20s per piece to allow more in-depth com­par­i­son.

I rated ~200 pairs, and the result was I pre­ferred the RL sam­ples in 93 of 210 com­par­isons, or ~46%. If any­thing, the RL-fine­tuned sam­ples were slightly worse than the base­line.


“I have attempted sci­ence” (, 2019)

Despite con­sid­er­able per­sonal effort over 3 months, I did not achieve any improve­ment in sam­ple qual­ity and the project failed. Since the tech­nique does work in some cas­es, how could I have fixed it? Hind­sight. In ret­ro­spect, I would have done a few things dif­fer­ent­ly:

  1. pre­fix com­ple­tions: the root cause seems to be the reward model not learn­ing ade­quate­ly. Even ini­tial­ized from a good music-­gen­er­a­tion mod­el, esthet­ics may be dif­fi­cult to learn from few n with paired com­par­isons where the pairs are com­pletely dis­sim­i­lar.

    The OA tasks, on the other hand, made heavy use of com­ple­tions: sam­ples which share a long pre­fix, and then diverge. Because they are iden­ti­cal, they dif­fer far less than 2 ran­dom sam­ples would, and so the same rat­ing is much more infor­ma­tive. It’s just a kind of sta­tis­ti­cal power issue, sim­i­lar to using rather than ran­dom peo­ple—the results are the same, but you need orders of mag­ni­tude less n.

    I avoided con­di­tional sam­ples because it made the pro­gram­ming much eas­ier to not have to count BPEs or slowly gen­er­ate 2 com­ple­tions for each pos­si­ble pre­fix, I could use ran­dom pairs of sam­ples col­lected from any­where, and it mapped directly onto my goal of uncon­di­tional gen­er­a­tion (if I used con­di­tional gen­er­a­tion, where do the pre­fixes come from?), which all seemed like good enough rea­sons at the time, but given the final results, this may have been a mis­take.

    Another idea (made much more dif­fi­cult by the rigid­ity of the inputs & con­fig) is to use “cur­ricu­lum learn­ing” (eg ): there’s at least two straight­for­ward ways of pro­vid­ing eas­ier sub­-­tasks than gen­er­at­ing a full music piece. First, the required length can be grad­u­ally expanded over train­ing—once it learns to gen­er­ate 5s of music that the critic can’t dis­tin­guish, require it to gen­er­ate 10s, etc.

    Sec­ond, real music can be used as a crutch by pro­vid­ing the gen­er­a­tor with a decreas­ing pre­fix from real music as a ‘seed’: once it can append 1 note suc­cess­ful­ly, require it to append 2 notes, then 3 notes, and so on, until the pre­fix is 0-length it is gen­er­at­ing music sequences from scratch. (This can be done with or with­out using a super­vised log-­like­li­hood loss for train­ing the NN to gen­er­ate the pre­fix.)

  2. more hyper­pa­ra­me­ter tun­ing: there’s no sup­port for hyper­pa­ra­me­ter opti­miza­tion in the code­base, but instead of set­ting the hyper­pa­ra­me­ters based on my expe­ri­ence & intu­ition, I could have run a more for­mal hyper­pa­ra­me­ter search, and grid-searched it man­u­al­ly. Since the reward model typ­i­cally takes less than an hour to train, a few hun­dred runs would have been fea­si­ble over the 3 months of my pro­ject, and I would have much more con­fi­dence that the reward model was squeez­ing as much out of the rat­ings as pos­si­ble.

    As it is, I’m left with the nag­ging doubt—was the LR just too high, or too low, and the reward model could’ve got­ten good enough to pro­vide a use­ful sig­nal to the PPO and train a gen­uine improve­ment to the music gen­er­a­tion?

  3. tried crowd­sourc­ing: I did­n’t want to involve third-­par­ties until I knew it would work or try to set up a web­site for inter­ac­tive generation/rating, but crowd­sourc­ing may have been nec­es­sary to col­lect a decen­t-­sized dataset. While it would not have gone hugely viral like other DL projects have, a few thou­sand vis­i­tors rat­ing a dozen com­par­isons each would’ve gone a long way.

  4. checked auto-rat­ings more: the auto-rat­ings seemed like a great idea at the time—if the model kept gen­er­at­ing sam­ples with sim­i­lar patho­log­i­cal behav­ior, or if they were syn­tac­ti­cally bro­ken, why hand-rate them at all? But now I have mis­giv­ings. Ban­ning the patho­log­i­cal sam­ples was prob­a­bly OK, but did I screw up the reward model by ban­ning bro­ken sam­ples? After all, they made up the over­whelm­ing major­ity of the cor­pus at the end, so I may have inad­ver­tently pro­duced a ‘class imbal­ance’-style prob­lem: the reward model wound up focus­ing entirely on try­ing to under­stand syn­tac­tic flaws, rather than esthetic flaws.

Oh well.


A bridge too far. I ini­tially began with RL train­ing of the , but after a few iter­a­tions, I aban­doned this idea. The sam­ples were too hard to rate quick­ly, and were heav­ily biased towards mod­ernist Poet­ry-­Foun­da­tion-based pieces, which, lack­ing for­mal struc­ture, had to be judged largely on seman­tic­s—but every poem sam­ple is bro­ken dif­fer­ent­ly, requir­ing world-­knowl­edge to fix, and how likely was it that the rat­ing model could hope to pick up on these issues from mere pair­wise com­par­isons? Even the OA paper did­n’t try to fix seman­tics, set­tling for generic tasks like ‘descrip­tive­ness’ or ‘sum­ma­riza­tion’ (as opposed to fine­tun­ing for tasks like ‘make accu­rate state­ments about the world’). So I switched to improv­ing the ABC music, rea­son­ing that fix­ing issues like too much rep­e­ti­tion ought to improve music qual­i­ty, while not being nigh-im­pos­si­ble for the reward model to learn from com­par­isons.

I did pick out a few good poetry sam­ples along the way:

A man is sitting at the foot of his host.
Every eye is filled with watching and dread.
And he speaks with a strange and fearful sound.
All, all are silent and no sound,
Save what the Spirit sings
In the calm hour of your pensive sleep;
A measured, silent, and mysterious sweep
Of the invisible,

"What voice is this?" the Captain cries
In a voice of troubled joy,
As he deems the gathering night-wind, and
The voices of the years away,
But knows the toot of the sands of tears
And the burden of our own despair.
He calls for his comrades and his wife
To wait upon him in the tent;
But they do not, as he says to them,
Because they fear to die;
They turn and journey on their way,
For they realize their fate.
Why does the Captain wait and wait
In silent and unseeing wait?
He has not come for his reward,
Nor will it be too late.

..."The Farm", by James Thomas Stevens [Nature, Landscapes & Pastorals]
The yard is an x of sun
shot across and then dividing
black with silver mesh in,
then crossing through it into salt,
becoming a tiny dream
so it can sometimes dream of water
across the tree. The house
is a part of the yard
and the sun is going down
past the barn. At home
the house is a field of cream.
A few lampshins
flare at the door.
A door is at the threshold
and when the house of cheese turns brown
the house turns red.
The house is an x of sun
and when the house of feed turns red
the house turns green.

...I hear him with pleasure roar through the wood,
A melody as of rushing from the main;
He sings as he treads the bound of human things,
Borne on the wings of the blast as I sweep along,
The music of all whose names I love to hear,
Like music heard in a wind that murmurs near,
The music of all who hear.

I think of thee I know, oh, come from far,
From this green world and tracks by unknown lands;
Come to me, all that still is beautiful,
Come all that dwell in nature, all resigned,
And all that glows with beauty bright and free,
Yea, all that glitters like your beautiful eyes,
And all that lives like your beautiful hair,
And all that mocks at with a tranquil air,
Weeneth more of the sweetness of your voice,
Wandering on as it wanders still and free,
With earth in every drop and spot on earth,
By night and day and starry night.

Perfect is this life,
And end is death. And so to end is life.
How many of us
Have come to this, and died
Like birds. Here, in a quiet room within
A sombre room, where even the gravest dead
In all the ills of life are counted down.
In the broad company of light and death,
I watched a silent procession of them die;
And one by one, by three,
Passed slowly out into the waiting dark.
We left behind us in a room of grief:
Her voice, her hands I laid upon mine eyes,
Stretched over mine, and strove to think in vain
We loved together in a world of tears.


Data Increases


1–10m labels too expen­sive to buy. If we need 70k to get good per­for­mance on a rel­a­tively straight­for­ward task like sum­ma­riza­tion (which can be solved to a con­sid­er­able degree just by copy­ing selected parts of the input), it’s easy to imag­ine that we might need an order of mag­ni­tude or two more data for sub­tler tasks like music. 1–10 mil­lion is totally infea­si­ble for one per­son on their own, and would cost far too much to pay a data labeler for as well5.

Crowd­sourc­ing scales to 10m+! Could we over­come the lack of rat­ings by using crowd­sourcing? Such sam­ple sizes appear to be entirely fea­si­ble with crowd­sourcing: the global pub­lic is inter­ested in AI and gen­er­a­tive art, and can con­tribute a lot of time en masse, donat­ing mil­lions of inter­ac­tions, and the nec­es­sary infra­struc­ture does not require enor­mous resources (many suc­cess­ful projects were done by hob­by­ists or intern­s). Some exam­ples:

  1. : within 2 months of launch, the turn-based GPT-2-text-dialogue game AD2 had racked up >100m turns.

  2. : >1m unique vis­i­tors within the first mon­th, spend­ing sev­eral min­utes on aver­age and look­ing at dozens of faces. TWDNE is only one of a num­ber of “This X Does Not Exist”, usu­ally based on StyleGAN mod­els, inspired by , and the total num­ber of vis­i­tors to TXDNE sites is likely into the tens of mil­lions.

    • : an anime face gen­er­a­tor sim­i­lar to TWDNE, it sim­i­larly went viral. Sizigi Labs esti­mates 325,000 ses­sions 2019-10-30–2020-01-28 (well after launch & viral­i­ty), at ~2 minutes/session; their ana­lyt­ics were bro­ken at launch but “if I had to guess, we’re some­where 1-3MM life­time [user­s].” Given how pop­u­lar it was dur­ing its viral­ity and the num­ber of links & men­tions I’ve seen on social media, I def­i­nitely believe it had at least as many unique users as TWDNE did.
  3. : the home­page reports gen­er­at­ing 56,391,540 images between its launch ~2019-09-09–2020-01-27; the stan­dard breed­ing inter­face shows 6 pos­si­ble images, so that cor­re­sponds to ~9m user actions/clicks.

  4. : when made avail­able by OA for a week­end in April 2019 for pub­lic play, there were 42,723 DoTA2 games against 30,937 play­ers tak­ing a total of 93,796 man-hours.

  5. : OA reported in the first day of MuseNet avail­abil­i­ty: “In the last 24 hours, we saw: 100k plays total of pre-­gen­er­ated MuseNet songs / 38,000 MuseNet sam­ples co-­com­posed / 29k unique MuseNet con­cert lis­ten­ers”.

    MuseNet sam­ples are typ­i­cally 1–3m long, and the con­cert was 3 hours long, sug­gest­ing man-hours in the first day listening/generating MuseNet sam­ples. Pre­sum­ably the counts increased by at least another order of mag­ni­tude in the fol­low­ing week as they ran a com­pe­ti­tion for best gen­er­ated sam­ple of the day.


Off-pol­icy RL/semi-supervised learn­ing. We do not nec­es­sar­ily need explicit rat­ings from humans if we can lever­age exist­ing algo­rithms and datasets to con­struct syn­thetic or pseudo-rat­ing datasets. They do not need to be per­fect or human-qual­ity to poten­tially greatly reduce how many human rat­ings are need­ed, sim­i­lar to how pre­train­ing GPT-2 for gen­er­at­ing ABC for trans­fer learn­ing makes pref­er­ence-learn­ing fea­si­ble at all on that domain. From an RL per­spec­tive, PPO may be an ‘on-pol­icy’ algo­rithm which can learn only from rewards on sam­ples it just gen­er­at­ed, but the reward model itself can learn from rat­ings on sam­ples gen­er­ated by any process, and is ‘off-pol­icy’. The sam­ples could be gen­er­ated by humans, or gen­er­ated by non-GPT-2 NNs, or gen­er­ated by non-NN algo­rithms entire­ly.

6 To kick­-s­tart the learn­ing process, you could ‘pre­train’ the reward model by gen­er­at­ing lots of music from low-qual­ity gen­er­a­tive sources and then mark­ing them all as the loser in a set of com­par­isons with high­er-qual­ity sources (such as real music). For exam­ple, one could define a few music gen­er­a­tors (ran­dom ASCII char­ac­ters, n-grams, char-RNN at var­i­ous tem­per­a­tures) to gen­er­ate a mil­lion fake music sequences, take the real music from the ABC Irish music cor­pus and cre­ate com­par­isons with the real music always the win­ner. If there is pop­u­lar­ity data on the real music, then this too can be used to pre-­gen­er­ate com­par­isons (just have the more pop­u­lar of two pieces win each com­par­ison). The pre­train­ing com­par­isons can reflect as much addi­tional infor­ma­tion as you think you can get away with. Along with pop­u­lar­ity rat­ing to make dis­tinc­tions between com­par­isons of the real music, why not order the com­par­isons as well by data qual­ity source? eg ran­dom < n-gram < char-RNN < GPT-2 < GPT-2-PPO-tuned There might be mis­taken com­par­isons (per­haps some­times the n-grams really do beat the char-RNNs), but this is amenable to fix­ing by active learn­ing on the per­sis­tently mis­clas­si­fied com­par­isons should it be an issue. This imme­di­ately pro­vides an enor­mous cor­pus for the pref­er­ence clas­si­fier, and then when it’s fin­ished train­ing on that, one can bring the human into the loop and start generating/comparing/retraining as in the pref­er­ence learn­ing.

More gen­er­al­ly, you can see the pre­train­ing+pref­er­ence learn­ing as a form of semi­-­su­per­vised learn­ing, with an ini­tial unsu­per­vised boot­strap phase fol­lowed by super­vised learn­ing as nec­es­sary:

  1. use unsu­per­vised learn­ing meth­ods to cre­ate gen­er­a­tive mod­els based on a cor­pus
  2. sam­ple from the gen­er­a­tive mod­els to cre­ate fake n
  3. cre­ate m com­par­isons with by pair­ing ran­dom fake n and real n
  4. train a reward model
  5. begin reg­u­lar pref­er­ence learn­ing

Architectural Improvements

I believe the cur­rent Chris­tiano black­box pref­er­ence learn­ing approach (/) could be improved to make it more com­pute-­ef­fi­cient, sam­ple-­ef­fi­cient, and sim­pler. There are two ways that seem par­tic­u­larly rel­e­vant for music/text gen­er­a­tion:

  1. directly opti­mize reward by back­prop: The opti­miza­tion of the reward does not require tak­ing a black­box approach where the ‘envi­ron­ment’ is not mod­eled, requir­ing an agent like PPO; the ‘envi­ron­ment’ is sim­ply the reward mod­el, which is a neural net­work and can be queried, dif­fer­en­ti­at­ed, and opti­mized over like any oth­er.
  2. directly model qual­ity score: The reward model can be improved in flex­i­bil­i­ty, inter­pretabil­i­ty, and effi­ciency by explic­itly treat­ing it as a Bradley-Terry mod­el, and train­ing the NN to pre­dict the intrin­sic ‘qual­ity’ score (rather than raw com­par­ison­s), which can be eas­ily esti­mated by stan­dard sta­tis­ti­cal meth­ods given a dataset of rat­ings.

The new archi­tec­ture & train­ing would go like this when com­bined:

  1. Data col­lec­tion:

    1. do Pair­wise Rat­ings on a cor­pus, with enough over­lap to form a rank­ing
    2. B-T Rank­ing algo­rithm to infer the latent qual­ity score for each dat­a­point
    3. Super­vised (Re)­Train­ing of the reward model on data→s­core
  2. Pol­icy improve­ment:

    1. for each dat­a­point (ei­ther ran­dom­ly-­gen­er­ated or from a cor­pus)

      1. Encode into the text embed­ding
      2. run iter­a­tions of Gra­di­ent Ascent on the reward model to opti­mize the embed­ded text sequence, until a limit i is hit or until the aver­age reward (qual­i­ty) is higher than the aver­age reward (qual­i­ty) of the pre­vi­ous cor­pus
    2. replace the pre­vi­ous cor­pus with the new Improved Cor­pus

    3. [op­tion­al] (Re)­Train a generator/agent-model by likelihood-training/imitation-learning on the new cor­pus (‘amor­tized infer­ence’)

Optimization by Backprop, not Blackbox

Here I pro­pose chang­ing the agent/generator model archi­tec­ture to explic­itly opti­mize the reward mod­el’s utility/reward score, by remov­ing the agent/generator entirely and instead improv­ing pos­si­ble sequences by gra­di­ent ascent on the (dif­fer­en­tiable) reward mod­el. There is no need to build a redun­dant agent model when the reward model is dif­fer­en­tiable and can be used to directly spec­ify how an input sequence ought to change to improve it.

This sim­pli­fies the over­all archi­tec­ture great­ly, avoids expen­sive & unsta­ble & com­plex black­box train­ing of DRL agents, and enables easy gen­er­a­tion of both high­-s­cor­ing & high­ly-­di­verse (thus infor­ma­tive) sequences for an ora­cle to rate, which can then be fed back into the reward model for fur­ther train­ing. To the extent an agent/generator is nec­es­sary to effi­ciently gen­er­ate many sequences, it can be trained quickly & sta­bly by imi­ta­tion learn­ing on a cor­pus of dat­a­points opti­mized by the mod­el.

While run­ning PPO against the reward model, I con­cluded that com­pared to other approaches I’ve seen to opti­miz­ing the out­puts of a NN, the black­box pref­er­ence learn­ing approach has 2 major flaws:

  1. Com­pute-In­ef­fi­cient: it is slow and mem­o­ry-hun­gry (I have to use GPT-2-117M to fit rea­son­able mini­batches onto 2×1080ti, and even then iter­a­tions can take days)
  2. Sin­gle-­Di­ver­gence-Prone: it ‘mode col­lapses’ into adver­sar­ial sam­ples, typ­i­cally highly repet­i­tive, and typ­i­cally even­tu­ally only one adver­sar­ial sam­ple

Slow feed­back: 1 day, 1 coun­ter-ex­am­ple. This makes iter­a­tion slow because of the dou­ble-wham­my: each run takes days before any score improve­ments or diver­gence, and when it diverges, it typ­i­cally only yields a hand­ful of usable adver­sar­ial dat­a­points to rate & retrain on. Thus, the frus­trat­ing expe­ri­ence of see­ing each run end in just one adver­sar­ial sam­ple, which may be only some­what dif­fer­ent from the pre­vi­ous run.

Mode col­lapse. Think­ing about this, a black­box RL approach does­n’t seem quite right. For an RL prob­lem, it’s fine to find only a sin­gle path which leads to a high reward. To put it in GAN terms, this ‘mode col­lapse’ onto a sin­gle adver­sar­ial exam­ple is, as far as the agent/Generator is con­cerned, a 100% valid solu­tion. The ‘envi­ron­ment’ has no mem­o­ry, and can­not penal­ize the agent/Generator for rep­e­ti­tion. If there exists any string “XYZ” which, on its own or appended to any other string, causes the reward model/Discriminator to always out­put the max­i­mal reward, then why does the agent/Generator need to learn any­thing else? It won. But that’s not really what we want. We want it to sam­ple from the full dis­tri­b­u­tion of high­-qual­ity sequences. Unfor­tu­nate­ly, mode col­lapse is not solved in GANs, and I can’t think of any easy how to eas­ily fix it in this pref­er­ence learn­ing either.

Ask reward model how to edit sam­ples. One approach to avoid both those issues is to drop the black­box opti­mizer approach entire­ly—which incen­tivizes wast­ing a ton of com­pute to find a sin­gle adver­sar­ial exam­ple—and instead opti­mize dat­a­points directly. It seems like a waste to go to all this effort to build a dif­fer­en­tiable sur­ro­gate (re­ward) model of the envi­ron­ment (the human), and then treat it like just another black­box. But it’s not, that’s the whole point of pref­er­ence learn­ing! Since GPT-2 is dif­fer­en­tiable, it ought to be pos­si­ble to back­prop through it to do plan­ning and opti­miza­tion like . Typ­i­cal­ly, we hold the inputs and out­puts fixed and use back­prop to adjust the mod­el, but one can instead hold the model fixed and adjust the inputs based on back­prop to give a desired out­put: in this case, hold a GPT-2 reward model fixed, and adjust tex­tual inputs to make the out­put, the reward, larg­er, by back­prop­ping from the out­put back through the model to the cur­rent input. This is an approach which , and ‘opti­miz­ing images’ to max­i­mize a CNN’s prob­a­bil­ity of clas­si­fi­ca­tion as a ‘dog’ or ‘cat’ etc has long been done as a way of visu­al­iz­ing what a CNN has learned. For exam­ple, one could gen­er­ate high­-s­cor­ing music pieces by gen­er­at­ing a ran­dom sequence, tex­t-em­bed­ding it into the vec­tor for the reward mod­el, and then doing gra­di­ent ascent on the vec­tor. (No PPO clus­ter required.) This is equiv­a­lent to doing planning/revising, as at each iter­a­tion, GPT-2 ‘con­sid­ers’ the sequence as a whole and can make global changes rather than local changes to the final entry in the sequence; over many iter­a­tions, it can ‘edit’ a sequence repeat­ed­ly, rather than being forced to gen­er­ate the entire sequence in a sin­gle shot like PPO is. This could be a lot faster since it exploits the white­box nature of a learned reward model instead of treat­ing it as a high­-­vari­ance black­box.

Exam­ple: PPLM. A sim­i­lar approach to opti­miz­ing GPT-2 out­puts has since been pub­lished by Uber as “PPLM” (; ). PPLM uses the gra­di­ents from GPT-2 and a con­trol NN to do simul­ta­ne­ous gra­di­ent ascent, try­ing to opti­mize an input to max­i­mize both like­li­hoods, thereby main­tain­ing sen­si­ble Eng­lish text (due to GPT-2) which still max­i­mizes the par­al­lel tar­get (such as a ‘pos­i­tiv­ity’ goal).

Another pos­si­bil­ity would be to try to use beam search (although it has pro­duced bad results in NNs as dis­cussed in the nucleus sam­pling paper), per­haps due to the log like­li­hood train­ing encour­ag­ing rep­e­ti­tion) or the expert iteration/MCTS train­ing from AlphaGo Zero. MCTS was orig­i­nally intro­duced for plan­ning in gen­eral MDPs, it isn’t inher­ently lim­ited to two-­player games, the “rules” of gen­er­at­ing sequence data is triv­ial (any­thing ASCII, in this case), and the dis­crim­i­na­tor pro­vides a well-de­fined reward. So instead of a NN which directly gen­er­ates a next char­ac­ter, it could instead (given a par­tic­u­lar prefix/history) out­put val­ues for the 128 ASCII val­ues, run MCTS search for a while, pro­duce a refined value for each char­ac­ter, and retrain the NN towards the refined val­ues; every mini­batch of the gen­er­a­tor, one gen­er­ates a bunch of exam­ples for the human to judge and pro­vide a new mini­batch for the dis­crim­i­na­tor. Hence, tree iter­a­tion learn­ing-from-pref­er­ences deep RL. With music we don’t nec­es­sar­ily need the sta­ble self­-­play that tree iter­a­tion pro­vides since I’m not too clear con­cep­tu­ally what one would expect self­-­play to deliver (it is inher­ently a human-de­fined prob­lem, as opposed to Go where it’s exter­nal and human pref­er­ences are not the cri­te­ri­a), but given the Alp­haZero & Antho­ny’s Hex results, this could be con­sid­er­ably more com­pu­ta­tion-­ef­fi­cient by pro­vid­ing much more super­vi­sion at each timestep instead of pro­vid­ing just a lit­tle bit of super­vi­sion from the end result of win/lose with REINFORCE. Pos­si­bly also more human-sam­ple-­ef­fi­cient?

Boot­strap. Rat­ings can be done pair­wise on the var­i­ous opti­mized sequences (ran­dom pairs of high­-s­cor­ing sequences, although before/after com­par­isons might be more infor­ma­tive), and then the reward model trained.

Amor­tized infer­ence. If gra­di­ent ascent is too slow for rou­tine use, then one can just dis­till the reward model via train­ing the GPT-2 on suc­ces­sively bet­ter cor­puses in the usual effi­cient quick like­li­hood train­ing (im­i­ta­tion learn­ing) way, for a sort of ‘expert iter­a­tion’: gen­er­ate improved ver­sions of a cor­pus by gen­er­at­ing & select­ing new dat­a­points above a thresh­old (per­haps using a cor­pus of human dat­a­points as start­ing points), and train to gen­er­ate that.

Auto­matic edit­ing via gra­di­ent ascent. Gra­di­ent ascent can be used to con­trol and opti­mize text in var­i­ous ways. For exam­ple, fic­tion could be edited in one region (say, to change char­ac­ter’s name from “Miriam” to “Mary”) and then the edited region could be held fixed dur­ing gra­di­ent ascent, while the rest of the unedited text is free to vary; this would prop­a­gate the edits, because self­-­con­tra­dic­tory text is unlikely while self­-­con­sis­tent text is more like­ly. (Be­cause it is a more likely text for the pro­tag­o­nist to be con­sis­tently named “Mary” through­out the entire text rather than named “Mary” in one place and “Miriam” every­where else.) One could pro­duce mul­ti­ple ver­sions of a text by spec­u­la­tive edit­s—what if this char­ac­ter was a pirate? what if the pro­tag­o­nist lost this bat­tle instead of win­ning? what if a ban­quet scene was deleted entire­ly?—and select the best one. One could also do this on het­ero­ge­neous sets of text, such as col­lab­o­ra­tive­ly-edited works like the : there is no lin­ear struc­ture like a nov­el, but one could take an edited entry and con­cate­nate it with an arbi­trary sec­ond entry, do ascent on the sec­ond entry to make it more con­sis­tent, and save the mod­i­fied ver­sion; iter­ated repeat­edly over the entire set of entries, one would wit­ness the wiki ‘grow­ing’ organ­i­cal­ly—changes in one entry, like a new SCP or char­ac­ter, will auto­mat­i­cally pop up else­where in log­i­cal ways, with all seams grad­u­ally smoothed over into one evolved whole.

Par­al­lel gen­er­a­tion of coun­ter-ex­am­ples. If noth­ing else, I think it would help with the adver­sar­ial instances. Part of the prob­lem with them is that each PPO run seems to col­lapse into a sin­gle spe­cific adver­sar­ial instance. I can do a bunch of rat­ings which penal­izes all vari­ants of that instance which fixes it, but then I must wait another day or two, and then that run col­lapses into a new sin­gle adver­sar­ial instance. The reward model seems to grad­u­ally get bet­ter and the adver­sar­ial instances seem to grad­u­ally increase in com­plex­i­ty, but the process is slow and ser­i­al. The gra­di­ent ascent approach may also run into the prob­lem that it will find adver­sar­ial instances for the reward mod­el, but at least it will do so in par­al­lel: if I can run a mini­batch n = 11 of GPT-2-117M reward mod­els each start­ing with a dif­fer­ent ran­dom ini­tial sequence being opti­mized and do gra­di­ent ascent on each in par­al­lel, they will prob­a­bly find mul­ti­ple adver­sar­ial instances in par­al­lel, while the PPO would only find the one it col­lapses on. So one would get a lot more use­ful adver­sar­ial instances to rate per run.

Opti­miz­ing embed­dings = non­sense? One of the most likely draw­backs to such gra­di­ent ascent approaches on the embed­ding is the pos­si­bil­ity that the max­i­mized embed­ding will not then con­vert back to any kind of sen­si­ble dis­crete sym­bol sequence, a fail­ure mode which has caused trou­ble in attempts to do and on T5 mod­els (re­quir­ing adding VAE autoen­coder-­like con­straints to make the latent space ‘smooth’, and—at least with extremely low-di­men­sional latent spaces like n = 2–10—tokens can be too sep­a­rated to be reached by gra­di­en­t-­fol­low­ing).

Bradley-Terry Preference Learning

Chris­tiano et al 2017 intro­duced a deep rein­force­ment learn­ing archi­tec­ture for learn­ing “I know it when I see it” sub­jec­tive­ly-de­fined reward func­tions from human feed­back: a human makes com­par­isons of actions/datapoints/episodes to select the ‘bet­ter’ one, a NN is trained to pre­dict the bet­ter one based on these com­par­isons, and another NN is RL-­trained based on the pre­dicted com­par­isons inter­preted as a reward. Since the human is unable to write down a con­ven­tional reward func­tion in soft­ware, the pre­dic­tor NN (anal­o­gous to a Dis­crim­i­na­tor in a GAN or a ‘critic’ in actor-­critic RL) learns the reward func­tion by exam­ple, and then the RL agent NN (anal­o­gous to a Gen­er­a­tor in a GAN) learns by tri­al-and-er­ror what sequences will opti­mize this com­plex reward func­tion, and the human feed­back pro­vides addi­tional guid­ance on new parts of the prob­lem as the pair of NNs boot­strap into bet­ter per­for­mance. This is demon­strated on video game or robot­ic-style sim­u­la­tions, but appears equally applic­a­ble to other sequence prob­lems where reward func­tions are impos­si­ble to write and exist­ing losses like max­i­mum like­li­hood are imper­fect for gen­er­a­tion (such as music or poetry com­po­si­tion).

As orig­i­nally framed, the pre­dic­tor merely does com­par­isons, receiv­ing & pro­vid­ing binary feed­back. This is jus­ti­fied as being implic­itly equiv­a­lent to a stan­dard pair-comparison/competition mod­el, the (akin to the famous ELO), where each dat­a­point has a latent vari­able on a com­mon car­di­nal scale (often, like a , scaled to for con­ve­nience), pro­duc­ing a total order which effi­ciently extracts all pos­si­ble infor­ma­tion from the com­par­isons.

I sug­gest that this is not nec­es­sar­ily the case, as exam­ples from GANs indi­cate that such a pref­er­ence-learn­ing archi­tec­ture may be learn­ing some­thing odder (such as mem­o­riz­ing com­par­ison­s), and that the archi­tec­ture could be improved by remov­ing the implic­it­ness of the B-T rank­ing, com­put­ing the B-T rank­ings directly (which can be done even with non-over­lap­ping com­par­isons by using a Bayesian model with pri­ors and using covari­ates such as the pre­dic­tor’s own esti­mates), thereby pro­vid­ing absolute qual­ity scores for cor­rect­ness of com­par­isons, more effi­cient regres­sion, RL rewards, and mean­ing­ful inter­pretable scores for down­stream uses.

The moti­va­tion for the dou­ble-­critic archi­tec­ture in the cur­rent black­box approach is that the data being col­lected from humans is pair­wise, and so one trains the critic to pre­dict com­par­isons. This out­side train­ing loop then has an inner G/agent train­ing loop etc. The dou­ble train­ing loop is nec­es­sary to col­lect rat­ings from new areas of state­space that the G/agent can now access, but also, GAN-style, to avoid the D/critic from being too pow­er­ful and sat­u­rat­ing loss. (The orig­i­nal Chris­tiano imple­men­ta­tion is used to avoid prob­lems, with only the most recent n = 3000 com­par­isons are stored.)

But, just because the input is pair­wise does­n’t mean that the out­put must also be pair­wise. (After all, many things, like tour­na­ments, turn a series of pair­wise com­par­isons into final scalar val­ues like ‘rank’.) It could instead be a scalar indi­cat­ing global rank, with the D/critic per­form­ing regres­sion. GANs and DRL are closely con­nected (// eg. is anal­o­gous to imi­ta­tion-learn­ing+R­L-fine­tun­ing), and in both fields, a richer reward sig­nal is always bet­ter, allow­ing for sta­bler faster train­ing to bet­ter final per­for­mance. And a global rank is more infor­ma­tive than a com­par­i­son.

Full Bradley-Terry Training

Extract the rank­ings fast. A Bradley-Terry (B-T) model is sim­ple and easy to esti­mate on even large sam­ples7, and can eas­ily pro­duce car­di­nal rank­ings. Each dat­a­point gets an esti­mated car­di­nal rank­ing in stan­dard devi­a­tions of a hypo­thet­i­cal latent Gauss­ian. The D/critic then is trained to do regres­sion from a sin­gle input to the esti­mated latent vari­able of qual­i­ty.

So the new loop would look like this:

  1. run off-the-shelf B-T rank­ing over a dataset of com­par­isons of dat­a­points

  2. extract the esti­mated latent vari­ables for each dat­a­point

  3. until con­ver­gence, super­vised train­ing of a D/critic NN to pre­dict the latent for each dat­a­point

  4. until con­ver­gence, RL train­ing of a G/agent NN with the D/critic NN

  5. sam­ple n new dat­a­points from the trained G/agent NN and add to the dataset

  6. run B-T rank­ing over the aug­mented dataset

  7. ask the ora­cle for rat­ings of the m dat­a­points with the largest pos­te­rior uncer­tainty or some proxy thereof like stan­dard error (which will usu­ally be the new dat­a­points)

    • active sam­pling or ban­dit algo­rithms can be used to max­i­mize the infor­ma­tive­ness

Is Preference Learning a Bradley-Terry Model?

What’s the dif­fer­ence?

Not extract­ing all infor­ma­tion. By using only com­par­isons, each pre­dic­tor train­ing is less mean­ing­ful, and even if the pre­dicted vari­able is still results of com­par­isons, not fit­ting a B-T model means that one can’t train on com­par­isons between all dat­a­points (since one needs the B-T model to pre­dict, based on the global rank­ing, what the out­come would be).

Pair­wise yields inco­her­ent rank­ing? It is also unclear that the pref­er­ence learn­ing archi­tec­ture is implic­itly esti­mat­ing a B-T mod­el. (I am not famil­iar with any paired-­com­par­isons approaches which opti­mize ML mod­els to pre­dict a fixed set of com­par­isons, or work purely on dis­con­nected com­par­ison­s.) Because no global rank­ing is ever con­struct­ed, no com­par­isons can be trained on other than the exact ones that the human made, and that may not be enough train­ing sig­nal to force infer­ring a global rank­ing, rather than merely learn­ing local­ly-­con­sis­tent pair­wise com­par­isons which are nev­er­the­less glob­ally incon­sis­tent (with cycles like rock­-­pa­per-s­cis­sors). The pre­dic­tor may be learn­ing some­thing much sim­pler, such as which dis­tin­guish within each fixed pair, but with­out learn­ing what we thought it was learn­ing—­gen­er­al­iz­able qual­ity fea­tures which allow a mean­ing­ful global rank­ing across all pairs.

What does a D do? In a GAN, you have real and fake dat­a­points being com­pared; the D attempts to regress the prob­a­bil­ity of each point being a winner/loser, so to speak, pro­duc­ing a log prob­a­bil­ity (in the orig­i­nal for­mu­la­tion); does D learn generic fea­tures of qual­ity or real­ism? Appar­ently not because even a ; and in , when I use a well-­trained StyleGAN D to rank real data, the rank­ings are strange, with out­liers ranked both low & high, sug­gest­ing that garbage data can be ranked extremely con­fi­dently by the D sim­ply because it could eas­ily mem­o­rize them as out­liers. So, we have a case where a D/critic is trained on com­par­i­son data from an ora­cle (real vs fake), is use­ful for train­ing, out­puts a vari­able which look exactly like an ELO and even has an ELO-like the­o­ret­i­cal inter­pre­ta­tion—and is com­pletely ungener­al­iz­able and not learn­ing any­thing remotely like a car­di­nal score or even a trans­for­ma­tion thereof like an ELO. What is going on? Appar­ently the D is mem­o­riz­ing real dat­a­points, and push­ing the G away from them and toward nearby poten­tial dat­a­points.

Ds just mem­o­rize? Why can’t this be the same thing for the pref­er­ence-learn­ing D? It is given a small dataset con­sist­ing of fixed pairs of good/bad dat­a­points, and it mem­o­rizes bad dat­a­points within a fixed pair, latch­ing on to some fea­ture or other (pos­si­bly impor­tant fea­tures, but they could also be the ‘non-ro­bust fea­tures’ involved in adver­sar­ial learn­ing) in order to mem­o­rize just within that pair (if it can over­fit…), and this then pushes the G away from tra­jec­to­ries that look like bad dat­a­points, pro­duc­ing use­ful train­ing just like in a GAN.

Weak learn­ing sig­nal. This would be con­sis­tent with the paper’s reported suc­cess, but would have a dif­fer­ent inter­pre­ta­tion: the D is not learn­ing any generic qual­ity met­ric, is not implic­itly rank­ing all dat­a­points on a com­mon scale of reward, and is not equiv­a­lent to a B-T. It is merely mem­o­riz­ing some data points or some ungener­al­iz­able non-ro­bust fea­tures which hap­pen to let it dis­tin­guish within the pairs. As such, it can’t pro­vide a sta­ble rank­ing within or across iter­a­tions or datasets, and its feed­back is of lim­ited value (since once the G/agent has moved suf­fi­ciently far away from the penal­ized mem­o­rized dat­a­points, that no longer pro­vides a train­ing sig­nal for more improve­ment and new rel­a­tive­ly-bad dat­a­points must be learned and penal­ized).

Active Learning

Esti­mated rank­ings can pri­or­i­tize com­par­isons. As imple­ment­ed, pref­er­ence learn­ing is (po­ten­tial­ly, assum­ing it’s equiv­a­lent to B-T) more sam­ple-­ef­fi­cient than a naive B-T: each data point appears once in a unique com­par­i­son (rather than in mul­ti­ple com­par­isons with mul­ti­ple other dat­a­points), and so each com­par­i­son is poten­tially max­i­mally effi­cient (in the sense that each addi­tional com­par­i­son involv­ing a dat­a­point pro­vides the pre­dic­tor less infor­ma­tion than the first one did). A naive B-T, like the usual fre­quen­tist imple­men­ta­tion, requires mul­ti­ple com­par­isons to con­nect all dat­a­points via a chain of com­par­isons, and may be unde­fined if any dat­a­points are ‘uncon­nected’.

A Bayesian B-T model mit­i­gates this by hav­ing pri­ors on any new dat­a­point, which pro­vides a mean­ing­ful esti­mate with­out few or no com­par­isons. (With no com­par­isons, the pos­te­rior mean is sim­ply the prior mean, pre­sum­ably some­thing like 0.) The esti­mates aren’t infor­ma­tive, but they are well-de­fined and can be used for sam­pling strate­gies.

The lack of com­par­isons can be fixed partly by using covari­ates. There are two par­tic­u­larly rel­e­vant covari­ates which could be used:

  1. the pre­dic­tor’s own rat­ings of each dat­a­point:

    Since the pre­dic­tor should be able to reach high accu­ra­cy, its esti­mate before any com­par­isons should be quite accu­rate and reduce the pos­te­rior uncer­tainty con­sid­er­ably (de­spite hav­ing no com­par­ison­s). This can be par­tic­u­larly use­ful for a sam­pling strat­egy because it can help dis­card sam­ples which are esti­mated as low qual­ity and not infor­ma­tive about the best sam­ples that we want to reach.

  2. the cur­rent epoch/iteration:

    Since we hope the generator/agent is also improv­ing, the iter­a­tion a dat­a­point was gen­er­ated from is rel­e­vant: early dat­a­points should be bad, inter­me­di­ate dat­a­points should be medi­um, and recent dat­a­points should be the best. The first few com­par­isons inside a batch give a strong indi­ca­tion how good the batch is over­all, and the qual­ity can also be extrap­o­lated from ear­lier iter­a­tions by fit­ting a progress curve (like a log or spline).

An exam­ple of a sam­pling algo­rithm would be best-arm rac­ing algo­rithms. Since in this sce­nar­io, we’re try­ing to teach the NN to gen­er­ate the best dat­a­points, we don’t value vari­ance reduc­tion else­where, we want cer­tainty about the best dat­a­points in order to penal­ize the NN for gen­er­at­ing any infe­rior dat­a­points. A sim­ple pos­te­rior sam­pling rac­ing algo­rithm for B-T might goes like this:

  1. take the arm/datapoint with the high­est pos­te­rior mean rank­ing, which is esti­mated to be the best;
  2. sam­ple from the pos­te­rior a pos­si­ble rank­ing for every other dat­a­point;
  3. com­pare the best known dat­a­point with the high­est pos­te­ri­or-sam­ple;
  4. update.

Best-arm ban­dit. This explores dat­a­points based on their remain­ing pos­te­rior prob­a­bil­ity of being the best. (I used this once to ) This can be applied to the k best dat­a­points for batch eval­u­a­tion etc.

So a train­ing loop could go like, begin iter­a­tion #11 by gen­er­at­ing 1000 new sam­ples from iter­a­tion #10’s G/agent mod­el; score each with the D/critic; insert the 1000 into the dataset with their esti­mated score and iter­a­tion=10 covari­ate; do the B-T regres­sion with Comparison[NA1][NA2] ~ iteration1+criticEstimate1 − iteration2+criticEstimate2 (pseudocode) to esti­mate pos­te­rior dis­tri­b­u­tions of esti­mates for all dat­a­points (miss­ing­ness of com­par­isons does­n’t mat­ter, the model can still be fit); run the rac­ing algo­rithm, find­ing that new sam­ple #551 has a critic score of +5SD, giv­ing a pos­te­rior esti­mate exceed­ing all other dat­a­points (de­spite not hav­ing been ever com­pared yet), and that new sam­ple #998 get picked by pos­te­rior sam­pling; ask the user to com­pare #551 and #998; record the result; refit the B-T for an updated rank­ing; retrain the D/critic; retrain the G/agent; begin iter­a­tion #12 etc.

We effi­ciently home in on the best dat­a­points with­out nec­es­sar­ily requir­ing any ‘redun­dant’ com­par­isons, while pro­vid­ing infor­ma­tive sta­ble car­di­nal rank­ings for the D/critic based on an order­ing of the entire dataset, enabling it to pro­vide more mean­ing­ful rewards to the G/agent. To the extent that we engage in ‘redun­dant’ com­par­isons, unlike the pref­er­ence learn­ing approach, those com­par­isons must have been nec­es­sary.

It’s an adap­tive pro­ce­dure so it’s hard to say exactly how it would dif­fer from pref­er­ence learn­ing. Depend­ing on how much the G improves each iter­a­tion, and how accu­rate the D is, and thus how much pos­te­rior over­lap there is between dif­fer­ent batches and dif­fer­ent dat­a­points within each batch, it could look a lot like the cur­rent heuris­tic approach of doing only unique com­par­isons once within a batch and throw­ing away+n­ev­er-­com­par­ing with prior batch­es, or it could look quite dif­fer­ent, and change with each iter­a­tion as nec­es­sary:

  • If the G improves rel­a­tively slow­ly, so there’s a great deal of over­lap between suc­ces­sive batch­es, and/or the D is only weakly cor­re­lated with mea­sured rank­ings, then the pro­ce­dure might need to sam­ple a lot of com­par­isons between old/new batches in order to improve esti­mates of the progress curve and all dat­a­points within the new batch, and it might want many com­par­isons toward the tail of high­est-ranked dat­a­points (which is not a bad thing because that’s where we should pri­or­i­tize improve­ments, since that’s where the G is mov­ing towards, and it’s less impor­tant to esti­mate accu­rately less-high­ly-ranked dat­a­points).
  • If the G or Ds are inter­me­di­ate, I think the dynam­ics might look more like sam­pling mostly pairs within the new batch, mostly unique com­par­isons, and a few com­par­isons with old batches to fine­tune the mean of the new batch.
  • If the D + G pro­gresses so rapidly such that rank­ings don’t over­lap at all a pri­ori, then few or no com­par­isons with the old batches are nec­es­sary: the D covari­ate pre­dict­ed-rank­ings elim­i­nate most of the pos­te­rior uncer­tainty despite no com­par­isons being avail­able, and the G progress means that the old dat­a­points (while still use­ful for G train­ing in teach­ing it the full spec­trum of dat­a­points) are unlikely to be any­where near the best dat­a­points and so aren’t worth mea­sur­ing more accu­rate­ly, so com­par­isons focus on the most uncer­tain pairs in the new batch.

Advantages & Disadvantages

This could have a lot of advan­tages:

  1. sim­pli­fied: the D/critic NN is con­cep­tu­ally sim­pli­fied—in­stead of 3-way clas­si­fi­ca­tion on a dou­ble input cor­re­spond­ing to an implicit global rank­ing, it is just a sin­gle input for regres­sion on a qual­ity score

  2. mem­o­ry-­ef­fi­cient: before, a dou­ble input takes up mem­o­ry, even with tied weights, only to yield a sin­gle com­par­ison; in the same space, 2 regres­sion mod­els could be run, each with a dif­fer­ent input + tar­get qual­ity rat­ing. If, to save mem­ory (crit­i­cal with GPT-2), a sin­gle input is used instead, now there must be two sep­a­rate passes for each input, and each pass merely trains one-half of the com­par­i­son.

    This could be par­tic­u­larly use­ful if one tries to use a large Trans­former model like GPT-2-345M where mem­ory con­sump­tion becomes a seri­ous bar­rier to run­ning it at all… (At 345M with the Siamese archi­tec­ture, we’re down to n = 1 mini­batch­es!)

  3. sam­ple-­ef­fi­cient: many com­par­isons will be use­less, or for a given pair, they will quickly cease to be infor­ma­tive; a qual­ity rat­ing is infor­ma­tive regard­less of what might’ve been used as a com­par­ison, pro­vid­ing richer feed­back on each input (anal­o­gous to Alp­haZero switch­ing to a regres­sion tar­get)

    • pos­si­bly bet­ter ’off-pol­i­cy’ learn­ing: related to sat­u­rat­ing, a D/critic trained from a cor­pus (eg ini­tial­iz­ing a D/critic by tak­ing a dataset of real and GPT-2-generated poems, and label­ing all com­par­isons as vic­tory for the human poem) might destroy G/agent train­ing if it pro­vides only com­par­i­son feed­back
    • bet­ter value function/reward sig­nal for any other approach lever­ag­ing the NNs (like over a tree of sequences), too
    • humans or other datasets can sup­ply car­di­nal rat­ings directly when those are avail­able
  4. com­pute-­ef­fi­cient:

    • pos­si­bly more D com­pute-­ef­fi­cient: by train­ing com­par­isons, the D/critic must, implic­it­ly, be learn­ing an equiv­a­lent to a qual­ity rat­ing, in order to pro­vide accu­rate pre­dic­tions of a human com­par­i­son of all pos­si­ble pairs—but it does so indi­rectly
    • G gets more data & becomes com­pute-­ef­fi­cient: a richer reward sig­nal for each sam­ple will of course be quite use­ful for the G; instead of sat­u­rat­ing (there is intrin­si­cally not much infor­ma­tion in com­par­isons, and mov­ing from, say, 99.99% to 99.999% is not help­ful, regard­less of whether these scores are log-­trans­formed)
  5. inter­pretable: an absolute car­di­nal qual­ity vari­able pro­vides an objec­tive loss for under­stand­ing train­ing progress (use­ful for tasks which don’t have them, like poetry gen­er­a­tion!), which is also inter­pretable and could be use­ful out­side of the task (eg rank­ing poems for rec­om­men­da­tion or data-­clean­ing)

    • for exam­ple, one could get insight into a trained G/agent by gen­er­at­ing a num­ber of sam­ples, and rank­ing them
    • one could also test out var­i­ous rat­ing approach­es, like how much con­di­tion­ing is nec­es­sary. Given a dataset of pure com­par­isons, one can­not exper­i­ment with try­ing out uncon­di­tional gen­er­a­tion because one does­n’t know what the out­come of any com­par­i­son would be; once one extracts the latent vari­ables from a total rank­ing, though, one knows the dis­tri­b­u­tion of out­comes and can sim­u­late arbi­trary com­par­i­son datasets.
  6. prin­ci­pled uncer­tainty: enables active learn­ing via B-T pos­te­rior uncer­tainty with­out any need to extract uncer­tainty esti­mates of any kind from the D/critic NN; human rat­ings can be acquired more effi­cient­ly, or dat­a­points selec­tively pulled from a large dataset (eg imag­ine a huge dump of poems from Project Guten­berg or else­where, of wildly vary­ing qual­i­ty—with a regres­sion style D/critic NN, you can do a sin­gle pass over it with the D/critic NN to select the k% high­est poems, use the esti­mate as a pseudo-­dat­a­point, insert into B-T, and ask humans for the most infor­ma­tive com­par­isons; with a com­par­i­son D/critic NN, how to import use­fully a large unla­beled cor­pus is harder to see)

The main down­sides I can see:

  • the latent vari­ables are not nec­es­sar­ily 100% sta­ble, as the dis­tri­b­u­tion can drift, yield­ing any­thing from ‘rat­ing infla­tion’ to ‘rat­ing defla­tion’

    The B-T esti­mates a dis­tri­b­u­tion arbi­trar­ily defined as ; if the B-T sees only selected dat­a­points at the begin­ning, it might be that after G/agent trains enough, the B-T step would be look­ing at dat­a­points which are much bet­ter than a mean of 0, so there might be new dat­a­points all the way out at (what used to be) +100S­Ds, say. This then leads to the B-T esti­mate the next cycle shift­ing the mean/SD to restore the con­ven­tional . So the regres­sion tar­get for the D/critic’s pre­dic­tions of old dat­a­points may grad­u­ally shift over time, pre­cisely because the richer latent vari­ables don’t sat­u­rate the way sim­ple pair­wise com­par­isons would.

    I believe this would be a minor prob­lem eas­ily solved by train­ing the D/critic NN each iter­a­tion, which is nec­es­sary just to han­dle novel dat­a­points any­way; since improve­ments will be small each iter­a­tion, the retrain­ing should be eas­ily able to keep up. If not, one could define par­tic­u­lar dat­a­points as a ‘zero point’, and that pro­vides a fixed point of ref­er­ence for future play­ers, even if they are far bet­ter; for exam­ple, the anchor could be ran­dom out­puts.

  • (fre­quen­tist) B-T might require more com­par­isons for a total order: a dat­a­point has to be com­pared with other dat­a­points which them­selves have com­par­isons if it is to be glob­ally ranked at all, while a com­par­i­son D/critic can work with two entirely dis­joint sets of com­par­isons which don’t over­lap. (This can be avoided by using pri­ors & covari­ates in a Bayesian B-T mod­el.)

All in all, I think this ver­sion of pref­er­ence could be sim­pler, eas­ier to imple­ment, and train faster. The poten­tially bet­ter sam­pling is nice, but my guess is that the D pro­vid­ing richer feed­back (for both the G and down­stream users) is the biggest advan­tage of this approach—a com­par­i­son is a bit, and a bit is worth only a lit­tle bit.

  1. I don’t know why they used 124M instead of 117M.↩︎

  2. Such a short­cut is rather against the spirit of DRL, where as much as pos­si­ble should be learned from data. If such short­cuts are nec­es­sary—as I do find with ABC—the domain knowl­edge ought to be local­ized in the rat­ing code (where it achieves the stated goals of being easy to imple­ment & eas­ing human raters’ bur­den), and not the train­ing code.↩︎

  3. KL penal­ties are com­monly used in RL. To try to explain it infor­mal­ly: think of each GPT-2 mod­el, like the orig­i­nal GPT-2 model vs its RL-­trained ver­sion as emit­ting prob­a­bil­i­ties over the ~50k pos­si­ble BPE out­puts and graph­ing that as a bar chart (prob­a­bil­ity vs BPE). For any par­tic­u­lar input, the 50k pos­si­ble BPEs will form a spiky bar chart: the model pre­dicts some BPEs are far more likely than oth­ers. The KL dis­tance between those two mod­els, then, is like if you sub­tract the orig­i­nal GPT-2’s bar chart from the new RL GPT-2’s bar chart, and added up the dif­fer­ences at each pos­si­ble bar (not just the biggest bar). So the KL dis­tance mea­sures the dif­fer­ence of opin­ion about every pos­si­ble BPE, from the most to the least likely (in­clud­ing the ones you would never select while sam­pling). You want there to be some dif­fer­ence between the bar chart­s—other­wise what’s the point?—but also not too big a dif­fer­ence, because the orig­i­nal GPT-2 model was usu­ally right already, fairly small KL dis­tances can change the final out­puts quite a bit qual­i­ta­tive­ly, and if the new model is mak­ing too many changes, it’s prob­a­bly gone wrong. So try­ing to keep the KL dis­tance small is a good way to main­tain high per­for­mance, but still fine­tune it based on expe­ri­ence.↩︎

  4. I use pair­wise rather than best-of-4 because it sim­pli­fies the rat­ing process & max­i­mizes the infor­ma­tion per sam­ple read/listened to; if I under­stand the imple­men­ta­tion, how­ev­er, it also reduces the mem­ory require­ments because the reward model train­ing unrolls n mod­els to train them all simul­ta­ne­ously on a n-rat­ing.↩︎

  5. Assum­ing a labeler could get through 2 rat­ings per minute (op­ti­mistic, given how exhaust­ing I found even an hour), a mil­lion rat­ings would require >8,333 man-hours, or at a rock­-bot­tom total-­cost-per-hour of $10, >$83,333. And that might not be enough.↩︎

  6. !Ma­grin: Weak super­vi­sion: implicit qual­ity rat­ings.↩︎

  7. eg my sim­ple , just re-es­ti­mates the entire B-T model each inter­ac­tion, rather than attempt any caching or incre­men­tal updat­ing to a stored mod­el, because it takes a frac­tion of a sec­ond to fit. A fully Bayesian model can be fit via in a few sec­onds, which is neg­li­gi­ble in a DRL con­text.↩︎