GPT-2 Preference Learning for Music Generation

Experiments with OpenAI’s ‘preference learning’ approach, which trains a NN to predict global quality of datapoints, and then uses reinforcement learning to optimize that directly, rather than proxies. I am unable to improve quality, perhaps due to too-few ratings.
statistics, NN, fiction, shell, GPT, tutorial, poetry, music
2019-12-162020-04-18 finished certainty: likely importance: 7

Stan­dard lan­guage gen­er­a­tion neural net­work mod­els, like GPT-2, are trained via like­li­hood train­ing to im­i­tate hu­man text cor­pus­es. Gen­er­ated text suffers from per­sis­tent flaws like rep­e­ti­tion, due to my­opic gen­er­a­tion word-by-word, and can­not im­prove on the train­ing data be­cause they are trained to pre­dict ‘re­al­is­tic’ com­ple­tions of the train­ing da­ta.

A pro­posed al­ter­na­tive is to use re­in­force­ment learn­ing to train the NNs, to en­cour­age global prop­er­ties like co­her­ence & lack of rep­e­ti­tion, and po­ten­tially im­prove over the orig­i­nal cor­pus’s av­er­age qual­i­ty. Pref­er­ence learn­ing trains a re­ward func­tion on hu­man rat­ings, and uses that as the ‘en­vi­ron­ment’ for a black­box DRL al­go­rithm like PPO.

Ope­nAI re­leased a code­base im­ple­ment­ing this du­al-model pref­er­ence learn­ing ap­proach for tex­tual gen­er­a­tion, based on GPT-2. Hav­ing pre­vi­ously used & , I ex­per­i­mented with GPT-2 pref­er­ence learn­ing for un­con­di­tional mu­sic and po­etry gen­er­a­tion.

I found that pref­er­ence learn­ing seemed to work bet­ter for mu­sic than po­et­ry, and seemed to re­duce the pres­ence of rep­e­ti­tion ar­ti­facts, but the re­sults, at n≅7,400 rat­ings com­piled over 23 it­er­a­tions of train­ing+sam­pling No­vem­ber 2019–­Jan­u­ary 2020, are not dra­mat­i­cally bet­ter than al­ter­na­tive im­prove­ments like scal­ing up mod­els or more thor­ough data-clean­ing or more strin­gent sam­ple cu­ra­tion. My blind rat­ings us­ing n≅200 com­par­isons showed no large ad­van­tage for the RL-tuned sam­ples (win­ning only 93 of 210 com­par­isons, or 46%).

This may be due to in­suffi­cient rat­ings, bad hy­per­pa­ra­me­ters, or not us­ing sam­ples gen­er­ated with com­mon pre­fix­es, but I sus­pect it’s the for­mer, as some NLP tasks in Ziegler et al 2019 re­quired up to 60k rat­ings for good per­for­mance, and the re­ward model ap­peared to achieve poor per­for­mance & suc­cumb to ad­ver­sar­ial ex­am­ples eas­i­ly.

Work­ing with it, I sus­pect that pref­er­ence learn­ing is un­nec­es­sar­ily sam­ple-in­effi­cient & data-in­effi­cient, and that the black­box re­in­force­ment learn­ing ap­proach is in­fe­rior to di­rectly us­ing the re­ward model to op­ti­mize text sam­ples, and pro­pose two ma­jor ar­chi­tec­tural over­hauls: have the re­ward model di­rectly model the im­plied rank­ing of every dat­a­point, and drop the agent model en­tirely in fa­vor of back­prop-pow­ered gra­di­ent as­cent which op­ti­mizes se­quences to max­i­mize the re­ward mod­el’s out­put.

Neural nets for gen­er­at­ing text typ­i­cally treat it as a pre­dic­tion prob­lem: pre­dict the next word given pre­vi­ous text, and max­i­mize prob­a­bil­ity of a cor­rect pre­dic­tion of the next word. This can effi­ciently train large NNs on large text cor­puses and can gen­er­ate sur­pris­ingly good text on av­er­age, as in my past poetry/music gen­er­a­tion projects with or GPT-2—but gen­er­ates only av­er­age text, like the cor­pus on av­er­age, and has per­sis­tent prob­lems with ar­ti­facts like rep­e­ti­tion or lack of global co­her­ence & themes due to greedy my­opic gen­er­a­tion word-by-word. Pre­dic­tion is fun­da­men­tally differ­ent from con­trol­ling & op­ti­miza­tion: a like­li­hood-trained NN is a pas­sive ob­server, sim­ply try­ing to pre­dict.

The in­line trick? There are some ways to con­trol the gen­er­ated text, like the ‘in­line trick’, where meta­data (such as the au­thor or source) is prepended to the raw text (like in my char-RNN/GPT-2 po­etry where I con­trol the style by in­sert­ing the au­thor dur­ing train­ing & prompt­ing with a de­sired au­thor, or a broader scale, ‘s use of ’genre’ meta­data), but these ap­proaches seem lim­it­ed. What if we want to gen­er­ate the best text, like the best po­ems or best mu­sic? Would the in­line trick work if we trained on a cor­pus of rated text and prompted the NN with ‘5 stars:’…? Prob­a­bly not—things like ‘good­ness’ are too sub­tle com­pared to au­thor or gen­re, even if we had many megabytes of rated text to train on.

Why Preference Learning?

“…while a proof of might has­ten a ro­bot up­ris­ing, it would­n’t guar­an­tee one. For again, what P≟NP asks is not whether all cre­ativ­ity can be au­to­mat­ed, but only cre­ativ­ity whose fruits can quickly be ver­i­fied by com­puter pro­grams. To il­lus­trate, sup­pose we wanted to pro­gram a com­puter to cre­ate new Mozart-qual­ity sym­phonies and Shake­speare-qual­ity plays. If P=NP via a prac­ti­cal al­go­rithm, then these feats would re­duce to the seem­ingly eas­ier prob­lem of writ­ing a com­puter pro­gram to rec­og­nize great works of art. And in­ter­est­ing­ly, P=NP might also help with the recog­ni­tion prob­lem: for ex­am­ple, by let­ting us train a neural net­work that re­verse-engi­neered the ex­pressed artis­tic pref­er­ences of hun­dreds of hu­man ex­perts. But how well that neural net­work would per­form is an em­pir­i­cal ques­tion out­side the scope of math­e­mat­ics.”

, 2017

Like­able, not like­ly. An al­ter­na­tive is to treat it as a prob­lem. From a RL per­spec­tive, like­li­hood train­ing is a kind of ‘im­i­ta­tion learn­ing’, where the NN learns to ‘copy’ an ex­pert, and its flaws are as ex­pected from im­i­ta­tion learn­ing when one tries to ap­ply it: the NN has never seen its own com­ple­tions, and has no way of re­cov­er­ing from er­rors (some­times dubbed the ), which aren’t rep­re­sented in the dataset, and the com­ple­tions it is im­i­tat­ing are of both high and low qual­ity and it must at­tempt to im­i­tate the bad as well as good. Im­i­tat­ing hu­man ex­perts: lim­it­ed. Un­sur­pris­ing­ly, its out­put is often bad, and if sam­pling goes a lit­tle hay­wire, it may then ‘ex­plode’. It is also not sur­pris­ing if an im­i­ta­tion-learn­ing NN has bizarre blind spot­s—there is noth­ing in the process which seeks out blind spots and fixes them, after all, since fix­ing them does­n’t im­prove pre­dic­tions on the fixed cor­pus.

Re­ward good tex­t—but what de­fines ‘good’? A bet­ter ap­proach is en­able tri­al-and-er­ror learn­ing: have an ‘agent’ or ‘gen­er­a­tor’ NN try to learn how to gen­er­ate text which max­i­mizes to­tal long-term re­ward over the en­tire se­quence, re­gard­less of each in­di­vid­ual word’s prob­a­bil­ity (only the fi­nal re­sult mat­ter­s). But who de­fines the ‘re­ward’? You can’t write a sim­ple rule defin­ing good po­etry or mu­sic, so you ask hu­mans, pre­sum­ably—but no one is pa­tient enough to rate mil­lions of text snip­pets, which is how many sam­ples you would need for stan­dard deep re­in­force­ment learn­ing on large com­plex lan­guage NNs. That’s an is­sue with neural net­works: they are good at su­per­vised learn­ing, where the right an­swer can be de­fined, but not so good at tri­al-and-er­ror, where one is told how good a se­quence was but not what the right an­swer was.

Train a NN to im­i­tate hu­man crit­ics, not ex­perts. In pref­er­ence learn­ing, we con­vert our in­tractable RL prob­lem into a su­per­vised learn­ing prob­lem: we try to learn the util­ity func­tion or re­ward func­tion from hu­mans in­stead of at­tempt­ing to con­jure up some nonex­is­tent de­fi­n­i­tion of good po­etry or ex­pect­ing a NN to some­how learn what good po­etry is from a gi­ant pile of medioc­re, good, great, or just plain ter­ri­ble po­et­ry. Ex­am­ples of reward/preference learn­ing are ////, but the kind of pref­er­ence learn­ing I am us­ing here was in­tro­duced by Ope­nAI in , et al 2017 (blogs: , ), where hu­mans looked at short video snip­pets from sim­ple video games, and picked the ones which looked more cor­rect; the se­quences need not be video or im­ages but can be text, and in , Ziegler et al 2019 (; ), they ex­per­i­ment with train­ing GPT-2 to gen­er­ate bet­ter text us­ing hu­man rat­ings of fac­tors like ‘de­scrip­tive­ness’ (which would be hard to write a rule for). A re­lated pa­per, ex­plores fine­tun­ing GPT-2 to not gen­er­ate ‘offen­sive’ or nor­m-vi­o­lat­ing text, us­ing a NN clas­si­fier trained pre­vi­ously (, which used, amus­ing­ly, along with n = 1000 other la­beled sam­ples); Peng et al 2020 did not do pure RL train­ing, but com­bined the nor­ma­tive clas­si­fier as a RL loss with sen­tence-level like­li­hood fine­tun­ing on a sci­ence fic­tion text cor­pus, and was able to halve the nor­m-vi­o­la­tion rate.

Boot­strap NN critic from hu­man crit­i­cism. Chris­tiano et al ask: if NNs are good at su­per­vised learn­ing like pre­dic­tion, and we would need mil­lions of hu­man rat­ings to get any­where with the RL ap­proach that might fix our im­i­ta­tion learn­ing prob­lems… why not have a NN learn to pre­dict hu­man rat­ings, and use that in­stead? Since NNs are so good at su­per­vised learn­ing, they should be able to learn to pre­dict hu­man rat­ings rel­a­tively eas­i­ly. (Be­cause it can be hard to rate every­thing on a scale 1–10, we can ask the hu­mans to make rat­ings which are better/worse-than pair­wise com­par­isons, which, if there are enough rat­ings, al­lows in­fer­ring an un­der­ly­ing la­tent vari­able through one of many sta­tis­ti­cal mod­els like the , and amounts to the same thing.) So, in ad­di­tion to the NN do­ing the RL learn­ing, we have a sec­ond ‘re­ward model’ or ‘critic’ NN learn to pre­dict a small set of hu­man forced-choice rat­ings which choose be­tween two pos­si­ble text snip­pets (eg thou­sand­s); this NN then rates as many snip­pets as nec­es­sary for the orig­i­nal NN do­ing re­in­force­ment learn­ing (eg mil­lion­s). Sim­i­lar to GANs (a­gen­t=­Gen­er­a­tor, re­ward mod­el=Dis­crim­i­na­tor), the re­ward model dis­tills the hu­man rater’s pref­er­ences in a NN which can be used ar­bi­trar­ily often. Now it’s OK that the RL-based NN is slow and needs mil­lions of tri­als to learn from its er­rors, since we can run the re­ward model as many times as nec­es­sary.

It­er­ate as nec­es­sary. Since the re­ward model has only an im­per­fect idea of hu­man pref­er­ences, it will make er­rors and may even be ‘fooled’ by the agent (much as a Gen­er­a­tor may de­feat a Dis­crim­i­na­tor per­ma­nent­ly), but one can then take the agen­t’s out­puts and get a hu­man rat­ing of them, fix­ing the prob­lem and im­prov­ing the re­ward mod­el, forc­ing the agent to find a bet­ter strat­egy in the next it­er­a­tion of the process. This process can re­peat as many times as nec­es­sary, and all of these steps can run in par­al­lel:

Ziegler et 2019: “Fig­ure 1: Our train­ing processes for re­ward model and pol­i­cy. In the on­line case, the processes are in­ter­leaved.”

For Music or Poetry

Greedy gen­er­a­tion. Chris­tiano’s pref­er­ence learn­ing seems like it could help in gen­er­at­ing nat­ural lan­guage types which have proven tricky, due to global prop­er­ties where word-by-word im­i­ta­tion is flawed.

Prob­lem: rep­e­ti­tion. Lan­guage gen­er­a­tion mod­els trained by max­i­mum like­li­hood have long had se­ri­ous prob­lems with falling into rep­e­ti­tions, and with hav­ing any kind of ‘theme’ or ‘mes­sage’ or even just ba­sic con­sis­ten­cy. The stereo­typ­i­cal neural text sam­ple from a char-RNN or Trans­former model is made up of in­di­vid­ual sen­tences which are flaw­lessly spelled, per­fectly gram­mat­i­cal, a lit­tle con­fus­ingly ob­tuse, and com­pletely un­re­lated to 10 sen­tences pre­vi­ous­ly, rem­i­nis­cent of schiz­o­phrenic ‘word salad’—and de­gen­er­at­ing after a few pages into end­less rep­e­ti­tion of “the the the”.

How can NNs be so good & so bad? The rep­e­ti­tion is par­tic­u­larly per­plex­ing, be­cause highly so­phis­ti­cated char-RNN or Trans­former mod­els ap­pear to en­code all sorts of se­man­tics and knowl­edge about the world and achieve diffi­cult tasks, and yet fall prey to a pathol­ogy the hum­blest or -style tem­plate al­go­rithm man­ages to avoid, and which still has no good so­lu­tion. Un­like su­per­vised se­q2seq tasks, more so­phis­ti­cated de­cod­ing search strate­gies like help only a lit­tle in lan­guage gen­er­a­tion, and can make things much worse by trig­ger­ing rep­e­ti­tion faster. (The re­cent is a patch, and one can still in­duce rep­e­ti­tion with low top-p set­tings; ap­pears to be bet­ter, but it is un­known if it’s a full so­lu­tion.) But be­cause it is so sim­ple, a re­ward model should be able to de­tect it eas­i­ly—how hard could it be to pe­nal­ize us­ing a BPE 20 times in a row?

Global co­herency and themes are hard­er, but it is still some­thing one ex­pects a re­ward model to be able to pick up on even­tu­ally and no­tice when a sam­ple has wan­dered off course in an il­log­i­cal way: even if each in­di­vid­ual word is a rea­son­ably likely next word, the end­ing will be highly un­likely given the be­gin­ning, and a model look­ing at the big pic­ture can de­tect that in­con­sis­ten­cy.


The Ziegler et al 2019 code­base, mod­els, and datasets (but not the rat­ing tools) were re­leased by Ope­nAI in Sep­tem­ber 2019 for pub­lic use, and I be­gan work­ing on adapt­ing it to po­etry & mu­sic.


The OA source code can be down­loaded from Github as usu­al:

git clone '' && cd ./lm-human-preferences/

The nec­es­sary Python 3 pack­ages are listed in the Pipfile.

One un­usual re­quire­ment is : the OA datasets are stored on a . This bucket is pub­lic, but must be ac­cessed through spe­cial cre­den­tial­s—if you are get­ting invalid_grant: Bad Request er­rors, you are run­ning into this is­sue, and you need to get spe­cial cre­den­tials, per­haps via gcloud auth login.

At least in the­o­ry, if you have your Google cre­den­tial ducks in a row and cor­rectly pip installed all the de­pen­den­cies, you should be able to run the com­bined-run ex­am­ple from their README:

experiment_name=testdesc-$(date +%y%m%d%H%M)
./ train_policy $experiment $experiment_name

This will au­to­mat­i­cally down­load GPT-2-124M1 & the ‘de­scrip­tive­ness’ dataset (as de­fined in, which uses snip­pets from Book­Cor­pus, to train a re­ward model us­ing 8 GPUs (OA used 8×) on best-of-4 com­par­isons of book pas­sage com­ple­tions based on how phys­i­cally de­scrip­tive or evoca­tive of the scene they are for 1 epoch; and then at­tempt to train a PPO for ~2k steps/iterations to op­ti­mize fic­tion gen­er­a­tion for de­scrip­tive­ness.


The OA code­base user ought to be aware of a few things be­fore run­ning on a generic new dataset:

  1. GCP per­mis­sions: as dis­cussed above, the OA datasets may not down­load un­less one has the cor­rect gsutil cre­den­tials gen­er­ated

  2. Python/Parallelism Is­sues: I ran into 2 er­rors which ul­ti­mately ter­mi­nated at a call to mpiexec.

    • Python 2 vs 3: The first was a Python in­ter­preter ver­sion er­ror, where it was some­how call­ing a Python 2 in­ter­preter even though my virtualenv was set to Python 3, and run­ning ex­plic­itly with python3 did­n’t help, so I patched the source code as fol­lows to force Python 3:

      diff --git a/lm_human_preferences/utils/ b/lm_human_preferences/utils/
      index 30f3440..62f4fe3 100644
      --- a/lm_human_preferences/utils/
      +++ b/lm_human_preferences/utils/
      @@ -11,7 +11,7 @@ def launch(name, f, *, namespace='safety', mode='local', mpi=1) -> None:
               with open('/tmp/pickle_fn', 'wb') as file:
                   cloudpickle.dump(f, file)
      -        subprocess.check_call(['mpiexec', '-n', str(mpi), 'python', '-c', 'import sys; import pickle; \
          pickle.loads(open("/tmp/pickle_fn", "rb").read())()'])
      +        subprocess.check_call(['mpiexec', '-n', str(mpi), 'python3', '-c', 'import sys; import pickle; \
          pickle.loads(open("/tmp/pickle_fn", "rb").read())()'])
           raise Exception('Other modes unimplemented!')
    • Fail­ing on 1 GPU: The README claims that run­ning on 1 GPU should be pos­si­ble, but when I tried run­ning on 1 GPU (so I could keep fine­tun­ing GPT-2 on my other GPU), mpiexec al­ways failed.

      I sus­pect that the call may need to be re­moved en­tire­ly. I avoided it by al­ways run­ning on both GPUs, and do­ing fine­tun­ing in the gaps be­tween it­er­a­tions, when I was busy with rat­ings or other things.

  3. Dis­abling GCP: the Ope­nAI GCP bucket is hard­wired, and aside from that, it’d be a pain to set up a GCP buck­et, set its per­mis­sions, and work with it when train­ing lo­cally rather than on a cloud GPU in­stance.

    The load­ing can be fixed by edit­ing to spec­ify lo­cal file paths. To dis­able sav­ing to GCP, I did an­other ed­it:

    diff --git a/lm_human_preferences/language/ b/lm_human_preferences/language/
    index f149a0c..99827fa 100644
    --- a/lm_human_preferences/language/
    +++ b/lm_human_preferences/language/
    @@ -10,7 +10,7 @@ class TrainedModel():
         def __init__(self, name, *, savedir=None, scope=None):
    = name
             self.scope = scope
    -        self.savedir = savedir if savedir else os.path.join('gs://gpt-2/models/', name)
    +        self.savedir = savedir if savedir else name
             if name == 'test':
                 self.encoding = encodings.Test
  4. en­cod­ing of BPEs: once you can load a lo­cal dataset, you need to cre­ate said dataset, of course. Un­for­tu­nate­ly, the code­base does­n’t make life easy for you, as the dataset must fol­low strict length lim­its & al­ready be BPE en­cod­ed. Rat­ing is com­pli­cated enough to re­quire a sep­a­rate sec­tion.

  5. Hard­wired Eng­lish Heuris­tic Loss:

    The sin­gle most frus­trat­ing bug I en­coun­tered in this code is due to a ‘clever’ hand-engi­neered fea­ture added to try to fil­ter out bad sam­ples early2. The code by de­fault looks for a pe­riod (.) within the first n BPEs, and if there is not one, the sam­ple is au­to­mat­i­cally pe­nal­ized −1!

    I did­n’t no­tice this in the po­etry runs, but when I switched over to mu­sic, it be­came a huge prob­lem with ABC sam­ples—as it turns out, ABC does not re­quire use of pe­ri­ods and most ABC mu­sic sam­ples will have no pe­ri­ods. So every sin­gle sam­ple was au­to­mat­i­cally rated −1, ren­der­ing train­ing im­pos­si­ble. This turns out to be men­tioned briefly in the pa­per but I had com­pletely over­looked the im­pli­ca­tions un­til I reread it try­ing to un­der­stand how the ABC (but not po­et­ry) re­ward model could be so badly mis­tak­en:

    To make the la­bel­ing task more nat­u­ral, we se­lect ex­cerpts that start and end with a pe­ri­od. When sam­pling con­tin­u­a­tions that will be pre­sented to hu­mans, we use to en­sure there is a pe­riod be­tween to­kens 16 and 24 and then trun­cate at that pe­ri­od. [This is a crude ap­prox­i­ma­tion for “end of sen­tence.” We chose it be­cause it is easy to in­te­grate into the RL loop, and even a crude ap­prox­i­ma­tion is suffi­cient for the in­tended pur­pose of mak­ing the hu­man eval­u­a­tion task some­what eas­i­er.] Dur­ing the RL fine­tun­ing, we pe­nal­ize con­tin­u­a­tions that don’t have such a pe­riod by giv­ing them a fixed re­ward of −1.

    This ‘fea­ture’ is spec­i­fied in at the be­gin­ning but it’s un­clear how to dis­able it en­tire­ly. To guar­an­tee that it can­not in­ter­fere, I patched it out:

    diff --git a/lm_human_preferences/ b/lm_human_preferences/
    @@ -398,7 +399,7 @@ def make_score_fn(hparams, score_model):
         def score_fn(queries, responses):
             responses = postprocess(responses)
    -        score = penalize(responses, unpenalized_score_fn(queries, responses))
    +        score = unpenalized_score_fn(queries, responses)
             return score, responses, dict(score=score)
         score_fn.stat_schemas = dict(score=Schema(tf.float32, (None,)))
         return score_fn
  6. sym­link model di­rec­tory: since I was re­train­ing the base­line mod­els as I went, par­tic­u­larly to fix the , it’s con­ve­nient to sym­link over to the reg­u­lar GPT-2 re­po’s model di­rec­to­ry, in­stead of deal­ing with copy­ing over fresh check­points. (Saves disk space too.) Some­thing like ln -s ../gpt-2/models/irish-nospaces3/ 117M-irish works.

  7. con­fig changes: all data-re­lated pa­ra­me­ters are hard­wired, and must be man­u­ally set.

    The length of prefixes/queries/conditioning and the length of all sam­ples must be ex­actly right; fur­ther, the size of the dataset (the n of rat­ings) must be man­u­ally spec­i­fied, and even fur­ther, the spec­i­fied n must be an ex­act mul­ti­ple of the re­ward mod­el’s mini­batch size (it can, how­ev­er, be lower than the ac­tual n in­side the dataset, so one does­n’t need to delete rat­ings if one has rated a few more than an ex­act mul­ti­ple).

    So for ex­am­ple, if one is train­ing the re­ward model with a mini­batch of n = 8 and one has n = 11,203 to­tal rat­ings, that is not an ex­act mul­ti­ple of 8 (11203⁄8 = 1400.375) and one would in­stead spec­ify n = 11,200 (which is both & an ex­act mul­ti­ple: ).

  8. Zom­bie processes:

    Make sure GPU is GCed
    OOM crashes are not un­com­mon dur­ing re­ward model train­ing, puz­zling­ly, and one will typ­i­cally kill a di­verged process with Control-c; how­ev­er, these may leave zom­bie processes ty­ing up GPU VRAM! Par­tic­u­larly if you are tin­ker­ing with set­tings like length or mini­batch, this is a great dan­ger—you may make a change, get an OOM crash (which leaves zom­bies), and any sub­se­quent change you make will look like a fail­ure. This caused me great trou­ble at least twice, as I be­gan try­ing to de­bug which (harm­less) con­fig change now trig­gered in­stant OOMs.

    To avoid this, I sug­gest get­ting in the habit of al­ways run­ning nvidia-smi after a train­ing run so you can check that has not left any or­phans (and if so, you can put them out of their mis­ery).

ABC Music configuration

Con­fig by source edit­ing. All of the hy­per­pa­ra­me­ters & dataset meta­data is de­fined in; there are no rel­e­vant CLI op­tions. It is struc­tured in two parts, for the re­ward model and then the agent; the con­fig­u­ra­tion is a cas­cade of in­creas­ing­ly-spe­cial­ized ob­jects. So for the re­ward model for the de­scrip­tive ex­per­i­ment, the books_task ob­ject is spe­cial­ized by _books_task, which is fur­ther spe­cial­ized by descriptiveness; and like­wise for the agent/PPO train­ing.

Hi­jack­ing ex­ist­ing con­fig. For my ABC mu­sic, in­stead of defin­ing a new cas­cade, I sim­ply hi­jacked the descriptiveness-re­lated vari­ables. I be­gin with the re­ward model in books_task, by cut­ting the con­di­tion­ing down to the min­i­mum which causes the code to not crash, 2, and ex­pand­ing the re­sponse length con­sid­er­ably to cover en­tire ABC mu­sic pieces, and I change the base model name to the ABC GPT-2 I trained nor­mal­ly:

 books_task = combos(
-    bind('query_length', 64),
+    bind('query_length', 2), # must be a minimum of 2 (but why?)
     bind('query_dataset', 'books'),
-    bind('response_length', 24),
-    bind('start_text', '.'), # Start the context at the beginning of a sentence
+    bind('response_length', 256),
+    bind('start_text', ''), # no conditioning aside from 'X:' in
     bind('end_text', '.'), # End the context at the end of a sentence.
     bind('truncate_token', 13), # Encoding of '.' -- end completions at the end of a sentence.
     bind('truncate_after', 16), # Make sure completions are at least 16 tokens long.

-    bind('policy.temperature', 0.7),
-    bind('policy.initial_model', '124M'),
+    bind('policy.temperature', 1.0),
+    bind('policy.initial_model', '117M-irish'),

The train­ing code needs to be mod­i­fied for the rat­ing data type (pair­wise) and for my lim­ited com­pute re­sources (2×1080ti in­stead of OA’s 8×V100)—I have to cut down mini­batch size & roll­out batch size:

 def get_train_reward_experiments():
     _shared = combos(
-        bind('labels.type', 'best_of_4'),
+        bind('labels.type', 'best_of_2'),
         bind('normalize_after', True),
         bind('normalize_before', True),
         bind('normalize_samples', 256),
@@ -58,9 +58,9 @@ def get_train_reward_experiments():
     _books_task = combos(
         bind_nested('task', books_task),
-        bind('batch_size', 32),
-        bind('rollout_batch_size', 512),
+        bind('batch_size', 10),
+        bind('rollout_batch_size', 226),

Fi­nal­ly, I spec­ify my lo­cal dataset & man­u­ally spec­ify its cor­pus size as a mul­ti­ple of the mini­batch size (this must be up­dated every time I add rat­ings or they won’t be trained on):

@@ -75,8 +75,8 @@ def get_train_reward_experiments():
     descriptiveness = combos(

-        bind('labels.source', 'gs://lm-human-preferences/labels/descriptiveness/offline_5k.json'),
-        bind('labels.num_train', 4_992),
+        bind('labels.source', 'irish.json'),
+        bind('labels.num_train', 16900),
         bind('run.seed', 1)

The agent model is eas­ier to con­fig­ure be­cause I need only to ad­just for com­pute:

 def get_experiments():
     train_reward_experiments = get_train_reward_experiments()

     _books_task = combos(
         bind_nested('task', books_task),

-        bind('', 1e-5),
-        bind('ppo.total_episodes', 1_000_000),
-        bind('ppo.batch_size', 512),
+        bind('', 1e-6), # original: 5e-5
+        bind('ppo.total_episodes', 1_000_000),
+        # original: 1_000_000; note, this is *episodes*, not *steps*; each step consists of _n_ episodes
+        bind('ppo.batch_size', 18), # original: 512

I also change the -3 as it ap­pears to far too harshly pun­ish di­ver­gence from the base­line mode for ABC mu­sic and effec­tively dis­ables ex­plo­ration:

@@ -139,9 +138,9 @@ def get_experiments():

     descriptiveness = combos(
-        bind('rewards.kl_coef', 0.15),
+        bind('rewards.kl_coef', 0.02),
         bind('rewards.adaptive_kl', 'on'),
-        bind('', 6.0),
+        bind('', 25.0),

For ABC mu­sic specifi­cal­ly, I made some fur­ther changes to the rest of the code:

  • con­di­tion­ing on X: for gen­er­at­ing ABC mu­sic dur­ing train­ing: all ABC mu­sic sam­ples , as an ID. I fig­ured that if I had to con­di­tion on at least 2 BPEs, I might as well spec­ify the X: and make it more likely that sam­ples will be valid:

    diff --git a/lm_human_preferences/ b/lm_human_preferences/
    index db02c98..b349717 100644
    --- a/lm_human_preferences/
    +++ b/lm_human_preferences/
    @@ -282,6 +282,7 @@ class PPOTrainer():
             step_started_at = time.time()
             queries = self.sample_queries()
    +        queries = np.tile([55,25], (queries.shape[0],1)) # Irish ABC prefix: 'X:' (ie for the initial numeric ID)
             rollouts = self.policy.respond(queries, length=self.hparams.task.response_length)
             responses = rollouts['responses']
  • in reg­u­lar gen­er­a­tion of sam­ples from a trained agent/policy mod­el, the de­fault set­tings are a tem­per­a­ture of 1 & top-k = 40; the lat­ter is fine but the for­mer is too high, and I lower it to 0.8. (The code claims to sup­port nu­cleus sam­pling, with a top_p ar­gu­ment, but when I changed that, it sim­ply broke.) The diff:

    diff --git a/lm_human_preferences/language/ b/lm_human_preferences/language/
    index 96e56e9..76e56a3 100644
    --- a/lm_human_preferences/language/
    +++ b/lm_human_preferences/language/
    @@ -5,7 +5,7 @@ from lm_human_preferences.utils import core as utils
     def sample_sequence(*, step, model_hparams, length, batch_size=None, context=None,
    -                    temperature=1, top_k=0, top_p=1.0, extra_outputs={}, cond=None):
    +                    temperature=0.8, top_k=40, top_p=1.0, extra_outputs={}, cond=None):
         Sampling from an autoregressive sequence model.

My full diff/patch for run­ning ABC mu­sic train­ing is avail­able to look at in case there is any am­bi­gu­i­ty.


./ train_policy descriptiveness irish-combined-20191222.17 --mpi 2 ; sleep 4s; nvidia-smi

Re­mem­ber to check the nvidia-smi out­put after a crash or in­ter­rupt to make sure your GPU VRAM has been re­leased & the zom­bie processes aren’t eat­ing it.


The OA code­base comes with no built-in sup­port for do­ing rat­ings; they used which ex­poses an API and pre­sum­ably felt there was not much point in pro­vid­ing the glue code. So, I rolled my own.

Data Formatting

The JSON schema. The in­put data for­mat is JSON: it is an ar­ray of hashmap ob­jects with, in the sim­plest case of best-of-two/pairwise rat­ings4, 4 fields (the first 3 of which are not strings but in­te­ger ar­rays, where each in­te­ger is as­sumed to be a BPE): the con­di­tion­ing text query, the first can­di­date string sample0, the sec­ond can­di­date string sample1, and the rat­ing best which is a sin­gle in­te­ger where 0 = first sam­ple won / 1 = sec­ond sam­ple and so on. How does this han­dle ties? Ties don’t seem to be ex­pressed as the in­te­ger 3 as one would guess. For ties, I sim­ply en­code it as two rat­ings with each sam­ple win­ning on­ce, which should be roughly equiv­a­lent.

Hard­wired n

The in­te­ger ar­ray lengths must be the length de­fined in the con­fig, and so if a sam­ple is too long or short, it must be trun­cated or padded to fit.

Record du­bi­ous sam­ples. The JSON parser code here ap­pears to not be strict, so you can ap­pend ad­di­tional fields if you want. Be­cause of is­sues with ad­ver­sar­ial sam­ples or ABC sam­ples be­ing syn­tac­ti­cally in­valid & not com­pil­ing to MIDI, and my con­cerns about what in­clu­sion of them might do to train­ing dy­nam­ics (per­haps they should be ex­cluded en­tire­ly?), I add a field, broken, to al­low fil­ter­ing black­listed sam­ples out and dis­tin­guish­ing their rat­ings from my hand-rat­ings.

So, a full and com­plete ex­am­ple of a valid JSON rat­ings dataset with n = 1 pair­wise rat­ings for ABC mu­sic would look like this:

  {"query": [0,0],
  "sample0": [   27,   91,  437, 1659, 5239,   91,   29,   55,   25,23349,  198,   51,
      25,14874, 1252,25308,22495,  198,   44,   25,   17,   14,   19,  198,
      43,   25,   16,   14,   23,  198,   42,   25,   34,  198,   38,  535,
      33,   91,   32,   17,   32,   17,   91,   38, 4339,   33,   91,   66,
      17,   89,   17,   91,   38,  535,   33,   91,   32,   17,   32,   17,
      91,   38, 4339,   33,   91,   66,   17,   32,   17,   91,  198,   38,
    4339,   33,   91,   66,   17,   32,   17,   91,   38, 4339,   33,   91,
      66,   17,   32,   17,   91,   38, 4339,   33,   91,   66,   17,   32,
      17,   91,   38, 4339,   33,   91,   66,   17,   32,   17,15886,   59,
      77,   27,   91,  437, 1659, 5239,   91,   29,  198,   27,   91,  437,
    1659, 5239,   91,   29,   27,   27,   27,   27,   27,   27,   27,   27,
      27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,
      27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,
      27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,
      27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,
      27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,
      27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,
      27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,
      27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,
      27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,
      27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,
      27,   27,   27,   27],
  "sample1": [   27,   91,  437, 1659, 5239,   91,   29,   55,   25,14208, 2816,  198,
      51,   25,   47,13218,34831,  338,  198,   44,   25,   19,   14,   19,
     198,   43,   25,   16,   14,   23,  198,   42,   25,   35,   76, 1228,
     198,   91,   25,   33,  198,   91,   93,   32,   17,   67,   32,   61,
      38, 6242,   67,   91,    7,   18,   32, 4339,   37, 2782,   33,   66,
      32,   91,12473,   93,   36,   17,   38,   18,   32,   91,   33,   67,
      70, 5036,   17,36077,   91,  198,   93,   32,   17,   67,   32,   61,
      38, 6242,   67,   91,    7,   18,   32, 4339,   37, 2782,   33,   66,
      32,   91,   33,   67,   93,   67,   17,  276,   33,   67,   91, 8579,
      38, 1961,   18,   25,   91,  198,   91,   25,   32,   91,   67,   17,
      69, 9395,16344,   91, 2782,   69, 2934,   17,36077,   91,   93,   67,
      17,   69, 9395,16344,   91,   70,   69,  891,   67,   17,36077,   91,
     198,   93,   32,   17,   69, 9395,16344,   91, 2782,   69, 2934,   17,
   36077,   91,   93,   70,   18,19082,   33,   67,   91, 8579,   38, 1961,
      18,   25,   91,   59,   77,   27,   91,  437, 1659, 5239,   91,   29,
     198,   27,   91,  437, 1659, 5239,   91,   29,   27,   27,   27,   27,
      27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,
      27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,
      27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,
      27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,
      27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,   27,
      27,   27,   27,   27],
  "best": 1,
  "broken": 1}

Dis­sect­ing JSON ex­am­ple. In this ABC mu­sic ex­am­ple, it ap­pears that sample0 is in­valid for some rea­son: it ei­ther could­n’t be com­piled to MIDI by abc2midi, Timid­ity could­n’t com­pile the MIDI to WAV, or it con­tained a string that vi­o­lated the man­ual black­list com­piled from di­verged ad­ver­sar­ial sam­ples. So, sample0 was au­to­mat­i­cally marked as the loser and sample1 won. Both sam­ples have been padded out with the BPE 27. Check­ing the BPE en­cod­ing (which can be done con­ve­niently with jq . encoder.json), BPE 0 is !, and the BPE en­cod­ing has odd han­dling of spaces so it’s un­clear how to pad with spaces. 27 was cho­sen ar­bi­trar­i­ly.

For script­ing pur­pos­es, we’d like a CLI fil­ter which takes text and prints out the BPE en­cod­ing. I hacked up the from the nshep­perd code­base to make, which reads from std­in, con­verts, and pads to our tar­get length of 256:

#!/usr/bin/env python3

import argparse
import numpy as np
import sys
import encoder
from load_dataset import load_dataset

parser = argparse.ArgumentParser(
    description='Pre-encode text files into tokenized training set.',
parser.add_argument('--model_name', metavar='MODEL', type=str, default='117M', help='Pretrained model name')
parser.add_argument('--combine', metavar='CHARS', type=int, default=50000, help='Concatenate files with <|endoftext|> separator into chunks of this minimum size')
parser.add_argument('in_text', metavar='PATH', type=str, help='Input file, directory, or glob pattern (utf-8 text).')

target_length = 256

def main():
    args = parser.parse_args()
    enc = encoder.get_encoder(args.model_name)
    chunks = load_dataset(enc, args.in_text, args.combine)
    with np.printoptions(threshold=sys.maxsize):
        result = chunks[0][0:target_length] # np.zeros(24)
        if len(result) != target_length:
            padding = [27] * target_length
            result = np.concatenate((result, padding))
            result = result[0:target_length]
        print(np.array2string(result, separator=','))

if __name__ == '__main__':

Interactive rating

For both po­etry and ABC mu­sic, there is no need for a GUI or web in­ter­face. A Bash script suffices.

Par­al­lelize & pre-com­pile the ABC. For a rat­ing script, we want to min­i­mize la­ten­cy, and avoid do­ing any pro­cess­ing in the main thread, so all rated files are pre­com­piled to WAVs be­fore rat­ing be­gins. All stages of the gen­er­ated files are left in /tmp/ for eas­ier de­bug­ging or pick­ing out good pieces. The orig­i­nal text ver­sion of the rat­ings are saved to an ad­di­tional text file for ref­er­ence, since BPEs are hard to read.

To avoid re­peat­edly eval­u­at­ing the same piece, which would hap­pen oc­ca­sion­ally with ran­dom sam­ples, I shuffle the sam­ples, store in an ar­ray, and pro­ceed in a pair­wise fash­ion to eval­u­ate #1 vs n⁄2+1 (where n is the list length) etc, so no com­par­i­son over­laps or du­pli­cates an­oth­er.

Au­to-fail bro­ken sam­ples. While rat­ing, each mu­sic piece is au­to­mat­i­cally checked (au­to-rat­ed) for va­lid­ity as a WAV (no WAV with a mean­ing­ful file­size = failed sam­ple), and its ABC against a hand-writ­ten black­list of ad­ver­sar­ial ex­am­ples. (I con­sid­ered in­clud­ing ‘com­press­ibil­ity’ as a cri­te­ri­a—pipe sam­ples into gzip and fail sam­ples which com­press too much—s­ince I no­ticed bad sam­ples were typ­i­cally highly repet­i­tive aside from their meta­data block, but did­n’t get around to it.) If both pieces in a com­par­i­son fail one of the checks, they are judged to be a tie (which is im­ple­mented as a pair of rat­ings). This saves an enor­mous amount of time & effort when ex­tract­ing rat­ings from through­out a run, as many will be ei­ther bro­ken or ad­ver­sar­i­al.

I print out the ABC pieces as well as play them—I find it help­ful to see them while lis­ten­ing. The ABC pieces are played at slightly higher speed than nor­mal, for ~10s each. (Be­cause the gen­er­a­tion is au­tore­gres­sive, a piece which starts off badly prob­a­bly is­n’t go­ing to wind up be­ing stel­lar, so there’s no point in rat­ing 3–4× fewer pieces by in­sist­ing on lis­ten­ing to en­tire pieces be­fore rat­ing. It’s more im­por­tant to get through a lot of rat­ings than make each rat­ing per­fec­t.)

My rat­ing script re­quires, parallel, abc2midi, timidity, & mpv to be in­stalled. The script is less than el­e­gant but (most­ly) works:

set +o posix

# $ bash 1000 irish-samples-1.txt irish-1

N="$1" # currently needs to be a multiple of 8
CORPUS="$2" # "/home/gwern/wiki/docs/ai/2019-03-06-gpt2-poetry-1000samples.txt"

encode() {
    TMP_FILE=$(mktemp /tmp/XXXXXX.txt)
    echo "$@" >> $TMP_FILE
    ENCODED=$(PYTHONPATH=src python3 --model_name 117M  $TMP_FILE)
    echo "$ENCODED"; rm "$TMP_FILE"; }
export -f encode

generateJson() {
    echo "{\"query\": [0,0], \"sample0\": $2, \"sample1\": $3, \"best\": $1}," >> $JSON-encoded.json;
    ## Store a backup copy of the plain text for easier consultation
    echo "{\"query\": [0,0], \"sample0\": $4, \"sample1\": $5, \"best\": $1}," >> $JSON-text.json;
generateJsonBroken() {
    echo "{\"query\": [0,0], \"sample0\": $2, \"sample1\": $3, \"best\": $1, \"broken\": 1}," >> $JSON-encoded.json;
    echo "{\"query\": [0,0], \"sample0\": $4, \"sample1\": $5, \"best\": $1, \"broken\": 1}," >> $JSON-text.json;

rm -rf /tmp/music-samples/; mkdir /tmp/music-samples/
cat "$CORPUS" | sed -e 's/===.*/<|endoftext|>/g' -e 's/⏎/\n/g' | \
    csplit --quiet --elide-empty-files --suppress-matched --prefix /tmp/music-samples/sample- - '/<|endoftext|>/' '{*}'

# Pre-compute all versions for speed; this also helps debugging since all stages can be inspected on disk in /tmp/music-samples/
generateEncoded() {
    echo "Starting encoding: $@"
    FIRST=$(cat $POEM)
    encode "<|endoftext|>$FIRST\n<|endoftext|>" >> $POEM.encoded
    abc2midi "$POEM" -o $POEM.midi -Q 130
    timidity -A125 -G5-20 $POEM.midi -Ow -o $POEM.wav
export -f generateEncoded
ls /tmp/music-samples/sample-* | shuf | head -$1 | parallel generateEncoded

filterMusic () {
    fgrep -i -e "2D2|G2D2G2D2|G2" -e "=e'|=e'=a'=a'=a'=g'=e'|=e'" -e "a' a' a' a'" -e "a=g|=f=g=a=c'=a=g|=f" \
     -e "|=c'=d'=c'=c'=a=g|=c'=d'=c'=c'=a=g|" -e "|=c=e=g=c'=g=e|=c=e=g=c'=g=e|" -e "|=d=c'=a=g=d=e|=d=c'=a=g=d=e|=d" \
     -e '(3B)(3B)(3B)(3B)(3B)' -e ',2B,2B,2|B,2B,2B,2|' -e ',C,C,C,C,C,C,C,C,C,C,C,C,C,' -e ',G,|G,A,B,G,A,B,G,|' \
     -e ',|G,2A,2G,2G,A,|G,2A,2G,2A' -e '-ghhathan-ghhathan-ghhathan' -e '////////////' \
     -e '/2B/2B/2B/2B/2B/2B/2B' -e '/g/g/g/g/g/g/g/g/g' -e '222222222' -e '2A2A2A2A2G2A2A2A2G2A2A2A2' \
     -e '2A2G2A2G2A2A2' -e '2D2D2D2D2D2D2D2D2' -e '2F2A2G2A2A2G2A2A2' -e '2G,|G,2G,A,2G,A,2G,|C2G' \
     -e '2G2E2|C2G2A2G2E2|' -e '2G2G2G2G2G2G2G2G2' -e '2c/2c/2c/2c/2c/2c/' -e '2d/2d/2d/2d/2d/2d/2d/2d/' \
     -e '2g/2g/2g/2g/2g/2g/2g/2g/' -e '2|G2G2G2G2|' -e '4g/4a/4g/4a/4g/4a/4g' -e '=A|=c=A=A2=c=A=G=A|=c=A=A2=c=A=G=A|' \
     -e '=D2=D2|=D2=D2=D2|' -e '=E=F=G=F=F=F=F=F=E=F' -e '=G,|=G,=A,=A,=A,=G,=G,|=G,' -e '=G2|=G2=G2=G2=G2=G2|=G2' \
     -e '=G2|=G2=G2=G2|=G2' -e '=G=G=G=G=G=G=G=G=G=G' -e '=G|=A=c=A=A2=G|=A=c=A=A2=G|=A' -e '=G|=G=G=G=G=G=G=G|=G' \
     -e '=G|=G=G=G=G=G=G|' -e '=a=a=a=a=a=a=a' -e '=b=d=b=d=b=d=b=d=b=d' -e '=c|=d=c=A=c=A=c|=d=c=A=c=d=d|' \
     -e '=e=d=g=d=e=d=g=d|=e=d=g' -e '=g|=a=f=a=g=e=g|=a' -e '=g|=d=f=f=f=f=g|=d=f=f=f=f=g|=' -e 'A2G2A2G2A2G2A2G2A2A2G2A2G2A' \
     -e 'A2|=A2G2A2|=A' -e 'AcAcAcAcAcAcAcA' -e 'B,B,B,B,B,B,B' -e 'B/B/B/B/B/B/B/B' -e 'B=G=A|=B=c=d=B=c=B=G=A|=B=c=d' \
     -e 'BcB|BcBcBcB|BcB' -e 'CA,CA,CA,CA,CA,CA,CA,CA,CA,' -e 'D2|=D2=D2=C2|=C2=D2=D2|' -e 'DADDADDADDA' \
     -e 'EGGAGEDC|EGGAGACD|E' -e 'G,G,G,G,G,G,G' -e 'G,G,G,G,G,G,G,G' -e 'G,G,G,G,G,G,G,G,G,G,' \
     -e 'G,|G,2G,G,|G,2G,G,|G' -e 'G,|G,G,G,|G,G,G,|' -e 'G/G/G/G/G/G/G/G/' -e 'G2|G2G2G2|G2' \
     -e 'G=A=c=G=A=c=G=A=c=G=A=c=G=A' -e 'G|=G=A=G=G2=G|=G=A=G=G2=G|' -e 'G|=G=G=G=G=G=G=G=G|' \
     -e '\n\n\n' -e '\n|\n|\n|\n|\n|' -e '^A|^A^A^A^A^A^A^A^A|^' -e '^D|^D^D^D^D^D^D^D|^' \
     -e '^f=f=f^f=f^f=f^d=f^f=f^' -e '^g|^g=f=f^d^d^c^c^g|^g=f' -e 'a a a a' -e 'a=a|=a=a=a=a=a=a|=' \
     -e 'aaeaaeaaeaaeaaeaaea' -e 'abbabbaba' -e 'b b b b' -e 'b=a=g|=b=a=g=b=a=g|=b=a' -e 'c/2c/2c/2c/2c/2c/2c/' \
     -e 'c/c/c/c/c/c/c/c/c/c/c' -e 'cccccccccccccccccc' -e 'e/e/e/e/e/e/e/e/e/e' -e 'f=a=a|=c=e=e=f=a=a|=c=e' \
     -e 'f=e|=f=g=f=e=g=f=e=g=f' -e 'fBfBfBfBfBfBfBfBfBfB' -e 'f^d|^c^d^f^g^f^d|^c' -e 'g g g g g' \
     -e 'g=e=g|=a=e=e=a=a=g=e=g|=a=e=' -e 'g=g^g|^g^g=g^g=g^g=g^g|' -e 'g=g|=a=g=f=e=g=g|=d' \
     -e 'g=g|=d=g=g=d=g=g|' -e 'g|=d=g=g=b=g=g|=d=g=g=b=g=g|=d' -e '|(3DDDD2(3DDDD2|(3DDDD2(3DDDD2|' -e '|(G,G,G,G,G' \
     -e '|=A=F=A=F=A=F=A=F|=A=F=A' -e '|=A=G=A=C2=G|=A=G=A=C2=G|=A=G=A=C2=G|' -e '|=E=G=G=E=F=A=A=F|=E=G=G=E=F=A=A=F|' \
     -e '|=E=G=G=E=G=A=G=F|=E=G=G=E=G=A=G=F|' -e '|=G,2=G,2=G,2|=G,2' -e '|=G=A=G=c=G=G|=G=A=G=c=A=G|' \
     -e '|=G=E=E=G=A=B=c=A|' -e '|=G=E=E=G=G=E=G=E|=G=E=' -e '|=G=G=G=G=G=G=G=G|' -e '|=G=G=G=G=G=G=G|=G=G=G=G=G=G=G|' \
     -e '|=G=G=G=G|=G=G=G=G|' -e '|=a=f=a=a=f=a|=a=f=a=a=f=a|' -e '|=a=g=f=e=g=g|=d=g=g=d=g=g|=a=g=' -e '|=a=g=g=g=f=e|=d=g=g=e=g=g|' \
     -e '|=b=a=g=b=a=g|' -e '|=c=c=c=g=c=e=c=g=c|=c' -e '|=c=d=c=B=c=d=e=d|=c=d=c=B=c=d=e=d|' -e '|=c=g=e=g=c=g=e=g|=c=g=e=g=c=g=e=g|' \
     -e '|=d=c=e=d=c=e|=d=c=e=d=c=e|=d' -e '|=d=f=g=f=g=f=d=c|=d=f=g=f=g=f=d=c|' -e '|=d=g=g=d=g=g|=a=g=f=e=g=g|=d=g=g=d=g=g|' \
     -e '|=d=g=g=d=g=g|=a=g=f=e=g=g|=d=g=g=d=g=g|=a=g=g=g=f=e|' -e '|=d=g=g=d=g=g|=a=g=g=g=f=e|=d=g=g=d' -e '|=d=g=g=d=g=g|=d=g=g=d=g=g' \
     -e '|=d=g=g=d=g=g|=d=g=g=d=g=g|' -e '|=d=g=g=d=g=g|=d=g=g=d=g=g|' -e '|=d=g=g=d=g=g|=d=g=g=d=g=g|=d' \
     -e '|=e=d=e=d=e=d=e=d|=e=d=e=d=e=d=e=d|' -e '|=e>=f=g>=e=d>=c=c>=d|=e>=' -e '|=g=e=g=g=e=g|=g=e=g=g=e=g|' \
     -e '|=g=f=e=d=c=d=e=f|=g=f=e=d=c=d=e=f|' -e '|A,A,A,A,A,A,A,|A,A' -e '|C2CD2E|C2CD2E|C' \
     -e '|C2ED2E|C2ED2E|' -e '|D2D2D2|D2D2D2|D2D2D2|' -e '|D2E2D2D2|D2E2D2D2|D2' -e '|D2E2D2|E2D2C2|D2E2D2|' \
     -e '|EDDD|EDDD|EDDD|' -e '|EDEEDE|EDEEDE|EDEDED|' -e '|G,2G,2G,2|G,' -e '|G,A,A,A,A,2G,G,|G,A,A' \
     -e '|G,A,G,A,G,A,G,A,|G,A,G,A,' -e '|G,B,CDDB,|G,B,CDDB,|G,B,CDDB,|' -e '|G,ED|EG,G,|G,ED|' \
     -e '|G,EEEDEGG2|G,EEEDEGG2|G' -e '|G,G,A,G,A,G,F,G,|G,G,A,G,A,G,F,' -e '|G2A2A2G2A2|G2A2G2A2G2' \
     -e '|G2G2A2|G2G2A2|G' -e '|G2G2G2G2G2G2|' -e '|G2G2G2G2|G2A2G2A2|G2A2G2A2|' -e '|GB\n|GB\n|GB\n|GB\n|GB\n|GB\n|' \
     -e '|GGGGGGGGGGGGGGGGGGGG|' -e '|^A=c^c^A^G=F|^A=c^c^A^G=F|' -e '|^G^A^G^G2^G|^G^A^G^G2^G|' \
     -e '|^G^G^G^G^G^G^G^G^G|' -e '|^c2^A2^A^G^A=c|^c2^A2^A^G^A=c|' -e '|^g=f^d^c^a^a|^g=f^' \
     -e '|^g^a^a^a^g=f|^g^a^a^a^g=f|' -e '|^g^f^g^f^d^d|^g^f^g^f^d^d|' -e '|f/a/g/f/e/d/|f/a/g/f/e/d/|f/a/g/f/e/d/|f/a/g' \
     -e '|gggggg|gggggg|'
} # }}}}

echo "[" >> $JSON-encoded.json; echo "[" >> $JSON-text.json; # "](((((

# should we simply auto-rate all pieces where possible (skipping pairs where both are valid) and not ask for any manual ratings?
# to avoid duplicate comparisons, we split the set of items in half and select from the top & bottom half in each loop;
# if there are 100 shuffled items, we compare #1 & #51, then #2 & #52 etc; thus, every item will be used once and only once
for file in $(ls /tmp/music-samples/sample-*.encoded | sed -e 's/\.encoded//' | shuf)
    if [[ -f $file ]]; then

LENGTH=$((${#POEMS[@]} / 2))
for ITERATION in `seq $I $LENGTH`; do
    echo "POEM: ${POEMS[I]}"

    FIRST=$(cat $POEM)
    FIRST_ENCODED=$(cat $POEM.encoded)

    SECOND=$(cat $POEM2)
    SECOND_ENCODED=$(cat $POEM2.encoded)

    # if first but not second is broken, second is the winner; if second but not first is broken, first wins;
    # and if both are broken, then we insert a pair where both win to tell the model that they are equally bad.
    # The check is a >100kb WAV file; if the ABC file is syntactically broken or too short to bother rating or
    # something goes wrong with abc2midi/timidity, then there will be no or small WAV files, so this checks most
    # errors. The other major error case is long repetitive degenerate ABC pieces generated by the model, so we
    # have a 'filterMusic' blacklist for snippets which show up in degenerate pieces.
    if [ ! $(wc -c < $POEM.wav) -ge 100000 ] || [[ -n $(echo "$FIRST" | filterMusic) ]]; then
        generateJsonBroken 1 "$FIRST_ENCODED" "$SECOND_ENCODED" "$FIRST" "$SECOND"
    elif [ ! $(wc -c < $POEM2.wav) -ge 100000 ] || [[ -n $(echo "$SECOND" | filterMusic) ]]; then
        generateJsonBroken 0 "$FIRST_ENCODED" "$SECOND_ENCODED" "$FIRST" "$SECOND"
    elif [ ! $(wc -c < $POEM.wav) -ge 100000 ] || [[ -n $(echo "$FIRST" | filterMusic) ]] && \
             ([ ! $(wc -c < $POEM2.wav) -ge 100000 ] || [[ -n $(echo "$SECOND" | filterMusic) ]]); then
        generateJson 1 "$FIRST_ENCODED" "$SECOND_ENCODED" "$FIRST" "$SECOND"
        generateJson 0 "$FIRST_ENCODED" "$SECOND_ENCODED" "$FIRST" "$SECOND"
        if [ -z "$SKIP_ALL" ]; then
            echo -e "\n\e[1m--------------------------------------------\e[0m"
            echo "$FIRST"
            timeout 10s mpv --af=scaletempo=scale=1.1:speed=pitch $POEM.wav
            sleep 1s

            echo "============================================="
            echo "$SECOND"
            timeout 9s mpv --af=scaletempo=scale=1.1:speed=pitch $POEM2.wav
            echo "" # print a newline to make output easier to read and divide from the foregoing

            echo -e "[$I] 1: \e[1mFirst\e[0m wins | 2: \e[1mSecond\e[0m wins | 3: Equal | \
                          r: stop & auto-Rate Rest | x: e\e[1mX\e[0mit immediately"
            read -N 1 RATING
            case "$RATING" in

                # skip

                    generateJson 0 "$FIRST_ENCODED" "$SECOND_ENCODED" "$FIRST" "$SECOND"
                    generateJson 1 "$FIRST_ENCODED" "$SECOND_ENCODED" "$FIRST" "$SECOND"
                    generateJson 0 "$FIRST_ENCODED" "$SECOND_ENCODED" "$FIRST" "$SECOND"
                    generateJson 1 "$FIRST_ENCODED" "$SECOND_ENCODED" "$FIRST" "$SECOND"
echo "]" >> $JSON-text.json; echo "]" >> $JSON-encoded.json

When I run the PPO in a screen ses­sion, I can ex­tract the full ter­mi­nal his­to­ry, with all printed out sam­ples, to rate (C-a C-[ C-Space, PgUp to be­gin­ning of run, then C-space C-> to save ter­mi­nal tran­script to /tmp/screen-exchange), and fil­ter out the sam­ples and se­lect only unique (im­por­tant with di­ver­gence) sam­ples with 42 char­ac­ters or more for rat­ings:

fgrep -v -e ppo -e 'k =' -e 'score =' -e 'kl =' -e 'total =' /tmp/screen-exchange | \
  sed -e 's/^X:$//' | sort --unique | sed -e 's/^/X:/' | sed -e 's/<|endoftext|>/\n/g'  | \
  sed -r '/^.{,42}$/d' | sed -e 's/^/<|endoftext|>\n===================\n/g'  -e 's/⏎/\n/g'| \
  egrep -v "^$" > $TARGET

## add the newly-encoded JSON ratings to the master dataset, remembering to close brackets:
emacs -nw abc-01-encoded.json irish.json
## update `` with the new dataset # of ratings, or else they won't be used
fgrep 'best' irish.json | wc --lines
# 16901
emacs -nw

Al­ter­nate­ly, I could gen­er­ate sam­ples from a check­point (as­sum­ing it’s not too far di­verged):

./ sample --mpi 2 --save_dir /tmp/save/train_policy/irish-combined-20191223.18/ --savescope policy \
  --temperature 0.9 --nsamples 2000 --batch_size 30 | tee --append rlsamples-combinedabc-06.txt


~1 day it­er­a­tions. Early on, each it­er­a­tion re­quired a few hours at most on my 2×1080ti (a few min­utes for the re­ward model then hours for PPO), and the PPO would di­verge within ~3k it­er­a­tions. As rat­ings ac­cu­mu­lat­ed, train­ing the re­ward model be­gan tak­ing up to an hour (oc­ca­sion­ally crash­ing with ran­dom OOMs at the end), and PPO be­gan tak­ing up to 24h, some­times di­verg­ing as late as 9k it­er­a­tions.

Ex­am­ple ter­mi­nal out­put of an ABC mu­sic (com­bined mod­el) PPO run, at 8,000 steps (~30 wall­clock hours/60 GPU-hours)
Ex­am­ple Ten­sor­Board logs of an ABC mu­sic (com­bined mod­el) PPO run, in the process of di­verg­ing after two ‘bounces’ (full screen­shot of an ex­am­ple di­ver­gence).

n = 7k rat­ings / 40 hours. I ran ~23 it­er­a­tions of train­ing and then rat­ing sam­ples; not all it­er­a­tions di­verged, be­cause they crashed or I ac­ci­den­tally killed them or I de­cided they were show­ing signs of di­ver­gence (like a col­lapse in the en­tropy of the pol­i­cy). Ex­clud­ing ‘bro­ken’ au­to-rat­ings, I rated n = 7,429 pairs of ABC mu­sic (this is an over­es­ti­mate of the ac­tual rat­ings, due to my im­ple­ment­ing ties as dou­ble-sam­ples); in­clud­ing au­to-rat­ings, I had n = 25,508. I found I was able to get through ~200 rat­ings per ses­sion (~1h) be­fore my skull be­gan leak­ing out my ears, so I some­times had to take breaks. Since each sam­ple takes ~20s to­tal (~10s per sam­ple), this re­quired a to­tal of >40 hours of con­cen­trated rat­ing. (I ini­tially tried do­ing other things while rat­ing to lessen the time bur­den, but quickly dis­cov­ered it was im­pos­si­ble to re­mem­ber the first mu­si­cal piece to com­pare it to the sec­ond piece if I was do­ing any­thing at all like read­ing.)

Di­ver­gences. The con­stant di­ver­gence cre­ated a lot of prob­lems, and I tried to deal with them by au­to­mat­i­cally black­list­ing ex­am­ples with patho­log­i­cal pat­terns, but this did not help. Since the OA pa­per did not re­port any di­ver­gence is­sues, I tried go­ing back to the OA setup by in­creas­ing the KL reg­u­lar­iza­tion, but while this gen­er­ated differ­ent dy­nam­ics (in­stead of a grad­ual ‘dou­ble bounce’, there is a long steady state fol­lowed by a sin­gle abrupt col­lapse), it did not fix the is­sue:

Ex­am­ple Ten­sor­Board of the com­bined mod­el, di­verg­ing in a sin­gle bounce de­spite full KL reg­u­lar­iza­tion

Di­verged ex­am­ples:


KL reg­u­lar­iza­tion did­n’t help. Fi­nal­ly, I gave up: after 3 months & 7k rat­ings, if it was­n’t work­ing, it was­n’t go­ing to start work­ing just be­cause I spent a few more weeks adding more rat­ings. I ran one last it­er­a­tion, stop­ping it at ~7k it­er­a­tions, not long be­fore I ex­pected it to di­verge but be­fore en­tropy had col­lapsed too much (fi­nal RL train­ing run Ten­sor­Board log). Some sam­ples from the fi­nal mod­el:

“Bour­réeàsixde­Bri­antes” sam­ple (2020-01-25):
“100 GPT-2 Pref­er­ence-Learn­ing-Tuned Tunes” sam­ple (2020-01-25):

Model & Data

Avail­able for down­load:

  1. All rat­ings & sam­ples (26MB; mir­ror)

  2. the fi­nal pol­icy mod­el:

    rsync -v rsync:// ./

Blind Ratings

No im­prove­ment from RL. I was not im­pressed by the qual­ity of RL sam­ples ei­ther dur­ing train­ing or when sam­pled, which did not strike me as clear im­prove­ments. (In con­trast, the ‘space­less’ ABC data clean­ing change made an im­me­di­ate differ­ence to sam­ples.) To eval­u­ate the fi­nal sam­ples, I used the 7k it­er­a­tion to gen­er­ate sam­ples (tem­per­a­ture: 0.95) and com­pared to the 117M ABC space­less base­line (with top-p = 0.95), and adapted my rat­ing script to load from the two sam­ple files, ran­dom­ize left/right pre­sen­ta­tion, and record which file won. I ex­panded the rat­ing time to 20s per piece to al­low more in­-depth com­par­i­son.

I rated ~200 pairs, and the re­sult was I pre­ferred the RL sam­ples in 93 of 210 com­par­isons, or ~46%. If any­thing, the RL-fine­tuned sam­ples were slightly worse than the base­line.


“I have at­tempted sci­ence” (, 2019)

De­spite con­sid­er­able per­sonal effort over 3 months, I did not achieve any im­prove­ment in sam­ple qual­ity and the project failed. Since the tech­nique does work in some cas­es, how could I have fixed it? Hind­sight. In ret­ro­spect, I would have done a few things differ­ent­ly:

  1. pre­fix com­ple­tions: the root cause seems to be the re­ward model not learn­ing ad­e­quate­ly. Even ini­tial­ized from a good mu­sic-gen­er­a­tion mod­el, es­thet­ics may be diffi­cult to learn from few n with paired com­par­isons where the pairs are com­pletely dis­sim­i­lar.

    The OA tasks, on the other hand, made heavy use of com­ple­tions: sam­ples which share a long pre­fix, and then di­verge. Be­cause they are iden­ti­cal, they differ far less than 2 ran­dom sam­ples would, and so the same rat­ing is much more in­for­ma­tive. It’s just a kind of sta­tis­ti­cal power is­sue, sim­i­lar to us­ing rather than ran­dom peo­ple—the re­sults are the same, but you need or­ders of mag­ni­tude less n.

    I avoided con­di­tional sam­ples be­cause it made the pro­gram­ming much eas­ier to not have to count BPEs or slowly gen­er­ate 2 com­ple­tions for each pos­si­ble pre­fix, I could use ran­dom pairs of sam­ples col­lected from any­where, and it mapped di­rectly onto my goal of un­con­di­tional gen­er­a­tion (if I used con­di­tional gen­er­a­tion, where do the pre­fixes come from?), which all seemed like good enough rea­sons at the time, but given the fi­nal re­sults, this may have been a mis­take.

    An­other idea (made much more diffi­cult by the rigid­ity of the in­puts & con­fig) is to use “cur­ricu­lum learn­ing” (eg ): there’s at least two straight­for­ward ways of pro­vid­ing eas­ier sub­-tasks than gen­er­at­ing a full mu­sic piece. First, the re­quired length can be grad­u­ally ex­panded over train­ing—once it learns to gen­er­ate 5s of mu­sic that the critic can’t dis­tin­guish, re­quire it to gen­er­ate 10s, etc.

    Sec­ond, real mu­sic can be used as a crutch by pro­vid­ing the gen­er­a­tor with a de­creas­ing pre­fix from real mu­sic as a ‘seed’: once it can ap­pend 1 note suc­cess­ful­ly, re­quire it to ap­pend 2 notes, then 3 notes, and so on, un­til the pre­fix is 0-length it is gen­er­at­ing mu­sic se­quences from scratch. (This can be done with or with­out us­ing a su­per­vised log-like­li­hood loss for train­ing the NN to gen­er­ate the pre­fix.)

  2. more hy­per­pa­ra­me­ter tun­ing: there’s no sup­port for hy­per­pa­ra­me­ter op­ti­miza­tion in the code­base, but in­stead of set­ting the hy­per­pa­ra­me­ters based on my ex­pe­ri­ence & in­tu­ition, I could have run a more for­mal hy­per­pa­ra­me­ter search, and grid-searched it man­u­al­ly. Since the re­ward model typ­i­cally takes less than an hour to train, a few hun­dred runs would have been fea­si­ble over the 3 months of my pro­ject, and I would have much more con­fi­dence that the re­ward model was squeez­ing as much out of the rat­ings as pos­si­ble.

    As it is, I’m left with the nag­ging doubt—was the LR just too high, or too low, and the re­ward model could’ve got­ten good enough to pro­vide a use­ful sig­nal to the PPO and train a gen­uine im­prove­ment to the mu­sic gen­er­a­tion?

  3. tried crowd­sourc­ing: I did­n’t want to in­volve third-par­ties un­til I knew it would work or try to set up a web­site for in­ter­ac­tive generation/rating, but crowd­sourc­ing may have been nec­es­sary to col­lect a de­cen­t-sized dataset. While it would not have gone hugely vi­ral like other DL projects have, a few thou­sand vis­i­tors rat­ing a dozen com­par­isons each would’ve gone a long way.

  4. checked au­to-rat­ings more: the au­to-rat­ings seemed like a great idea at the time—if the model kept gen­er­at­ing sam­ples with sim­i­lar patho­log­i­cal be­hav­ior, or if they were syn­tac­ti­cally bro­ken, why hand-rate them at all? But now I have mis­giv­ings. Ban­ning the patho­log­i­cal sam­ples was prob­a­bly OK, but did I screw up the re­ward model by ban­ning bro­ken sam­ples? After all, they made up the over­whelm­ing ma­jor­ity of the cor­pus at the end, so I may have in­ad­ver­tently pro­duced a ‘class im­bal­ance’-style prob­lem: the re­ward model wound up fo­cus­ing en­tirely on try­ing to un­der­stand syn­tac­tic flaws, rather than es­thetic flaws.

Oh well.


A bridge too far. I ini­tially be­gan with RL train­ing of the , but after a few it­er­a­tions, I aban­doned this idea. The sam­ples were too hard to rate quick­ly, and were heav­ily bi­ased to­wards mod­ernist Po­et­ry-Foun­da­tion-based pieces, which, lack­ing for­mal struc­ture, had to be judged largely on se­man­tic­s—but every poem sam­ple is bro­ken differ­ent­ly, re­quir­ing world-knowl­edge to fix, and how likely was it that the rat­ing model could hope to pick up on these is­sues from mere pair­wise com­par­isons? Even the OA pa­per did­n’t try to fix se­man­tics, set­tling for generic tasks like ‘de­scrip­tive­ness’ or ‘sum­ma­riza­tion’ (as op­posed to fine­tun­ing for tasks like ‘make ac­cu­rate state­ments about the world’). So I switched to im­prov­ing the ABC mu­sic, rea­son­ing that fix­ing is­sues like too much rep­e­ti­tion ought to im­prove mu­sic qual­i­ty, while not be­ing nigh-im­pos­si­ble for the re­ward model to learn from com­par­isons.

I did pick out a few good po­etry sam­ples along the way:

A man is sitting at the foot of his host.
Every eye is filled with watching and dread.
And he speaks with a strange and fearful sound.
All, all are silent and no sound,
Save what the Spirit sings
In the calm hour of your pensive sleep;
A measured, silent, and mysterious sweep
Of the invisible,

"What voice is this?" the Captain cries
In a voice of troubled joy,
As he deems the gathering night-wind, and
The voices of the years away,
But knows the toot of the sands of tears
And the burden of our own despair.
He calls for his comrades and his wife
To wait upon him in the tent;
But they do not, as he says to them,
Because they fear to die;
They turn and journey on their way,
For they realize their fate.
Why does the Captain wait and wait
In silent and unseeing wait?
He has not come for his reward,
Nor will it be too late.

..."The Farm", by James Thomas Stevens [Nature, Landscapes & Pastorals]
The yard is an x of sun
shot across and then dividing
black with silver mesh in,
then crossing through it into salt,
becoming a tiny dream
so it can sometimes dream of water
across the tree. The house
is a part of the yard
and the sun is going down
past the barn. At home
the house is a field of cream.
A few lampshins
flare at the door.
A door is at the threshold
and when the house of cheese turns brown
the house turns red.
The house is an x of sun
and when the house of feed turns red
the house turns green.

...I hear him with pleasure roar through the wood,
A melody as of rushing from the main;
He sings as he treads the bound of human things,
Borne on the wings of the blast as I sweep along,
The music of all whose names I love to hear,
Like music heard in a wind that murmurs near,
The music of all who hear.

I think of thee I know, oh, come from far,
From this green world and tracks by unknown lands;
Come to me, all that still is beautiful,
Come all that dwell in nature, all resigned,
And all that glows with beauty bright and free,
Yea, all that glitters like your beautiful eyes,
And all that lives like your beautiful hair,
And all that mocks at with a tranquil air,
Weeneth more of the sweetness of your voice,
Wandering on as it wanders still and free,
With earth in every drop and spot on earth,
By night and day and starry night.

Perfect is this life,
And end is death. And so to end is life.
How many of us
Have come to this, and died
Like birds. Here, in a quiet room within
A sombre room, where even the gravest dead
In all the ills of life are counted down.
In the broad company of light and death,
I watched a silent procession of them die;
And one by one, by three,
Passed slowly out into the waiting dark.
We left behind us in a room of grief:
Her voice, her hands I laid upon mine eyes,
Stretched over mine, and strove to think in vain
We loved together in a world of tears.


Data Increases


1–10m la­bels too ex­pen­sive to buy. If we need 70k to get good per­for­mance on a rel­a­tively straight­for­ward task like sum­ma­riza­tion (which can be solved to a con­sid­er­able de­gree just by copy­ing se­lected parts of the in­put), it’s easy to imag­ine that we might need an or­der of mag­ni­tude or two more data for sub­tler tasks like mu­sic. 1–10 mil­lion is to­tally in­fea­si­ble for one per­son on their own, and would cost far too much to pay a data la­beler for as well5.

Crowd­sourc­ing scales to 10m+! Could we over­come the lack of rat­ings by us­ing crowd­sourcing? Such sam­ple sizes ap­pear to be en­tirely fea­si­ble with crowd­sourcing: the global pub­lic is in­ter­ested in AI and gen­er­a­tive art, and can con­tribute a lot of time en masse, do­nat­ing mil­lions of in­ter­ac­tions, and the nec­es­sary in­fra­struc­ture does not re­quire enor­mous re­sources (many suc­cess­ful projects were done by hob­by­ists or in­tern­s). Some ex­am­ples:

  1. : within 2 months of launch, the turn-based GPT-2-text-dialogue game AD2 had racked up >100m turns.

  2. : >1m unique vis­i­tors within the first mon­th, spend­ing sev­eral min­utes on av­er­age and look­ing at dozens of faces. TWDNE is only one of a num­ber of “This X Does Not Ex­ist”, usu­ally based on StyleGAN mod­els, in­spired by , and the to­tal num­ber of vis­i­tors to TXDNE sites is likely into the tens of mil­lions.

    • : an anime face gen­er­a­tor sim­i­lar to TWDNE, it sim­i­larly went vi­ral. Sizigi Labs es­ti­mates 325,000 ses­sions 2019-10-30–2020-01-28 (well after launch & vi­ral­i­ty), at ~2 minutes/session; their an­a­lyt­ics were bro­ken at launch but “if I had to guess, we’re some­where 1-3MM life­time [user­s].” Given how pop­u­lar it was dur­ing its vi­ral­ity and the num­ber of links & men­tions I’ve seen on so­cial me­dia, I defi­nitely be­lieve it had at least as many unique users as TWDNE did.
  3. : the home­page re­ports gen­er­at­ing 56,391,540 im­ages be­tween its launch ~2019-09-09–2020-01-27; the stan­dard breed­ing in­ter­face shows 6 pos­si­ble im­ages, so that cor­re­sponds to ~9m user actions/clicks.

  4. : when made avail­able by OA for a week­end in April 2019 for pub­lic play, there were 42,723 DoTA2 games against 30,937 play­ers tak­ing a to­tal of 93,796 man-hours.

  5. : OA re­ported in the first day of MuseNet avail­abil­i­ty: “In the last 24 hours, we saw: 100k plays to­tal of pre-gen­er­ated MuseNet songs / 38,000 MuseNet sam­ples co-com­posed / 29k unique MuseNet con­cert lis­ten­ers”.

    MuseNet sam­ples are typ­i­cally 1–3m long, and the con­cert was 3 hours long, sug­gest­ing man-hours in the first day listening/generating MuseNet sam­ples. Pre­sum­ably the counts in­creased by at least an­other or­der of mag­ni­tude in the fol­low­ing week as they ran a com­pe­ti­tion for best gen­er­ated sam­ple of the day.


Off-pol­icy RL/semi-supervised learn­ing. We do not nec­es­sar­ily need ex­plicit rat­ings from hu­mans if we can lever­age ex­ist­ing al­go­rithms and datasets to con­struct syn­thetic or pseudo-rat­ing datasets. They do not need to be per­fect or hu­man-qual­ity to po­ten­tially greatly re­duce how many hu­man rat­ings are need­ed, sim­i­lar to how pre­train­ing GPT-2 for gen­er­at­ing ABC for trans­fer learn­ing makes pref­er­ence-learn­ing fea­si­ble at all on that do­main. From an RL per­spec­tive, PPO may be an ‘on-pol­icy’ al­go­rithm which can learn only from re­wards on sam­ples it just gen­er­at­ed, but the re­ward model it­self can learn from rat­ings on sam­ples gen­er­ated by any process, and is ‘off-pol­icy’. The sam­ples could be gen­er­ated by hu­mans, or gen­er­ated by non-GPT-2 NNs, or gen­er­ated by non-NN al­go­rithms en­tire­ly.

6 To kick­-s­tart the learn­ing process, you could ‘pre­train’ the re­ward model by gen­er­at­ing lots of mu­sic from low-qual­ity gen­er­a­tive sources and then mark­ing them all as the loser in a set of com­par­isons with high­er-qual­ity sources (such as real mu­sic). For ex­am­ple, one could de­fine a few mu­sic gen­er­a­tors (ran­dom ASCII char­ac­ters, n-grams, char-RNN at var­i­ous tem­per­a­tures) to gen­er­ate a mil­lion fake mu­sic se­quences, take the real mu­sic from the ABC Irish mu­sic cor­pus and cre­ate com­par­isons with the real mu­sic al­ways the win­ner. If there is pop­u­lar­ity data on the real mu­sic, then this too can be used to pre-gen­er­ate com­par­isons (just have the more pop­u­lar of two pieces win each com­par­ison). The pre­train­ing com­par­isons can re­flect as much ad­di­tional in­for­ma­tion as you think you can get away with. Along with pop­u­lar­ity rat­ing to make dis­tinc­tions be­tween com­par­isons of the real mu­sic, why not or­der the com­par­isons as well by data qual­ity source? eg ran­dom < n-gram < char-RNN < GPT-2 < GPT-2-PPO-tuned There might be mis­taken com­par­isons (per­haps some­times the n-grams re­ally do beat the char-RNNs), but this is amenable to fix­ing by ac­tive learn­ing on the per­sis­tently mis­clas­si­fied com­par­isons should it be an is­sue. This im­me­di­ately pro­vides an enor­mous cor­pus for the pref­er­ence clas­si­fier, and then when it’s fin­ished train­ing on that, one can bring the hu­man into the loop and start generating/comparing/retraining as in the pref­er­ence learn­ing.

More gen­er­al­ly, you can see the pre­train­ing+pref­er­ence learn­ing as a form of semi­-su­per­vised learn­ing, with an ini­tial un­su­per­vised boot­strap phase fol­lowed by su­per­vised learn­ing as nec­es­sary:

  1. use un­su­per­vised learn­ing meth­ods to cre­ate gen­er­a­tive mod­els based on a cor­pus
  2. sam­ple from the gen­er­a­tive mod­els to cre­ate fake n
  3. cre­ate m com­par­isons with by pair­ing ran­dom fake n and real n
  4. train a re­ward model
  5. be­gin reg­u­lar pref­er­ence learn­ing

Architectural Improvements

I be­lieve the cur­rent Chris­tiano black­box pref­er­ence learn­ing ap­proach (/) could be im­proved to make it more com­pute-effi­cient, sam­ple-effi­cient, and sim­pler. There are two ways that seem par­tic­u­larly rel­e­vant for music/text gen­er­a­tion:

  1. di­rectly op­ti­mize re­ward by back­prop: The op­ti­miza­tion of the re­ward does not re­quire tak­ing a black­box ap­proach where the ‘en­vi­ron­ment’ is not mod­eled, re­quir­ing an agent like PPO; the ‘en­vi­ron­ment’ is sim­ply the re­ward mod­el, which is a neural net­work and can be queried, differ­en­ti­at­ed, and op­ti­mized over like any oth­er.
  2. di­rectly model qual­ity score: The re­ward model can be im­proved in flex­i­bil­i­ty, in­ter­pretabil­i­ty, and effi­ciency by ex­plic­itly treat­ing it as a Bradley-Terry mod­el, and train­ing the NN to pre­dict the in­trin­sic ‘qual­ity’ score (rather than raw com­par­ison­s), which can be eas­ily es­ti­mated by stan­dard sta­tis­ti­cal meth­ods given a dataset of rat­ings.

The new ar­chi­tec­ture & train­ing would go like this when com­bined:

  1. Data col­lec­tion:

    1. do Pair­wise Rat­ings on a cor­pus, with enough over­lap to form a rank­ing
    2. B-T Rank­ing al­go­rithm to in­fer the la­tent qual­ity score for each dat­a­point
    3. Su­per­vised (Re)­Train­ing of the re­ward model on data → score
  2. Pol­icy im­prove­ment:

    1. for each dat­a­point (ei­ther ran­dom­ly-gen­er­ated or from a cor­pus)

      1. En­code into the text em­bed­ding
      2. run it­er­a­tions of Gra­di­ent As­cent on the re­ward model to op­ti­mize the em­bed­ded text se­quence, un­til a limit i is hit or un­til the av­er­age re­ward (qual­i­ty) is higher than the av­er­age re­ward (qual­i­ty) of the pre­vi­ous cor­pus
    2. re­place the pre­vi­ous cor­pus with the new Im­proved Cor­pus

    3. [op­tion­al] (Re)­Train a generator/agent-model by likelihood-training/imitation-learning on the new cor­pus (‘amor­tized in­fer­ence’)

Optimization by Backprop, not Blackbox

Here I pro­pose chang­ing the agent/generator model ar­chi­tec­ture to ex­plic­itly op­ti­mize the re­ward mod­el’s utility/reward score, by re­mov­ing the agent/generator en­tirely and in­stead im­prov­ing pos­si­ble se­quences by gra­di­ent as­cent on the (d­iffer­en­tiable) re­ward mod­el. There is no need to build a re­dun­dant agent model when the re­ward model is differ­en­tiable and can be used to di­rectly spec­ify how an in­put se­quence ought to change to im­prove it.

This sim­pli­fies the over­all ar­chi­tec­ture great­ly, avoids ex­pen­sive & un­sta­ble & com­plex black­box train­ing of DRL agents, and en­ables easy gen­er­a­tion of both high­-s­cor­ing & high­ly-di­verse (thus in­for­ma­tive) se­quences for an or­a­cle to rate, which can then be fed back into the re­ward model for fur­ther train­ing. To the ex­tent an agent/generator is nec­es­sary to effi­ciently gen­er­ate many se­quences, it can be trained quickly & sta­bly by im­i­ta­tion learn­ing on a cor­pus of dat­a­points op­ti­mized by the mod­el.

While run­ning PPO against the re­ward mod­el, I con­cluded that com­pared to other ap­proaches I’ve seen to op­ti­miz­ing the out­puts of a NN, the black­box pref­er­ence learn­ing ap­proach has 2 ma­jor flaws:

  1. Com­pute-In­effi­cient: it is slow and mem­o­ry-hun­gry (I have to use GPT-2-117M to fit rea­son­able mini­batches onto 2×1080ti, and even then it­er­a­tions can take days)
  2. Sin­gle-Di­ver­gence-Prone: it ‘mode col­lapses’ into ad­ver­sar­ial sam­ples, typ­i­cally highly repet­i­tive, and typ­i­cally even­tu­ally only one ad­ver­sar­ial sam­ple

Slow feed­back: 1 day, 1 coun­ter-ex­am­ple. This makes it­er­a­tion slow be­cause of the dou­ble-wham­my: each run takes days be­fore any score im­prove­ments or di­ver­gence, and when it di­verges, it typ­i­cally only yields a hand­ful of us­able ad­ver­sar­ial dat­a­points to rate & re­train on. Thus, the frus­trat­ing ex­pe­ri­ence of see­ing each run end in just one ad­ver­sar­ial sam­ple, which may be only some­what differ­ent from the pre­vi­ous run.

Mode col­lapse. Think­ing about this, a black­box RL ap­proach does­n’t seem quite right. For an RL prob­lem, it’s fine to find only a sin­gle path which leads to a high re­ward. To put it in GAN terms, this ‘mode col­lapse’ onto a sin­gle ad­ver­sar­ial ex­am­ple is, as far as the agent/Generator is con­cerned, a 100% valid so­lu­tion. The ‘en­vi­ron­ment’ has no mem­o­ry, and can­not pe­nal­ize the agent/Generator for rep­e­ti­tion. If there ex­ists any string “XYZ” which, on its own or ap­pended to any other string, causes the re­ward model/Discriminator to al­ways out­put the max­i­mal re­ward, then why does the agent/Generator need to learn any­thing else? It won. But that’s not re­ally what we want. We want it to sam­ple from the full dis­tri­b­u­tion of high­-qual­ity se­quences. Un­for­tu­nate­ly, mode col­lapse is not solved in GANs, and I can’t think of any easy how to eas­ily fix it in this pref­er­ence learn­ing ei­ther.

Ask re­ward model how to edit sam­ples. One ap­proach to avoid both those is­sues is to drop the black­box op­ti­mizer ap­proach en­tire­ly—which in­cen­tivizes wast­ing a ton of com­pute to find a sin­gle ad­ver­sar­ial ex­am­ple—and in­stead op­ti­mize dat­a­points di­rectly. It seems like a waste to go to all this effort to build a differ­en­tiable sur­ro­gate (re­ward) model of the en­vi­ron­ment (the hu­man), and then treat it like just an­other black­box. But it’s not, that’s the whole point of pref­er­ence learn­ing! Since GPT-2 is differ­en­tiable, it ought to be pos­si­ble to back­prop through it to do plan­ning and op­ti­miza­tion like . Typ­i­cal­ly, we hold the in­puts and out­puts fixed and use back­prop to ad­just the mod­el, but one can in­stead hold the model fixed and ad­just the in­puts based on back­prop to give a de­sired out­put: in this case, hold a GPT-2 re­ward model fixed, and ad­just tex­tual in­puts to make the out­put, the re­ward, larg­er, by back­prop­ping from the out­put back through the model to the cur­rent in­put. This is an ap­proach which , and ‘op­ti­miz­ing im­ages’ to max­i­mize a CNN’s prob­a­bil­ity of clas­si­fi­ca­tion as a ‘dog’ or ‘cat’ etc has long been done as a way of vi­su­al­iz­ing what a CNN has learned. For ex­am­ple, one could gen­er­ate high­-s­cor­ing mu­sic pieces by gen­er­at­ing a ran­dom se­quence, tex­t-em­bed­ding it into the vec­tor for the re­ward mod­el, and then do­ing gra­di­ent as­cent on the vec­tor. (No PPO clus­ter re­quired.) This is equiv­a­lent to do­ing planning/revising, as at each it­er­a­tion, GPT-2 ‘con­sid­ers’ the se­quence as a whole and can make global changes rather than lo­cal changes to the fi­nal en­try in the se­quence; over many it­er­a­tions, it can ‘edit’ a se­quence re­peat­ed­ly, rather than be­ing forced to gen­er­ate the en­tire se­quence in a sin­gle shot like PPO is. This could be a lot faster since it ex­ploits the white­box na­ture of a learned re­ward model in­stead of treat­ing it as a high­-vari­ance black­box.

Ex­am­ple: PPLM. A sim­i­lar ap­proach to op­ti­miz­ing GPT-2 out­puts has since been pub­lished by Uber as “PPLM” (; ). PPLM uses the gra­di­ents from GPT-2 and a con­trol NN to do si­mul­ta­ne­ous gra­di­ent as­cent, try­ing to op­ti­mize an in­put to max­i­mize both like­li­hoods, thereby main­tain­ing sen­si­ble Eng­lish text (due to GPT-2) which still max­i­mizes the par­al­lel tar­get (such as a ‘pos­i­tiv­ity’ goal).

An­other pos­si­bil­ity would be to try to use beam search (although it has pro­duced bad re­sults in NNs as dis­cussed in the nu­cleus sam­pling pa­per), per­haps due to the log like­li­hood train­ing en­cour­ag­ing rep­e­ti­tion) or the ex­pert iteration/MCTS train­ing from Al­phaGo Ze­ro. MCTS was orig­i­nally in­tro­duced for plan­ning in gen­eral MDPs, it is­n’t in­her­ently lim­ited to two-player games, the “rules” of gen­er­at­ing se­quence data is triv­ial (any­thing ASCII, in this case), and the dis­crim­i­na­tor pro­vides a well-de­fined re­ward. So in­stead of a NN which di­rectly gen­er­ates a next char­ac­ter, it could in­stead (given a par­tic­u­lar prefix/history) out­put val­ues for the 128 ASCII val­ues, run MCTS search for a while, pro­duce a re­fined value for each char­ac­ter, and re­train the NN to­wards the re­fined val­ues; every mini­batch of the gen­er­a­tor, one gen­er­ates a bunch of ex­am­ples for the hu­man to judge and pro­vide a new mini­batch for the dis­crim­i­na­tor. Hence, tree it­er­a­tion learn­ing-from-pref­er­ences deep RL. With mu­sic we don’t nec­es­sar­ily need the sta­ble self­-play that tree it­er­a­tion pro­vides since I’m not too clear con­cep­tu­ally what one would ex­pect self­-play to de­liver (it is in­her­ently a hu­man-de­fined prob­lem, as op­posed to Go where it’s ex­ter­nal and hu­man pref­er­ences are not the cri­te­ri­a), but given the Al­p­haZero & An­tho­ny’s Hex re­sults, this could be con­sid­er­ably more com­pu­ta­tion-effi­cient by pro­vid­ing much more su­per­vi­sion at each timestep in­stead of pro­vid­ing just a lit­tle bit of su­per­vi­sion from the end re­sult of win/lose with REINFORCE. Pos­si­bly also more hu­man-sam­ple-effi­cient?

Boot­strap. Rat­ings can be done pair­wise on the var­i­ous op­ti­mized se­quences (ran­dom pairs of high­-s­cor­ing se­quences, al­though before/after com­par­isons might be more in­for­ma­tive), and then the re­ward model trained.

Amor­tized in­fer­ence. If gra­di­ent as­cent is too slow for rou­tine use, then one can just dis­till the re­ward model via train­ing the GPT-2 on suc­ces­sively bet­ter cor­puses in the usual effi­cient quick like­li­hood train­ing (im­i­ta­tion learn­ing) way, for a sort of ‘ex­pert it­er­a­tion’: gen­er­ate im­proved ver­sions of a cor­pus by gen­er­at­ing & se­lect­ing new dat­a­points above a thresh­old (per­haps us­ing a cor­pus of hu­man dat­a­points as start­ing points), and train to gen­er­ate that.

Au­to­matic edit­ing via gra­di­ent as­cent. Gra­di­ent as­cent can be used to con­trol and op­ti­mize text in var­i­ous ways. For ex­am­ple, fic­tion could be edited in one re­gion (say, to change char­ac­ter’s name from “Miriam” to “Mary”) and then the edited re­gion could be held fixed dur­ing gra­di­ent as­cent, while the rest of the unedited text is free to vary; this would prop­a­gate the ed­its, be­cause self­-con­tra­dic­tory text is un­likely while self­-con­sis­tent text is more like­ly. (Be­cause it is a more likely text for the pro­tag­o­nist to be con­sis­tently named “Mary” through­out the en­tire text rather than named “Mary” in one place and “Miriam” every­where else.) One could pro­duce mul­ti­ple ver­sions of a text by spec­u­la­tive ed­it­s—what if this char­ac­ter was a pi­rate? what if the pro­tag­o­nist lost this bat­tle in­stead of win­ning? what if a ban­quet scene was deleted en­tire­ly?—and se­lect the best one. One could also do this on het­ero­ge­neous sets of text, such as col­lab­o­ra­tive­ly-edited works like the : there is no lin­ear struc­ture like a nov­el, but one could take an edited en­try and con­cate­nate it with an ar­bi­trary sec­ond en­try, do as­cent on the sec­ond en­try to make it more con­sis­tent, and save the mod­i­fied ver­sion; it­er­ated re­peat­edly over the en­tire set of en­tries, one would wit­ness the wiki ‘grow­ing’ or­gan­i­cal­ly—changes in one en­try, like a new SCP or char­ac­ter, will au­to­mat­i­cally pop up else­where in log­i­cal ways, with all seams grad­u­ally smoothed over into one evolved whole.

Par­al­lel gen­er­a­tion of coun­ter-ex­am­ples. If noth­ing else, I think it would help with the ad­ver­sar­ial in­stances. Part of the prob­lem with them is that each PPO run seems to col­lapse into a sin­gle spe­cific ad­ver­sar­ial in­stance. I can do a bunch of rat­ings which pe­nal­izes all vari­ants of that in­stance which fixes it, but then I must wait an­other day or two, and then that run col­lapses into a new sin­gle ad­ver­sar­ial in­stance. The re­ward model seems to grad­u­ally get bet­ter and the ad­ver­sar­ial in­stances seem to grad­u­ally in­crease in com­plex­i­ty, but the process is slow and se­r­i­al. The gra­di­ent as­cent ap­proach may also run into the prob­lem that it will find ad­ver­sar­ial in­stances for the re­ward mod­el, but at least it will do so in par­al­lel: if I can run a mini­batch n = 11 of GPT-2-117M re­ward mod­els each start­ing with a differ­ent ran­dom ini­tial se­quence be­ing op­ti­mized and do gra­di­ent as­cent on each in par­al­lel, they will prob­a­bly find mul­ti­ple ad­ver­sar­ial in­stances in par­al­lel, while the PPO would only find the one it col­lapses on. So one would get a lot more use­ful ad­ver­sar­ial in­stances to rate per run.

Op­ti­miz­ing em­bed­dings = non­sense? One of the most likely draw­backs to such gra­di­ent as­cent ap­proaches on the em­bed­ding is the pos­si­bil­ity that the max­i­mized em­bed­ding will not then con­vert back to any kind of sen­si­ble dis­crete sym­bol se­quence, a fail­ure mode which has caused trou­ble in at­tempts to do and on T5 mod­els (re­quir­ing adding VAE au­toen­coder-like con­straints to make the la­tent space ‘smooth’, and—at least with ex­tremely low-di­men­sional la­tent spaces like n = 2–10—to­kens can be too sep­a­rated to be reached by gra­di­en­t-fol­low­ing).

Bradley-Terry Preference Learning

Chris­tiano et al 2017 in­tro­duced a deep re­in­force­ment learn­ing ar­chi­tec­ture for learn­ing “I know it when I see it” sub­jec­tive­ly-de­fined re­ward func­tions from hu­man feed­back: a hu­man makes com­par­isons of actions/datapoints/episodes to se­lect the ‘bet­ter’ one, a NN is trained to pre­dict the bet­ter one based on these com­par­isons, and an­other NN is RL-trained based on the pre­dicted com­par­isons in­ter­preted as a re­ward. Since the hu­man is un­able to write down a con­ven­tional re­ward func­tion in soft­ware, the pre­dic­tor NN (anal­o­gous to a Dis­crim­i­na­tor in a GAN or a ‘critic’ in ac­tor-critic RL) learns the re­ward func­tion by ex­am­ple, and then the RL agent NN (anal­o­gous to a Gen­er­a­tor in a GAN) learns by tri­al-and-er­ror what se­quences will op­ti­mize this com­plex re­ward func­tion, and the hu­man feed­back pro­vides ad­di­tional guid­ance on new parts of the prob­lem as the pair of NNs boot­strap into bet­ter per­for­mance. This is demon­strated on video game or ro­bot­ic-style sim­u­la­tions, but ap­pears equally ap­plic­a­ble to other se­quence prob­lems where re­ward func­tions are im­pos­si­ble to write and ex­ist­ing losses like max­i­mum like­li­hood are im­per­fect for gen­er­a­tion (such as mu­sic or po­etry com­po­si­tion).

As orig­i­nally framed, the pre­dic­tor merely does com­par­isons, re­ceiv­ing & pro­vid­ing bi­nary feed­back. This is jus­ti­fied as be­ing im­plic­itly equiv­a­lent to a stan­dard pair-comparison/competition mod­el, the (akin to the fa­mous ELO), where each dat­a­point has a la­tent vari­able on a com­mon car­di­nal scale (often, like a , scaled to for con­ve­nience), pro­duc­ing a to­tal or­der which effi­ciently ex­tracts all pos­si­ble in­for­ma­tion from the com­par­isons.

I sug­gest that this is not nec­es­sar­ily the case, as ex­am­ples from GANs in­di­cate that such a pref­er­ence-learn­ing ar­chi­tec­ture may be learn­ing some­thing odder (such as mem­o­riz­ing com­par­ison­s), and that the ar­chi­tec­ture could be im­proved by re­mov­ing the im­plic­it­ness of the B-T rank­ing, com­put­ing the B-T rank­ings di­rectly (which can be done even with non-over­lap­ping com­par­isons by us­ing a Bayesian model with pri­ors and us­ing co­vari­ates such as the pre­dic­tor’s own es­ti­mates), thereby pro­vid­ing ab­solute qual­ity scores for cor­rect­ness of com­par­isons, more effi­cient re­gres­sion, RL re­wards, and mean­ing­ful in­ter­pretable scores for down­stream us­es.

The mo­ti­va­tion for the dou­ble-critic ar­chi­tec­ture in the cur­rent black­box ap­proach is that the data be­ing col­lected from hu­mans is pair­wise, and so one trains the critic to pre­dict com­par­isons. This out­side train­ing loop then has an in­ner G/agent train­ing loop etc. The dou­ble train­ing loop is nec­es­sary to col­lect rat­ings from new ar­eas of state­space that the G/agent can now ac­cess, but al­so, GAN-style, to avoid the D/critic from be­ing too pow­er­ful and sat­u­rat­ing loss. (The orig­i­nal Chris­tiano im­ple­men­ta­tion is used to avoid prob­lems, with only the most re­cent n = 3000 com­par­isons are stored.)

But, just be­cause the in­put is pair­wise does­n’t mean that the out­put must also be pair­wise. (After all, many things, like tour­na­ments, turn a se­ries of pair­wise com­par­isons into fi­nal scalar val­ues like ‘rank’.) It could in­stead be a scalar in­di­cat­ing global rank, with the D/critic per­form­ing re­gres­sion. GANs and DRL are closely con­nected (// eg. is anal­o­gous to im­i­ta­tion-learn­ing+R­L-fine­tun­ing), and in both fields, a richer re­ward sig­nal is al­ways bet­ter, al­low­ing for sta­bler faster train­ing to bet­ter fi­nal per­for­mance. And a global rank is more in­for­ma­tive than a com­par­i­son.

Full Bradley-Terry Training

Ex­tract the rank­ings fast. A Bradley-Terry (B-T) model is sim­ple and easy to es­ti­mate on even large sam­ples7, and can eas­ily pro­duce car­di­nal rank­ings. Each dat­a­point gets an es­ti­mated car­di­nal rank­ing in stan­dard de­vi­a­tions of a hy­po­thet­i­cal la­tent Gauss­ian. The D/critic then is trained to do re­gres­sion from a sin­gle in­put to the es­ti­mated la­tent vari­able of qual­i­ty.

So the new loop would look like this:

  1. run off-the-shelf B-T rank­ing over a dataset of com­par­isons of dat­a­points

  2. ex­tract the es­ti­mated la­tent vari­ables for each dat­a­point

  3. un­til con­ver­gence, su­per­vised train­ing of a D/critic NN to pre­dict the la­tent for each dat­a­point

  4. un­til con­ver­gence, RL train­ing of a G/agent NN with the D/critic NN

  5. sam­ple n new dat­a­points from the trained G/agent NN and add to the dataset

  6. run B-T rank­ing over the aug­mented dataset

  7. ask the or­a­cle for rat­ings of the m dat­a­points with the largest pos­te­rior un­cer­tainty or some proxy thereof like stan­dard er­ror (which will usu­ally be the new dat­a­points)

    • ac­tive sam­pling or ban­dit al­go­rithms can be used to max­i­mize the in­for­ma­tive­ness

Is Preference Learning a Bradley-Terry Model?

What’s the differ­ence?

Not ex­tract­ing all in­for­ma­tion. By us­ing only com­par­isons, each pre­dic­tor train­ing is less mean­ing­ful, and even if the pre­dicted vari­able is still re­sults of com­par­isons, not fit­ting a B-T model means that one can’t train on com­par­isons be­tween all dat­a­points (s­ince one needs the B-T model to pre­dict, based on the global rank­ing, what the out­come would be).

Pair­wise yields in­co­her­ent rank­ing? It is also un­clear that the pref­er­ence learn­ing ar­chi­tec­ture is im­plic­itly es­ti­mat­ing a B-T mod­el. (I am not fa­mil­iar with any paired-com­par­isons ap­proaches which op­ti­mize ML mod­els to pre­dict a fixed set of com­par­isons, or work purely on dis­con­nected com­par­ison­s.) Be­cause no global rank­ing is ever con­struct­ed, no com­par­isons can be trained on other than the ex­act ones that the hu­man made, and that may not be enough train­ing sig­nal to force in­fer­ring a global rank­ing, rather than merely learn­ing lo­cal­ly-con­sis­tent pair­wise com­par­isons which are nev­er­the­less glob­ally in­con­sis­tent (with cy­cles like rock­-pa­per-s­cis­sors). The pre­dic­tor may be learn­ing some­thing much sim­pler, such as which dis­tin­guish within each fixed pair, but with­out learn­ing what we thought it was learn­ing—­gen­er­al­iz­able qual­ity fea­tures which al­low a mean­ing­ful global rank­ing across all pairs.

What does a D do? In a GAN, you have real and fake dat­a­points be­ing com­pared; the D at­tempts to regress the prob­a­bil­ity of each point be­ing a winner/loser, so to speak, pro­duc­ing a log prob­a­bil­ity (in the orig­i­nal for­mu­la­tion); does D learn generic fea­tures of qual­ity or re­al­ism? Ap­par­ently not be­cause even a ; and in , when I use a well-trained StyleGAN D to rank real data, the rank­ings are strange, with out­liers ranked both low & high, sug­gest­ing that garbage data can be ranked ex­tremely con­fi­dently by the D sim­ply be­cause it could eas­ily mem­o­rize them as out­liers. So, we have a case where a D/critic is trained on com­par­i­son data from an or­a­cle (real vs fake), is use­ful for train­ing, out­puts a vari­able which look ex­actly like an ELO and even has an ELO-like the­o­ret­i­cal in­ter­pre­ta­tion—and is com­pletely un­gener­al­iz­able and not learn­ing any­thing re­motely like a car­di­nal score or even a trans­for­ma­tion thereof like an ELO. What is go­ing on? Ap­par­ently the D is mem­o­riz­ing real dat­a­points, and push­ing the G away from them and to­ward nearby po­ten­tial dat­a­points.

Ds just mem­o­rize? Why can’t this be the same thing for the pref­er­ence-learn­ing D? It is given a small dataset con­sist­ing of fixed pairs of good/bad dat­a­points, and it mem­o­rizes bad dat­a­points within a fixed pair, latch­ing on to some fea­ture or other (pos­si­bly im­por­tant fea­tures, but they could also be the ‘non-ro­bust fea­tures’ in­volved in ad­ver­sar­ial learn­ing) in or­der to mem­o­rize just within that pair (if it can over­fit…), and this then pushes the G away from tra­jec­to­ries that look like bad dat­a­points, pro­duc­ing use­ful train­ing just like in a GAN.

Weak learn­ing sig­nal. This would be con­sis­tent with the pa­per’s re­ported suc­cess, but would have a differ­ent in­ter­pre­ta­tion: the D is not learn­ing any generic qual­ity met­ric, is not im­plic­itly rank­ing all dat­a­points on a com­mon scale of re­ward, and is not equiv­a­lent to a B-T. It is merely mem­o­riz­ing some data points or some un­gener­al­iz­able non-ro­bust fea­tures which hap­pen to let it dis­tin­guish within the pairs. As such, it can’t pro­vide a sta­ble rank­ing within or across it­er­a­tions or datasets, and its feed­back is of lim­ited value (s­ince once the G/agent has moved suffi­ciently far away from the pe­nal­ized mem­o­rized dat­a­points, that no longer pro­vides a train­ing sig­nal for more im­prove­ment and new rel­a­tive­ly-bad dat­a­points must be learned and pe­nal­ized).

Active Learning

Es­ti­mated rank­ings can pri­or­i­tize com­par­isons. As im­ple­ment­ed, pref­er­ence learn­ing is (po­ten­tial­ly, as­sum­ing it’s equiv­a­lent to B-T) more sam­ple-effi­cient than a naive B-T: each data point ap­pears once in a unique com­par­i­son (rather than in mul­ti­ple com­par­isons with mul­ti­ple other dat­a­points), and so each com­par­i­son is po­ten­tially max­i­mally effi­cient (in the sense that each ad­di­tional com­par­i­son in­volv­ing a dat­a­point pro­vides the pre­dic­tor less in­for­ma­tion than the first one did). A naive B-T, like the usual fre­quen­tist im­ple­men­ta­tion, re­quires mul­ti­ple com­par­isons to con­nect all dat­a­points via a chain of com­par­isons, and may be un­de­fined if any dat­a­points are ‘un­con­nected’.

A Bayesian B-T model mit­i­gates this by hav­ing pri­ors on any new dat­a­point, which pro­vides a mean­ing­ful es­ti­mate with­out few or no com­par­isons. (With no com­par­isons, the pos­te­rior mean is sim­ply the prior mean, pre­sum­ably some­thing like 0.) The es­ti­mates aren’t in­for­ma­tive, but they are well-de­fined and can be used for sam­pling strate­gies.

The lack of com­par­isons can be fixed partly by us­ing co­vari­ates. There are two par­tic­u­larly rel­e­vant co­vari­ates which could be used:

  1. the pre­dic­tor’s own rat­ings of each dat­a­point:

    Since the pre­dic­tor should be able to reach high ac­cu­ra­cy, its es­ti­mate be­fore any com­par­isons should be quite ac­cu­rate and re­duce the pos­te­rior un­cer­tainty con­sid­er­ably (de­spite hav­ing no com­par­ison­s). This can be par­tic­u­larly use­ful for a sam­pling strat­egy be­cause it can help dis­card sam­ples which are es­ti­mated as low qual­ity and not in­for­ma­tive about the best sam­ples that we want to reach.

  2. the cur­rent epoch/iteration:

    Since we hope the generator/agent is also im­prov­ing, the it­er­a­tion a dat­a­point was gen­er­ated from is rel­e­vant: early dat­a­points should be bad, in­ter­me­di­ate dat­a­points should be medi­um, and re­cent dat­a­points should be the best. The first few com­par­isons in­side a batch give a strong in­di­ca­tion how good the batch is over­all, and the qual­ity can also be ex­trap­o­lated from ear­lier it­er­a­tions by fit­ting a progress curve (like a log or spline).

An ex­am­ple of a sam­pling al­go­rithm would be best-arm rac­ing al­go­rithms. Since in this sce­nar­io, we’re try­ing to teach the NN to gen­er­ate the best dat­a­points, we don’t value vari­ance re­duc­tion else­where, we want cer­tainty about the best dat­a­points in or­der to pe­nal­ize the NN for gen­er­at­ing any in­fe­rior dat­a­points. A sim­ple pos­te­rior sam­pling rac­ing al­go­rithm for B-T might goes like this:

  1. take the arm/datapoint with the high­est pos­te­rior mean rank­ing, which is es­ti­mated to be the best;
  2. sam­ple from the pos­te­rior a pos­si­ble rank­ing for every other dat­a­point;
  3. com­pare the best known dat­a­point with the high­est pos­te­ri­or-sam­ple;
  4. up­date.

Best-arm ban­dit. This ex­plores dat­a­points based on their re­main­ing pos­te­rior prob­a­bil­ity of be­ing the best. (I used this once to ) This can be ap­plied to the k best dat­a­points for batch eval­u­a­tion etc.

So a train­ing loop could go like, be­gin it­er­a­tion #11 by gen­er­at­ing 1000 new sam­ples from it­er­a­tion #10’s G/agent mod­el; score each with the D/critic; in­sert the 1000 into the dataset with their es­ti­mated score and it­er­a­tion=10 co­vari­ate; do the B-T re­gres­sion with Comparison[NA1][NA2] ~ iteration1+criticEstimate1 − iteration2+criticEstimate2 (pseudocode) to es­ti­mate pos­te­rior dis­tri­b­u­tions of es­ti­mates for all dat­a­points (miss­ing­ness of com­par­isons does­n’t mat­ter, the model can still be fit); run the rac­ing al­go­rithm, find­ing that new sam­ple #551 has a critic score of +5SD, giv­ing a pos­te­rior es­ti­mate ex­ceed­ing all other dat­a­points (de­spite not hav­ing been ever com­pared yet), and that new sam­ple #998 get picked by pos­te­rior sam­pling; ask the user to com­pare #551 and #998; record the re­sult; re­fit the B-T for an up­dated rank­ing; re­train the D/critic; re­train the G/agent; be­gin it­er­a­tion #12 etc.

We effi­ciently home in on the best dat­a­points with­out nec­es­sar­ily re­quir­ing any ‘re­dun­dant’ com­par­isons, while pro­vid­ing in­for­ma­tive sta­ble car­di­nal rank­ings for the D/critic based on an or­der­ing of the en­tire dataset, en­abling it to pro­vide more mean­ing­ful re­wards to the G/agent. To the ex­tent that we en­gage in ‘re­dun­dant’ com­par­isons, un­like the pref­er­ence learn­ing ap­proach, those com­par­isons must have been nec­es­sary.

It’s an adap­tive pro­ce­dure so it’s hard to say ex­actly how it would differ from pref­er­ence learn­ing. De­pend­ing on how much the G im­proves each it­er­a­tion, and how ac­cu­rate the D is, and thus how much pos­te­rior over­lap there is be­tween differ­ent batches and differ­ent dat­a­points within each batch, it could look a lot like the cur­rent heuris­tic ap­proach of do­ing only unique com­par­isons once within a batch and throw­ing away+n­ev­er-com­par­ing with prior batch­es, or it could look quite differ­ent, and change with each it­er­a­tion as nec­es­sary:

  • If the G im­proves rel­a­tively slow­ly, so there’s a great deal of over­lap be­tween suc­ces­sive batch­es, and/or the D is only weakly cor­re­lated with mea­sured rank­ings, then the pro­ce­dure might need to sam­ple a lot of com­par­isons be­tween old/new batches in or­der to im­prove es­ti­mates of the progress curve and all dat­a­points within the new batch, and it might want many com­par­isons to­ward the tail of high­est-ranked dat­a­points (which is not a bad thing be­cause that’s where we should pri­or­i­tize im­prove­ments, since that’s where the G is mov­ing to­wards, and it’s less im­por­tant to es­ti­mate ac­cu­rately less-high­ly-ranked dat­a­points).
  • If the G or Ds are in­ter­me­di­ate, I think the dy­nam­ics might look more like sam­pling mostly pairs within the new batch, mostly unique com­par­isons, and a few com­par­isons with old batches to fine­tune the mean of the new batch.
  • If the D + G pro­gresses so rapidly such that rank­ings don’t over­lap at all a pri­ori, then few or no com­par­isons with the old batches are nec­es­sary: the D co­vari­ate pre­dict­ed-rank­ings elim­i­nate most of the pos­te­rior un­cer­tainty de­spite no com­par­isons be­ing avail­able, and the G progress means that the old dat­a­points (while still use­ful for G train­ing in teach­ing it the full spec­trum of dat­a­points) are un­likely to be any­where near the best dat­a­points and so aren’t worth mea­sur­ing more ac­cu­rate­ly, so com­par­isons fo­cus on the most un­cer­tain pairs in the new batch.

Advantages & Disadvantages

This could have a lot of ad­van­tages:

  1. sim­pli­fied: the D/critic NN is con­cep­tu­ally sim­pli­fied—in­stead of 3-way clas­si­fi­ca­tion on a dou­ble in­put cor­re­spond­ing to an im­plicit global rank­ing, it is just a sin­gle in­put for re­gres­sion on a qual­ity score

  2. mem­o­ry-effi­cient: be­fore, a dou­ble in­put takes up mem­o­ry, even with tied weights, only to yield a sin­gle com­par­ison; in the same space, 2 re­gres­sion mod­els could be run, each with a differ­ent in­put + tar­get qual­ity rat­ing. If, to save mem­ory (crit­i­cal with GPT-2), a sin­gle in­put is used in­stead, now there must be two sep­a­rate passes for each in­put, and each pass merely trains one-half of the com­par­i­son.

    This could be par­tic­u­larly use­ful if one tries to use a large Trans­former model like GPT-2-345M where mem­ory con­sump­tion be­comes a se­ri­ous bar­rier to run­ning it at all… (At 345M with the Siamese ar­chi­tec­ture, we’re down to n = 1 mini­batch­es!)

  3. sam­ple-effi­cient: many com­par­isons will be use­less, or for a given pair, they will quickly cease to be in­for­ma­tive; a qual­ity rat­ing is in­for­ma­tive re­gard­less of what might’ve been used as a com­par­ison, pro­vid­ing richer feed­back on each in­put (anal­o­gous to Al­p­haZero switch­ing to a re­gres­sion tar­get)

    • pos­si­bly bet­ter ’off-pol­i­cy’ learn­ing: re­lated to sat­u­rat­ing, a D/critic trained from a cor­pus (eg ini­tial­iz­ing a D/critic by tak­ing a dataset of real and GPT-2-generated po­ems, and la­bel­ing all com­par­isons as vic­tory for the hu­man po­em) might de­stroy G/agent train­ing if it pro­vides only com­par­i­son feed­back
    • bet­ter value function/reward sig­nal for any other ap­proach lever­ag­ing the NNs (like over a tree of se­quences), too
    • hu­mans or other datasets can sup­ply car­di­nal rat­ings di­rectly when those are avail­able
  4. com­pute-effi­cient:

    • pos­si­bly more D com­pute-effi­cient: by train­ing com­par­isons, the D/critic must, im­plic­it­ly, be learn­ing an equiv­a­lent to a qual­ity rat­ing, in or­der to pro­vide ac­cu­rate pre­dic­tions of a hu­man com­par­i­son of all pos­si­ble pairs—but it does so in­di­rectly
    • G gets more data & be­comes com­pute-effi­cient: a richer re­ward sig­nal for each sam­ple will of course be quite use­ful for the G; in­stead of sat­u­rat­ing (there is in­trin­si­cally not much in­for­ma­tion in com­par­isons, and mov­ing from, say, 99.99% to 99.999% is not help­ful, re­gard­less of whether these scores are log-trans­formed)
  5. in­ter­pretable: an ab­solute car­di­nal qual­ity vari­able pro­vides an ob­jec­tive loss for un­der­stand­ing train­ing progress (use­ful for tasks which don’t have them, like po­etry gen­er­a­tion!), which is also in­ter­pretable and could be use­ful out­side of the task (eg rank­ing po­ems for rec­om­men­da­tion or data-clean­ing)

    • for ex­am­ple, one could get in­sight into a trained G/agent by gen­er­at­ing a num­ber of sam­ples, and rank­ing them
    • one could also test out var­i­ous rat­ing ap­proach­es, like how much con­di­tion­ing is nec­es­sary. Given a dataset of pure com­par­isons, one can­not ex­per­i­ment with try­ing out un­con­di­tional gen­er­a­tion be­cause one does­n’t know what the out­come of any com­par­i­son would be; once one ex­tracts the la­tent vari­ables from a to­tal rank­ing, though, one knows the dis­tri­b­u­tion of out­comes and can sim­u­late ar­bi­trary com­par­i­son datasets.
  6. prin­ci­pled un­cer­tainty: en­ables ac­tive learn­ing via B-T pos­te­rior un­cer­tainty with­out any need to ex­tract un­cer­tainty es­ti­mates of any kind from the D/critic NN; hu­man rat­ings can be ac­quired more effi­cient­ly, or dat­a­points se­lec­tively pulled from a large dataset (eg imag­ine a huge dump of po­ems from Project Guten­berg or else­where, of wildly vary­ing qual­i­ty—with a re­gres­sion style D/critic NN, you can do a sin­gle pass over it with the D/critic NN to se­lect the k% high­est po­ems, use the es­ti­mate as a pseudo-dat­a­point, in­sert into B-T, and ask hu­mans for the most in­for­ma­tive com­par­isons; with a com­par­i­son D/critic NN, how to im­port use­fully a large un­la­beled cor­pus is harder to see)

The main down­sides I can see:

  • the la­tent vari­ables are not nec­es­sar­ily 100% sta­ble, as the dis­tri­b­u­tion can drift, yield­ing any­thing from ‘rat­ing in­fla­tion’ to ‘rat­ing de­fla­tion’

    The B-T es­ti­mates a dis­tri­b­u­tion ar­bi­trar­ily de­fined as ; if the B-T sees only se­lected dat­a­points at the be­gin­ning, it might be that after G/agent trains enough, the B-T step would be look­ing at dat­a­points which are much bet­ter than a mean of 0, so there might be new dat­a­points all the way out at (what used to be) +100S­Ds, say. This then leads to the B-T es­ti­mate the next cy­cle shift­ing the mean/SD to re­store the con­ven­tional . So the re­gres­sion tar­get for the D/critic’s pre­dic­tions of old dat­a­points may grad­u­ally shift over time, pre­cisely be­cause the richer la­tent vari­ables don’t sat­u­rate the way sim­ple pair­wise com­par­isons would.

    I be­lieve this would be a mi­nor prob­lem eas­ily solved by train­ing the D/critic NN each it­er­a­tion, which is nec­es­sary just to han­dle novel dat­a­points any­way; since im­prove­ments will be small each it­er­a­tion, the re­train­ing should be eas­ily able to keep up. If not, one could de­fine par­tic­u­lar dat­a­points as a ‘zero point’, and that pro­vides a fixed point of ref­er­ence for fu­ture play­ers, even if they are far bet­ter; for ex­am­ple, the an­chor could be ran­dom out­puts.

  • (fre­quen­tist) B-T might re­quire more com­par­isons for a to­tal or­der: a dat­a­point has to be com­pared with other dat­a­points which them­selves have com­par­isons if it is to be glob­ally ranked at all, while a com­par­i­son D/critic can work with two en­tirely dis­joint sets of com­par­isons which don’t over­lap. (This can be avoided by us­ing pri­ors & co­vari­ates in a Bayesian B-T mod­el.)

All in all, I think this ver­sion of pref­er­ence could be sim­pler, eas­ier to im­ple­ment, and train faster. The po­ten­tially bet­ter sam­pling is nice, but my guess is that the D pro­vid­ing richer feed­back (for both the G and down­stream users) is the biggest ad­van­tage of this ap­proach—a com­par­i­son is a bit, and a bit is worth only a lit­tle bit.

  1. I don’t know why they used 124M in­stead of 117M.↩︎

  2. Such a short­cut is rather against the spirit of DRL, where as much as pos­si­ble should be learned from da­ta. If such short­cuts are nec­es­sary—as I do find with ABC—the do­main knowl­edge ought to be lo­cal­ized in the rat­ing code (where it achieves the stated goals of be­ing easy to im­ple­ment & eas­ing hu­man raters’ bur­den), and not the train­ing code.↩︎

  3. KL penal­ties are com­monly used in RL. To try to ex­plain it in­for­mal­ly: think of each GPT-2 mod­el, like the orig­i­nal GPT-2 model vs its RL-trained ver­sion as emit­ting prob­a­bil­i­ties over the ~50k pos­si­ble BPE out­puts and graph­ing that as a bar chart (prob­a­bil­ity vs BPE). For any par­tic­u­lar in­put, the 50k pos­si­ble BPEs will form a spiky bar chart: the model pre­dicts some BPEs are far more likely than oth­ers. The KL dis­tance be­tween those two mod­els, then, is like if you sub­tract the orig­i­nal GPT-2’s bar chart from the new RL GPT-2’s bar chart, and added up the differ­ences at each pos­si­ble bar (not just the biggest bar). So the KL dis­tance mea­sures the differ­ence of opin­ion about every pos­si­ble BPE, from the most to the least likely (in­clud­ing the ones you would never se­lect while sam­pling). You want there to be some differ­ence be­tween the bar chart­s—other­wise what’s the point?—but also not too big a differ­ence, be­cause the orig­i­nal GPT-2 model was usu­ally right al­ready, fairly small KL dis­tances can change the fi­nal out­puts quite a bit qual­i­ta­tive­ly, and if the new model is mak­ing too many changes, it’s prob­a­bly gone wrong. So try­ing to keep the KL dis­tance small is a good way to main­tain high per­for­mance, but still fine­tune it based on ex­pe­ri­ence.↩︎

  4. I use pair­wise rather than best-of-4 be­cause it sim­pli­fies the rat­ing process & max­i­mizes the in­for­ma­tion per sam­ple read/listened to; if I un­der­stand the im­ple­men­ta­tion, how­ev­er, it also re­duces the mem­ory re­quire­ments be­cause the re­ward model train­ing un­rolls n mod­els to train them all si­mul­ta­ne­ously on a n-rat­ing.↩︎

  5. As­sum­ing a la­beler could get through 2 rat­ings per minute (op­ti­mistic, given how ex­haust­ing I found even an hour), a mil­lion rat­ings would re­quire >8,333 man-hours, or at a rock­-bot­tom to­tal-cost-per-hour of $10, >$83,333. And that might not be enough.↩︎

  6. !Ma­grin: Weak su­per­vi­sion: im­plicit qual­ity rat­ings.↩︎

  7. eg my sim­ple , just re-es­ti­mates the en­tire B-T model each in­ter­ac­tion, rather than at­tempt any caching or in­cre­men­tal up­dat­ing to a stored mod­el, be­cause it takes a frac­tion of a sec­ond to fit. A fully Bayesian model can be fit via in a few sec­onds, which is neg­li­gi­ble in a DRL con­text.↩︎