GPT-2 Folk Music

Generating Irish/folk/classical music in ABC format using GPT-2-117M, with good results.
statistics, NN, shell, GPT, music
2019-11-012020-04-25 finished certainty: likely importance: 6

In No­vem­ber 2019, I ex­per­i­mented with train­ing a GPT-2 neural net model to gen­er­ate folk mu­sic in the high­-level ABC mu­sic text for­mat, fol­low­ing pre­vi­ous work in 2016 which used a char-RNN trained on a ‘The Ses­sion’ dataset. A GPT-2 hy­po­thet­i­cally can im­prove on an RNN by bet­ter global co­her­ence & copy­ing of pat­terns, with­out prob­lems with the hid­den-s­tate bot­tle­neck.

I en­coun­tered prob­lems with the stan­dard GPT-2 mod­el’s en­cod­ing of text which dam­aged re­sults, but after fix­ing that, I suc­cess­fully trained it on n = 205,304 ABC mu­sic pieces taken from The Ses­sion & The re­sult­ing mu­sic sam­ples are in my opin­ion quite pleas­ant. (A sim­i­lar model was later re­trained by .)

The ABC folk model & dataset are avail­able for down­load, and I pro­vide for lis­ten­ing se­lected mu­sic sam­ples as well as med­leys of ran­dom sam­ples from through­out train­ing.

We fol­lowed the ABC folk model with an ABC-MIDI model: a dataset of 453k ABC pieces de­com­piled from MIDI pieces, which fit into GPT-2-117M with an ex­panded con­text win­dow when trained on TPUs. The MIDI pieces are far more di­verse and chal­leng­ing, and GPT-2 un­der­fits and strug­gles to pro­duce valid sam­ples but when sam­pling suc­ceeds, it can gen­er­ate even bet­ter mu­si­cal sam­ples.

Back in 2015–2016, Bob L. Sturm ex­per­i­mented with gen­er­at­ing Irish folk mu­sic us­ing a char-RNN trained on a cor­pus of folk mu­sic writ­ten from in a high­-level mu­si­cal for­mat called “”. Com­pact tex­t—per­fect for NNs. While ABC no­ta­tion is writ­ten in ASCII, it sup­ports many com­plex fea­tures, and it has been adopted widely by folk mu­si­cians and hun­dreds of thou­sands of pieces written/transcribed in it.


Background: folk-RNN

Sturm et al scraped ~40k ABC files from The Ses­sion and trained a char-RNN called “folk-RNN”, putting the code & data on­line, and pro­vid­ing a web in­ter­face for gen­er­a­tion. Prior suc­cess with char-RNNs. In ad­di­tion to the var­i­ous re­search pub­li­ca­tions, Sturm has also writ­ten many blog posts eval­u­at­ing folk-RNN pieces, such as how well they’re played by hu­man mu­si­cians.


2015 was a long time ago, how­ev­er, and DL has seen a par­a­digm shift in se­quence mod­el­ing away from char-RNNs to CNNs and at­ten­tion-based Trans­former mod­el­s—­most fa­mous­ly, GPT-2. DL progress. () with & Ope­nAI’s -based have both demon­strated ex­cel­lent re­sults in mu­sic com­po­si­tion at var­i­ous timescales/formats, and in­ter­est­ing fea­tures like mix­ing gen­res.

While mess­ing around with in late Oc­to­ber 2019, I be­came cu­ri­ous if folk-RNN could be im­proved by sim­ply throw­ing one of the GPT-2 mod­els at it. GPT-2: a per­fect match. (Not the large ones, of course, which would over­fit far too eas­ily or had­n’t been re­leased yet, but GPT-2-117M.) GPT-2 is un­able to model raw au­dio, or , be­cause a mean­ing­ful mu­si­cal piece is a WAV se­quence of hun­dreds of thou­sands to mil­lions of sym­bols long, and a MIDI piece is tens of thou­sands of sym­bols long, which far ex­ceed GPT-2’s small con­text win­dow (but see later), and why Ope­nAI used Sparse Trans­form­ers for its MIDI gen­er­a­tion, as Sparse Trans­form­ers can scale to text with tens of thou­sands of char­ac­ters. How­ev­er, the high­-level no­ta­tion of ABC pieces means they fit just fine into the GPT-2 win­dow.

I had avoided do­ing any­thing mu­sic with GPT-2, fo­cus­ing in­stead, be­cause I as­sumed Ope­nAI would be do­ing a MuseNet fol­lowup, but months lat­er, they’d done noth­ing fur­ther, and when I in­quired, I got the im­pres­sion that their mu­sic projects were over. So why not?

As for why re­peat Stur­m’s pro­jec­t—there were two pos­si­ble ad­van­tages to us­ing GPT-2-117M:

  1. im­proved global co­herency:

    I thought the Trans­former might work par­tic­u­larly well on ABC for­mat, be­cause RNNs suffer from per­sis­tent ‘for­get­ting’ is­sues, where it’s diffi­cult for the RNN to per­sist its mem­ory of past gen­er­ated se­quences, mak­ing it hard for an RNN to re­peat a theme with vari­ants, while a GPT-2 Trans­former has a con­text win­dow of 1024 BPEs—much longer than al­most every ABC piece—and so is able to ‘see’ the en­tire piece si­mul­ta­ne­ously while gen­er­at­ing the next note

  2. Eng­lish meta­data un­der­stand­ing:

    The Eng­lish pre­train­ing could po­ten­tially help by pro­vid­ing se­man­tic un­der­stand­ing of eg the ABC meta­data, such as the differ­ence be­tween two pieces ti­tled a ‘jig’ ver­sus a ‘waltz’, or the pseudo-nat­u­ral-lan­guage-ness of the ABC for­mat as a whole.

ABC Data

The Session

So I did apt-get install abcmidi timidity1 to get the CLI tools to do ABCMIDIWAV (re­spec­tive­ly) and down­loaded the folk-RNN repo with its data files. Pipeline: ABCMIDIWAV.

The data comes in sev­eral for­mats, for their ex­per­i­ments in chang­ing the no­ta­tion. I used the orig­i­nal for­mat, with n = 48,064 songs.

The data needed pro­cess­ing for GPT-2 as fol­lows:

  1. there was stray HTML (</html>) which had to be re­moved.

    I used search-and-re­place, and re­ported the is­sue.

  2. abc2midi re­quires every song to have an in­te­ger iden­ti­fier, eg X: 48064, to be a valid ABC file which it can com­pile to MIDI.

    I used an Emacs macro (which can in­cre­ment an in­te­ger 1–48,064) to in­sert a X: $N be­fore each T: ti­tle line, but in ret­ro­spect, I could have sim­ply used an­other search-and-re­place to in­sert X: 1 in front of each piece—it’s not like the ID has to be unique, we’re just sat­is­fy­ing abc2midi which is a bit picky.

  3. as usual for any neural model like char-RNN or GPT-2, it is im­por­tant to in­sert <|endoftext|> mark­ers where rel­e­vant, so it un­der­stands how to gen­er­ate sep­a­rate pieces and avoids ‘run on’.

    I used search-and-re­place.

This yielded 14MB of text to train on, and is con­verted to NPZ as usual.


First Model

Be­cause the Ses­sion cor­pus was so small (just 14M­B), I used the small­est avail­able GPT-2, GPT-2-117M to train on, and stan­dard set­tings to train on one of my Nvidia 1080tis:

CUDA_VISIBLE_DEVICES=0 PYTHONPATH=src ./ --dataset thesessions-irishabc.txt.npz
    --batch_size 11 --model_name irish --save_every 4000 --sample_every 500 --learning_rate 0.0001
    --run_name irish --memory_saving_gradients --noise 0.01 --val_every 500

Train­ing was fairly easy, tak­ing just a few days at most to train down to a loss of 0.46 (9742 steps at mini­batch n = 5), and I killed it on 2019-10-21 and looked at the ran­dom sam­ples. Straight­for­ward suc­cess. They struck me as pretty good, aside from gen­er­ated pieces often hav­ing the same ti­tle re­peat­ed­ly, which ap­par­ently was due to The Ses­sion post­ing mul­ti­ple tran­scrip­tions of the same piece, so the model picked up on that and would gen­er­ate vari­ants on the same theme. Sturm high­lighted a few and did some more in­-depth com­men­tary on them, with a mixed eval­u­a­tion, con­clud­ing “So of the five tran­scrip­tions above, two are plau­si­ble. The polka is ac­tu­ally pretty good! All ti­tles by GPT-2 are pla­gia­rized, but I haven’t found much pla­gia­rism in the tunes them­selves.”

I was wor­ried about pla­gia­rism and thought ~0.40 would be safe, but it seemed the mu­sic it­self was still far from be­ing copied, so I con­sid­ered fur­ther train­ing. Some datasets are in­valid ABC. The ad­di­tional processed ver­sions of The Ses­sion that Sturm et al had made seemed like a tar­get, but caused prob­lems when I sim­ply con­cate­nated them in, and I soon dis­cov­ered why abc2midi now thought all the sam­ples were bro­ken:

allabcwrepeats_parsed_wot: This is ver­sion 3 of the dataset from the­ses­ In this ver­sion, we trans­pose all tunes to have the root C, trans­pose them all to have the root C#, re­move the ti­tles, and make new mode to­kens, K:maj, K:min, K:dor, and K:mix. There are over 46,000 tran­scrip­tions here.

This turns out to be a prob­lem: K:maj, K:min, K:dor, K:mix all break abc2midi! So I did ad­di­tional search-and-re­place to trans­form them into valid key sig­na­tures like K: Cmaj, K: Cmin, K: Cdor, or K: Cmix.

Re­train­ing, I dis­cov­ered 0.40 was far from con­verged, and with an­other 13k steps, it could go down to <0.09. Bal­anc­ing im­i­ta­tion & pla­gia­rism. How­ev­er, check­ing ran­dom sam­ples by hand, the tex­tual over­lap with The Ses­sion be­came par­tic­u­larly large once the loss reaches ~0.09 (note that it was not ‘over­fit­ting’ in the stan­dard sense, since the loss was still de­creas­ing on the val­i­da­tion set), so I backed off to a model with ~0.13 loss. This seems to be high­-qual­ity with­out gross pla­gia­rism.

Spaceless Model

I be­gan us­ing that model for the pref­er­ence learn­ing work, where I found that pref­er­ence learn­ing seemed to im­prove mu­sic more than the po­et­ry, so I be­gan fo­cus­ing on the mu­sic.

Puz­zling­ly, no mat­ter how many rat­ings I added, and de­spite the low loss, the gen­er­ated sam­ples would per­sis­tently have ba­sic, bla­tant syn­tax er­rors in­volv­ing spaces; abc2midi would often warn or even er­ror out on a piece which could be eas­ily fixed by hand by sim­ply re­mov­ing a few spaces. Anom­aly: per­ma­nent space-re­lated syn­tax er­rors. This was wast­ing my time dur­ing rat­ing, since I could­n’t pick sam­ples with syn­tax prob­lems (even if they’d oth­er­wise sound good) be­cause I did­n’t want to re­in­force gen­er­a­tion of in­valid sam­ples, and also while gen­er­at­ing mu­sic.

Dis­cussing it with Shawn Presser, who I was work­ing with si­mul­ta­ne­ously to train GPT-2-1.5b on po­etry, he pointed out that some peo­ple, like Nos­tal­ge­braist had some frus­trat­ing prob­lems with the stan­dard GPT-2 BPE en­cod­ing.

To ex­plain what BPE is and why it might be a bad thing for ABC no­ta­tion: GPT-2 does­n’t just feed in raw char­ac­ters like a char-RNN does, be­cause that makes every in­put ex­tremely long. GPT-2 gen­er­ates space-de­lim­ited word frag­ments. In­stead, it tries to ‘chunk’ them into some­thing in­-be­tween char­ac­ter-sized and word-sized, to get the best of both worlds, a way of writ­ing text where com­mon words are a sin­gle sym­bol but rare words can still be ex­pressed as a cou­ple sym­bols rather than deleted en­tirely like word-based en­cod­ings must; how­ev­er, since the de­fault model is trained on Eng­lish text, chunk­ing is done as­sum­ing nor­mal Eng­lish white­space, like spaces be­tween words.

Nos­tal­ge­braist notes that the ac­tual BPE im­ple­men­ta­tion used is weird and does­n’t act as you’d ex­pect, es­pe­cially when spaces are in­volved. So Presser won­dered if GPT-2 could­n’t ex­press the syn­tac­ti­cal­ly-cor­rect text sans spaces, and that is why the er­rors were there and stub­bornly per­sisted de­spite hun­dreds of rat­ings which told GPT-2 to stop do­ing that al­ready.

Check­ing, the ABC for­mat ap­par­ently does not re­quire spaces. Workaround—­spaces op­tion­al! They are only there for the con­ve­nience of hu­mans read­ing & writ­ing ABC. Aside from the meta­data fields, if you delete all spaces, the mu­sic should be the same. I was sur­prised, but this seemed to be true. (Presser did some ex­per­i­ments with cre­at­ing a brand-new BPE tai­lored to ABC, and while this would have re­duced the BPE size of ABC pieces by >33%, I fig­ured that all the ABC pieces fit into the GPT-2 win­dow any­way and it was­n’t worth the has­sle now that we had di­ag­nosed & worked around the prob­lem. He also did some ex­per­i­ments in gen­er­at­ing video game style mu­sic via ABC: he prompted it with chords, and then switched from the de­fault pi­ano instruments/sound sam­ples used by TiMid­i­ty++ to in­stru­ments like harps for a fan­tasy feel.)

So, I deleted the spaces with a sim­ple tr -d ' ', re-en­coded to NPZ, and re­trained the first model to cre­ate a new ‘space­less’ mod­el. This re­quired an­other few days to re-con­verge to ~0.13.

The space­less cor­pus fixed the invalid-ABC prob­lem, and the new model reg­u­larly gen­er­ated ABC sam­ples that trig­gered nei­ther warn­ings nor er­rors, and Presser swore that the per­cep­tual qual­ity was much high­er.

Combined Model: The Session +

Presser was in­ter­ested in ex­pand­ing the reper­toire be­yond The Ses­sion and be­gan look­ing at ABC data­bas­es. More dakka (data). The biggest by far ap­peared to be, which had n = 290,000 pieces. He scraped a ran­dom half of them, for n = 157,240 to­tal, and I com­bined them with The Ses­sion du­pli­cated dataset, for a to­tal n = 308,280 (n = 205,304 unique; 81M­B). pieces are much more di­verse in for­mat­ting & meta­data than The Ses­sion. Sim­pli­fy­ing to match The Ses­sion ABC. To ho­mog­e­nize them, I ran all the pieces through abc2abc, and then I deleted some meta­data fields that struck me as ex­ces­sive—­com­men­tary, dis­cus­sions about how to play a piece, sources, au­thors of the tran­scrip­tion, that sort of thing, which greatly in­flated the loss of the com­bined dataset com­pared to the space­less mod­el. (In to­tal, I fil­tered out abc2abc-gen­er­ated warn­ings start­ing with %, and B:/D:/F:/N:/O:/S:/Z:/w: meta­da­ta.) It would have been nice if the meta­data had in­cluded genre tags for greater con­trol of con­di­tional pieces, akin to my au­thor-based con­trol for GPT-2 or , a tech­nique demon­strated at scale by us­ing ex­plicit Red­dit meta­data, and Choi et al 2019 us­ing au­toen­coders to do un­su­per­vised learn­ing of mu­si­cal fea­tures which im­plic­itly cov­ers gen­re, but alas! We’ll have to stick with the ba­sics like title/key/meter.

This re­quired a full week of train­ing or 168810 steps (1–7 Dec), down to a higher loss (as ex­pect­ed) but still on the edge of pla­gia­rism:

Ex­am­ples of gen­er­ated ABC (note the lack of spaces):


Last ex­am­ple ren­dered as a score:

Score for “PolkaEbBbAb(5letras)cf.CGF5-Parts” (an ABC mu­sic sam­ple gen­er­ated by GPT-2-117M trained on a com­bined ABC dataset)


An ABC sam­ple is not playable on its own; it must be con­verted to MIDI, and then the MIDI can be played. If one is look­ing at in­di­vid­ual sam­ples be­ing gen­er­ated by the mod­el, a quick CLI way to play and then dump to an OGG Vor­bis file might be:

xclip -o | abc2midi - -o /dev/stdout | \
    timidity -A110 -
TARGET="`today`-thelonelyfireside.wav"; xclip -o | abc2midi - -o /dev/stdout | \
    timidity -A110 - -Ow -o "$TARGET" && oggenc -q0 "$TARGET"

Ex­tract­ing mul­ti­ple ABC sam­ples, con­vert­ing, and merg­ing into a sin­gle long piece of mu­sic is some­what more chal­leng­ing, and I reused parts of my pref­er­ence-learn­ing rat­ing script for that.

First Model Samples

  • GPT-2-117M ran­dom sam­ples, first model trained on Ses­sion (2019-10-21):
  • “Pad­dy­whack” gen­er­ated ti­tle & sam­ple (2019-10-22):
  • “The Bank of Turf” sam­ple (2019-10-22):
  • “Hick­ey’s Tune” sam­ple (2019-10-23):
  • “The Loon and his Quine” sam­ple (2019-10-29):
  • “The At­lantic Roar” sam­ple (2019-10-30):
  • “The Lonely Fire­side” sam­ple (2019-10-30):
  • “The Ma­rine” sam­ple (2019-10-31):
  • “Whiskey Be­fore Break­fast” sam­ple (2019-10-31):
  • “The Flog­ging” sam­ple (2019-11-01):
  • “Banks of the Al­lan” sam­ple (2019-11-03):
  • “A Short Jour­ney” sam­ple (2019-11-04):

Spaceless Model Samples

I en­joyed the mod­el’s ren­di­tions of the “Yn Bol­lan Bane” jig when I came across it, and so I used con­di­tional gen­er­a­tion to gen­er­ate 50 vari­a­tions on it:

“50 Vari­ants on ‘Yn Bol­lan Bane’”:

Combined Model Samples

  • 100 ran­dom sam­ples from com­bined GPT-2-117M:
  • “In­vereshie’s House” sam­ple (2019-12-04):
  • “FaroeRum” sam­ple (2020-01-25):


I am pleased with the fi­nal gen­er­ated mu­sic; the space­less & changes defi­nitely im­proved over the orig­i­nal.

In ret­ro­spect, the use of GPT-2-117M was not nec­es­sary. Smaller = bet­ter (for now). It was so large that overfitting/plagiarism was a con­cern even with the com­bined dataset, and the Eng­lish pre­train­ing was largely use­less—all of the gen­er­ated ti­tles I checked were copied from the train­ing data, and I did­n’t ob­serve any in­ter­est­ing new ones. The GPT-2 BPE en­cod­ing also proved to be a prob­lem in the end; gen­er­at­ing a BPE specifi­cally for an ABC cor­pus would have avoided that and prob­a­bly also have im­proved learn­ing. A smaller GPT-2 with a cus­tomized BPE (fewer pa­ra­me­ters & at­ten­tion heads, but more lay­ers, I think would be bet­ter) would have trained much faster & prob­a­bly give sim­i­lar or bet­ter re­sults.

Trans­form­ers = bet­ter copy­ing + co­her­ence? Qual­i­ta­tive­ly, I feel like the pieces have a differ­ent feel from char-RNN pieces, in their abil­ity to re­peat themes & mo­tifs, and they also seem to have a much bet­ter abil­ity to come to an end­ing, in­stead of me­an­der­ing on in­defi­nitely as char-RNN things have a ten­dency to do (even when trained on cor­puses with clear end-of-text de­lim­iter­s), per­haps be­cause it’s eas­ier for a Trans­former to ‘count’ and know when a piece has reached a rea­son­able length, while a char-RNN for­gets where it is. Over­all, I’d call it a suc­cess.

Generating MIDI with 10k–30k Context Windows

To ex­pand the ABC GPT-2 model to cover a wider va­ri­ety of mu­si­cal gen­res, I turn to the nex­t-most com­pact wide­spread mu­sic en­cod­ing for­mat: MIDI. There are hun­dreds of thou­sands of MIDIs which can be to ABC for­mat, av­er­ag­ing ~10k BPEs—within GPT-2-117M’s fea­si­ble con­text win­dow when trained on TPUs (which per­mit train­ing of con­text win­dows up to 30k wide).

We com­pile the ABC from be­fore and 2 large MIDI datasets, and con­vert to ABC, yield­ing ~453k us­able ABC-MIDI mu­si­cal files (~5.1GB of tex­t). We trained Jan­u­ary–April 2020 on our TPU swarm (with many in­ter­rup­tion­s), achiev­ing a fi­nal loss of ~0.2 (un­der­fit).

Sam­pling from the fi­nal model is hit-or-miss as it is prone to the like­li­hood rep­e­ti­tion trap and it gen­er­ates in­stru­ments one-by-one so it is com­mon for in­stru­ments to be cut off or oth­er­wise bro­ken dur­ing sam­pling (indi­cat­ing that sam­pling is in­creas­ingly a big­ger prob­lem than train­ing for long-range se­quence mod­el­ing). How­ev­er, suc­cess­ful pieces are pos­si­ble, and are mu­si­cally far more di­verse than the folk ABC cor­pus, with many pleas­ingly com­plex sam­ples.

The log­i­cal next step from gen­er­at­ing short ABC folk tunes is to gen­er­ate mu­sic in gen­er­al. MIDI: hard, but not too hard? Since ABC is not much used out­side folk mu­sic, there are no good large ABC cor­puses ap­pro­pri­ate for this. Raw mu­sic is in­fea­si­ble at the mo­ment: things like can gen­er­ate ex­cel­lent raw au­dio WAVs like , but the model sizes pro­hibit ‘see­ing’ more than a few sec­onds at most, mak­ing them ca­pa­ble of play­ing mu­si­cal scores fed to them, but not of high­er-level com­pos­ing—strug­gling to pass more than ~10s even with . So you could have a GPT-2 gen­er­at­ing ABC which is fed into a WaveNet CNN, cre­at­ing what sounds like real in­stru­ments play­ing real mu­sic, but you could­n’t have a sin­gle NN do­ing it all. In­ter­me­di­ate, more pow­er­ful than ABC but also not as de­mand­ing as raw au­dio gen­er­a­tion, would be gen­er­at­ing MIDI.

First: turn bi­nary MIDI into text ABC. We can’t gen­er­ate raw MIDI since that’s a bi­nary file for­mat which does­n’t play well with most GPT-2 im­ple­men­ta­tions (although there is noth­ing in prin­ci­ple stop­ping op­er­a­tion on full bi­nary bytes rather than text bytes). What if we cre­ate a text en­cod­ing for MIDI and con­vert? The most straight­for­ward rep­re­sen­ta­tion, the , rep­re­sents MIDI as 128 dis­tinct in­stru­ments per timestep (al­most all al­ways silen­t), which leads to a hope­lessly enor­mous sparse ma­trix. There are sev­eral avail­able tools for en­cod­ing MIDI more di­rectly than pi­ano rolls, and one of them turns out to be ABC it­self!

ABC-MIDI: still too long for GPT-2 The ABC-MIDI en­cod­ings are also rel­a­tively short. Check­ing with wc -w, MIDI files typ­i­cally range in size from 10–50k char­ac­ters equiv­a­lent; even with BPEs po­ten­tially sav­ing some space (it’s ~10.7k BPEs2 per MIDI), GPT-2 sim­ply can­not han­dle this sort of se­quence (with many im­por­tant long-range de­pen­den­cies) with its stan­dard con­text win­dow of 1024 to­kens, be­cause it would only be able to see less than a tenth of the mu­sic file at a time and it would be com­pletely blind to the rest of it, since it has no mem­o­ry. If you tried, it would not sound nearly as good as the ABC pieces above, be­cause it would be un­able to do all the nifty rep­e­ti­tion-with­-vari­a­tion of melodies, over­all the­matic struc­ture with be­gin­nings & end­ings etc; in­stead, it would prob­a­bly ram­ble around, gen­er­at­ing plau­si­ble mu­sic which, how­ev­er, never goes any­where and just sort of ends. And, as any­one knows, GPT-2 does not scale to larger at­ten­tion win­dows as self­-at­ten­tion is in the win­dow length (𝒪(l2 · d)), and 1024 is al­ready push­ing it for train­ing, mo­ti­vat­ing .3

More Dakka

Un­less we use brute force. How­ev­er, when train­ing var­i­ous GPT-2-1.5b mod­els on our TPU swarm us­ing the TPU re­search cred­its granted to us by Google’s , we no­ticed that a TPU can fit a model which uses up to 300GB VRAM be­fore crash­ing.4 300G­B—that’s quite a lot. What could you do with that? How bad ex­actly is GPT-2’s scal­ing…? Qua­dratic (𝒪(l2)) ≠ (𝒪(2l)): if you have some good hard­ware avail­able, you may be able to push it quite far.5

TPUs can train 30k con­text win­dows! We tried GPT-2-117M and it turned out a con­text win­dow of 10,000 worked! Then we tried 12.5k, 15k, 20k, & 30k, and they all worked.6 (For com­par­ison, gen­er­ated great ex­cite­ment for be­ing able to scale up to 64k win­dows on sin­gle GPUs.) These are wide enough to train ABC-MIDI files, and we could even gen­er­ate small tex­t-en­coded im­ages (s­ince 30k = 1732). 30k is slow, but not hope­lessly slow, as with GPT-2-117M, we get 4 train­ing steps in 2700 sec­onds (n = 1). A swarm of TPUs, like >100, would be able to train on a large cor­pus in just a few wall­clock days. The mem­ory us­ages are con­sid­er­able, but not be­yond the TPU:

  1. 12.5k = 45GB back­prop
  2. 15k = 75GB back­prop
  3. 20k = 112GB back­prop
  4. 30k = 280GB back­prop

When are 30k con­text win­dows use­ful? For reusing pre­trained GPT-2. So, it can be done, but should it be done? Is this more than just a par­lor trick re­mind­ing us that ex­po­nen­tial ≠ qua­drat­ic? For the most part, it seems like any­one in need of big con­text win­dows is prob­a­bly bet­ter off not rent­ing all these TPUs to train a GPT-2 with 30k con­text win­dows, and us­ing one of the many al­ter­na­tives for long-range de­pen­den­cies, like Re­former or Com­pres­sive Trans­form­ers or Sparse Trans­form­ers or dy­namic con­vo­lu­tions or… But it might be use­ful in two cas­es: where one has a model al­ready fully trained, and can do a fi­nal fine­tun­ing train­ing phase on the orig­i­nal cor­pus to pick up long-range de­pen­den­cies, since it’s still fea­si­ble to gen­er­ate with wide con­text win­dows on com­mod­ity hard­ware; and where one needs the ful­ly-trained model be­cause the trans­fer learn­ing is vi­tal, and train­ing one of the al­ter­na­tives from scratch would nev­er­the­less de­liver in­fe­rior re­sults. (At least as of 2020-01-30, I know of no pub­licly-re­leased trained mod­els for any of those al­ter­na­tives ar­chi­tec­tures which ap­proach the power of GPT-2-1.5b or etc—they are all trained on much smaller dataset­s.)

MIDI Dataset

To get enough MIDIs to be worth train­ing on, I com­bined 3 MIDI datasets:

  1. The Ses­sions + (our par­tial scrape), de­scribed above

  2. The Lakh MIDI Dataset v0.1:

    The Lakh MIDI dataset is a col­lec­tion of 176,581 unique MIDI files, 45,129 of which have been matched and aligned to en­tries in the Mil­lion Song Dataset…LMD-full: The full col­lec­tion of 176,581 deduped MIDI files.

  3. MIDI Dataset (Com­pos­

    The en­tire dataset, gzipped, can be down­loaded at

    The di­rec­tory struc­ture cor­re­spond to the sources on the In­ter­net from which we grabbed the MIDI files. We dedu­pli­cated the dataset and the re­sult­ing col­lec­tion has 77153 songs.

    • Big_Data_Set: “The Largest MIDI Col­lec­tion on the In­ter­net”
    • cariart: Ben Bur­graff’s Midi Col­lec­tion
    • download-midi: Pop and Rock gen­res from DOWNLOAD-MIDI.COM
    • Guitar_midkar.com_MIDIRip: mid­kar.­com
    • ics: Dan Hirschberg at Uni­ver­sity of Cal­i­for­nia, Irvine
    • lmd_matched: Straight from the LMD-matched dataset from the Lakh project
    • (Now de­funct) TV Time­warp

Com­bined, this yields 537,594 MIDI files (~9G­B).

Converting MIDI to ABC

Need com­pact text ver­sions of MIDIs. The prob­lem with raw MIDI files is that they are bi­na­ry, and not tex­tu­al. The code­base does not sup­port bi­nary in­puts as far as we know, and in any case, text is al­ways much eas­ier to in­spect and work with. There are tex­tual for­mats for MIDI, like hex en­cod­ing, and one can use generic bi­nary → text en­cod­ings like , but I was not too sat­is­fied with them: when I tried them, they often blew up the char­ac­ter count great­ly, and even 30k con­text would not be enough.

ABC works well enough. Iron­i­cal­ly, ABC turns out to be the an­swer! The ABC tools ship with midi2abc, which is the in­verse of abc2midi, and while it warns that it is a lossy trans­la­tor, after lis­ten­ing to a few dozen, I had to say that it does a good job over­all. It also gen­er­ates ABC files which are sim­i­lar in size to the orig­i­nal MIDI files. (I fig­ure that what­ever midi2abc loses in hav­ing to po­ten­tially la­bo­ri­ously en­code MIDI con­structs, it makes up in be­ing able to take short­cuts us­ing ABC’s high­er-level mu­sic con­struct­s.) The weak­nesses of midi2abc is that it: loses vol­ume con­trol, does­n’t work well on some rock tracks (drums/guitars are hol­lowed out), does­n’t suc­cess­fully com­pile many MIDI files (leav­ing empty or ex­tremely short ABC files), and there are oc­ca­sional patho­log­i­cal ex­am­ples (eg one 44kb MIDI de­com­piled to a 3.4GB ABC file—­some ex­po­nen­tial blowup?). How­ev­er, the first two are rel­a­tively mi­nor, and the third can be tamed by sim­ply drop­ping ABC out­put past 300kb (as hardly any non-patho­log­i­cal MIDI files com­piled to >300k­b).

Data clean­ing. In con­vert­ing the files, we want to avoid ex­ces­sive use of new­lines (-bpol 999999), we want to delete var­i­ous warn­ings or er­ror lines, we want to fil­ter com­ments start­ing with '^% '—which take up a ton of space and often are just lyric­s—and then delete spaces in non-MIDI-directive (^%%) lines per the ‘space­less’ model above. The shell pipeline:

## delete odd trash files in LMD/
find . -type f -name '\._*' -delete
## clean up conversions from any previous runs
find . -type f -name "*" -delete

function convertMidi2Abc() { midi2abc -bpl 999999 -nogr "$@" |  \
 egrep -v -e '^% ' -e '^w:' -e 'Missing time signature meta' \
          -e 'All rights reserved' -e 'Error ' -e 'Copyright' | \
 ## ABC MIDI commands start with '%%', but regular comments start with '%', so drop any line with '%' but not '%%' in it:
 sed -e 's/^%[^%].*//g' | \
 ## delete spaces inside non-MIDI commands (MIDI, but not ABC, commands are space-sensitive):
 sed -e '/^%%/! s/ //g' | \
 ## a handful of files are pathological, so truncate:
 head --bytes=300000 > "$@".abc; }
export -f convertMidi2Abc
find . -type f -name "*.mid" | parallel convertMidi2Abc

## below 150 characters means an ABC file probably contains only metadata
find . -type f -name "*.abc" -size -150c -delete

## create training (19/20) and validation (1/20) file lists:
find . -type f -name "*" | shuf > filelist.txt
split -n l/20  filelist.txt split

The file-lists can be used to read in in­di­vid­ual files, in­ter­sperse <|endoftext|>\n, and write them out as a sin­gle GPT-2-formatted text. A shell loop proved too slow so I wrote a sim­ple Haskell script:

{-# LANGUAGE OverloadedStrings #-}
import Data.ByteString as B (concat, readFile, writeFile)
import Control.Monad (liftM)
import Data.List (intersperse)
main = do files <- liftM lines $ Prelude.readFile "training.txt" # 'splitaa'
          contents <- mapM B.readFile files
          let contents' = Data.List.intersperse "<|endoftext|>\n" contents
          B.writeFile "foo" (B.concat contents')

5.2GB of ABCs. The fi­nal re­sults were 453,651 ABC files (~10k BPE/ABC?):

  1. 2020-03-30-abc-combinedmidi-training.txt: 4.9GB (430,928 ABC pieces)
  2. 2020-03-30-abc-combinedmidi-validation.txt: 263MB (22,723 ABC pieces)

A sam­ple of 50 ran­dom­ly-cho­sen (ex­clud­ing The Ses­sions) hu­man-writ­ten MIDI files:

50 ran­dom MIDIs from the merged dataset

The full dataset (all MIDIs, ABCs, file-lists, and GPT-2-formatted text files) is avail­able for down­load:

rsync -v rsync:// ./

MIDI Training

Cur­ricu­lum train­ing. We ini­tially trained with a ~27k con­text win­dow, but it­er­a­tions took mul­ti­ple sec­onds, and after we took a sec­ond look at the ABC con­ver­sions and saw that pieces av­er­aged only 10.7K BPEs, de­cided that was largely a waste, and switched over to 5k con­text win­dows. Due to re­peated TPU pre­emp­tions and de­lays in restart­ing train­ing and some il­l-timed learn­ing rate de­creas­es, it was­n’t un­til late March 2020 that we got no­tice­ably good sam­ples at a loss of ~0.22/10m steps on 2020-03-24. (Thank good­ness for our TFRC cred­its! Even with TPUs cov­ered, the fi­nal to­tal cost of the VMs/bandwidth/storage was >$1000.)

Swarm train­ing loss curve, ~2020-02-20–2020-04-01, ~18.9m steps (re­peated in­ter­rup­tions)

Con­verged—un­der­fit‽ Around 4 April, with the 5k con­text train­ing ap­par­ently ful­ly-con­verged, we switched back to ‘core­less’ mode and 10k con­text win­dows and trained to a loss of ~0.20. It was sta­ble at that loss, with­out any fur­ther de­crease, for many steps, so, in­ter­est­ing­ly, I sus­pect GPT-2-117M turns out to un­der­fit our MIDI dataset while be­ing able to over­fit the ear­lier smaller ABC datasets. If I had known we’d be able to hit a only loss of 0.2, I would’ve started with a larger model than just GPT-2-117M, and ac­cepted any nec­es­sary lim­its on con­text win­dow. Since like­li­hood loss seems to track qual­ity so closely & the fi­nal few in­cre­ments of loss make a big differ­ence per­cep­tu­al­ly, it’s pos­si­ble we could’ve done a lot bet­ter.

GPT-2-30k Download

The fi­nal model was trained for 30,588,051 steps, and is avail­able for down­load:

  1. Rsync mir­ror: rsync -v rsync:// ./ (475M)
  2. Mega

MIDI Generation

Sam­pling still un­solved—­di­ver­gence+gib­ber­ish. Gen­er­at­ing sam­ples from the 10m check­point proved un­usu­ally tricky, de­spite nu­cleus sam­pling, with high risk of rep­e­ti­tion or de­gen­er­a­tion into pseudo-English-lyrics/text. A round of data-clean­ing to re­move that fixed the de­gen­er­a­tion into pseudo-Eng­lish, but left the rep­e­ti­tion prob­lem: sam­pling re­mains prone to gen­er­at­ing hun­dreds of lines of re­peat­ing notes like z8|z8|z8|z8| (which is es­pe­cially slow once the con­text win­dow fills up). These rep­e­ti­tion traps (so com­mon in like­li­hood-trained NN mod­els) ap­pear to go on in­defi­nitely in my sam­pling, and high temperatures/top_p pa­ra­me­ters do not break loops. The sheer length of pieces ap­pears to ex­ac­er­bate the usual is­sues with rep­e­ti­tion—a sam­pling process which is fine for a few hun­dred to­kens may nev­er­the­less fre­quently di­verge as re­quired lengths ap­proach 10k. (Given that ex­tremely long se­quences of z notes are com­mon in the orig­i­nal ABC-MIDI datasets, this sug­gests that a cus­tom BPE en­cod­ing—un­like in my ear­lier po­etry or The Ses­sions dataset­s—­would be help­ful in col­laps­ing these com­mon long se­quences down to a few BPEs.)

Fail­ure mode: drop­ping in­stru­ments. Strik­ing­ly, pieces tended to be ei­ther good or fail en­tire­ly, pos­si­bly be­cause instrument/tracks are gen­er­ated one at a time, and syn­tax er­rors or early ter­mi­na­tion of sam­pling can dis­able en­tire in­stru­ments, lead­ing to silent tracks and a fi­nal mu­sic piece which sim­ply re­peats end­lessly (be­cause the melody & vari­a­tion would have been pro­vided by a failed or nonex­is­tent later track). To re­duce the im­pact from di­ver­gence while try­ing to avoid trun­cat­ing po­ten­tial­ly-good pieces ear­ly, I use the rep­e­ti­tion penalty from Nick Wal­ton’s (it­self bor­rowed from ), and set a 10k length limit on sam­ples (cor­re­spond­ing to the av­er­age length of an ABC-encoded MIDI), and gen­er­ate 10 BPEs at a time to speed things up (1 BPE per step would the­o­ret­i­cally give bet­ter re­sults but is ex­tremely slow):

python3 src/ --top_p 0.95 --penalty 0.65 --prompt "\nX:" --model_name midi \
    --length 10000 --n_ctx 10000 --nsamples 10 --batch_size 1 --step 10 --maxlen 10000

MIDI Samples

Be­cause of the sam­pling fragili­ty, there is lit­tle point in pro­vid­ing a dump of ran­dom sam­ples to lis­ten to (although I pro­vide a text dump); in­stead, here are hand-s­e­lected sam­ples rep­re­sent­ing roughly the top 5% of sam­ples. All sam­ples are gen­er­ated us­ing the de­fault Timid­ity set­tings, tim­ings, and sound­banks (eg no use of harps like Shawn Presser’s mod­i­fied sam­ples be­fore),

MIDI sam­ple (2020-01-28; from it­er­a­tion #364086, loss ~0.35); (folk/classical?)

Sam­ples from it­er­a­tion #10,974,811, loss ~0.22:

  • MIDI sam­ple (2020-03-24); electronica/instrumental
  • MIDI sam­ple (2020-03-24); elec­tron­ica
  • MIDI sam­ple (2020-03-24); jazz (par­tic­u­larly nice use of a sax­o­phone track)
  • MIDI sam­ple (2020-03-24); clas­si­cal pi­ano piece
  • MIDI sam­ple (2020-03-25); clas­si­cal fugue-esque piece (harp­si­chord or or­gan?)

From the fi­nal mod­els:

  • Big Dataset (no­table for its drum+bag­pipe-like duet)
  • (2020-04-11); fast jazz sax­o­phone piece?
  • LMD folk-ish piece (heavy on a woodwind/organ?)
  • LMD: catchy am­bi­en­t-esque pi­ano piece
  • Pop MIDI, rapid jazz pi­ano?
  • Pop MIDI, al­most video-game or­ches­tral-like in parts
  • (2020-04-15); LMD pi­ano piece
  • LMD pi­ano piece
  • LMD slow clas­si­cal
  • Pop MIDI, gui­tar rock
  • Pop MIDI pi­ano
  • The Ses­sions jig
  • The Ses­sions
  • The Ses­sions
  • The Ses­sions
  • The Ses­sions
  • The Ses­sions
  • The Ses­sions
  • The Ses­sions
  • The Ses­sions
  • The Ses­sions
  • The Ses­sions
  • The Ses­sions
  • The Ses­sions (Steve Mart­land-like?)
  • The Ses­sions
  • The Ses­sions
  • 2020-04-18 sam­ples: ABC
  • Big Dataset
  • Big Dataset
  • Big Dataset
  • Big Dataset
  • Big Dataset
  • Big Dataset
  • Big Dataset
  • Big Dataset
  • Big Dataset
  • Pop
  • Pop
  • The Ses­sions
  • The Ses­sions
  • The Ses­sions
  • The Ses­sions
  • The Ses­sions

Bet­ter but frail­er. Qual­i­ty-wise, the MIDI model is ca­pa­ble of much more di­verse pieces while still gen­er­at­ing good folk mu­sic, il­lus­trat­ing the ben­e­fits of switch­ing to a much larger dataset. Al­though the diffi­culty we had train­ing it means I do not rec­om­mend us­ing GPT-2 for such long se­quences! (For­tu­nate­ly, there are many tech­ni­cal so­lu­tions to the im­me­di­ate prob­lem of self­-at­ten­tion’s qua­dratic cost.) But after the task of train­ing with wide win­dows is solved, there is not yet any so­lu­tion to the prob­lem of sam­pling. It’s hard to not no­tice that the good MIDI sam­ples are more repet­i­tive than the small ABC sam­ples; this could be due in part to is­sues in sam­pling break­ing ad­di­tional voices & leav­ing the back­ing tracks (which are sup­posed to be repet­i­tive), and could also re­flect greater rep­e­ti­tion in MIDI pieces in­tended to be sev­eral min­utes longer as op­posed to cu­rated ABC folk mu­sic pieces which tend to be quite short (as per­form­ers will vary & re­peat as nec­es­sary while per­form­ing), but one must strongly sus­pect that the sam­pling process it­self is to blame and nu­cleus sam­pling is far from a so­lu­tion. Were we to train a larger or longer model which can model tens of thou­sands of to­kens, how would we sam­ple tens of thou­sands of to­kens from it?

  1. TiMid­i­ty++ is a strange tool. The man page boasts of builtin sup­port! Just in case you ever needed to play all MIDI files in a par­tic­u­lar news­group.↩︎

  2. Us­ing the stan­dard GPT-2 BPE en­cod­ing of the ABC-compiled ver­sion­s—although we did dis­cuss mak­ing ei­ther MIDI or ABC-specific BPE en­cod­ings for sav­ing per­haps a quar­ter of space, it would add com­plex­ity for users and we weren’t sure how to cre­ate a new BPE en­cod­ing.↩︎

  3. For ex­am­ple, this mo­ti­vated Ope­nAI to de­velop Sparse Trans­form­ers, which tame the scal­ing by trim­ming the Trans­former at­ten­tion win­dows to much smaller than the full win­dow, thereby avoid­ing the full self­-at­ten­tion qua­dratic scal­ing, which en­ables win­dows of tens of thou­sands eas­ily (30k is more than enough to han­dle most MIDIs), and en­ables MuseNet to gen­er­ate MIDIs with­out a prob­lem. avoid the win­dow prob­lem by pre­pro­cess­ing MIDI into a cus­tom en­cod­ing spe­cial­ized to sin­gle pi­ano tracks which is eas­ier to un­der­stand, and switch­ing to , which has a lim­ited win­dow but adds on recurrency/memory to main­tain co­her­ence.↩︎

  4. This is sur­pris­ing be­cause you would es­ti­mate that TPUs have 16GB per core and 8 cores, so only 128GB VRAM to­tal. But if you avoid that, and use what we call ‘core­less mode’, both TPUv2 and TPUv3 ap­par­ently die at around 300GB. This does not ap­pear to be doc­u­mented any­where by Google, even though it’s in­ter­est­ing & sur­pris­ing & use­ful.↩︎

  5. This is part of how “ac­ci­den­tally qua­dratic” al­go­rithms can sneak into pro­duc­tion sys­tem­s—qua­dratic growth is slow enough that for many in­puts, fast hard­ware can still han­dle it and the soft­ware just ‘feels slow’, while for truly ex­po­nen­tial growth, it’d im­me­di­ately hit the wall on re­al­is­tic in­puts and fail.↩︎

  6. Amus­ing­ly, Ope­nAI would later take a sim­i­lar brute force ap­proach in train­ing : .↩︎