GPT-2 Folk Music

Generating Irish/folk/classical music in ABC format using GPT-2-117M, with good results.
statistics, NN, shell, GPT, music
2019-11-012020-04-25 finished certainty: likely importance: 6

In Novem­ber 2019, I exper­i­mented with train­ing a GPT-2 neural net model to gen­er­ate folk music in the high­-level ABC music text for­mat, fol­low­ing pre­vi­ous work in 2016 which used a char-RNN trained on a ‘The Ses­sion’ dataset. A GPT-2 hypo­thet­i­cally can improve on an RNN by bet­ter global coher­ence & copy­ing of pat­terns, with­out prob­lems with the hid­den-s­tate bot­tle­neck.

I encoun­tered prob­lems with the stan­dard GPT-2 mod­el’s encod­ing of text which dam­aged results, but after , I suc­cess­fully trained it on n = 205,304 ABC music pieces taken from The Ses­sion & The result­ing music sam­ples are in my opin­ion quite pleas­ant.

The ABC folk model & dataset are avail­able for down­load, and I pro­vide for lis­ten­ing selected music sam­ples as well as med­leys of ran­dom sam­ples from through­out train­ing.

We fol­lowed the ABC folk model with : a dataset of 453k ABC pieces decom­piled from MIDI pieces, which fit into GPT-2-117M with an expanded con­text win­dow when trained on TPUs. The MIDI pieces are far more diverse and chal­leng­ing, and GPT-2 under­fits and strug­gles to pro­duce valid sam­ples but when sam­pling suc­ceeds, it can gen­er­ate even bet­ter musi­cal sam­ples.

Back in 2015–2016, Bob L. Sturm exper­i­mented with gen­er­at­ing Irish folk music using a char-RNN trained on a cor­pus of folk music writ­ten from in a high­-level musi­cal for­mat called “”. Com­pact tex­t—per­fect for NNs. While ABC nota­tion is writ­ten in ASCII, it sup­ports many com­plex fea­tures, and it has been adopted widely by folk musi­cians and hun­dreds of thou­sands of pieces written/transcribed in it.


Background: folk-RNN

Sturm et al scraped ~40k ABC files from The Ses­sion and trained a char-RNN called “folk-RNN”, putting the code & data online, and pro­vid­ing a web inter­face for gen­er­a­tion. Prior suc­cess with char-RNNs. In addi­tion to the var­i­ous research pub­li­ca­tions, Sturm has also writ­ten many blog posts eval­u­at­ing folk-RNN pieces, such as how well they’re played by human musi­cians.


2015 was a long time ago, how­ev­er, and DL has seen a par­a­digm shift in sequence mod­el­ing away from char-RNNs to CNNs and atten­tion-based Trans­former mod­el­s—­most famous­ly, GPT-2. DL progress. () with & Ope­nAI’s -based have both demon­strated excel­lent results in music com­po­si­tion at var­i­ous timescales/formats, and inter­est­ing fea­tures like mix­ing gen­res.

While mess­ing around with in late Octo­ber 2019, I became curi­ous if folk-RNN could be improved by sim­ply throw­ing one of the GPT-2 mod­els at it. GPT-2: a per­fect match. (Not the large ones, of course, which would over­fit far too eas­ily or had­n’t been released yet, but GPT-2-117M.) GPT-2 is unable to model raw audio, or , because a mean­ing­ful musi­cal piece is a WAV sequence of hun­dreds of thou­sands to mil­lions of sym­bols long, and a MIDI piece is tens of thou­sands of sym­bols long, which far exceed GPT-2’s small con­text win­dow (but see later), and why Ope­nAI used Sparse Trans­form­ers for its MIDI gen­er­a­tion, as Sparse Trans­form­ers can scale to text with tens of thou­sands of char­ac­ters. How­ev­er, the high­-level nota­tion of ABC pieces means they fit just fine into the GPT-2 win­dow.

I had avoided doing any­thing music with GPT-2, focus­ing instead, because I assumed Ope­nAI would be doing a MuseNet fol­lowup, but months lat­er, they’d done noth­ing fur­ther, and when I inquired, I got the impres­sion that their music projects were over. So why not?

As for why repeat Stur­m’s pro­jec­t—there were two pos­si­ble advan­tages to using GPT-2-117M:

  1. improved global coherency:

    I thought the Trans­former might work par­tic­u­larly well on ABC for­mat, because RNNs suf­fer from per­sis­tent ‘for­get­ting’ issues, where it’s dif­fi­cult for the RNN to per­sist its mem­ory of past gen­er­ated sequences, mak­ing it hard for an RNN to repeat a theme with vari­ants, while a GPT-2 Trans­former has a con­text win­dow of 1024 BPEs—much longer than almost every ABC piece—and so is able to ‘see’ the entire piece simul­ta­ne­ously while gen­er­at­ing the next note

  2. Eng­lish meta­data under­stand­ing:

    The Eng­lish pre­train­ing could poten­tially help by pro­vid­ing seman­tic under­stand­ing of eg the ABC meta­data, such as the dif­fer­ence between two pieces titled a ‘jig’ ver­sus a ‘waltz’, or the pseudo-­nat­u­ral-lan­guage-­ness of the ABC for­mat as a whole.

ABC Data

The Session

So I did apt-get install abcmidi timidity1 to get the CLI tools to do ABCMIDIWAV (re­spec­tive­ly) and down­loaded the folk-RNN repo with its data files. Pipeline: ABCMIDIWAV.

The data comes in sev­eral for­mats, for their exper­i­ments in chang­ing the nota­tion. I used the orig­i­nal for­mat, with n = 48,064 songs.

The data needed pro­cess­ing for GPT-2 as fol­lows:

  1. there was stray HTML (</html>) which had to be removed.

    I used search-and-re­place, and reported the issue.

  2. abc2midi requires every song to have an inte­ger iden­ti­fier, eg X: 48064, to be a valid ABC file which it can com­pile to MIDI.

    I used an Emacs macro (which can incre­ment an inte­ger 1–48,064) to insert a X: $N before each T: title line, but in ret­ro­spect, I could have sim­ply used another search-and-re­place to insert X: 1 in front of each piece—it’s not like the ID has to be unique, we’re just sat­is­fy­ing abc2midi which is a bit picky.

  3. as usual for any neural model like char-RNN or GPT-2, it is impor­tant to insert <|endoftext|> mark­ers where rel­e­vant, so it under­stands how to gen­er­ate sep­a­rate pieces and avoids ‘run on’.

    I used search-and-re­place.

This yielded 14MB of text to train on, and is con­verted to NPZ as usual.


First Model

Because the Ses­sion cor­pus was so small (just 14M­B), I used the small­est avail­able GPT-2, GPT-2-117M to train on, and stan­dard set­tings to train on one of my Nvidia 1080tis:

CUDA_VISIBLE_DEVICES=0 PYTHONPATH=src ./ --dataset thesessions-irishabc.txt.npz
    --batch_size 11 --model_name irish --save_every 4000 --sample_every 500 --learning_rate 0.0001
    --run_name irish --memory_saving_gradients --noise 0.01 --val_every 500

Train­ing was fairly easy, tak­ing just a few days at most to train down to a loss of 0.46 (9742 steps at mini­batch n = 5), and I killed it on 2019-10-21 and looked at the ran­dom sam­ples. Straight­for­ward suc­cess. They struck me as pretty good, aside from gen­er­ated pieces often hav­ing the same title repeat­ed­ly, which appar­ently was due to The Ses­sion post­ing mul­ti­ple tran­scrip­tions of the same piece, so the model picked up on that and would gen­er­ate vari­ants on the same theme. Sturm high­lighted a few and did some more in-depth com­men­tary on them, with a mixed eval­u­a­tion, con­clud­ing “So of the five tran­scrip­tions above, two are plau­si­ble. The polka is actu­ally pretty good! All titles by GPT-2 are pla­gia­rized, but I haven’t found much pla­gia­rism in the tunes them­selves.”

I was wor­ried about pla­gia­rism and thought ~0.40 would be safe, but it seemed the music itself was still far from being copied, so I con­sid­ered fur­ther train­ing. Some datasets are invalid ABC. The addi­tional processed ver­sions of The Ses­sion that Sturm et al had made seemed like a tar­get, but caused prob­lems when I sim­ply con­cate­nated them in, and I soon dis­cov­ered why abc2midi now thought all the sam­ples were bro­ken:

allabcwrepeats_parsed_wot: This is ver­sion 3 of the dataset from the­ses­ In this ver­sion, we trans­pose all tunes to have the root C, trans­pose them all to have the root C#, remove the titles, and make new mode tokens, K:maj, K:min, K:dor, and K:mix. There are over 46,000 tran­scrip­tions here.

This turns out to be a prob­lem: K:maj, K:min, K:dor, and K:mix com­pletely breaks abc2midi! So I did addi­tional search-and-re­place to trans­form them into valid key sig­na­tures like K: Cmaj, K: Cmin, K: Cdor, and K: Cmix.

Retrain­ing, I dis­cov­ered 0.40 was far from con­verged, and with another 13k steps, it could go down to <0.09. Bal­anc­ing imi­ta­tion & pla­gia­rism. How­ev­er, check­ing ran­dom sam­ples by hand, the tex­tual over­lap with The Ses­sion became par­tic­u­larly large once the loss reaches ~0.09 (note that it was not ‘over­fit­ting’ in the stan­dard sense, since the loss was still decreas­ing on the val­i­da­tion set), so I backed off to a model with ~0.13 loss. This seems to be high­-qual­ity with­out gross pla­gia­rism.

Spaceless Model

I began using that model for the pref­er­ence learn­ing work, where I found that pref­er­ence learn­ing seemed to improve music more than the poet­ry, so I began focus­ing on the music.

Puz­zling­ly, no mat­ter how many rat­ings I added, and despite the low loss, the gen­er­ated sam­ples would per­sis­tently have basic, bla­tant syn­tax errors involv­ing spaces; abc2midi would often warn or even error out on a piece which could be eas­ily fixed by hand by sim­ply remov­ing a few spaces. Anom­aly: per­ma­nent space-re­lated syn­tax errors. This was wast­ing my time dur­ing rat­ing, since I could­n’t pick sam­ples with syn­tax prob­lems (even if they’d oth­er­wise sound good) because I did­n’t want to rein­force gen­er­a­tion of invalid sam­ples, and also while gen­er­at­ing music.

Dis­cussing it with Shawn Presser, who I was work­ing with simul­ta­ne­ously to train GPT-2-1.5b on poetry, he pointed out that some peo­ple, like Nos­tal­ge­braist had some frus­trat­ing prob­lems with the stan­dard GPT-2 BPE encod­ing.

To explain what BPE is and why it might be a bad thing for ABC nota­tion: GPT-2 does­n’t just feed in raw char­ac­ters like a char-RNN does, because that makes every input extremely long. GPT-2 gen­er­ates space-de­lim­ited word frag­ments. Instead, it tries to ‘chunk’ them into some­thing in-­be­tween char­ac­ter-­sized and word-­sized, to get the best of both worlds, a way of writ­ing text where com­mon words are a sin­gle sym­bol but rare words can still be expressed as a cou­ple sym­bols rather than deleted entirely like word-based encod­ings must; how­ev­er, since the default model is trained on Eng­lish text, chunk­ing is done assum­ing nor­mal Eng­lish white­space, like spaces between words.

Nos­tal­ge­braist notes that the actual BPE imple­men­ta­tion used is weird and does­n’t act as you’d expect, espe­cially when spaces are involved. So Presser won­dered if GPT-2 could­n’t express the syn­tac­ti­cal­ly-­cor­rect text sans spaces, and that is why the errors were there and stub­bornly per­sisted despite hun­dreds of rat­ings which told GPT-2 to stop doing that already.

Check­ing, the ABC for­mat appar­ently does not require spaces. Workaround—­spaces option­al! They are only there for the con­ve­nience of humans read­ing & writ­ing ABC. Aside from the meta­data fields, if you delete all spaces, the music should be the same. I was sur­prised, but this seemed to be true. (Presser did some exper­i­ments with cre­at­ing a brand-new BPE tai­lored to ABC, and while this would have reduced the BPE size of ABC pieces by >33%, I fig­ured that all the ABC pieces fit into the GPT-2 win­dow any­way and it was­n’t worth the has­sle now that we had diag­nosed & worked around the prob­lem. He also did some exper­i­ments in gen­er­at­ing video game style music via ABC: he prompted it with chords, and then switched from the default piano instruments/sound sam­ples used by TiMid­i­ty++ to instru­ments like harps for a fan­tasy feel.)

So, I deleted the spaces with a sim­ple tr -d ' ', re-en­coded to NPZ, and retrained the first model to cre­ate a new ‘space­less’ mod­el. This required another few days to re-­con­verge to ~0.13.

The space­less cor­pus fixed the invalid-ABC prob­lem, and the new model reg­u­larly gen­er­ated ABC sam­ples that trig­gered nei­ther warn­ings nor errors, and Presser swore that the per­cep­tual qual­ity was much high­er.

Combined Model: The Session +

Presser was inter­ested in expand­ing the reper­toire beyond The Ses­sion and began look­ing at ABC data­bas­es. More dakka (data). The biggest by far appeared to be, which had n = 290,000 pieces. He scraped a ran­dom half of them, for n = 157,240 total, and I com­bined them with The Ses­sion dupli­cated dataset, for a total n = 308,280 (n = 205,304 unique; 81M­B). pieces are much more diverse in for­mat­ting & meta­data than The Ses­sion. Sim­pli­fy­ing to match The Ses­sion ABC. To homog­e­nize them, I ran all the pieces through abc2abc, and then I deleted some meta­data fields that struck me as exces­sive—­com­men­tary, dis­cus­sions about how to play a piece, sources, authors of the tran­scrip­tion, that sort of thing, which greatly inflated the loss of the com­bined dataset com­pared to the space­less mod­el. (In total, I fil­tered out abc2abc-gen­er­ated warn­ings start­ing with %, and B:/D:/F:/N:/O:/S:/Z:/w: meta­da­ta.) It would have been nice if the meta­data had included genre tags for greater con­trol of con­di­tional pieces, akin to my author-based con­trol for GPT-2 or , a tech­nique demon­strated at scale by using explicit Red­dit meta­data, and Choi et al 2019 using autoen­coders to do unsu­per­vised learn­ing of musi­cal fea­tures which implic­itly cov­ers gen­re, but alas! We’ll have to stick with the basics like title/key/meter.

This required a full week of train­ing or 168810 steps (1–7 Dec), down to a higher loss (as expect­ed) but still on the edge of pla­gia­rism:

Exam­ples of gen­er­ated ABC (note the lack of spaces):


Last exam­ple ren­dered as a score:

Score for “PolkaEbBbAb(5letras)cf.CGF5-Parts” (an ABC music sam­ple gen­er­ated by GPT-2-117M trained on a com­bined ABC dataset)


An ABC sam­ple is not playable on its own; it must be con­verted to MIDI, and then the MIDI can be played. If one is look­ing at indi­vid­ual sam­ples being gen­er­ated by the mod­el, a quick CLI way to play and then dump to an OGG Vor­bis file might be:

xclip -o | abc2midi - -o /dev/stdout | \
    timidity -A110 -
TARGET="`today`-thelonelyfireside.wav"; xclip -o | abc2midi - -o /dev/stdout | \
    timidity -A110 - -Ow -o "$TARGET" && oggenc -q0 "$TARGET"

Extract­ing mul­ti­ple ABC sam­ples, con­vert­ing, and merg­ing into a sin­gle long piece of music is some­what more chal­leng­ing, and I reused parts of my pref­er­ence-learn­ing rat­ing script for that.

First Model Samples

  • GPT-2-117M ran­dom sam­ples, first model trained on Ses­sion (2019-10-21):
  • “Pad­dy­whack” gen­er­ated title & sam­ple (2019-10-22):
  • “The Bank of Turf” sam­ple (2019-10-22):
  • “Hick­ey’s Tune” sam­ple (2019-10-23):
  • “The Loon and his Quine” sam­ple (2019-10-29):
  • “The Atlantic Roar” sam­ple (2019-10-30):
  • “The Lonely Fire­side” sam­ple (2019-10-30):
  • “The Marine” sam­ple (2019-10-31):
  • “Whiskey Before Break­fast” sam­ple (2019-10-31):
  • “The Flog­ging” sam­ple (2019-11-01):
  • “Banks of the Allan” sam­ple (2019-11-03):
  • “A Short Jour­ney” sam­ple (2019-11-04):

Spaceless Model Samples

I enjoyed the mod­el’s ren­di­tions of the “Yn Bol­lan Bane” jig when I came across it, and so I used con­di­tional gen­er­a­tion to gen­er­ate 50 vari­a­tions on it:

“50 Vari­ants on ‘Yn Bol­lan Bane’”:

Combined Model Samples

  • 100 ran­dom sam­ples from com­bined GPT-2-117M:
  • “Invereshie’s House” sam­ple (2019-12-04):
  • “FaroeRum” sam­ple (2020-01-25):


I am pleased with the final gen­er­ated music; the space­less & changes def­i­nitely improved over the orig­i­nal.

In ret­ro­spect, the use of GPT-2-117M was not nec­es­sary. Smaller = bet­ter (for now). It was so large that overfitting/plagiarism was a con­cern even with the com­bined dataset, and the Eng­lish pre­train­ing was largely use­less—all of the gen­er­ated titles I checked were copied from the train­ing data, and I did­n’t observe any inter­est­ing new ones. The GPT-2 BPE encod­ing also proved to be a prob­lem in the end; gen­er­at­ing a BPE specif­i­cally for an ABC cor­pus would have avoided that and prob­a­bly also have improved learn­ing. A smaller GPT-2 with a cus­tomized BPE (fewer para­me­ters & atten­tion heads, but more lay­ers, I think would be bet­ter) would have trained much faster & prob­a­bly give sim­i­lar or bet­ter results.

Trans­form­ers = bet­ter copy­ing + coher­ence? Qual­i­ta­tive­ly, I feel like the pieces have a dif­fer­ent feel from char-RNN pieces, in their abil­ity to repeat themes & motifs, and they also seem to have a much bet­ter abil­ity to come to an end­ing, instead of mean­der­ing on indef­i­nitely as char-RNN things have a ten­dency to do (even when trained on cor­puses with clear end-of-­text delim­iter­s), per­haps because it’s eas­ier for a Trans­former to ‘count’ and know when a piece has reached a rea­son­able length, while a char-RNN for­gets where it is. Over­all, I’d call it a suc­cess.

Generating MIDI with 10k–30k Context Windows

To expand the ABC GPT-2 model to cover a wider vari­ety of musi­cal gen­res, I turn to the nex­t-­most com­pact wide­spread music encod­ing for­mat: MIDI. There are hun­dreds of thou­sands of MIDIs which can be to ABC for­mat, aver­ag­ing ~10k BPEs—within GPT-2-117M’s fea­si­ble con­text win­dow when trained on TPUs (which per­mit train­ing of con­text win­dows up to 30k wide).

We com­pile the ABC from before and 2 large MIDI datasets, and con­vert to ABC, yield­ing ~453k usable ABC-MIDI musi­cal files (~5.1GB of tex­t). We trained Jan­u­ary–April 2020 on our TPU swarm (with many inter­rup­tion­s), achiev­ing a final loss of ~0.2 (un­der­fit).

Sam­pling from the final model is hit-or-miss as it is prone to the like­li­hood rep­e­ti­tion trap and it gen­er­ates instru­ments one-by-one so it is com­mon for instru­ments to be cut off or oth­er­wise bro­ken dur­ing sam­pling (indi­cat­ing that sam­pling is increas­ingly a big­ger prob­lem than train­ing for long-range sequence mod­el­ing). How­ev­er, suc­cess­ful pieces are pos­si­ble, and are musi­cally far more diverse than the folk ABC cor­pus, with many pleas­ingly com­plex sam­ples.

The log­i­cal next step from gen­er­at­ing short ABC folk tunes is to gen­er­ate music in gen­er­al. MIDI: hard, but not too hard? Since ABC is not much used out­side folk music, there are no good large ABC cor­puses appro­pri­ate for this. Raw music is infea­si­ble at the moment: things like can gen­er­ate excel­lent raw audio WAVs like , but the model sizes pro­hibit ‘see­ing’ more than a few sec­onds at most, mak­ing them capa­ble of play­ing musi­cal scores fed to them, but not of high­er-level com­pos­ing—strug­gling to pass more than ~10s even with . So you could have a GPT-2 gen­er­at­ing ABC which is fed into a WaveNet CNN, cre­at­ing what sounds like real instru­ments play­ing real music, but you could­n’t have a sin­gle NN doing it all. Inter­me­di­ate, more pow­er­ful than ABC but also not as demand­ing as raw audio gen­er­a­tion, would be gen­er­at­ing MIDI.

First: turn binary MIDI into text ABC. We can’t gen­er­ate raw MIDI since that’s a binary file for­mat which does­n’t play well with most GPT-2 imple­men­ta­tions (although there is noth­ing in prin­ci­ple stop­ping oper­a­tion on full binary bytes rather than text bytes). What if we cre­ate a text encod­ing for MIDI and con­vert? The most straight­for­ward rep­re­sen­ta­tion, the , rep­re­sents MIDI as 128 dis­tinct instru­ments per timestep (al­most all always silen­t), which leads to a hope­lessly enor­mous sparse matrix. There are sev­eral avail­able tools for encod­ing MIDI more directly than piano rolls, and one of them turns out to be ABC itself!

ABC-MIDI: still too long for GPT-2 The ABC-MIDI encod­ings are also rel­a­tively short. Check­ing with wc -w, MIDI files typ­i­cally range in size from 10–50k char­ac­ters equiv­a­lent; even with BPEs poten­tially sav­ing some space (it’s ~10.7k BPEs2 per MIDI), GPT-2 sim­ply can­not han­dle this sort of sequence (with many impor­tant long-range depen­den­cies) with its stan­dard con­text win­dow of 1024 tokens, because it would only be able to see less than a tenth of the music file at a time and it would be com­pletely blind to the rest of it, since it has no mem­o­ry. If you tried, it would not sound nearly as good as the ABC pieces above, because it would be unable to do all the nifty rep­e­ti­tion-with­-­vari­a­tion of melodies, over­all the­matic struc­ture with begin­nings & end­ings etc; instead, it would prob­a­bly ram­ble around, gen­er­at­ing plau­si­ble music which, how­ev­er, never goes any­where and just sort of ends. And, as any­one knows, GPT-2 does not scale to larger atten­tion win­dows as self­-at­ten­tion is in the win­dow length (), and 1024 is already push­ing it for train­ing, moti­vat­ing .3

More Dakka

Unless we use brute force. How­ev­er, when train­ing var­i­ous GPT-2-1.5b mod­els on our TPU swarm using the TPU research cred­its granted to us by Google’s , we noticed that a TPU can fit a model which uses up to 300GB VRAM before crash­ing.4 300G­B—that’s quite a lot. What could you do with that? How bad exactly is GPT-2’s scal­ing…? Qua­dratic () ≠ (): if you have some good hard­ware avail­able, you may be able to push it quite far.5

TPUs can train 30k con­text win­dows! We tried GPT-2-117M and it turned out a con­text win­dow of 10,000 worked! Then we tried 12.5k, 15k, 20k, & 30k, and they all worked.6 (For com­par­ison, gen­er­ated great excite­ment for being able to scale up to 64k win­dows on sin­gle GPUs.) These are wide enough to train ABC-MIDI files, and we could even gen­er­ate small tex­t-en­coded images (since 30k = 1732). 30k is slow, but not hope­lessly slow, as with GPT-2-117M, we get 4 train­ing steps in 2700 sec­onds (n = 1). A swarm of TPUs, like >100, would be able to train on a large cor­pus in just a few wall­clock days. The mem­ory usages are con­sid­er­able, but not beyond the TPU:

  1. 12.5k = 45GB back­prop
  2. 15k = 75GB back­prop
  3. 20k = 112GB back­prop
  4. 30k = 280GB back­prop

When are 30k con­text win­dows use­ful? For reusing pre­trained GPT-2. So, it can be done, but should it be done? Is this more than just a par­lor trick remind­ing us that expo­nen­tial ≠ qua­drat­ic? For the most part, it seems like any­one in need of big con­text win­dows is prob­a­bly bet­ter off not rent­ing all these TPUs to train a GPT-2 with 30k con­text win­dows, and using one of the many alter­na­tives for long-range depen­den­cies, like Reformer or Com­pres­sive Trans­form­ers or Sparse Trans­form­ers or dynamic con­vo­lu­tions or… But it might be use­ful in two cas­es: where one has a model already fully trained, and can do a final fine­tun­ing train­ing phase on the orig­i­nal cor­pus to pick up long-range depen­den­cies, since it’s still fea­si­ble to gen­er­ate with wide con­text win­dows on com­mod­ity hard­ware; and where one needs the ful­ly-­trained model because the trans­fer learn­ing is vital, and train­ing one of the alter­na­tives from scratch would nev­er­the­less deliver infe­rior results. (At least as of 2020-01-30, I know of no pub­licly-re­leased trained mod­els for any of those alter­na­tives archi­tec­tures which approach the power of GPT-2-1.5b or etc—they are all trained on much smaller dataset­s.)

MIDI Dataset

To get enough MIDIs to be worth train­ing on, I com­bined 3 MIDI datasets:

  1. The Ses­sions + (our par­tial scrape), described above

  2. The Lakh MIDI Dataset v0.1:

    The Lakh MIDI dataset is a col­lec­tion of 176,581 unique MIDI files, 45,129 of which have been matched and aligned to entries in the Mil­lion Song Dataset…LMD-full: The full col­lec­tion of 176,581 deduped MIDI files.

  3. MIDI Dataset (Com­pos­

    The entire dataset, gzipped, can be down­loaded at

    The direc­tory struc­ture cor­re­spond to the sources on the Inter­net from which we grabbed the MIDI files. We dedu­pli­cated the dataset and the result­ing col­lec­tion has 77153 songs.

    • Big_Data_Set: “The Largest MIDI Col­lec­tion on the Inter­net”
    • cariart: Ben Bur­graf­f’s Midi Col­lec­tion
    • download-midi: Pop and Rock gen­res from DOWNLOAD-MIDI.COM
    • Guitar_midkar.com_MIDIRip: mid­kar.­com
    • ics: Dan Hirschberg at Uni­ver­sity of Cal­i­for­nia, Irvine
    • lmd_matched: Straight from the LMD-matched dataset from the Lakh project
    • (Now defunct) TV Time­warp

Com­bined, this yields 537,594 MIDI files (~9G­B).

Converting MIDI to ABC

Need com­pact text ver­sions of MIDIs. The prob­lem with raw MIDI files is that they are bina­ry, and not tex­tu­al. The code­base does not sup­port binary inputs as far as we know, and in any case, text is always much eas­ier to inspect and work with. There are tex­tual for­mats for MIDI, like hex encod­ing, and one can use generic bina­ry→­text encod­ings like , but I was not too sat­is­fied with them: when I tried them, they often blew up the char­ac­ter count great­ly, and even 30k con­text would not be enough.

ABC works well enough. Iron­i­cal­ly, ABC turns out to be the answer! The ABC tools ship with midi2abc, which is the inverse of abc2midi, and while it warns that it is a lossy trans­la­tor, after lis­ten­ing to a few dozen, I had to say that it does a good job over­all. It also gen­er­ates ABC files which are sim­i­lar in size to the orig­i­nal MIDI files. (I fig­ure that what­ever midi2abc loses in hav­ing to poten­tially labo­ri­ously encode MIDI con­structs, it makes up in being able to take short­cuts using ABC’s high­er-level music con­struct­s.) The weak­nesses of midi2abc is that it: loses vol­ume con­trol, does­n’t work well on some rock tracks (drums/guitars are hol­lowed out), does­n’t suc­cess­fully com­pile many MIDI files (leav­ing empty or extremely short ABC files), and there are occa­sional patho­log­i­cal exam­ples (eg one 44kb MIDI decom­piled to a 3.4GB ABC file—­some expo­nen­tial blowup?). How­ev­er, the first two are rel­a­tively minor, and the third can be tamed by sim­ply drop­ping ABC out­put past 300kb (as hardly any non-­patho­log­i­cal MIDI files com­piled to >300k­b).

Data clean­ing. In con­vert­ing the files, we want to avoid exces­sive use of new­lines (-bpol 999999), we want to delete var­i­ous warn­ings or error lines, we want to fil­ter com­ments start­ing with '^% '—which take up a ton of space and often are just lyric­s—and then delete spaces in non-MIDI-directive (^%%) lines per the ‘space­less’ model above. The shell pipeline:

## delete odd trash files in LMD/
find . -type f -name '\._*' -delete
## clean up conversions from any previous runs
find . -type f -name "*" -delete

function convertMidi2Abc() { midi2abc -bpl 999999 -nogr "$@" |  \
 egrep -v -e '^% ' -e '^w:' -e 'Missing time signature meta' \
          -e 'All rights reserved' -e 'Error ' -e 'Copyright' | \
 ## ABC MIDI commands start with '%%', but regular comments start with '%', so drop any line with '%' but not '%%' in it:
 sed -e 's/^%[^%].*//g' | \
 ## delete spaces inside non-MIDI commands (MIDI, but not ABC, commands are space-sensitive):
 sed -e '/^%%/! s/ //g' | \
 ## a handful of files are pathological, so truncate:
 head --bytes=300000 > "$@".abc; }
export -f convertMidi2Abc
find . -type f -name "*.mid" | parallel convertMidi2Abc

## below 150 characters means an ABC file probably contains only metadata
find . -type f -name "*.abc" -size -150c -delete

## create training (19/20) and validation (1/20) file lists:
find . -type f -name "*" | shuf > filelist.txt
split -n l/20  filelist.txt split

The file-lists can be used to read in indi­vid­ual files, inter­sperse <|endoftext|>\n, and write them out as a sin­gle GPT-2-formatted text. A shell loop proved too slow so I wrote a sim­ple Haskell script:

{-# LANGUAGE OverloadedStrings #-}
import Data.ByteString as B (concat, readFile, writeFile)
import Control.Monad (liftM)
import Data.List (intersperse)
main = do files <- liftM lines $ Prelude.readFile "training.txt" # 'splitaa'
          contents <- mapM B.readFile files
          let contents' = Data.List.intersperse "<|endoftext|>\n" contents
          B.writeFile "foo" (B.concat contents')

5.2GB of ABCs. The final results were 453,651 ABC files (~10k BPE/ABC?):

  1. 2020-03-30-abc-combinedmidi-training.txt: 4.9GB (430,928 ABC pieces)
  2. 2020-03-30-abc-combinedmidi-validation.txt: 263MB (22,723 ABC pieces)

A sam­ple of 50 ran­dom­ly-­cho­sen (ex­clud­ing The Ses­sions) human-writ­ten MIDI files:

50 ran­dom MIDIs from the merged dataset

The full dataset (all MIDIs, ABCs, file-lists, and GPT-2-formatted text files) is avail­able for down­load:

rsync -v rsync:// ./

MIDI Training

Cur­ricu­lum train­ing. We ini­tially trained with a ~27k con­text win­dow, but iter­a­tions took mul­ti­ple sec­onds, and after we took a sec­ond look at the ABC con­ver­sions and saw that pieces aver­aged only 10.7K BPEs, decided that was largely a waste, and switched over to 5k con­text win­dows. Due to repeated TPU pre­emp­tions and delays in restart­ing train­ing and some ill-­timed learn­ing rate decreas­es, it was­n’t until late March 2020 that we got notice­ably good sam­ples at a loss of ~0.22/10m steps on 2020-03-24. (Thank good­ness for our TFRC cred­its! Even with TPUs cov­ered, the final total cost of the VMs/bandwidth/storage was >$1000.)

Swarm train­ing loss curve, ~2020-02-20–2020-04-01, ~18.9m steps (re­peated inter­rup­tions)

Con­verged—un­der­fit‽ Around 4 April, with the 5k con­text train­ing appar­ently ful­ly-­con­verged, we switched back to ‘core­less’ mode and 10k con­text win­dows and trained to a loss of ~0.20. It was sta­ble at that loss, with­out any fur­ther decrease, for many steps, so, inter­est­ing­ly, I sus­pect GPT-2-117M turns out to under­fit our MIDI dataset while being able to over­fit the ear­lier smaller ABC datasets. If I had known we’d be able to hit a only loss of 0.2, I would’ve started with a larger model than just GPT-2-117M, and accepted any nec­es­sary lim­its on con­text win­dow. Since like­li­hood loss seems to track qual­ity so closely & the final few incre­ments of loss make a big dif­fer­ence per­cep­tu­al­ly, it’s pos­si­ble we could’ve done a lot bet­ter.

GPT-2-30k Download

The final model was trained for 30,588,051 steps, and is avail­able for down­load:

  1. Rsync mir­ror: rsync -v rsync:// ./ (475M)
  2. Mega

MIDI Generation

Sam­pling still unsolved—­di­ver­gence+gib­ber­ish. Gen­er­at­ing sam­ples from the 10m check­point proved unusu­ally tricky, despite nucleus sam­pling, with high risk of rep­e­ti­tion or degen­er­a­tion into pseudo-English-lyrics/text. A round of data-­clean­ing to remove that fixed the degen­er­a­tion into pseudo-Eng­lish, but left the rep­e­ti­tion prob­lem: sam­pling remains prone to gen­er­at­ing hun­dreds of lines of repeat­ing notes like z8|z8|z8|z8| (which is espe­cially slow once the con­text win­dow fills up). These rep­e­ti­tion traps (so com­mon in like­li­hood-­trained NN mod­els) appear to go on indef­i­nitely in my sam­pling, and high temperatures/top_p para­me­ters do not break loops. The sheer length of pieces appears to exac­er­bate the usual issues with rep­e­ti­tion—a sam­pling process which is fine for a few hun­dred tokens may nev­er­the­less fre­quently diverge as required lengths approach 10k. (Given that extremely long sequences of z notes are com­mon in the orig­i­nal ABC-MIDI datasets, this sug­gests that a cus­tom BPE encod­ing—un­like in my ear­lier poetry or The Ses­sions dataset­s—­would be help­ful in col­laps­ing these com­mon long sequences down to a few BPEs.)

Fail­ure mode: drop­ping instru­ments. Strik­ing­ly, pieces tended to be either good or fail entire­ly, pos­si­bly because instrument/tracks are gen­er­ated one at a time, and syn­tax errors or early ter­mi­na­tion of sam­pling can dis­able entire instru­ments, lead­ing to silent tracks and a final music piece which sim­ply repeats end­lessly (be­cause the melody & vari­a­tion would have been pro­vided by a failed or nonex­is­tent later track). To reduce the impact from diver­gence while try­ing to avoid trun­cat­ing poten­tial­ly-­good pieces ear­ly, I use the rep­e­ti­tion penalty from Nick Wal­ton’s (it­self bor­rowed from ), and set a 10k length limit on sam­ples (cor­re­spond­ing to the aver­age length of an ABC-encoded MIDI), and gen­er­ate 10 BPEs at a time to speed things up (1 BPE per step would the­o­ret­i­cally give bet­ter results but is extremely slow):

python3 src/ --top_p 0.95 --penalty 0.65 --prompt "\nX:" --model_name midi \
    --length 10000 --n_ctx 10000 --nsamples 10 --batch_size 1 --step 10 --maxlen 10000

MIDI Samples

Because of the sam­pling fragili­ty, there is lit­tle point in pro­vid­ing a dump of ran­dom sam­ples to lis­ten to (although I pro­vide a text dump); instead, here are hand-s­e­lected sam­ples rep­re­sent­ing roughly the top 5% of sam­ples. All sam­ples are gen­er­ated using the default Timid­ity set­tings, tim­ings, and sound­banks (eg no use of harps like Shawn Presser’s mod­i­fied sam­ples before),

MIDI sam­ple (2020-01-28; from iter­a­tion #364086, loss ~0.35); (folk/classical?)

Sam­ples from iter­a­tion #10,974,811, loss ~0.22:

  • MIDI sam­ple (2020-03-24); electronica/instrumental
  • MIDI sam­ple (2020-03-24); elec­tron­ica
  • MIDI sam­ple (2020-03-24); jazz (par­tic­u­larly nice use of a sax­o­phone track)
  • MIDI sam­ple (2020-03-24); clas­si­cal piano piece
  • MIDI sam­ple (2020-03-25); clas­si­cal fugue-esque piece (harp­si­chord or organ?)

From the final mod­els:

  • Big Dataset (no­table for its drum+bag­pipe-­like duet)
  • (2020-04-11); fast jazz sax­o­phone piece?
  • LMD folk-ish piece (heavy on a woodwind/organ?)
  • LMD: catchy ambi­en­t-esque piano piece
  • Pop MIDI, rapid jazz piano?
  • Pop MIDI, almost video-game orches­tral-­like in parts
  • (2020-04-15); LMD piano piece
  • LMD piano piece
  • LMD slow clas­si­cal
  • Pop MIDI, gui­tar rock
  • Pop MIDI piano
  • The Ses­sions jig
  • The Ses­sions
  • The Ses­sions
  • The Ses­sions
  • The Ses­sions
  • The Ses­sions
  • The Ses­sions
  • The Ses­sions
  • The Ses­sions
  • The Ses­sions
  • The Ses­sions
  • The Ses­sions
  • The Ses­sions (-like?)
  • The Ses­sions
  • The Ses­sions
  • 2020-04-18 sam­ples: ABC
  • Big Dataset
  • Big Dataset
  • Big Dataset
  • Big Dataset
  • Big Dataset
  • Big Dataset
  • Big Dataset
  • Big Dataset
  • Big Dataset
  • Pop
  • Pop
  • The Ses­sions
  • The Ses­sions
  • The Ses­sions
  • The Ses­sions
  • The Ses­sions

Bet­ter but frail­er. Qual­i­ty-­wise, the MIDI model is capa­ble of much more diverse pieces while still gen­er­at­ing good folk music, illus­trat­ing the ben­e­fits of switch­ing to a much larger dataset. Although the dif­fi­culty we had train­ing it means I do not rec­om­mend using GPT-2 for such long sequences! (For­tu­nate­ly, there are many tech­ni­cal solu­tions to the imme­di­ate prob­lem of self­-at­ten­tion’s qua­dratic cost.) But after the task of train­ing with wide win­dows is solved, there is not yet any solu­tion to the prob­lem of sam­pling. It’s hard to not notice that the good MIDI sam­ples are more repet­i­tive than the small ABC sam­ples; this could be due in part to issues in sam­pling break­ing addi­tional voices & leav­ing the back­ing tracks (which are sup­posed to be repet­i­tive), and could also reflect greater rep­e­ti­tion in MIDI pieces intended to be sev­eral min­utes longer as opposed to curated ABC folk music pieces which tend to be quite short (as per­form­ers will vary & repeat as nec­es­sary while per­form­ing), but one must strongly sus­pect that the sam­pling process itself is to blame and nucleus sam­pling is far from a solu­tion. Were we to train a larger or longer model which can model tens of thou­sands of tokens, how would we sam­ple tens of thou­sands of tokens from it?

  1. TiMid­i­ty++ is a strange tool. The man page boasts of builtin sup­port! Just in case you ever needed to play all MIDI files in a par­tic­u­lar news­group.↩︎

  2. Using the stan­dard GPT-2 BPE encod­ing of the ABC-compiled ver­sion­s—although we did dis­cuss mak­ing either MIDI or ABC-specific BPE encod­ings for sav­ing per­haps a quar­ter of space, it would add com­plex­ity for users and we weren’t sure how to cre­ate a new BPE encod­ing.↩︎

  3. For exam­ple, this moti­vated Ope­nAI to develop Sparse Trans­form­ers, which tame the scal­ing by trim­ming the Trans­former atten­tion win­dows to much smaller than the full win­dow, thereby avoid­ing the full self­-at­ten­tion qua­dratic scal­ing, which enables win­dows of tens of thou­sands eas­ily (30k is more than enough to han­dle most MIDIs), and enables MuseNet to gen­er­ate MIDIs with­out a prob­lem. avoid the win­dow prob­lem by pre­pro­cess­ing MIDI into a cus­tom encod­ing spe­cial­ized to sin­gle piano tracks which is eas­ier to under­stand, and switch­ing to , which has a lim­ited win­dow but adds on recurrency/memory to main­tain coher­ence.↩︎

  4. This is sur­pris­ing because you would esti­mate that TPUs have 16GB per core and 8 cores, so only 128GB VRAM total. But if you avoid that, and use what we call ‘core­less mode’, both TPUv2 and TPUv3 appar­ently die at around 300GB. This does not appear to be doc­u­mented any­where by Google, even though it’s inter­est­ing & sur­pris­ing & use­ful.↩︎

  5. This is part of how “acci­den­tally qua­dratic” algo­rithms can sneak into pro­duc­tion sys­tem­s—qua­dratic growth is slow enough that for many inputs, fast hard­ware can still han­dle it and the soft­ware just ‘feels slow’, while for truly expo­nen­tial growth, it’d imme­di­ately hit the wall on real­is­tic inputs and fail.↩︎

  6. Amus­ing­ly, Ope­nAI would later take a sim­i­lar brute force approach in train­ing : .↩︎