GPT-2 Folk Music

Generating Irish and folk music in ABC format using GPT-2-117M, with good results.
statistics, NN, shell, GPT, music
1 Nov 201917 Jan 2020 · finished · certainty: likely · importance: 6

In November 2019, I experimented with training a GPT-2 neural net model to generate folk music in the high-level ABC music text format, following previous work in 2016 which used a char-RNN trained on a ‘The Session’ dataset. A GPT-2 hypothetically can improve on an RNN by better global coherence & copying of patterns, without problems with the hidden-state bottleneck.

I encountered problems with the standard GPT-2 model’s encoding of text which damaged results, but after fixing that, I successfully trained it on n=205,304 ABC music pieces taken from The Session & The resulting music samples are in my opinion quite pleasant.

The model & dataset are available for download, and I provide for listening selected music samples as well as medleys of random samples from throughout training.

Back in 2015–2016, Bob L. Sturm experimented with generating Irish folk music using a char-RNN trained on a corpus of folk music written from in a high-level musical format called “”. Compact text—perfect for NNs. While ABC notation is written in ASCII, it supports many complex features, and it has been adopted widely by folk musicians and hundreds of thousands of pieces written/transcribed in it.


Background: folk-RNN

Sturm et al scraped ~40k ABC files from The Session and trained a char-RNN called “folk-RNN”, putting the code & data online, and providing a web interface for generation. Prior success with char-RNNs. In addition to the various research publications, Sturm has also written many blog posts evaluating folk-RNN pieces, such as how well they’re played by human musicians.


2015 was a long time ago, however, and DL has seen a paradigm shift in sequence modeling away from char-RNNs to CNNs and attention-based Transformer models—most famously, GPT-2. DL progress. () with & OpenAI’s -based have both demonstrated excellent results in music composition at various timescales/formats, and interesting features like mixing genres.

While messing around with in late October 2019, I became curious if folk-RNN could be improved by simply throwing one of the GPT-2 models at it. GPT-2: a perfect match. (Not the large ones, of course, which would overfit far too easily or hadn’t been released yet, but GPT-2-117M.) GPT-2 is unable to model raw WAV audio, or MIDI, because a meaningful musical piece is a WAV sequence of hundreds of thousands to millions of symbols long, and a MIDI piece is tens of thousands of symbols long, which far exceed GPT-2’s small context window, and why OpenAI used Sparse Transformers for its MIDI generation, as Sparse Transformers can scale to text with tens of thousands of characters. However, the high-level notation of ABC pieces means they fit just fine into the GPT-2 window.

I had avoided doing anything music with GPT-2, focusing instead, because I assumed OpenAI would be doing a MuseNet followup, but months later, they’d done nothing further, and when I inquired, I got the impression that their music projects were over. So why not?

As for why repeat Sturm’s project—there were two possible advantages to using GPT-2-117M:

  1. improved global coherency:

    I thought the Transformer might work particularly well on ABC format, because RNNs suffer from persistent ‘forgetting’ issues, where it’s difficult for the RNN to persist its memory of past generated sequences, making it hard for an RNN to repeat a theme with variants, while a GPT-2 Transformer has a context window of 1024 BPEs—much longer than almost every ABC piece—and so is able to ‘see’ the entire piece simultaneously while generating the next note

  2. English metadata understanding:

    The English pretraining could potentially help by providing semantic understanding of eg the ABC metadata, such as the difference between two pieces titled a ‘jig’ versus a ‘waltz’, or the pseudo-natural-language-ness of the ABC format as a whole.

ABC Data

The Session

So I did apt-get install abcmidi timidity1 to get the CLI tools to do ABCMIDIWAV (respectively) and downloaded the folk-RNN repo with its data files. Pipeline: ABCMIDIWAV.

The data comes in several formats, for their experiments in changing the notation. I used the original format, with n=48,064 songs.

The data needed processing for GPT-2 as follows:

  1. there was stray HTML (</html>) which had to be removed.

    I used search-and-replace, and reported the issue.

  2. abc2midi requires every song to have an integer identifier, eg X: 48064, to be a valid ABC file which it can compile to MIDI.

    I used an Emacs macro (which can increment an integer 1–48,064) to insert a X: $N before each T: title line, but in retrospect, I could have simply used another search-and-replace to insert X: 1 in front of each piece—it’s not like the ID has to be unique, we’re just satisfying abc2midi which is a bit picky.

  3. as usual for any neural model like char-RNN or GPT-2, it is important to insert <|endoftext|> markers where relevant, so it understands how to generate separate pieces and avoids ‘run on’.

    I used search-and-replace.

This yielded 14MB of text to train on, and is converted to NPZ as usual.


First Model

Because the Session corpus was so small (just 14MB), I used the smallest available GPT-2, GPT-2-117M to train on, and standard settings to train on one of my Nvidia 1080tis:

Training was fairly easy, taking just a few days at most to train down to a loss of 0.46 (9742 steps at minibatch n=5), and I killed it on 21 October 2019 and looked at the random samples. Straightforward success. They struck me as pretty good, aside from generated pieces often having the same title repeatedly, which apparently was due to The Session posting multiple transcriptions of the same piece, so the model picked up on that and would generate variants on the same theme. Sturm highlighted a few and did some more in-depth commentary on them, with a mixed evaluation, concluding “So of the five transcriptions above, two are plausible. The polka is actually pretty good! All titles by GPT-2 are plagiarized, but I haven’t found much plagiarism in the tunes themselves.”

I was worried about plagiarism and thought ~0.40 would be safe, but it seemed the music itself was still far from being copied, so I considered further training. Some datasets are invalid ABC. The additional processed versions of The Session that Sturm et al had made seemed like a target, but caused problems when I simply concatenated them in, and I soon discovered why abc2midi now thought all the samples were broken:

allabcwrepeats_parsed_wot: This is version 3 of the dataset from In this version, we transpose all tunes to have the root C, transpose them all to have the root C#, remove the titles, and make new mode tokens, K:maj, K:min, K:dor, and K:mix. There are over 46,000 transcriptions here.

This turns out to be a problem: K:maj, K:min, K:dor, and K:mix completely breaks abc2midi! So I did additional search-and-replace to transform them into valid key signatures like K: Cmaj, K: Cmin, K: Cdor, and K: Cmix.

Retraining, I discovered 0.40 was far from converged, and with another 13k steps, it could go down to <0.09. Balancing imitation & plagiarism. However, checking random samples by hand, the textual overlap with The Session became particularly large once the loss reaches ~0.09 (note that it was not ‘overfitting’ in the standard sense, since the loss was still decreasing on the validation set), so I backed off to a model with ~0.13 loss. This seems to be high-quality without gross plagiarism.

Spaceless Model

I began using that model for the preference learning work, where I found that preference learning seemed to improve music more than the poetry, so I began focusing on the music.

Puzzlingly, no matter how many ratings I added, and despite the low loss, the generated samples would persistently have basic, blatant syntax errors involving spaces; abc2midi would often warn or even error out on a piece which could be easily fixed by hand by simply removing a few spaces. Anomaly: permanent space-related syntax errors. This was wasting my time during rating, since I couldn’t pick samples with syntax problems (even if they’d otherwise sound good) because I didn’t want to reinforce generation of invalid samples, and also while generating music.

Discussing it with Shawn Presser, who I was working with simultaneously to train GPT-2-1.5b on poetry, he pointed out that some people, like Nostalgebraist had some frustrating problems with the standard GPT-2 BPE encoding.

To explain what BPE is and why it might be a bad thing for ABC notation: GPT-2 doesn’t just feed in raw characters like a char-RNN does, because that makes every input extremely long. GPT-2 generates space-delimited word fragments. Instead, it tries to ‘chunk’ them into something in-between character-sized and word-sized, to get the best of both worlds, a way of writing text where common words are a single symbol but rare words can still be expressed as a couple symbols rather than deleted entirely like word-based encodings must; however, since the default model is trained on English text, chunking is done assuming normal English whitespace, like spaces between words.

Nostalgebraist notes that the actual BPE implementation used is weird and doesn’t act as you’d expect, especially when spaces are involved. So Presser wondered if GPT-2 couldn’t express the syntactically-correct text sans spaces, and that is why the errors were there and stubbornly persisted despite hundreds of ratings which told GPT-2 to stop doing that already.

Checking, the ABC format apparently does not require spaces. Workaround—spaces optional! They are only there for the convenience of humans reading & writing ABC. Aside from the metadata fields, if you delete all spaces, the music should be the same. I was surprised, but this seemed to be true. (Presser did some experiments with creating a brand-new BPE tailored to ABC, and while this would have reduced the BPE size of ABC pieces by >33%, I figured that all the ABC pieces fit into the GPT-2 window anyway and it wasn’t worth the hassle now that we had diagnosed & worked around the problem. He also did some experiments in generating video game style music via ABC: he prompted it with chords, and then switched from the default piano instruments/sound samples used by TiMidity++ to instruments like harps for a fantasy feel.)

So, I deleted the spaces with a simple tr -d ' ', re-encoded to NPZ, and retrained the first model to create a new ‘spaceless’ model. This required another few days to re-converge to ~0.13.

The spaceless corpus fixed the invalid-ABC problem, and the new model regularly generated ABC samples that triggered neither warnings nor errors, and Presser swore that the perceptual quality was much higher.

Combined Model: The Session +

Presser was interested in expanding the repertoire beyond The Session and began looking at ABC databases. More dakka (data). The biggest by far appeared to be, which had n=290,000 pieces. He scraped a random half of them, for n=157,240 total, and I combined them with The Session duplicated dataset, for a total n=308,280 (n=205,304 unique; 81MB). pieces are much more diverse in formatting & metadata than The Session. Simplifying to match The Session ABC. To homogenize them, I ran all the pieces through abc2abc, and then I deleted some metadata fields that struck me as excessive—commentary, discussions about how to play a piece, sources, authors of the transcription, that sort of thing, which greatly inflated the loss of the combined dataset compared to the spaceless model. (In total, I filtered out abc2abc-generated warnings starting with %, and B:/D:/F:/N:/O:/S:/Z:/w: metadata.) It would have been nice if the metadata had included genre tags for greater control of conditional pieces, akin to my author-based control for GPT-2 or , a technique demonstrated at scale by using explicit Reddit metadata, and Choi et al 2019 using autoencoders to do unsupervised learning of musical features which implicitly covers genre, but alas! We’ll have to stick with the basics like title/key/meter.

This required a full week of training or 168810 steps (1–7 Dec), down to a higher loss (as expected) but still on the edge of plagiarism:

Examples of generated ABC (note the lack of spaces):


Last example rendered as a score:

Score for “PolkaEbBbAb(5letras)cf.CGF5-Parts” (an ABC music sample generated by GPT-2-117M trained on a combined ABC dataset)
Score for “PolkaEbBbAb(5letras)cf.CGF5-Parts” (an ABC music sample generated by GPT-2-117M trained on a combined ABC dataset)


An ABC sample is not playable on its own; it must be converted to MIDI, and then the MIDI can be played. If one is looking at individual samples being generated by the model, a quick CLI way to play and then dump to an OGG Vorbis file might be:

Extracting multiple ABC samples, converting, and merging into a single long piece of music is somewhat more challenging, and I reused parts of my preference-learning rating script for that.

First Model Samples

GPT-2-117M random samples, first model trained on Session (2019-10-21):
“Paddywhack” generated title & sample (2019-10-22):
“The Bank of Turf” sample (2019-10-22):
“Hickey’s Tune” sample (2019-10-23):
“The Loon and his Quine” sample (2019-10-29):
“The Atlantic Roar” sample (2019-10-30):
“The Lonely Fireside” sample (2019-10-30):
“The Marine” sample (2019-10-31):
“Whiskey Before Breakfast” sample (2019-10-31):
“The Flogging” sample (2019-11-01):
“Banks of the Allan” sample (2019-11-03):
“A Short Journey” sample (2019-11-04):

Spaceless Model Samples

100 random samples from spaceless Session GPT-2-117M (2019-11-08):
50 random samples (2019-11-09):
50 random samples, but with to maximize diversity of samples (2019-11-09):
“#127512” sample (2019-11-14):

I enjoyed the model’s renditions of the “Yn Bollan Bane” jig when I came across it, and so I used conditional generation to generate 50 variations on it:

“50 Variants on ‘Yn Bollan Bane’”:

Combined Model Samples

100 random samples from combined GPT-2-117M:
“Invereshie’s House” sample (2019-12-04):
“FaroeRum” sample (2020-01-25):


I am pleased with the final generated music; the spaceless & changes definitely improved over the original.

In retrospect, the use of GPT-2-117M was not necessary. Smaller = better (for now). It was so large that overfitting/plagiarism was a concern even with the combined dataset, and the English pretraining was largely useless—all of the generated titles I checked were copied from the training data, and I didn’t observe any interesting new ones. The GPT-2 BPE encoding also proved to be a problem in the end; generating a BPE specifically for an ABC corpus would have avoided that and probably also have improved learning. A smaller GPT-2 with a customized BPE (fewer parameters & attention heads, but more layers, I think would be better) would have trained much faster & probably give similar or better results.

Transformers = better copying + coherence? Qualitatively, I feel like the pieces have a different feel from char-RNN pieces, in their ability to repeat themes & motifs, and they also seem to have a much better ability to come to an ending, instead of meandering on indefinitely as char-RNN things have a tendency to do (even when trained on corpuses with clear end-of-text delimiters), perhaps because it’s easier for a Transformer to ‘count’ and know when a piece has reached a reasonable length, while a char-RNN forgets where it is. Overall, I’d call it a success.

Generating MIDI with 30k Context Windows

MIDI: long, but not too long? The logical next step from generating short ABC folk tunes is to generate music in general. Since ABC is not much used outside folk music, there are no good large ABC corpuses appropriate for this. Raw music is infeasible at the moment: things like can generate excellent raw audio WAVs like , but the model sizes prohibit ‘seeing’ more than a few seconds at most, making them capable of playing musical scores fed to them, but not of higher-level composing—struggling to pass more than ~10s even with . So you could have a GPT-2 generating ABC which is fed into a WaveNet CNN, creating what sounds like real instruments playing real music, but you couldn’t have a single NN doing it all. Intermediate, more powerful than ABC but also not as demanding as raw audio generation, would be generating MIDI.

Too long for GPT-2… Checking with wc -w, MIDI files typically range in size from 10–50k characters equivalent; even with BPEs potentially saving some space (it’s ~10.7k BPEs2 per MIDI), GPT-2 simply cannot handle this sort of sequence (with many important long-range dependencies) with its standard context window of 1024 tokens, because it would only be able to see less than a tenth of the music file at a time and it would be completely blind to the rest of it, since it has no memory. If you tried, it would not sound nearly as good as the ABC pieces above, because it would be unable to do all the nifty repetition-with-variation of melodies, overall thematic structure with beginnings & endings etc; instead, it would probably ramble around, generating plausible music which, however, never goes anywhere and just sort of ends. And, as anyone knows, GPT-2 does not scale to larger attention windows as self-attention is in the window length (), and 1024 is already pushing it for training. This motivated OpenAI to develop Sparse Transformers, which tame the scaling by trimming the Transformer attention windows to much smaller than the full window, thereby avoiding the full self-attention quadratic scaling, which enables windows of tens of thousands easily (30k is more than enough to handle most MIDIs), and enables MuseNet to generate MIDIs without a problem. avoid the window problem by preprocessing MIDI into a custom encoding specialized to single piano tracks which is easier to understand, and switching to , which has a limited window but adds on recurrency/memory to maintain coherence.

More Dakka

Unless we use brute force. However, when training various GPT-2-1.5b models on our TPU swarm using the TPU research credits granted to us by Google’s , we noticed that a TPU can fit a model which uses up to 300GB VRAM before crashing. (This is surprising because you would estimate that TPUs have 16GB per core and 8 cores, so only 128GB VRAM total. But if you avoid that, and use what we call ‘coreless mode’, both TPUv2 and TPUv3 apparently die at around 300GB. This does not appear to be documented anywhere by Google, even though it’s interesting & surprising & useful.) 300GB—that’s quite a lot. What could you do with that? How bad exactly is GPT-2’s scaling…? Quadratic () ≠ (): if you have some good hardware available, you may be able to push it quite far.3

TPUs can train 30k context windows! We tried GPT-2-117M and it turned out a context window of 10,000 worked! Then we tried 12.5k, 15k, 20k, & 30k, and they all worked. (For comparison, generated great excitement for being able to scale up to 64k windows on single GPUs.) These are wide enough to train MIDI-length files, and we could even generate small text-encoded images (since 30k = 1732). 30k is slow, but not hopelessly slow, as with GPT-2-117M, we get 4 training steps in 2700 seconds (n=1). A swarm of TPUs, like >100, would be able to train on a large corpus in just a few wallclock days. The memory usages are considerable, but not beyond the TPU:

  1. 12.5k = 45GB backprop
  2. 15k = 75GB backprop
  3. 20k = 112GB backprop
  4. 30k = 280GB backprop

When are 30k context windows useful? So, it can be done, but should it be done? Is this more than just a parlor trick reminding us that exponential ≠ quadratic? For the most part, it seems like anyone in need of big context windows is probably better off not renting all these TPUs to train a GPT-2 with 30k context windows, and using one of the many alternatives for long-range dependencies, like Reformer or Compressive Transformers or Sparse Transformers or dynamic convolutions or… But it might be useful in two cases: where one has a model already fully trained, and can do a final finetuning training phase on the original corpus to pick up long-range dependencies, since it’s still feasible to generate with wide context windows on commodity hardware; and where one needs the fully-trained model because the transfer learning is vital, and training one of the alternatives from scratch would nevertheless deliver inferior results. (At least as of 30 January 2020, I know of no publicly-released trained models for any of those alternatives architectures which approach the power of GPT-2 or T5 etc—they are all trained on much smaller datasets.)

MIDI Dataset

To get enough MIDIs to be worth training on, I combined 3 MIDI datasets:

  1. The Sessions + (our partial scrape), described above

  2. The Lakh MIDI Dataset v0.1:

    The Lakh MIDI dataset is a collection of 176,581 unique MIDI files, 45,129 of which have been matched and aligned to entries in the Million Song Dataset. LMD-full

  3. MIDI Dataset (

    The entire dataset, gzipped, can be downloaded at

    The directory structure correspond to the sources on the Internet from which we grabbed the MIDI files. We deduplicated the dataset and the resulting collection has 77153 songs.

    • Big_Data_Set: “The Largest MIDI Collection on the Internet”
    • cariart: Ben Burgraff’s Midi Collection
    • download-midi: Pop and Rock genres from DOWNLOAD-MIDI.COM
    • Guitar_midkar.com_MIDIRip:
    • ics: Dan Hirschberg at University of California, Irvine
    • lmd_matched: Straight from the LMD-matched dataset from the Lakh project
    • (Now defunct) TV Timewarp

Combined, this yields 537,594 MIDI files (~9GB).

Converting MIDI to ABC

Need compact text versions of MIDIs. The problem with raw MIDI files is that they are binary, and not textual. The codebase does not support binary inputs as far as we know, and in any case, text is always much easier to inspect and work with. There are textual formats for MIDI, like hex encoding, and one can use generic binary→text encodings like , but I was not too satisfied with them: when I tried them, they often blew up the character count greatly, and even 30k context would not be enough.

ABC works well enough. Ironically, ABC turns out to be the answer! The ABC tools ship with midi2abc, which is the inverse of abc2midi, and while it warns that it is a lossy translator, after listening to a few dozen, I had to say that it does a good job overall. It also generates ABC files which are similar in size to the original MIDI files. (I figure that whatever midi2abc loses in having to potentially laboriously encode MIDI construct, it makes up in being able to take shortcuts using ABC’s higher-level music constructs.) The weaknesses of midi2abc is that it: loses volume control, doesn’t work well on some rock tracks (drums/guitars are hollowed out), doesn’t successfully compile many MIDI files (leaving empty or extremely short ABC files), and there are occasional pathological examples (eg sometimes a 44kb MIDI will compile to a 3.4GB ABC file—some sort of exponential blowup of some MIDI construct?). However, the first two are relatively minor, and the third can be tamed by simply dropping ABC output past 300kb (as hardly any non-pathological MIDI files compiled to >300kb).

Data cleaning. In converting the files, we want to avoid excessive use of newlines (-bpol 999999), we want to delete various warnings or error lines, we want to filter comments starting with '^% '—which take up a ton of space and often are just lyrics—and then delete spaces in non-MIDI-directive (^%%) lines per the ‘spaceless’ model above. The shell pipeline:

The file-lists can be used to read in individual files, intersperse <|endoftext|>\n, and write them out as a single GPT-2-formatted text. A shell loop proved too slow so I wrote a simple Haskell script:

5.2GB of ABCs. The final results were 453,651 ABC files (~10k BPE/ABC?):

  1. 2020-03-30-abc-combinedmidi-training.txt: 4.9GB (430,928 ABC pieces)
  2. 2020-03-30-abc-combinedmidi-validation.txt: 263MB (22,723 ABC pieces)

The full dataset (all MIDIs, ABCs, file-lists, and GPT-2-formatted text files) is available for download:

30k Training

We initially trained with a ~27k context window, but iterations took multiple seconds, and after we took a second look at the ABC conversions and saw that pieces averaged only 10.7K BPEs, decided that was largely a waste, and switched over to 5k context windows. Due to repeated TPU pre-emptions and delays in restarting training and some ill-timed learning rate decreases, it wasn’t until late March 2020 that we got noticeably good samples at a loss of ~0.22/10m steps on 24 March 2020.

Generating samples from the 10m checkpoint proved unusually tricky, with high risk of repetition or degeneration into pseudo-English-lyrics/text; pieces tended to be either quite good or fail entirely, possibly because instrument/tracks are generated one at a time, and syntax errors can disable entire instruments, leading to silent tracks.

30k Samples

MIDI sample (2020-01-28; from iteration #364086, loss ~0.35); (folk/classical?)

Samples from iteration #10,974,811, loss ~0.22:

MIDI sample (2020-03-24); electronica/instrumental
MIDI sample (2020-03-24); electronica
MIDI sample (2020-03-24); jazz (particularly nice use of a saxophone track)
MIDI sample (2020-03-24); classical piano piece
MIDI sample (2020-03-25); classical fugue-esque piece (harpsichord or organ?)

  1. TiMidity++ is a strange tool. The man page boasts of builtin support! Just in case you ever needed to play all MIDI files in a particular newsgroup.↩︎

  2. Using the standard GPT-2 BPE encoding of the ABC-compiled versions—although we did discuss making either MIDI or ABC-specific BPE encodings for saving perhaps a quarter of space, it would add complexity for users and we weren’t sure how to create a new BPE encoding.↩︎

  3. This is part of how “accidentally quadratic” algorithms can sneak into production systems—quadratic growth is slow enough that for many inputs, fast hardware can still handle it and the software just ‘feels slow’, while for truly exponential growth, it’d immediately hit the wall on realistic inputs and fail.↩︎