Replicating GPT2–1.5B

Jun 6, 2019·8 min read

In this post, I want to quickly talk about the technical and organizational questions around my recent replication of GPT2–1.5B. Please read my main post for the full story. I will try to keep this post brief.

The important facts

Code: https://github.com/ConnorJL/GPT2

Samples: https://github.com/ConnorJL/GPT2/tree/master/samples

The code should run out of the box on GPUs and TPUs (and CPUs, if you’re really desperate). I used the parameters specified in 1.5B.json and trained it on a preemptible v3–512 TPU pod (which is actually more powerful than the machine OpenAI used) for around a week (with interruptions). Code and instructions for generating the dataset are also included in the repo.

You can download my models with the script in the repo. Currently I have a weaker version of 117M, and a model I call PrettyBig which is slightly larger than OpenAI’s 345M, which means it is technically the largest GPT2 model currently publicly available.

I will be releasing 1.5B to the public on July 1st, if, and only if, no one shows me a convincing reason not to. When I do, it will be downloadable just like my other models.

The story in brief

I am just a humble CS undergrad student with no funding or anything else really to speak of (other than parental support. Love you, mom). I study at the Technical University of Munich and learned everything I know about Machine Learning from self study in my free time (I probably spend most of my free time on stuff like this).

The biggest question people will have is how the hell I got access to so much computing power. In short, the Tensorflow Research Cloud program. It was a bit of a funny story overall, but basically whenever I ran into a roadblock of not having enough compute, I just casually mentioned it to the TFRC Team and, in incredible generosity and to my utter surprise, they usually just gave me what I asked for. This way I ended up, on two separate occasions, with a full v2–256 and a preemptible v3–512 TPU pod, and with access to some more smaller TPUs I’m using for other projects. I am actually also affiliated with the Max Planck Institute of Psychiatry, where I’m doing a project using these TPUs to develop a system to recognize certain features in neuron images. So I am also using the TFRC for “real” research.

I really can’t express how grateful I am to Google and the TFRC team for their support in enabling this. They were incredibly gracious and open to allowing me access, without requiring any kind of rigorous, formal qualifications, applications or similar. I can really only hope they are happy with what I’ve made of what they gave me.

Going into this project, I had never used TPUs, Transformer models or Tensorflow (well I had used Tensorflow some years back, but there was really no memory left). So this was quite the learning experience for me.

I estimate I spent around 200 hours working on this project. I don’t have a reference frame for how long projects like this usually take, but my guess is that my time invested is rather high, since I had to learn so much from scratch and kind of fumble my way through a lot of the time. I ended up spending around 600–800€ on cloud resources for creating the dataset, testing the code and running the experiments (and other people spend their money on going out with friends or traveling. What a bunch of losers, am I right?). Most of the work happened over the course of April and May, just me locked in the computer lab at university with my beat up old Thinkpad. I’d like to personally thank my noise cancelling headphones for preserving my sanity.

Issues

Now let me be clear: Creating and maintaining a huge, complex system like Tensorflow is incredibly difficult, and I have nothing but the utmost respect for Google’s engineers. And if I’ve learned one thing, it’s that TPUs, Tensorflow and the Google Cloud are incredibly powerful and useful…but there are some usability issues. It hurts me to have to criticize the very people that allowed this to happen, but it has to be said: Google documentation is a god damn Kafkaesque nightmare. And so is Tensorflow error reporting for that matter. Documentation (and code, if you have the misfortune of having to look into it) for Tensorflow, TPUs and co is a sprawling mess of overlapping and confusing articles, tutorials and examples, and all too often it is impossible to tell whether they are out of date or not. Especially for TPUs there is a lot of issues I ran into that are plainly not documented anywhere and I had to figure out through trial and error.

To illustrate, I would like to tell my favorite little anecdote of this story. So when I first started the project, I was given access to a bunch of single v2 TPUs. So I thought, fine, I can just chain a bunch of them together to get enough computing power. And look, Tensorflow even provides exactly what I need in the TPUClusterResolver! As you can see in the documentation (as of early June 2019), it clearly states:

Cool, so either a string or list of strings, each one representing a TPU. Easy! I extensively tested my code on one TPU until everything was perfect. Great! So I expand to two TPUs, pass the list of strings…crash. Of course, given Tensorflow’s incredibly opaque error reporting, it wasn’t immediately clear that this was the problem and I hadn’t made an error somewhere else (maybe I set a wrong batch size or something). So after hours and hours of bug searching, I just cannot find the issue, and finally, I take a look into the code of TPUClusterResolver, and find this absolute gem:

It’s not even implemented! They literally say you can do a thing in the documentation, but you can’t! This is definitely the most egregious example, I didn’t encounter any other examples of plainly nonexistent functionality, but it was just too funny not to share.

There is an obvious and intended way around this by the way, you just rent a TPU “pod” instead of multiple single TPUs. This works wonderfully (usually…).

Where my model differs from the original GPT2–1.5B

My model differs in a few minor ways from the original GPT2 model. I provide these here for completeness sake.

Since we haven’t seen the original training code and parameters, it is well possible that there were some tweaks or different parameters used by OpenAI that they didn’t include in their paper. I in particular am not sure of how they used dropout and their learning rate schedule.
I trained my model using Adafactor rather than Adam. I have no idea how they fit this model with Adam on a v3 TPU. Since Adam needs a lot more memory, I was unable to fit 1.5B + Adam onto a v3 TPU, even with 16bit precision.
My training data is obviously self collected and probably differs from the OpenAI data in various ways. I didn’t use dragnet, only newspaper, for content extraction and I scraped newer links from reddit than did OpenAI.
My training data input pipeline is probably different from OpenAI’s. Since there was no explanation given in their paper, I made an educated guess that the model was trained on texts glued together with a “<|endoftext|>” token connecting them. As a bit of “artistic freedom” I created an input I called “longbiased”, which biases the input to only take texts at least 512 tokens long 70% of the time. I cleaned all text with the “NFKC” setting in ftfy and threw out all texts with less than 20 or so tokens. Also because I didn’t see much reason to make a train/eval split, I trained it on all long (>512 tokens) data, but forgot to feed it the eval part of the shorter texts.

A few thoughts on my experiences

TPUs are amazingly cool and powerful, but have some finicky limitations and edge cases (I won’t go into all the details, I’ve already reported them to Google)
Developing ML experiments in the Google Cloud, once I figured it all out, was an extremely positive experience.
Tensorflow (I was using version 1.13) is…not perfect. The error reporting is a nightmare, the documentation is inconsistent in quality and some features just don’t work (TPUClusterResolver). But overall it is very powerful and, at times, even elegant.

Some Pro Tips

VS Code Insiders has a new remote code editing functionality that makes working on remote machines beautifully easy. I used it all the time to develop directly on my cloud VMs.
The worst problems I had with TPUs was getting my data pipeline to work. Invest the time into understanding datasets and, if necessary, TFRecords.
TPUs require all data to be of fixed size. If your data pipeline outputs something that you know is a fixed shape, but TF doesn’t recognize as such (this happened to me. I subsampled a large piece of data to a fixed size, but TF didn’t understand that), literally just wrapping it in a final tf.reshape() solves your problem.
If you accidentally try to load a much too large model (or data batch) onto a TPU, you may not get an Out Of Memory error, but instead a completely non-descriptive “Socket Closed” error. Keep this in mind and try lowering the memory use and see if that fixes the problem. (This is unfortunately not the only way a Socket Closed error can be raised, but I did run into this and it drove me nuts for a while)
TPU Pods work in theory the same as single TPUs, but not necessarily in practice. I ran into several utterly arcane issues when switching a model from a single TPU to a pod. For the most part, I can’t help you here other than trial and error, it’s just worth being warned.
If you need to get fancy with your data input on TPUs, you are limited, but there is more you can do than just reading from files. The actual data ingestion runs on CPU, but does not have access to Python, meaning you can only use (some) TF functionalities in generating your data. But you still have access to a lot of useful features like random numbers and tf.map_fn (which you can use to hack-ily implement loops). Don’t forget to wrap the result in a tf.reshape!
There are some problems that aren’t well suited to TPUs. If you require dynamic shapes, or generate data in complex ways that you can’t implement in pure TF (such as often in RL), you probably will still have to rely on GPUs.

Responses (2)

Replicating GPT2–1.5B

The important facts

The story in brief

Issues

Where my model differs from the original GPT2–1.5B

A few thoughts on my experiences

Some Pro Tips

More from Connor Leahy

More From Medium

An Overview of Signal Classification

Classification of Traffic Signs Using Deep Learning

Continuous delivery and automation pipelines in machine learning with Polyaxon and Kubeflow…

Deploy your machine learning models, get ready for production!

Naïve Bayes Algorithm

Emotion-aware movie characterization with Oliver API

what to do if training on JFT-300M is not an option? convnet teachers to the rescue.

Fundamentals of Deep Learning