OpenGPT-2: We Replicated GPT-2 Because You Can Too

Aaron Gokaslan*, Vanya Cohen*, Ellie Pavlick, Stefanie Tellex | Brown University

Introduction

Recently, large language models like BERT¹, XLNet², GPT-2³, and Grover⁴ have demonstrated impressive results in generating text and on multiple NLP tasks. Since Open-AI has not released their largest model at this time (but has released their 774M param model), we seek to replicate their 1.5B model to allow others to build on our pretrained model and further improve it.

You can access the model and generate text using our Google Colab.

We’ve also made the model weights available separately.

Replication

Radford’s et al’s³ security strategy of delaying the release of the model relies on these models being difficult to replicate and requiring a high degree of specialized domain knowledge. We demonstrate that many of the results of the paper can be replicated by two masters students, with no prior experience in language modeling. Because of the relative ease of replicating this model, an overwhelming number of interested parties could replicate GPT-2. Further, Zellers et al.⁴ shows that large language models like GPT-2 are an invaluable tool for countering the use of the same models as text generators.

Because our replication efforts are not unique, and large language models are the current most effective means of countering generated text, we believe releasing our model is a reasonable first step towards countering the potential future abuse of these kinds of models.

We base our implementation off of the Grover model⁴ and modify their codebase to match the language modeling training objective of GPT-2. Since their model was trained on a similarly large corpus, much of the code and hyper-parameters proved readily reusable. We did not substantially change the hyper-parameters from Grover.

The cost of training the model from scratch using our code is about $50k. It’s important to note this figure is the estimated value of the cloud compute, and does not reflect the much smaller intrinsic costs involved (training the model is less if training on other less time-efficient, user-friendly compute resources).

There is a significant time-cost tradeoff, and slower training methods have considerably smaller costs, thus reducing the barrier to entry.

Dataset

The original paper provided minimal details on how the dataset was cleaned.

As in WebText³, we begin by parsing out all links from Reddit with more than 3 up-votes. We started with the Pushshift Reddit scrape⁵, a dataset containing a continuously updated collection of Reddit posts, comments, and related metadata. These links are then filtered to remove direct links to file-types unlikely to contain usable text or HTML (i.e. video files, PDFs, and CSS style files).

We also filter webpages to remove Wikipedia as it is used by various evaluation benchmarks and datasets. We were not able to determine if our filtering criteria matched OpenAI’s since this information was never released. Text was extracted from HTML pages using the Newspaper Python library, and then filtered for only English text using the fastText Python library⁶. Specifically we use the WhatTheLang python Wrapper⁷. We deduplicate documents using locally sensitive hashing (LSH)⁸ ⁹ ¹⁰. We hashed the documents into sets of 5-grams and all documents that had a similarity threshold of greater than 0.5 were removed.

As a cleaning heuristic, documents with fewer than 128 tokens were removed from the dataset. These shorter documents tended to be lower quality, as determined by text coherence. We release this dataset as the OpenWebTextCorpus¹¹.

For encoding the dataset, we used the Binary Pattern Encoder¹² released with the small models from Radford et al.³

We used a modified version of the OpenWebText web-scraping codebase¹³ as a starting point for our dataset collection.

Errata

From the publicly released collection of 260k documents from WebText³, we find that all have a minimum byte-pair (BPE) encoding¹² length of 40, and a maximum of 1024. OpenWebText differs in that we set a lower bound for document length at 128 tokens (instead of BPE codes), and do not restrict the maximum document length. The original WebTextCorpus was released before these samples became available and therefore did not use the information for generating cleaning heuristics.

We made multiple attempts to contact Radford et al.³ to clarify evaluation and model details, but were ultimately unsuccessful.

Results

Despite the differences in our training distribution, we do report similar perplexities over most datasets.

Samples

Prompt: “Recycling is good for the world. NO! YOU COULD NOT BE MORE WRONG!!”

Output:

Citations

*. Equal contribution.

  1. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  2. Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le,and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-lengthcontext.arXiv preprint arXiv:1901.02860, 2019.
  3. Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Languagemodels are unsupervised multitask learners. 2019.
  4. Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. Defending against neural fake news. arXiv preprint arXiv:1905.12616, 2019.
  5. Jason Baumgarten. Reddit posts dataset.
  6. Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. Bag of tricks forefficient text classification. InProceedings of the 15th Conference of the European Chapterof the Association for Computational Linguistics: Volume 2, Short Papers, pages 427–431.Association for Computational Linguistics, April 2017.
  7. Whatthelang.https://github.com/indix/whatthelang, 2019.
  8. Abhinandan Das, Mayur Datar, Ashutosh Garg, and Shyam Rajaram. Google news per-sonalization: scalable online collaborative filtering. InProceedings of the 16th internationalconference on World Wide Web, pages 271–280. ACM, 2007.
  9. Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. Detecting near-duplicates forweb crawling. InProceedings of the 16th international conference on World Wide Web, pages141–150. ACM, 2007.
  10. Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing thecurse of dimensionality. InProceedings of the thirtieth annual ACM symposium on Theory ofcomputing, pages 604–613. ACM, 1998.
  11. Aaron Gokaslan and Vanya Cohen. Openwebtext corpus.http://Skylion007.github.io/OpenWebTextCorpus, 2019.
  12. Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare wordswith subword units.arXiv preprint arXiv:1508.07909, 2015.
  13. OpenWebText. https://github.com/eukaryote31/openwebtext, 2019.

We would like to thank Google (TensorFlow Research Cloud) for providing the compute for this and related projects (and state that their providing compute is in no way an endorsement of our views).

📝 Read this story later in Journal.

👩‍💻 Wake up every Sunday morning to the week’s most noteworthy stories in Tech waiting in your inbox. Read the Noteworthy in Tech newsletter.

Noteworthy - The Journal Blog

The Official Journal Blog

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium