Data Efficient Language Transfer with GPT-J

We release Genji-JP 6B, a model finetuned on our Japanese storytelling dataset. We explore language transfer on big LMs without any changes on the architecture including the tokenizer and then compare the results with models trained from scratch on the target language with better tokenization.

We greatly improve the capabilities in the target language without any architectural change to the model. Additionally, we provide samples using common prompts from the original model, our trained models as well as existing generative language models in the target language.

How to Install & Use

The model has been released on Hugging Face hub for ease of use. The model can be used with the transformers repo. Check the Model Description for example code on how to use the model.

Link: https://huggingface.co/NovelAI/genji-jp

Model

We use the GPT-J 6B as our base model, an autoregressive GPT model pre-trained by EleutherAI on the PILE, a large-scale curated dataset created by EleutherAI. The dataset consists primarily of English material. Model architecture is left unchanged.

The model consists of 28 layers with a model dimension of 4096, and a feedforward dimension of 16384.

The model dimension is split into 16 heads, each with a dimension of 256. Rotary Position Embedding (RoPE) is applied to 64 dimensions of each head.

The model is trained with a tokenization vocabulary of 50257, using the same set of BPEs as GPT-2/GPT-3.

Training Dataset

We collect two separate datasets, first consisting of Japanese/English pairs from the Tanaka Corpus and EDICT dictionary entires. Context gets filled in while training by the entries, with the rest of the context being padded with endoftext token. Total data gets to 97,563,135 tokens.

The second dataset consists of 6668 web novels scraped from syosetu.com, no story with less than 500 points is included and the highest rated web novels are retrieved first.

It consists of 3,536,080,191 tokens, 7,463,307,114 bytes and 2,573,673,175 characters. This makes it 0.73 characters per token, showing that the GPT-2 BPE tokenizer is pretty inefficient for Japanese text.

Training Process

We train two different models, first on our Japanese dictionary dataset and web novel dataset together, second is only trained on the web novel dataset.

The Dictionary+Novel model is trained by going through the dictionary dataset first, then the web novels. The motivation for this is to let the model learn the Japanese words first to improve the token embeddings.

Training is done with the mesh-transformer-jax codebase running on TPUv3–8s provided by Google TPU Research Cloud.

The model is trained with the AdamW optimizer. We train the models with 2000 warmup steps, 5e-06 start learning rate and learning rate decreasing until the 90% of the steps to 1e-06. A weight decay of 0.1 is used with a batch size of 16.

Results

training loss curve

We find adding edict entries to the dataset helps with the loss. We evaluate and compare our finetuned model to vanilla GPT-J and GPT2JA-Large (a GPT-2 model trained from scratch with Japanese data and a Japanese tokenizer.) and show that we greatly improve the Japanese capabilities of the model on various tasks by finetuning alone, with no additional changes made to the model. We then show comparisons between this model and GPT2JA-Large.

Tasks

Freeform generation

In this task, GPT models continue the given prompt without any constraints. We show a large improvement over the Vanilla model with our finetuned model trained on our dataset. We get comparable quality to a model trained from scratch on Japanese with a decent tokenizer. The generated text is translated with DeepL for people without Japanese knowledge to judge the outputs. Because the model has been instructed to generate a specific amount of text, the generations may be truncated suddenly. Prompts for the model often also start with partial sentences, to guide the initial direction of the generated text.

Prompt

This specific prompt is the default prompt used in BunCho.

“あらすじ:あなたは異世界に転生してしまいました。勇者となって、仲間を作り、異世界を冒険しよう!

***

転生すると、ある能力を手に入れていた。それは、”

English translation
“Summary: You were reincarnated to another world. Become a hero, make comrades and adventure in another world!

***

When I was reincarnated, I received an ability. It was ”

GPT-J Finetuned Generation

“転生すると、ある能力を手に入れていた。それは、新しく創造したアイテムに設定できるのだ。
魔法を創造する。それは簡単だが、一度も見たことのないもの、未知のものを創造するのは、なかなかに難しい。そんなときに使えるのが、俺の能力だ。
また、同じく新しい創造する能力が付与されている指輪もあった。これは、この世界で物理攻撃に対して極めて高い耐性を持つ。しかし、それは『無敵』というわけではない。この指輪の効果が切れると、どうしても攻撃を受けてしまう。
つまり、例えば石のナイフを使って戦うな”

English Translation

“When I was reincarnated, I had gained a certain ability. It could be set on newly created items.
Create magic. That’s easy enough, but creating something you’ve never seen before, something unknown, is quite difficult. That’s when my ability comes in handy.
I also found a ring that also gave me the ability to create new things. It has an extremely high resistance to physical attacks in this world. However, it is not ‘invincible’. When the effect of this ring wears off, you will inevitably be attacked.
So, for example, don’t fight with a stone knife.”

GPT-J Vanilla Generation

“あらすじ:あなたは異世界に転生してしまいました。勇者となって、仲間を作り、異世界を冒険しよう!
***
転生すると、ある能力を手に入れていた。それは、私のカッコイイ組み合わせである、スピアーズの「太陽の炎」を操作することができる組み合わせだった。実は、私が取得した能力、ある旅で操作してみる機会があることを思い出した。
晴れて昼には雲散とんでかい旅を続けるが、夜遅くまで電波に飲み込まれることに気付く。すると、天井の窓から何者かが高くて跳ね返ってくる。
彼らは「魔術師」つまりは謎の者である。その仕業のようだが、私はどうしようもなく調子に乗って渡り歩くが、そのとき晴れた春の庭に”

English Translation

“Synopsis: You’ve been reincarnated into another world. Become a brave warrior, make friends, and explore the other world!
***
When I was reincarnated, I got a certain ability. It was a combination that allowed me to manipulate my cool combination, Spears’ Flame of the Sun. In fact, I remembered the ability I had acquired, and the opportunity to try to manipulate it on one of my trips.
I continue my journey in a cloudburst of sunshine and daylight, but find myself swallowed up by radio waves until late at night. Then, something is high and bouncing off the ceiling windows.
They are “sorcerers,” in other words, mysterious ones. It seems to be their work, but I can’t help but get carried away and cross over, but then I see a sunny spring garden.”

GPT2JA-Large Generation

“転生すると、ある能力を手に入れていた。それは、アイテムボックスと呼ばれる能力だ。
『変身能力』と呼ばれるアイテムボックスは、すぐに職業や技能レベルの差が生まれる。人から自分を変えることができるようになる。 武器を生成する、衣類を作り出す、金品を盗むなど、自分にどういった行動をとるかを想定して自動的に技能レベルが上がる。
武具に掛けることもできるようになり、他にも、特殊なスキルと呼ばれる能力も覚えることができる。
とはいえ、今でもそんな性能のアイテムボックスは存在していない。ただし、スキルに魔法が加わるまで、そこそこのスキルを覚えることができる。
だから、スキルで特技を覚えるには、どんな才能が必要だろうか?
もしかしたら、異世界で戦う時に『”

“When I was reincarnated, I had gained a certain ability. It’s an ability called an item box.
The item box, called “Transformation Ability,” immediately creates a difference in occupation and skill level. It will allow you to change yourself from a person. Your skill level will automatically increase based on what actions are expected of you, such as generating weapons, creating clothing, or stealing money and goods.
You will also be able to hang on to your armor and learn other abilities, called special skills.
However, even now there is no item box with such performance. However, you can learn a lot of skills there until magic is added to your skills.
So what kind of talent would it take to learn a special skill in a skill?
Maybe, when you fight in the other world, you will be able to use the ‘“

Zero-Shot Question Answering

In this task, we prompt the GPT models to answer the given questions. Answers and the question is translated by DeepL for people without Japanese knowledge to judge the outputs.
Question

石田さんはA組の先生です。山田さんはA組の生徒です。山田さんの先生はだれですか。

English Translation
Ishida-san is the teacher of class A. Yamada-san is a student of class A. Who is the teacher of Yamada-san.

Answers

GPT-J Finetuned

石田恵美、いや、石田先生。

English Translation: Ishida Emi, no, Ishida-sensei.

GPT-J Vanilla

山田先生です。

English Translation: Yamada-sensei.

GPT2JA-Large

山田さんはA組の担任です。山田さんの担任は、担任の先生です。

English Translation: Yamada-san is the homeroom teacher of Class A. Mr. Yamada’s homeroom teacher is the homeroom teacher.

Zero-Shot Translation

In this task, we prompt the GPT models to do Japanese to English translation zero-shot.

Japanese: 生命、宇宙、そして万物についての究極の疑問の答えは何ですか。

Correct English Translation: What is the answer to the ultimate question of life, the universe, and everything?

GPT-J Finetuned: What is the answer to the ultimate question of life, the universe, and everything?

GPT-J Vanilla: What is the ultimate answer to life, the universe and everything?

GPT2JA-Large: 宇宙の答え。 Interview: 万物の答え。 Image: 宇宙の答え。 Time: 万物の答え。 Double-manifest: 万物の答え。 Manual: 万物の答え。 Password

You can view all of the samples we selected from in: https://github.com/finetuneanon/genji-jp-samples

Note: DeepL is used for translations of the generated samples.

Limitations & Future Work

A significant limitation of the current approach is the tokenizer trained on the GPT-2 dataset being quite inefficient for Japanese(0.73 characters per token on our dataset). Tokenizer customization is also not explored in this work. Lastly, our dataset consists of mostly web novels which could be expanded upon to better capture the language.

Acknowledgments

Main contributors:

Thanks to TPU Research Cloud for providing TPU compute.

Thanks to EleutherAI, Ben Wang, and Aran Komatsuzaki for pre-training of GPT-J 6B.

Interactive AI storyteller