RNN metadata for mimicking individual author style

Teaching a text-generating char-RNN to automatically imitate many different authors by labeling the input text by author; additional experiments include imitating Geocities and retraining GPT-2 on a large Project Gutenberg poetry corpus.
statistics, NN, fiction, shell, R, GPT, tutorial, poetry
2015-09-122019-03-26 finished certainty: likely importance: 8

Char-RNNs are unsu­per­vised gen­er­a­tive mod­els which learn to mimic text sequences. I sug­gest extend­ing char-RNNs with inline meta­data such as genre or author pre­fixed to each line of input, allow­ing for bet­ter & more effi­cient meta­data, and more con­trol­lable sam­pling of gen­er­ated out­put by feed­ing in desired meta­da­ta. A 2015 exper­i­ment using torch-rnn on a set of ~30 Project Guten­berg e-books (1 per author) to train a large char-RNN shows that a char-RNN can learn to remem­ber meta­data such as authors, learn asso­ci­ated prose styles, and often gen­er­ate text vis­i­bly sim­i­lar to that of a spec­i­fied author.

I fur­ther try & fail to train a char-RNN on Geoc­i­ties HTML for unclear rea­sons.

More suc­cess­ful­ly, , the Trans­former NN archi­tec­ture, by fine­tun­ing train­ing Ope­nAI’s GPT-2-117M Trans­former model on a much larger (117MB) Project Guten­berg poetry cor­pus using both unla­beled lines & lines with inline meta­data (the source book). The gen­er­ated poetry is much bet­ter. And is bet­ter still.

A char­ac­ter-level (“char-RNN”) trained on cor­puses like can pro­duce amus­ing tex­tual out­put mim­ic­k­ing them. Music can also be gen­er­ated by a char-RNN if it is trained on tex­tual scores or tran­scrip­tions, and some effec­tive music has been pro­duced this way (I par­tic­u­larly liked Stur­m’s).

A char-RNN is sim­ple: dur­ing train­ing, it takes a binary blob (its mem­ory or “hid­den state”) and tries to pre­dict a char­ac­ter based on it and a new binary blob; that binary blob gets fed back in to a sec­ond copy of the RNN which tries to pre­dict the sec­ond char­ac­ter using the sec­ond binary blob, and this gets fed into a third copy of the RNN and so on (“unrolling through time”). Whether each char­ac­ter is cor­rect is the train­ing error, which get back­prop­a­gated to the pre­vi­ous RNNs; since they are still hang­ing around in RAM, blame can be assigned appro­pri­ate­ly, and even­tu­ally gib­ber­ish hope­fully evolves into a pow­er­ful sequence mod­eler which learns how to com­pactly encode rel­e­vant mem­o­ries into the hid­den state, and what char­ac­ters can be pre­dicted from the hid­den state. This does­n’t require us to have labels or com­plex loss func­tions or a big appa­ra­tus—the RNN gets trained char­ac­ter by char­ac­ter.

Handling multiple corpuses

A prob­lem with this approach is that a char-RNN has to be trained for each cor­pus: if you want Shake­spearean gib­ber­ish, you must train it only on Shake­speare, and if you want Irish music, you must train only on Irish—if you don’t, and you cre­ate a cor­pus which is Shake­speare con­cate­nated with the Bible, you will prob­a­bly get some­thing halfway between the two, which might be some­what inter­est­ing, but is not a step for­ward to gen­er­at­ing bet­ter & more inter­est­ing gib­ber­ish; or if you have a few hun­dred songs of Irish music writ­ten in ABC for­mat and then you have a few dozen of rock or clas­si­cal pieces writ­ten in MIDI, train­ing an RNN on them all mixed together will sim­ply yield gib­ber­ish out­put because you will get an ‘aver­age syn­tax’ of ABC & MIDI and an ‘aver­age music’ of Irish & Rock. This is in part because the train­ing is unsu­per­vised in the sense that the char-RNN is only attempt­ing to pre­dict the next char­ac­ter given the pre­vi­ous char­ac­ters, and it has no rea­son to give you just Shake­speare or just Bible out­put; it is bounc­ing between them

How­ev­er, it seems like it should be pos­si­ble to do this. An RNN is a pow­er­ful neural net­work, and we can see in exam­ples using Karpa­thy’s char-rnn that such RNNs have learned ‘sub­lan­guages’: in the Linux C source code exam­ples, the RNN has learned to switch appro­pri­ately between com­ments, source code, and string lit­er­als; in the CSS exam­ples, it’s learned to switch between com­ments, CSS source code, string lit­er­als, URLs, and . If the RNN can decide on its own while gen­er­at­ing C or CSS to switch from “source code mode” to “com­ment mode”, then it should be able to also learn to switch between Shake­speare and Bible mode, or even more authors.

If we could get the RNN to do such switch­ing on demand, there are sev­eral pos­si­ble ben­e­fits. Human-au­thored tex­tual out­put is always more sim­i­lar than differ­ent: a text file of Shake­speare is much more sim­i­lar to a text file of the Bible than it is to an equiv­a­lent length of ASCII gen­er­ated at ran­dom such as $M@Spc&kl?,U.(rUB)x9U0gd6G; a baroque clas­si­cal music score is more sim­i­lar to a tran­script of a tra­di­tional Irish music jam. Since they share such mutual infor­ma­tion, a trained RNN to pro­duce Shake­speare and the Bible will be smaller than the sum of2 RNNs for Shake­speare & the Bible sep­a­rate­ly; this makes it eas­ier to share trained RNNs since you can dis­trib­ute 1 RNN cov­er­ing many gen­res or authors for peo­ple to play with, rather than hav­ing to train & host a dozen differ­ent RNNs. Such an RNN may also gen­er­ate bet­ter out­put for all cases since less of the cor­pus­es’ infor­ma­tion is spent on learn­ing the basics of Eng­lish shared by both cor­puses and more is avail­able for learn­ing the finer details of each kind of writ­ing, which may help in cases like music where large datasets of tex­tual tran­scrip­tions of a desired genre may not be avail­able (by train­ing on a large cor­pus of clas­si­cal music, a smaller cor­pus of Irish music may go fur­ther than it would’ve on its own). More spec­u­la­tive­ly, the meta­data itself may dynam­i­cally improve gen­er­a­tion by mak­ing it eas­ier for the RNN to not ‘wan­der’ but, since the RNN is keep­ing a mem­ory of the meta­data in its hid­den state, out­put may be more the­mat­i­cally coher­ent since the RNN can peri­od­i­cally refer back to the hid­den state to remem­ber what it was talk­ing about.

How can we do that? The RNN in the C or CSS exam­ples is able to mod­e-switch like this because, I think, there are clear tran­si­tion mark­ers inside the CSS or C which ‘tell’ the RNN that it needs to switch modes now; a com­ment begins /* ... or a data-URI in CSS begins url('data:image/png;base64,...). In con­trast, the most straight­for­ward way of com­bin­ing music or books and feed­ing them into a char-RNN is to sim­ply con­cate­nate them; but then the RNN has no syn­tac­tic or seman­tic mark­ers which tell it where ‘Bible’ begins and ‘Shake­speare’ ends. Per­haps we can fix that by pro­vid­ing meta­data such as author/genre and turn­ing it into a semi­-su­per­vised task, some­how, along the lines of the source code: dis­tin­guish the text of one author from anoth­er, and then let the RNN learn the dis­tinc­tions on its own, just like the CSS/C.


There are two approaches for how to encode the meta­data into the RNN:

  1. in band: sys­tem­at­i­cally encode the meta­data into the cor­pus itself, such as by a pre­fixed or suffixed string, and hope that the RNN will be able to learn the rel­e­vance of the meta­data and use it dur­ing train­ing to improve its pre­dic­tions (which it should, as LSTM/GRU units are sup­posed to help prop­a­gate long-term depen­den­cies like this); then spe­cific gen­res or authors or styles can be elicited dur­ing sam­pling by pro­vid­ing that meta­data as a seed.

    So for exam­ple, a Shake­speare cor­pus might be trans­formed by pre­fix­ing each line with a unique string which does­n’t to appear in the cor­pus itself, eg “SHAKESPEARE|To be or not to be,|SHAKESPEARE”. Then dur­ing sam­pling, Shake­spearean prose will be trig­gered like th sample.lua rnn.t7 -primetext "SHAKESPEARE|". (Why the pipe char­ac­ter? Because it’s rarely used in prose but isn’t hard to type or work with.) To add in more meta­data, one adds in more pre­fix­es; for exam­ple, per­haps the spe­cific work might be thought rel­e­vant and so the cor­pus is trans­formed to “SHAKESPEARE|HAMLET|To be or not to be,|HAMLET|SHAKESPEARE”. Then one can sam­ple with the spe­cific work, author, or both. For musi­cal gen­er­a­tion, rel­e­vant meta­data might be musi­cal gen­re, author, tem­po, instru­ments, type of work, tags pro­vided by music lis­ten­ers (“ener­getic”, “sad”, “for_run­ning” etc), so one could ask for ener­getic Irish music for two fid­dles.

    This has the advan­tage of being easy to set up (some regexes to add meta­data) and easy to extend (take an exist­ing trained RNN and use it on the mod­i­fied cor­pus); the dis­ad­van­tage is that it may not work as the RNN may be unable to jointly learn to recall and use the meta­data—it may instead learn to for­get the meta­data imme­di­ate­ly, or spend all its learn­ing capac­ity on mod­el­ing an ‘aver­age’ input because that yields bet­ter log-loss error. This in band approach can also eas­ily be extended to cover clas­si­fi­ca­tion; in clas­si­fi­ca­tion, the meta­data is put at the end of each line, so instead of learn­ing to pre­dict text con­di­tional on meta­data & pre­vi­ous text, the RNN is learn­ing to pre­dict meta­data con­di­tional on pre­vi­ous text, and clas­si­fi­ca­tions can be extracted by low-tem­per­a­ture sam­pling with the input as the prime text fol­lowed by the sep­a­ra­tor char­ac­ter and see­ing what meta­data is pre­dicted (eg th sample.lua classification.t7 -temperature 0.1 -primetext "...text...|" → "SHAKESPEARE\n").

    As far as I know, no one has done this except per­haps inad­ver­tently or implic­it­ly.

  2. out of band: instead of depend­ing on the RNN to learn the value of the meta­data and pre­serv­ing it in its hid­den state, one can change the RNN archi­tec­ture to inject the meta­data at each timestep. So if one has an RNN of 500 neu­rons, 5 of them will be hard­wired at each timestep to the meta­data value for the sequence being worked on.

    The down­side is that all meta­data inputs will require mod­i­fi­ca­tion of the RNN archi­tec­ture to map them onto a par­tic­u­lar hid­den neu­ron. The advan­tage is that the meta­data value will always be pre­sent, there is no need to hope that the RNN will learn to hold onto the meta­data, and it only has to learn the asso­ci­ated differ­ences; so it will learn more reli­ably and faster. Vari­ants of this turn out to have been done before:

    1. Mikolov & Zweig 2012, “Con­text depen­dent recur­rent neural net­work lan­guage model”: RNN aug­mented with topic infor­ma­tion from , achiev­ing bet­ter pre­dic­tion on the Penn Tree­bank & WSJ tran­scrip­tion task

    2. Aransa et al 2013/2015, “Improv­ing Con­tin­u­ous Space Lan­guage Mod­els using Aux­il­iary Fea­tures”: a feed­for­ward NN given n char­ac­ters at a time, with the inputs at each sequence includ­ing embed­dings of the pre­vi­ous lines and, par­tic­u­lar­ly, 5 ‘gen­res’ (in this case, Egypt­ian Ara­bic SMS/chat, mod­ern stan­dard Ara­bic, Egypt­ian Ara­bic forum dis­cus­sions, Lev­an­tine forum dis­cus­sions, for­mal MSA from UN trans­la­tions, Egypt­ian Ara­bic tele­phone call­s), hard­wired into the input lay­er; find­ing that genre par­tic­u­larly helped BLEU scores. (In­clud­ing meta­data like genre to assist train­ing appears to have been used fairly reg­u­larly in ear­lier text top­ic-mod­el­ing work, but not so much neural net­works or for increas­ing real­ism of gen­er­ated tex­t.)

    3. Chen et al 2015, “Recur­rent Neural Net­work Lan­guage Model Adap­ta­tion for mul­ti­-Genre Broad­cast Speech Recog­ni­tion”: an RNN aug­mented with the text input being fed into stan­dard text top­ic-mod­el­ing algo­rithms like LDA, par­tially trained on BBC gen­res (advice/children/comedy/competition/documentary/drama/events/news), and the total out­puts from the topic algo­rithms hard­wired into the input layer along with the text; giv­ing mod­er­ate improve­ments on audio→­text tran­scrip­tion.

    4. Sen­nrich et al 2016, “Con­trol­ling Polite­ness in Neural Machine Trans­la­tion via Side Con­straints”: a stan­dard neural machine trans­la­tion using RNNs in the encoder-de­coder frame­work, here for trans­lat­ing Eng­lish→Ger­man movie sub­ti­tles, but the Ger­man cor­pus’s sen­tences are anno­tated by polite­ness meta­data describ­ing the pronouns/verb con­ju­ga­tions; they obtain both bet­ter BLEU scores on trans­la­tion as well as the abil­ity to change to change the gen­er­ated Eng­lish

    5. This has also been done in (see also ): they model beer reviews with a char­ac­ter-level RNN which is given meta­data (beer types: “Amer­i­can IPA”, “Russ­ian Impe­r­ial Stout”, “Amer­i­can Porter”, “Fruit/Vegetable Beer”, and “Amer­i­can Adjunct Lager”) as a hard­wired input to the RNN at each timestep, not­ing that

      It might seem redun­dant to repli­cate xaux at each sequence step, but by pro­vid­ing it, we elim­i­nate pres­sure on the model to mem­o­rize it. Instead, all com­pu­ta­tion can focus on mod­el­ing the text and its inter­ac­tion with the aux­il­iary input…­Such mod­els have suc­cess­fully pro­duced (short) image cap­tions, but seem imprac­ti­cal for gen­er­at­ing full reviews at the char­ac­ter level because sig­nal from xaux must sur­vive for hun­dreds of sequence steps. We take inspi­ra­tion from an anal­ogy to human text gen­er­a­tion. Con­sider that given a topic and told to speak at length, a human might be apt to mean­der and ram­ble. But given a sub­ject to stare at, it is far eas­ier to remain focused.

      They expe­ri­enced trou­ble train­ing their beer char-RNN, and they adopt a strat­egy of train­ing nor­mally with­out the hard­wired meta­data down to a loss of <1.0/character and then train­ing with meta­data to a final loss of 0.7–0.8. This is rea­son­able because at a loss of 1.1 on Eng­lish text, sam­pled out­put has many clear errors, but at <0.9 the out­put becomes uncan­ny; it stands to rea­son that sub­tle differ­ences of style & vocab­u­lary will only begin to emerge once the RNN has the basics of Eng­lish down pat (the differ­ences between skilled authors’ Eng­lishes are, unsur­pris­ing­ly, smaller than the differ­ences between reg­u­lar Eng­lish & gib­ber­ish).

      Pre­train­ing+meta­data works well for Lip­ton et al 2015, but they don’t com­pare it to inlined meta­data or show that the pre­train­ing is nec­es­sary. I am also a lit­tle skep­ti­cal about the ratio­nale that out of band sig­nal­ing is use­ful because it puts less pres­sure on the hid­den state: while it may reduce pres­sure on the RNN’s LSTMs to mem­o­rize the meta­data, one is still los­ing RAM to rein­ject­ing the meta­data into the RNN at every timestep. Either way, the meta­data must be stored some­where in RAM and it does­n’t make much differ­ence if it’s 495 effec­tive neu­rons (with 5 hard­wired to meta­data) or if it’s 500 effec­tive neu­rons (of which 5 even­tu­ally get trained to hold meta­data, yield­ing 495 effec­tive neu­ron­s). Pre­train­ing also won’t work with torch-rnn as the word-em­bed­ding it com­putes is differ­ent on each dataset, so it’s cur­rently impos­si­ble to train on an unla­beled dataset, change the data to labeled, and resume train­ing.

    6. after my exper­i­ments here, Deep­Mind pub­lished a CNN for gen­er­at­ing raw audio: , van den Oord et al 2016. They noted sim­i­lar phe­nom­e­na: the WaveNet could imi­tate spe­cific speak­ers if pro­vided speaker labels along with the raw audio, and spec­i­fy­ing meta­data like instru­ments allowed con­trol of gen­er­ated musi­cal out­put. Another later Google paper, John­son et al 2016’s , applies in-band meta­data to gen­er­al­ize a RNN trans­la­tor by spec­i­fy­ing the tar­get lan­guage in-band and hav­ing the RNN learn how to exploit this meta­data for bet­ter nat­ural lan­guage gen­er­a­tion and the abil­ity to trans­late between lan­guage pairs with no avail­able cor­pus­es.

Given the attrac­tive sim­plic­i­ty, I am going to try in band meta­da­ta.


The eas­i­est kind of data to test with is Eng­lish prose: I can rec­og­nize prose differ­ences eas­i­ly, and there are count­less nov­els or fic­tional works which can be con­verted into labeled prose.

If we just down­load some com­plete works off (googling ‘Project Guten­berg “com­plete works of”’), pre­fix each line with “$AUTHOR|”, con­cate­nate the com­plete works, and throw them into char-rnn, we should not expect good results: the author meta­data will now make up some­thing like 5% of the entire char­ac­ter count (be­cause PG wraps them to short lines) and by train­ing on 5M of exclu­sively Austen and then 5M of exclu­sively Churchill, we might run into over­fit­ting prob­lems and due to the lack of prox­im­ity of differ­ent styles, the RNN might not ‘real­ize’ that the author meta­data isn’t just some eas­ily pre­dicted & then ignored noise but can be used to pre­dict far into the future. We also don’t want the PG head­ers explain­ing what PG is, and to make sure the files are all con­verted to ASCII.

So to deal with these 4 issues I’m going to process the PG col­lected works thus­ly:

  1. delete the first 80 lines and last ~300 lines, and fil­ter out any line men­tion­ing “Guten­berg”

  2. con­vert to ASCII

  3. delete all new­lines and then rewrap to make lines which are 10000 bytes—­long enough to have a great deal of inter­nal struc­ture and form a good batch to learn from, and thus can be ran­domly sorted with the oth­ers.

    But new­lines do carry seman­tic infor­ma­tion—­think about dia­logues—and does delet­ing them carry a cost? Per­haps we should map new­lines to some rare char­ac­ter like tilde, or use the poetry con­ven­tion of denot­ing new­lines with for­ward-s­lash­es?

  4. pre­fix each long line with the author it was sam­pled from


As a base­line, a char-RNN with 2×2500 neu­rons, trained with 50% dropout, batch-size 55, and BPTT length 200, on the PG dataset with­out any author pre­fixes or suffix­es, con­verges to a val­i­da­tion loss of ~1.08 after ~20 epoches.

Training with prefixes

Small RNN

For my first try, I grabbed 7 authors, giv­ing a good final dataset of 46M, and fed it into char-rnn, choos­ing a fairly small 2-layer RNN and using up the rest of my GPU RAM by doing unrolling far more than the default 50 timesteps to encour­age it to learn the long-range depen­den­cies of style:

cd ~/src/char-rnn/data/
mkdir ./styles/ ; cd ./styles/

## "The Complete Project Gutenberg Works of Jane Austen" http://www.gutenberg.org/ebooks/31100
wget 'https://www.gutenberg.org/ebooks/31100.txt.utf-8' -O austen.txt
## "The Complete Works of Josh Billings" https://www.gutenberg.org/ebooks/36556
wget 'https://www.gutenberg.org/files/36556/36556-0.txt' -O billings.txt
## "Project Gutenberg Complete Works of Winston Churchill" http://www.gutenberg.org/ebooks/5400
wget 'https://www.gutenberg.org/ebooks/5400.txt.utf-8' -O churchill.txt
## "The Project Gutenberg Complete Works of Gilbert Parker" https://www.gutenberg.org/ebooks/6300
wget 'https://www.gutenberg.org/ebooks/6300.txt.utf-8' -O parker.txt
## "The Complete Works of William Shakespeare" http://www.gutenberg.org/ebooks/100
wget 'https://www.gutenberg.org/ebooks/100.txt.utf-8' -O shakespeare.txt
## "The Entire Project Gutenberg Works of Mark Twain" http://www.gutenberg.org/ebooks/3200
wget 'https://www.gutenberg.org/ebooks/3200.txt.utf-8' -O twain.txt
## "The Complete Works of Artemus Ward" https://www.gutenberg.org/ebooks/6946
wget 'https://www.gutenberg.org/ebooks/6946.txt.utf-8' -O ward.txt
du -ch *.txt; wc --char *.txt
# 4.2M  austen.txt
# 836K  billings.txt
# 9.0M  churchill.txt
# 34M   input.txt
# 12M   parker.txt
# 5.3M  shakespeare.txt
# 15M   twain.txt
# 12K   ward.txt
# 80M   total
#  4373566 austen.txt
#   849872 billings.txt
#  9350541 churchill.txt
# 34883356 input.txt
# 12288956 parker.txt
#  5465099 shakespeare.txt
# 15711658 twain.txt
#     9694 ward.txt
# 82932742 total
for FILE in *.txt; do
  dos2unix $FILE
  AUTHOR=$(echo $FILE | sed -e 's/\.txt//' | tr '[:lower:]' '[:upper:]')
  cat $FILE | tail -n +80 | grep -v -i 'Gutenberg' | iconv -c -tascii | tr '\n' ' ' | \
   fold --spaces --bytes --width=10000 | sed -e "s/^/$AUTHOR\|/" > $FILE.transformed
rm input.txt
cat *.transformed | shuf > input.txt
cd ../../
th train.lua -data_dir data/styles/ -gpuid 0 -rnn_size 747 -num_layers 2 -seq_length 187
# using CUDA on GPU 0...
# loading data files...
# cutting off end of data so that the batches/sequences divide evenly
# reshaping tensor...
# data load done. Number of data batches in train: 4852, val: 256, test: 0
# vocab size: 96
# creating an LSTM with 2 layers
# number of parameters in the model: 7066716
# cloning rnn
# cloning criterion
# 1⁄242600 (epoch 0.000), train_loss = 4.57489208, grad/param norm = 9.6573e-01, time/batch = 2.03s
# ...
# 15979⁄242600 (epoch 3.293), train_loss = 1.01393854, grad/param norm = 1.8754e-02, time/batch = 1.40s

This gets us a cor­pus in which every line spec­i­fies its author and then switches authors, while still being long enough to have read­able mean­ing. After about 22 hours of train­ing yield­ing a val­i­da­tion loss of 1.0402 (with lit­tle improve­ment evi­dent after the first 7 hours), we can try out our best can­di­date and see if it knows Shake­speare ver­sus Austen:

BEST=`ls cv/*.t7 | sort --field-separator="_" --key=4 --numeric-sort --reverse | tail -1`
th sample.lua $BEST -temperature 0.8 -length 500 -primetext "SHAKESPEARE|"
# SHAKESPEARE|is of no regular complexion.  The action of the plain chatter--"  "Alas, they
# have discovered what was to be afforded since then?"  "We can believe--for the signature of
# the Church."  "So they do, dear lord, do they their home?  Oh, no, to the devil which we
# have not written, the Church is not in the world; but not in this harmless way then to the
# captain of man--therefore while the praise of it was allurious he would not reflect on the
# curious man's hatch deemed that his life should be very con

th sample.lua $BEST -temperature 0.8 -length 500 -primetext "SHAKESPEARE|" -seed 105
# now looked at him a sharp pleasure in passing southward and again in portion of his mother's
# reach of it. Suddenly the thing was said.  "We'll sit down and find out," he inquired, with a
# pity to see Mr. Carvel driving beside the bedroom, which was almost as much as he could bear
# the potion.  "You say you're strong," said Mrs. Holy, indignantly, "you won't have to go
# away, about the doctor. What is it?"  "Why, we are"

th sample.lua $BEST -temperature 0.8 -length 500 -primetext "AUSTEN|"
# AUSTEN|business, and the gout--a constant and foolish figure in which Fellowes' ring is
# nearer to distemper than meek and steady interest and clean iron. The episode for the future
# and the war, and the seedy and effective sun-elogs and the others ventured its remote room,
# whose hair was a suffering man--that the work of the circumstance interested him. It had no
# long served to open the papers to answer up a quiet road, free from the long row of white
# to the lash called No. 14,000 to a sweet conversatio

th sample.lua $BEST -temperature 0.8 -length 500 -primetext "TWAIN|"
# TWAIN|quarrelling with a little book, and so on, considering its sensations as to whether
# it were not possible to eat it.  He thought that the leader of the conference with his own
# death would be recognized as a common expression.  The men that mounted from motive powers,
# how big the calf, commander of the rights of the new economic steamer, the English, a lass
# of manhood, will exhibit no praise or increase out of a sort of meaning in the senses, and
# send them back to such a winter as we can go into t

We can see that while the RNN is pro­duc­ing very Eng­lish-sound­ing nov­el­is­tic prose and pro­duces its usual mix of flaw­less syn­tax and hilar­i­ous seman­tics (I par­tic­u­larly like the phrase “Oh, no, to the devil which we have not writ­ten, the Church is not in the world”), it has failed to learn the styles I was hop­ing for. The Austen and Twain sam­ples sound some­what like them­selves, but the Shake­speare sam­ples are totally wrong and sound like a Vic­to­rian Eng­lish nov­el. And given the lack of improve­ments on the val­i­da­tion set, it seems unlikely that another 10 epochs will rem­edy the sit­u­a­tion: the RNN should quickly learn how to use the very use­ful meta­da­ta.

Since the style varies so lit­tle between the sam­ples, I won­der if mim­ic­k­ing Eng­lish uses up all the capac­ity in the RNN? I gave it only 747 neu­rons, but I could’ve given it much more.

Larger RNN

So to try again:

  • to bet­ter pre­serve the seman­tics, instead of delet­ing new­li­nes, replace them with a slash
  • try much shorter lines of 1000 bytes (in­creas­ing the rel­a­tive den­sity of the meta­data)
  • back off on the very long back­prop­a­ga­tion through time, and instead, devote the GPU RAM to many more neu­rons.
  • the default set­ting for the val­i­da­tion set is a bit exces­sive here and I’d rather use some of that text for train­ing
rm input.txt *.transformed
for FILE in *.txt; do
  dos2unix $FILE
  AUTHOR=$(echo $FILE | sed -e 's/\.txt//' | tr '[:lower:]' '[:upper:]')
  cat $FILE | tail -n +80 | grep -v -i 'Gutenberg' | iconv -c -tascii | tr '\n' '/' | \
   fold --spaces --bytes --width=1000 | sed -e "s/^/$AUTHOR\|/" > $FILE.transformed
cat *.transformed | shuf > input.txt
cd ../../
th train.lua -data_dir data/styles/ -gpuid 0 -rnn_size 2600 -num_layers 2 -val_frac 0.01
# ...data load done. Number of data batches in train: 18294, val: 192, test: 771
# vocab size: 96
# creating an LSTM with 2 layers
# number of parameters in the model: 82409696
# cloning rnn
# cloning criterion
# 1⁄914700 (epoch 0.000), train_loss = 4.80300702, grad/param norm = 1.1946e+00, time/batch = 2.78s
# 2⁄914700 (epoch 0.000), train_loss = 13.66862074, grad/param norm = 1.5432e+00, time/batch = 2.63s
# ...

Errored out of mem­ory early the next day; the val­i­da­tion loss is still pretty meh, but at 1.1705, can’t expect much, and indeed, the style is not impres­sive when I check sev­eral pre­fix­es:

th sample.lua cv/lm_lstm_epoch0.93_1.1705.t7 -temperature 0.8 -length 500 -primetext "SHAKESPEARE|"
# seeding with SHAKESPEARE|
# --------------------------
# SHAKESPEARE|jung's own,/which is on the house again.  There is no endeavour to be dressed in the midst of the/present of
# Belle, who persuades himself to know to have a condition of/the half, but "The garnal she was necessary, but it was high,
# consecrets, and/excursions of the worst and thing and different honor to flew himself.  But/since the building closed the
# mass of inspiration of the children of French wind,/hurried down--but he was in the second farmer of the Cald endless figures,
# Mary/Maeaches, and t

th sample.lua cv/lm_lstm_epoch0.93_1.1705.t7 -temperature 0.8 -length 500 -primetext "AUSTEN|"
# AUSTEN|mill./And now the good deal now be alone, there is no endeavour to be dreaming./In fact, what was the story of his
# state, must be a steady carriages of pointing out/both till he has walked at a long time, and not convinced that he
# remembers/her in this story of a purpose of this captain in stock. There was/no doubt of interest, that Mr. Crewe's
# mother could not be got the/loss of first poor sister, and who looked warm enough by a/great hay below and making a
# leaver and with laid with a murder to

th sample.lua cv/lm_lstm_epoch0.93_1.1705.t7 -temperature 0.8 -length 500 -primetext "TWAIN|"
# TWAIN|nor contributed/she has filled on behind him.  He had been satisfied by little just as to/deliver that the inclination
# of the possession of a thousand expenses in the group of feeling had destroyed/him to descend.  The physical had he darted
# before him that he was worth a
# PARKER|George Pasha, for instance?"//"Then it is not the marvel of laws upon Sam and the Sellers."  She said/he would ask
# himself to, one day standing from the floor, as he/stood for the capital.  He was no good of conversation

Larger author count

Next, I decided to increase diver­sity of styles: ramp­ing up to 38 authors, includ­ing mod­ern SF/F fic­tion authors (Robert Jor­dan’s Wheel of Time, Gene Wolfe, R.A. Laffer­ty, Ryuk­ishi07’s Umineko no naku koro ni, Kafka), poetry ancient and mod­ern (Iliad, Beowulf, Dan­te, Keats, Coleridge, Poe, Whit­man, Gilbert & Sul­li­van), ancient fic­tion (the Bible), mis­cel­la­neous non­fic­tion (Aris­totle, Machi­avel­li, Paine) etc. By adding in many more authors from many differ­ent gen­res and time peri­ods, this may force the RNN to real­ize that it needs to take seri­ously the meta­data pre­fix.

wget 'https://dl.dropboxusercontent.com/u/182368464/umineko-compress.tar.xz'
untar umineko-compress.tar.xz && rm umineko-compress.tar.xz
mv umineko/umineko.txt  ryukishi07.txt; mv  umineko/wot.txt jordan.txt; rm -rf ./umineko/

cat /home/gwern/doc-misc/fiction/lafferty/*.txt > lafferty.txt
cat /home/gwern/doc-misc/fiction/wolfe/fiction/*.txt > wolfe.txt

wget 'https://www.gutenberg.org/ebooks/10031.txt.utf-8'  -O poe.txt && sleep 5s ## avoid anti-crawl defenses
wget 'https://www.gutenberg.org/ebooks/11.txt.utf-8'     -O carroll.txt && sleep 5s
wget 'https://www.gutenberg.org/ebooks/1232.txt.utf-8'   -O machiavelli.txt && sleep 5s
wget 'https://www.gutenberg.org/ebooks/12699.txt.utf-8'  -O aristotle.txt && sleep 5s
wget 'https://www.gutenberg.org/ebooks/1322.txt.utf-8'   -O whitman.txt && sleep 5s
wget 'https://www.gutenberg.org/ebooks/16328.txt.utf-8'  -O beowulf.txt && sleep 5s
wget 'https://www.gutenberg.org/ebooks/1661.txt.utf-8'   -O doyle.txt && sleep 5s
wget 'https://www.gutenberg.org/ebooks/23684.txt.utf-8'  -O keats.txt && sleep 5s
wget 'https://www.gutenberg.org/ebooks/2383.txt.utf-8'   -O chaucer.txt && sleep 5s
wget 'https://www.gutenberg.org/ebooks/2701.txt.utf-8'   -O melville.txt && sleep 5s
wget 'https://www.gutenberg.org/ebooks/30.txt.utf-8'     -O bible.txt && sleep 5s
wget 'https://www.gutenberg.org/ebooks/3090.txt.utf-8'   -O maupassant.txt && sleep 5s
wget 'https://www.gutenberg.org/ebooks/31270.txt.utf-8'  -O paine.txt && sleep 5s
wget 'https://www.gutenberg.org/ebooks/3253.txt.utf-8'   -O lincoln.txt && sleep 5s
wget 'https://www.gutenberg.org/ebooks/345.txt.utf-8'    -O stoker.txt && sleep 5s
wget 'https://www.gutenberg.org/ebooks/3567.txt.utf-8'   -O bonaparte.txt && sleep 5s
wget 'https://www.gutenberg.org/ebooks/3600.txt.utf-8'   -O montaigne.txt && sleep 5s
wget 'https://www.gutenberg.org/ebooks/4200.txt.utf-8'   -O pepys.txt && sleep 5s
wget 'https://www.gutenberg.org/ebooks/4361.txt.utf-8'   -O sherman.txt && sleep 5s
wget 'https://www.gutenberg.org/ebooks/4367.txt.utf-8'   -O grant.txt && sleep 5s
wget 'https://www.gutenberg.org/ebooks/6130.txt.utf-8'   -O homer.txt && sleep 5s
wget 'https://www.gutenberg.org/ebooks/7849.txt.utf-8'   -O kafka.txt && sleep 5s
wget 'https://www.gutenberg.org/ebooks/808.txt.utf-8'    -O gilbertsullivan.txt && sleep 5s
wget 'https://www.gutenberg.org/ebooks/8800.txt.utf-8'   -O dante.txt && sleep 5s
wget 'https://www.gutenberg.org/files/28289/28289-0.txt' -O eliot.txt && sleep 5s
wget 'https://www.gutenberg.org/files/29090/29090-0.txt' -O coleridge.txt && sleep 5s
wget 'https://www.gutenberg.org/files/5000/5000-8.txt'   -O davinci.txt && sleep 5s

Due to OOM crash, I decreased the neu­ron count. With a much big­ger mod­el, also nec­es­sary to have dropout enabled (de­fault of 0 means progress seems to halt around a loss of 3.5 and makes no dis­cernible progress for hours)

rm input.txt *.transformed *.t7
wc --char *.txt
# 100972224 total
for FILE in *.txt; do
  dos2unix $FILE;
  AUTHOR=$(echo $FILE | sed -e 's/\.txt//' | tr '[:lower:]' '[:upper:]')
  cat $FILE | tail -n +80 | grep -i -v -e 'Gutenberg' -e 'http' -e 'file://' -e 'COPYRIGHT' -e 'ELECTRONIC VERSION' -e 'ISBN' \
   | iconv -c -tascii | sed -e ':a;N;$!ba;s/\n/ /g' -e 's/  */ /g' -e 's/ \/ \/ //g' | \
   fold --spaces --bytes --width=3000 | head --bytes=1M | sed -e "s/^/$AUTHOR\|/" > $FILE.transformed
cat *.transformed | shuf > input.txt
cd ../../
th train.lua -data_dir data/styles/ -gpuid 0 -rnn_size 2400 -num_layers 2 -val_frac 0.01 -dropout 0.5
# ...data load done. Number of data batches in train: 39862, val: 419, test: 1679
# vocab size: 98
# creating an LSTM with 2 layers
# number of parameters in the model: 70334498
# cloning rnn
# cloning criterion
# 1⁄1993100 (epoch 0.000), train_loss = 4.68234798, grad/param norm = 7.4220e-01, time/batch = 2.53s
# 2⁄1993100 (epoch 0.000), train_loss = 13.00693768, grad/param norm = 1.7191e+00, time/batch = 2.35s
# ...

Did OK but seemed to have diffi­culty improv­ing past a loss of 1.14, had issues with explod­ing error (one explod­ing error up to a loss of 59 ter­mi­nated an overnight train­ing run) and then began error­ing out every time I tried to resume, so I began a third try, this time exper­i­ment­ing with deeper lay­ers and increas­ing the data pre­pro­cess­ing steps to catch var­i­ous con­trol-char­ac­ters and copyright/boilerplate which snuck in:

nice th train.lua -data_dir data/styles/ -gpuid 0 -rnn_size 1000 -num_layers 3 -val_frac 0.005 -seq_length 75 -dropout 0.7

This one even­tu­ally exploded too, hav­ing maxed out at a loss of 1.185.

After delet­ing even more con­trol char­ac­ters and con­stantly restart­ing after explo­sions (which had become a reg­u­lar thing as the val­i­da­tion loss began bounc­ing around a range of 1.09–1.2, the RNN seem­ing to have severe trou­ble doing any bet­ter) I did some sam­pling. The results are curi­ous: the RNN has mem­o­rized the pre­fix­es, of course, and at higher tem­per­a­tures will spon­ta­neously end with a new­line and begin with a new pre­fix; many of the pre­fixes like “BIBLE|” look noth­ing like the orig­i­nal source, but the “JORDAN|” pre­fix per­forms extremely well in mim­ic­k­ing the Wheel of Time, drop­ping in many char­ac­ter names and WoT neol­o­gisms like “Aiel” or (of course) “Aes Sedai”. This isn’t too sur­pris­ing since the WoT cor­pus makes up 20M or a sixth of the input; it’s also not too sur­pris­ing when WoT terms pop up with other pre­fix­es, but they do so at a far lower rate. So at least to some extent, the RNN has learned to use Jor­dan ver­sus non-Jor­dan pre­fixes to decide whether to drop in WoT vocab. The next largest author in the cor­pus is Mark Twain, and here too we see some­thing sim­i­lar: when gen­er­at­ing Twain text, we see a lot of words that sound like Twain vocab­u­lary (river­boats, “Amer­ica”, “the Con­sti­tu­tion” etc), and while these some­times pop up in the smaller pre­fix sam­ples it’s at a much lower rate. So the RNN is learn­ing that differ­ent pre­fixes indi­cate differ­ent vocab­u­lar­ies, but it’s only doing this well on the largest authors.

Class imbalance fix

Does this reflect that <2M of text from an author is too lit­tle to learn from and so the bet­ter-learned authors’ mate­r­ial inher­ently pulls the weaker sam­ples towards them (bor­row­ing strength), that the other authors’ differ­ences are too sub­tle com­pared to the dis­tinctly differ­ent vocab of Jor­dan & Twain (so the RNN focuses on the more pre­dic­tive­ly-valu­able differ­ences in neol­o­gisms etc), or that the RNN is too small to store the differ­ences between so many authors?

For com­par­ison, a one-layer RNN trained on solely the Robert Jor­dan cor­pus (but still for­mat­ted with pre­fixes etc) got down to a loss of 0.9638, and just the Bible, 0.9420 So the penalty for the Bible for hav­ing to learn Jor­dan is 0.9763 − 0.9420 = 0.0343, and vice-versa is 0.9763 − 0.9638 = 0.0125. Pre­sum­ably the rea­son the Bible RNN is hurt 2.7× more is because the Jor­dan cor­pus is 4.3× larger and more learn­ing capac­ity goes to its vocab­u­lary & style since a bias towards Jor­dan style will pay off more in reduced loss, a clas­sic class-im­bal­ance prob­lem.

Class-im­bal­ance prob­lems can some­times be fixed by chang­ing the loss func­tion to bet­ter match what one wants (such as by penal­iz­ing more errors on the smaller class), reduc­ing the too-big class, or increas­ing the too-s­mall class (by col­lect­ing more data or fak­ing that with data aug­men­ta­tion). I tried bal­anc­ing the cor­puses bet­ter by lim­it­ing how much was taken from the biggest.

Also at this time, torch-rnn was released by Justin John­son, with claims of much greater mem­ory effi­ciency & bet­ter per­for­mance com­pared to char-rnn, so I tried it out. torch-rnn was capa­ble of train­ing larger RNNs, and I expe­ri­enced many fewer prob­lems with explod­ing loss or OOM errors, so I switched to using it. The pre­pro­cess­ing step remains much the same, with the excep­tion of a | head --bytes=1M call added to the pipeline to limit each of the 31 authors to 1MB:

rm *.transformed
for FILE in *.txt; do
  dos2unix $FILE;
  AUTHOR=$(echo $FILE | sed -e 's/\.txt//' | tr '[:lower:]' '[:upper:]')
  cat $FILE | tail -n +80 | head -n -362 | grep -i -v -e 'Gutenberg' -e 'http' -e 'file://' -e 'COPYRIGHT' -e 'ELECTRONIC VERSION' \
    -e 'ISBN' | tr -d '[:cntrl:]' | iconv -c -tascii | sed -e ':a;N;$!ba;s/\n/ /g' -e 's/  */ /g' -e 's/ \/ \/ //g' | \
    fold --spaces --bytes --width=3000 | head --bytes=1M | sed -e "s/^/$AUTHOR\|/" > $FILE.transformed
cat *.transformed | shuf > input.txt

## with limiting:
findhog *.transformed
# 8   coleridge.txt.transformed
# 8   dante.txt.transformed
# 8   davinci.txt.transformed
# 8   eliot.txt.transformed
# 8   gilbertsullivan.txt.transformed
# 8   grant.txt.transformed
# 8   homer.txt.transformed
# 8   kafka.txt.transformed
# 8   pepys.txt.transformed
# 8   sherman.txt.transformed
# 152 carroll.txt.transformed
# 240 keats.txt.transformed
# 244 beowulf.txt.transformed
# 284 machiavelli.txt.transformed
# 356 poe.txt.transformed
# 560 doyle.txt.transformed
# 596 aristotle.txt.transformed
# 692 whitman.txt.transformed
# 832 stoker.txt.transformed
# 1028    bible.txt.transformed
# 1028    bonaparte.txt.transformed
# 1028    chaucer.txt.transformed
# 1028    jordan.txt.transformed
# 1028    lafferty.txt.transformed
# 1028    lincoln.txt.transformed
# 1028    maupassant.txt.transformed
# 1028    melville.txt.transformed
# 1028    montaigne.txt.transformed
# 1028    paine.txt.transformed
# 1028    ryukishi07.txt.transformed
# 1028    wolfe.txt.transformed

cd ../../
python scripts/preprocess.py --input_txt data/multi/input.txt --output_h5 multi.h5 --output_json multi.json --val_frac 0.005 --test_frac 0.005
nice th train.lua -input_h5 multi.h5 -input_json multi.json -batch_size 100 -seq_length 70 -dropout 0.5 -rnn_size 2500 -num_layers 2
# ...
# Epoch 28.52 / 50, i = 65000 / 118100, loss = 0.901009
# val_loss =      1.028011712161

This trained to con­ver­gence with a loss of ~1.03 after ~30 epochs tak­ing a week or two, yield­ing 2016-03-27-metadata.t7 (583M­B). This is ~0.05 bet­ter than the unla­beled base­line.

Did it suc­ceed in learn­ing to use the meta­data and mim­ic­k­ing style?


Yes. Sam­pling 80K char­ac­ters of text on CPU and set­ting the tem­per­a­ture high enough that the RNN will peri­od­i­cally emit a new­line and jump to a new mode with the invo­ca­tion th sample.lua -gpu -1 -checkpoint cv/2016-03-27-metadata.t7 -length 80000 -temperature 0.8 -start_text 'JORDAN|', there are 13 tran­si­tions:

  1. Jor­dan: short but fail. Men­tions “Lon­don”, “Jacques”, “Nan­tucket”, etc

  2. Mau­pas­sant: suc­cess. Poi­son, mur­der, city etc

  3. Lafferty: mixed suc­cess. Clubs, girls, Chicago, heavy on dia­logue, and Amer­i­can names, but also some vocab­u­lary creep­ing in from other authors such as “Tar Valon” (Jor­dan)

  4. Chaucer: suc­cess. Clearly old-timey with invo­ca­tions of Jesus. Sam­ple:

    “…through­out this world, and shall thereby be called in trust, as now O first cause of this world we have no dan­ger; That women were with you and the mes­sage, As I loved them they that should pray: No more of this so lit­tle wicked­ness.” When she saw him that there was no wight to see, For in h is cursed peace, his Chris­te’s hand, And cried his daugh­ter many a long time For he took her out of the world so dear. And she was not holy and more jol­ly, Had wed­ded her no sooth and blithe sore; The lady is this mar­riage and her wife. Come to the priest, what woe we have to do, And thanke him to make a dream, and I can Thomas, with that he saide, may I not stand: And the time went him all out of the town, And with the corpse, and set­tled him like As Jesus Christ, as he was thought, They would have been a full con­fused grace.

  5. Whit­man: short but suc­cess?

    WHITMAN|but lusty, clos­ing the walls, Who are the clauses of cav­alry with

  6. Chaucer: suc­cess

  7. Lin­coln: suc­cess. Sam­ple:

    LINCOLN|of his con­sti­tu­tional affairs, is bet­ter put down by their own things than above the extent of the major­ity of the peo­ple or of the Repub­li­cans of the United States which in the extremes may be said to be one of those who will obtain bad negro as ill-de­manded and sim­ple means as they have belonged. r. Pitt in the same man­ner in Par­lia­ment I have not seen him in the other uncom­mon per­sonal expe­di­tion to the British court, and that his thirst was the object, or in which he wrote lib­erty for sup­port­ing him in the present day with an extreme res­o­lu­tion of the sov­er­eign­ty…

  8. Bible: suc­cess. Sam­ple:

    BIBLE|with him two cities which I com­manded them; he shall not die: for the LORD is among us. And the LORD was come unto his son that sent him to seek the way to Adon. 02:019:019 And it came to pass at the end of three days after the peo­ple of Israel, that they had to touch their voice, and give him a south, and be cut before Pharaoh: 04:030:028 And the LORD spake unto oses, say­ing, 03:022:002 There shall not a man be found out of the house of the LORD. 03:013:028 And the priest shall have one lot and the length of the bul­lock, and shall put the blood upon the altar, and put the altar of gold to his feet, and set his fin­ger in water, and shall come into the plain. 03:011:027 And the priest shall take the but­ler and the head of the ser­vant shall sprin­kle it out, and the priest shall burn it into a ring, and cover the fat that is upon the altar, and shall pitch it out. 03:001:004 And he shall put the lamps in water, even a tres­pass offer­ing, and the hang­ing for the robe of the burnt offer­ing, and put the altar of shit­tim wood, and burn the altar of burnt offer­ing unto the LORD.

  9. Stoker: suc­cess. Vic­to­rian Eng­lish, men­tion of ceme­ter­ies, dis­emvow­el­ing, Van Hels­ing.

  10. Lafferty: mixed suc­cess. More Chicago and Laffer­ty-like vocab­u­lary, but what is “Ren­field” doing there—that’s Stok­er!

  11. Ryuk­ishi07: suc­cess. Sam­ple:

    RYUKISHI07|of some­thing like that. You can stop too long, a lit­tle bit more spin­ning stuff. You could put away the first side of your way out on the study at the end of the ‘Sea From Bat­tler’. “I see, isn’t it‽ Ooooooohh­hh…” In other words, if the seag­ulls had been known to have been over there already, the Shan­non would­n’t have accepted a ser­vant. …And when George-aniki sud­denly put his head over and spat on his shoul­ders, Rand said, show­ing some rela­tion­ship to her. He was calm and was jeal­ous of his nearly much image or expe­ri­ence. “………………Ha­ha­ha­ha­ha……….” Nat­suhi noticed that tune from the warm block, and it was quite a small part of it… “I’m not gonna be out of the main way. Where’s the witch‽” Nat­suhi oba-san said some­thing about forty… The fork of gold was­n’t like whis­per­ing every day. “…You’re still unable to make me. Now if you stay back to the back of the world part of my heart, that’s wrong. …………But I really have here a mag­a­zine.” “Ah, ………­don’t worry about it. I would­n’t call a lot one.” “That’s right. …If it was a metal bird, I would also stay here. I’m sor­ry, but it’s a fan­tas­tic per­son who is still liv­ing in your speed… If you could­n’t think of it, that’s right. If you want to call me a bed, I’d be swept by your duty and you may be fine.” “…………………” “……W, ………what are you going to do with the cul­prit? Did you say some­thing like that…?” Nat­suhi returned the rose gar­den. As the announce­ment had fin­ished look­ing over his, he heard the over­whelm­ing sound of the falling hair, on the win­dows, his eyes slic­ing around the sound of a pair of hold of holes in one hand. …

  12. Doyle: mixed suc­cess. There appears to be infil­tra­tion from Lin­coln.

  13. Mon­taigne: mixed suc­cess. Dis­cusses France, but also Melville’s Nan­tuck­et.

So of the 13 sam­ples, 8 were defi­nitely in the style of the right author, 5 were mixed suc­cesses as they mostly resem­bled their author but not entire­ly, and only 1 was a clear fail­ure. With 31 authors to choose from, that’s not an acci­dent.

One Walt Whit­man pas­tiche sam­ple I gen­er­ated while test­ing struck me as quite poet­ic; with line breaks inserted where indi­cated by cap­i­tal­iza­tion:

And shes my brothers to be put upon me, intense and sound,
All are me. Sounds purified, O sound of the streets!
O landscapes! O still the fierce and the scraping of beauty!
The murderous twinkle of the sky and basement,
How the beasts at first began to bite and the waves near the floor.
The walls of lands discover'd passions,
Earth, sword-ships, enders, storms, pools, limailes, shapes of violent,
Rooters, alarms, the light-starring mail, untold arms, patients, portals, the well-managed number, the bravest farms,
The effect of doubts, the bad ways, the deeds of true signs, the curious things, the sound of the world,
It is of figure and anthem, the common battle rais'd,
The beautiful lips of the world that child in them can chase it

For a more sys­tem­atic look, I gen­er­ated sam­ples from all included authors:

    th sample.lua -gpu -1 -checkpoint cv/2016-03-27-metadata.t7 -length 5000 -temperature 0.8 -start_text "$AUTHOR|"
done) > 2016-03-27-rnn-metadata-samples-all.txt

The Eliot out­put was per­plex­ingly bad, con­sist­ing mostly of num­bers, so I looked at the orig­i­nal. It turned out that in this par­tic­u­lar cor­pus, 10 of the text files had failed to down­load, and instead, Project Guten­berg served up some HTML CAPTCHAs (not cool, guys)! This affect­ed: Coleridge, Dan­te, Da Vin­ci, Eliot, Gilbert & Sul­li­van, Grant, Homer, Kafka, Pepys, & Sher­man. (Check­ing the out­put, I also noticed that a num­ber of words start­ing with cap­i­tal ‘M’ were miss­ing the ‘M’, which I traced to the tr call try­ing to strip out con­trol char­ac­ters that did not do what I thought it did.) Exclud­ing the cor­rupted authors, I’d infor­mally rank the out­put sub­jec­tively as:

  • bad: Aris­totle, Beowulf, Bible, Chaucer, Jor­dan, Keats
  • uncer­tain: Car­roll, Wolfe
  • good: Stok­er, Paine, Bona­parte, Laffer­ty, Melville, Doyle, Ryuk­ishi07, Whit­man, Laffer­ty, Machi­avel­li, Aris­totle, Bible

The RNN is some­what incon­sis­tent: some­times it’ll gen­er­ate spot-on prose and other times fail. In this case, good and bad Bible sam­ples were pre­sent, and pre­vi­ous Chaucer was fine but the Chaucer in this sam­ple was bad. (This might be due to the high tem­per­a­ture set­ting, or the messed-up texts.) But over­all, it does­n’t change my con­clu­sion that the RNN has indeed learned to use meta­data and suc­cess­fully mimic differ­ent authors.

Training with prefixes+suffixes

The RNN seems to learn the con­nec­tion of the pre­fix meta­data to the vocab­u­lary & style of the fol­low­ing text only at the very end of train­ing, as sam­ples gen­er­ated before then tend to have dis­con­nected metadata/text. This might be due to the RNN ini­tially learn­ing to for­get the meta­data to focus on lan­guage mod­el­ing, and only after devel­op­ing an implicit model of the differ­ent kinds of text, ‘notice’ the con­nec­tion between the meta­data and kinds of text. (Or, to put it another way, it does­n’t learn to remem­ber the meta­data imme­di­ate­ly, as the meta­data tag is too dis­tant from the rel­e­vant text and the meta­data is only use­ful for too-sub­tle dis­tinc­tions which it has­n’t learned yet.) What if we tried to force the RNN to mem­o­rize the meta­data into the hid­den state, thereby mak­ing it eas­ier to draw on it for pre­dic­tions? One way of forc­ing the mem­o­riza­tion is to force it to pre­dict the meta­data later on; a sim­ple way to do this is to append the meta­data as well, so the RNN can improve pre­dic­tions at the end of a sam­ple (pre­dict­ing poorly if it has for­got­ten the orig­i­nal con­tex­t); so text would look some­thing like SHAKESPEARE|...to be or not to be...|SHAKESPEARE.

I mod­i­fied the data pre­pro­cess­ing script slightly to append the author as well, but oth­er­wise used the same dataset (in­clud­ing the cor­rupt authors) and train­ing set­tings.

My first try at append­ing resulted in a fail­ure, as it con­verged to a loss of 1.129 after a week or two of train­ing, much worse than the 1.03 achieved with pre­fix-on­ly. Sam­pling text indi­cated that it had learned to gen­er­ate ran­dom author meta­data at the end of each line, and had learned to mimic some differ­ent prose styles (eg Bib­li­cal prose vs non-Bib­li­cal) but it had not learned to mem­o­rize the pre­fix nor even the use of the pre­fix (!).

A sec­ond try with the same set­tings con­verged to 1.1227 after 25 epochs, with the same sam­pling per­for­mance.

In a third try, I resumed from that check­point but increased the BPTT unrolling seq_length from 50 to 210 to see if that would help it. It con­verged to 1.114 with suffixes still ran­dom. For a fourth try, I reduced dropout from 0.5 to 0.1, which did not make a differ­ence and con­verged to 1.117 after 8 epoches.

So in this case, train­ing with suffixes did not speed up train­ing, and impeded learn­ing.

While I am not too sur­prised that suffixes did not speed up train­ing, I am sur­prised how it barred learn­ing pre­fixes at all and I don’t know why. This should have been, if any­thing, an eas­ier task.


I won­dered if the same meta­data approach could be used to trick the char-RNN into learn­ing clas­si­fi­ca­tion as well—per­haps if the RNN learns lan­guage mod­el­ing by try­ing to pre­dict sub­se­quent char­ac­ters, it acquires a greater nat­ural lan­guage under­stand­ing than if it was trained directly on pre­dict­ing the author?

I fixed the cor­rupted HTML files and the tr bug, and mod­i­fied the script to read fold --spaces --bytes --width=3000 (so each line is 3000 char­ac­ters long) and the author is now placed at the end: sed -e "s/$/\|$AUTHOR/". So the char-RNN is trained to pre­dict each sub­se­quent char­ac­ter, and at the end of 3000 char­ac­ters, it sees a | and (in the­o­ry) will then pre­dict the author. To test the results, one can feed in a short stereo­typ­i­cal piece of text end­ing in a pipe, and see if it is able to respond by gen­er­at­ing the author.

This turned out to be a total fail­ure. After over a week of train­ing, the val­i­da­tion loss had fallen to 1.02, yet when I sam­pled it, it was unable to clas­sify text, eg:

th sample.lua -gpu -1 -checkpoint `ls -t cv/*.t7|head -1` -length 44 -temperature 0.1 -start_text "Thou shalt not tempt the Lord thy God|B"
# Thou shalt not tempt the Lord thy God|Becaus

At best, it some­times would add ran­dom upcased text fol­low­ing the pipe (“|CHAPTER” was com­mon), or ran­dom authors (never the right one).

I thought per­haps the penalty for miss­ing the final char­ac­ters in a line was too small as it rep­re­sented no more than 0.3% of each line, and so I reduced the line-length down to 500 char­ac­ters (so the author was now ~2% of each line). This did­n’t work either (val­i­da­tion loss of ~1.12, prob­a­bly due to shorter lines with less con­text to work with), so I dis­abled dropout, added batch­norm, and increased the BPTT enough to back­prop­a­gate over the entire line.

After another week or two, the val­i­da­tion loss asymp­toted at ~1.09, but still no clas­si­fi­ca­tion per­for­mance. Here is a sam­ple (adding line-breaks for read­abil­ity at cap­i­tal­ized words which cor­re­spond to line­breaks in the orig­i­nal):

41 Book 40 With patient ones of the seas, the form of the sea which was gained the streets of the moon.
Yet more all contest in the place, See
the stream and constant spirit, that is of a material spirit,
The live of the storm of forms and the first stretch
Of the complexion of the mountains;
The sea fell at the tree, twenty feet wide,
And the taste of a scarlet spot where the captain bears,
She shook the sound the same that was white,
Where the permanent eye of the sea had scarce assembled,
The many such, the beauteous of a subject of such spectacles.
If thou be too sure that thou the second shall not last,
Thou canst not be the exceeding strength of all.
Thou wert as far off as thou goest, the sea Of the bands and the streams of the bloody stars
Of the world are the mountains of the sun,
And so the sun and the sand strike the light,
But each through the sea dead the sun and spire
And the beams of the mountain shed the spirits half so long,
That of the which we throw them all in air.
Think of thy seas, and come thee from that for him,
That thou hast slain in dreams, as they do not see
The horses; but the world beholds me; and behold
The same the dark shadows to the sand,
And stream and slipping of the darkness from the flood.
He that I shall be seen the flying strain,
That pierces with the wind, and the storm of many a thousand rays
Were seen from the act of love to the course.
There was a stream, and all the land and bare
Ereth shall thy spirit be suppos'd
To fall in water, and the wind should go home on all the parts
That stood and meet the world, that with the strong the place
Of thy prayer, or the continual rose,
So that the shape of the brand broke the face,
And to the band of the ring which erewhile
Is turn'd the merchant bride.
I am thine only then such as thou seest,
That the spirits stood in those ancient courses,
And in their spirit to be seen, as in the hard form
Of their laws the people in the land,
That they are between, that thou dost hear a strong shadow,
And then, nor war in all their powers, who purposes hanging to the road,
And to the living sorrow shall make thy days
Behold the strains of the fair streets, and burn,
And the shepherd for the day of the secret tear,
That thou seest so high shall be so many a man.
What can ye see, as sinking on the part
Of this reminiscence of the pursuit?
Behold the martial spirits of men of the rock,
From the flowers of the touch of the land with the sea and the blow
The steamer and the bust of the fair cloud.
The steps behind them still advanc'd, and drew,
As prepared they were alone all now
The sharp stick and all their shapes that winds,
And the trembling streams with silver the showering fires
The same resort; they stood there from the plain,
And shook their arms, sad and strong, and speaks the stars,
Or pointed and his head in the blood,
In light and blue he went, as the contrary came and beat his hands.
The stars, that heard what she approach'd, and drew
The shore, and thus her breast retraced the rushing throng:
"And more with every man the sun
Proclaims the force of future tongues
That this of all the streams are crack'd."
"The thought of me, alas!" said he,
"Now that the thirst of life your country's father sang,
That in the realms of this beast the prince
The victor from the true betray beginnings of the day."

The gen­er­ated text is semi­-in­ter­est­ing, so it’s not that the RNN was bro­ken. It was focused on learn­ing to model the aver­age text.

So it would seem that the clas­si­fi­ca­tion sig­nal was not strong enough to cause learn­ing of it. The wors­ened val­i­da­tion score sug­gests that this approach sim­ply won’t work: the longer the lines, the less incen­tive there is for clas­si­fi­ca­tion, but the shorter the lines, the worse it learns to model the reg­u­lar text.


Can we learn mul­ti­ple meta­data pre­fix­es? Like an author and then a trans­form of some sort—in music, a use­ful trans­form might be time sig­na­ture or instru­ment set.

A sim­ple trans­form we could apply here is upcas­ing and down­cas­ing every char­ac­ter, so we might have a set of 6 pre­fixes like Bible+up­case, Bible+­down­case, Bible+mix, etc, writ­ten as BIBLE|U|, BIBLE|D|, BIBLE|M|, and to help enforce abstrac­tion, also reverse order­ing like U|BIBLE|, giv­ing 12 total pre­fixes (3×2×2). The inter­est­ing ques­tion here is whether the RNN would be able to fac­tor out the trans­for­ma­tions and learn the up/mix/downcase trans­for­ma­tion sep­a­rately from the Bible/Jordan differ­ence in styles. (If it thought that Jor­dan upcased was a differ­ent author, and to be learned differ­ent­ly, from Jor­dan down­cased, then we would have to con­clude that it was not see­ing two pieces of meta­data, Jor­dan+up­case, but see­ing it as one JORDANUPCASE, and a fail­ure of both learn­ing and abstrac­tion.) But if we included each of the 12 pre­fix­es, then we would­n’t know if it had man­aged to do this, since it could have learned each of the 12 sep­a­rate­ly, which might or might not show up as much worse per­for­mance. So we should leave out two pre­fix­es: one to test out gen­er­al­iza­tion of cas­ing, and one to test out swap­ping (drop­ping 1 from Bible and 1 from Jor­dan to be fair). At the end, we should get an RNN with a val­i­da­tion loss slightly worse than 0.9763 (the extra trans­for­ma­tion & key­word must cost some­thing), and one which will hope­fully be able to yield the cor­rect out­put for the pre­fixes JORDAN|U| and C|BIBLE|

rm *.t7 *.transformed input.txt
for FILE in *.txt; do
  AUTHOR=$(echo $FILE | sed -e 's/\.txt//' | tr '[:lower:]' '[:upper:]')
  TEXT=$(cat $FILE | tail -n +80 | grep -i -v -e 'Gutenberg' -e 'http' -e 'file://' -e 'COPYRIGHT' -e 'ELECTRONIC VERSION' \
    -e 'ISBN' | iconv -c -tascii | sed -e ':a;N;$!ba;s/\n/ /g' -e 's/  */ /g' -e 's/ \/ \/ //g')
  echo $TEXT | fold --spaces --width=3000 |                              sed -e "s/^/$AUTHOR\|M\|/" >> $FILE.transformed
  echo $TEXT | fold --spaces --width=3000 | tr '[:lower:]' '[:upper:]' | sed -e "s/^/$AUTHOR\|U\|/" >> $FILE.transformed
  echo $TEXT | fold --spaces --width=3000 | tr '[:upper:]' '[:lower:]' | sed -e "s/^/$AUTHOR\|D\|/" >> $FILE.transformed
  echo $TEXT | fold --spaces --width=3000 |                              sed -e "s/^/M\|$AUTHOR\|/" >> $FILE.transformed
  echo $TEXT | fold --spaces --width=3000 | tr '[:lower:]' '[:upper:]' | sed -e "s/^/U\|$AUTHOR\|/" >> $FILE.transformed
  echo $TEXT | fold --spaces --width=3000 | tr '[:upper:]' '[:lower:]' | sed -e "s/^/D\|$AUTHOR\|/" >> $FILE.transformed
cat *.transformed | grep -v -e "JORDAN|U|" -e "M|BIBLE|" | shuf > input.txt

First ver­sion sans dropout got to a loss of 0.7969 (!); con­t­a­m­i­na­tion or leak­age of the val­i­da­tion test set? But since the ver­sions in the val­i­da­tion set could be only differ­en­t-cased ver­sions, then would­n’t’ve the RNN’d’t’ve learned the trans­for­ma­tion and it’s not really leak­age at all? After it hit a limit at 0.79 and started turn­ing in losses of 0.8+ for hours, tried retrain­ing it with some dropout and the loss explod­ed, not shrink­ing even after train­ing it all night, so I restarted with a fresh RNN and some dropout, get­ting a more sta­ble train­ing result.

Unfor­tu­nate­ly, it did not work. Using the unob­served pairs showed it had not learned to gen­er­al­ize.


So some lessons here are:

  1. use a suffi­ciently large RNN; 500 neu­rons may be ade­quate to model a sin­gle author like the Bible or Shake­speare but is too small to learn many authors despite the sav­ings
  2. train to con­ver­gence; the differ­ences between authors is smaller than between the aver­age of authors & ran­dom noise, and the meta­data will only show its worth at the end when it has reached ~1 loss
  3. keep data rel­a­tively bal­anced, or the RNN will spend all its effort try­ing to learn pat­terns & vocab­u­lary of the most com­mon kind of input

Fur­ther work:

  • mul­ti­ple meta­data: author/genre/work, per­haps. The RNN might learn to dis­en­tan­gle the var­i­ous fac­tors, so one could gen­er­ate sam­ples from BIBLE|RELIGION|RAYMOND_CHANDLER|. Music in ABC nota­tion would be another tar­get as ABC sup­ports genre meta­data and there might be use­ful ABC data­bas­es.

  • visu­al­ize the RNN hid­den state to look for ‘grand­mother neu­rons’; could such neu­rons be used to cre­ate the equiv­a­lent of or and ‘trans­fer’ the style of, say, Bib­li­cal prose to hard-boiled detec­tive sto­ries?

    My belief is that a genre/author-classification+unsupervised-prediction char-RNN may be able to do style trans­fer. This is because such a char-RNN should learn a clean sep­a­ra­tion between the meta­data (style) and the seman­tics (con­tent).

    In genre/author clas­si­fi­ca­tion, the hid­den state incre­men­tally builds up an inferred genre/author as it processes the text sequence; in unsu­per­vised pre­dic­tion, the hid­den state incre­men­tally builds up a sum­mary of past seman­tic­s+syn­tax as it tries to pre­dict the next char­ac­ter. The hid­den state rep­re­sent­ing the best cur­rent guess for clas­si­fi­ca­tion will be mostly sta­tic because it will quickly reach high con­fi­dence as to the genre/author and then the neu­rons encod­ing that infor­ma­tion must be pro­tected long-term from being mod­i­fied; in con­trast, the seman­tic­s+syn­tax hid­den state is chang­ing every time-step and if its dis­trib­uted encod­ing over­lapped with the genre/author dis­trib­uted encod­ing, it would quickly for­get its orig­i­nal con­clu­sions about genre/author.

    This oppo­si­tion should yield a trained char-RNN with a few neu­rons devoted solely to genre/author and the rest devoted to seman­tic­s+syn­tax encod­ing.

    Given such a clean split, some­thing anal­o­gous to the style trans­fer CNN should be pos­si­ble. First, fig­ure out which neu­rons are which; then feed in texts from differ­ent genre/authors and extract the hid­den state cor­re­spond­ing to each genre/author, eg Bible vs Wheel of Time. To con­vert a piece of Wheel of Time prose into Bib­li­cal prose or vice ver­sa, feed in a desired piece of text to pro­duce the genre/author and seman­tic­s+syn­tax hid­den state vec­tors; now, hard­wire the seman­tic­s+syn­tax vec­tor and do gra­di­ent ascent on the input text to grad­u­ally turn the orig­i­nal genre/author hid­den state into the tar­get genre/author hid­den state; once the trans­formed text yields both the tar­get genre/author hid­den state but also the same seman­tic­s+syn­tax hid­den state, it has been con­vert­ed. Hypo­thet­i­cal­ly, to the extent that the char-RNN has learned Eng­lish seman­tics and prose styles, this would con­vert text into differ­ent styles while pre­serv­ing the seman­tics.

    This might not work with a char-RNN doing char­ac­ter-level pre­dic­tion if the learned seman­tic­s+syn­tax turns out to be weak enough that a con­verted piece of text only bears a faint resem­blance to the orig­i­nal. (Per­haps the seman­tics don’t add enough pre­dic­tive pow­er, or the char-RNN is small enough that it must use all its capac­ity learn­ing vocab­u­lary etc.) If it does­n’t, some other approaches might be to train a clas­si­fi­ca­tion char-RNN, pro­vid­ing the style met­ric, and also a sequence-to-se­quence autoen­cod­ing RNN to pro­vide a seman­tics encod­ing; then set the style tar­get to be the desired style, hard­wire the autoen­coder, and use them jointly as a loss to do gra­di­ent descent on. RNNs can also be com­bined with CNNs, and this may allow a more direct bor­row­ing of the orig­i­nal style trans­fer algo­rithm.


Geocities char-RNN

(1994–2009) was an Inter­net ser­vice for host­ing per­sonal web­pages which fea­tured a wide range of idio­syn­cratic and unusual con­tent. Geoc­i­ties For­ever is a web­site cre­ated by Aanand which fea­tures text gen­er­ated by a small CPU-trained 3×512 char-RNN on a small 50MB sam­ple of the raw HTML from the ArchiveTeam Geoc­i­ties cor­pus. The gen­er­ated HTML is amus­ing but also shows some weak­nesses in gen­er­at­ing inter­leaved English/HTML, which I thought was con­nected to under­train­ing on a small cor­pus—based on my ear­lier exper­i­ments with char-RNN mod­els of CSS and mul­ti­ple Eng­lish authors, I know that char-RNNs are capa­ble of switch­ing lan­guages smooth­ly. Dur­ing Octo­ber-No­vem­ber 2016, I attempted to train a larger 2×3000 RNN with a 1GB+ sam­ple using torch-rnn, and ran into issues:

  • the larger cor­pus had qual­ity issues related to some files being present many times, includ­ing 1 file which was present in sev­eral thou­sand copies
  • train­ing repeat­edly “bounced” in that after quickly reach­ing low train­ing & val­i­da­tion losses and gen­er­at­ing high­-qual­ity text sam­ples, error would sky­rocket & text sam­ples plum­met in qual­ity (or not be gen­er­ated at all due to mal­formed prob­a­bil­i­ties)

Clean­ing and shuffling the cor­pus reduced the qual­ity issue, and reduc­ing learn­ing rate sub­stan­tially helped avoid the bounc­ing prob­lem, but ulti­mately the goal of high qual­ity text sam­ples was not reached before my lap­top died and I was forced to stop GPU train­ing. Train­ing a char-RNN on very large text cor­puses is more diffi­cult than I thought, per­haps because the vari­ety of con­tent over­loads the RNN model capac­ity and can cre­ate cat­a­strophic for­get­ting unless trained for a very long time at low learn­ing rates for many epoches.

Hav­ing down­loaded the tor­rent, the -com­pressed files are laid out accord­ing to the orig­i­nal Geoc­i­ties ‘neigh­bor­hood’ struc­ture and must be extract­ed.

Data extraction

The bulk of the tor­rent is image files and other media con­tent, while we only want to the HTML, so we extract those, and to keep the con­tent eas­ily read and avoid any pos­si­ble binary cor­rup­tion or weird char­ac­ters, we con­vert every­thing to ASCII before writ­ing to disk:

cd ~/torrent/geocities.archiveteam.torrent/
## 'shuf' call added to randomize order of HTML files and make minibatches more i.i.d.
## due to training problems
for ARCHIVE in `find LOWERCASE/ UPPERCASE/ -type f -name "*.7z*" | shuf`;
    7z x -so $ARCHIVE | tar x --wildcards "*.html" --to-stdout | iconv -c -tascii >> geocities-corpus.txt

wc --chars data/geocities-corpus.txt
# 984248725
du data/geocities-corpus.txt
# 961188 geocities-corpus.txt

The total HTML con­tent is ~9GB, more than ade­quate.

A quick inspec­tion shows that the HTML is excep­tion­ally ver­bose and repet­i­tive due to injected Geoc­i­ties HTML and copy­-paste. What sort of train­ing loss could we expect from the con­tent? We can look at the bit­s-per-char­ac­ter per­for­mance of a com­pres­sion util­i­ty:

LZMA/xz base­line:

cat data/geocities-corpus.txt  | xz -9 --stdout | wc --bytes
# 146915476

(146915476*8) / 984248725
# 1.194132924

xz man­ages 1.194bpc; in terms of a neg­a­tive log loss, xz man­aged a loss of 0.69:

1 - exp(-1.194132924)
# [1] 0.6970334647

RNNs can model very non­lin­ear and com­pli­cated phe­nom­e­na, but they also have tiny hidden-state/memories and so suffer in com­par­i­son to a com­pres­sion util­ity which can store long lit­er­als in RAM (xz -9 will use up to 4GB of RAM for con­tex­t). So if the RNN can reach 0.69, that would be accept­able.

Another way to put it: how many lines are repeat­ed? A com­par­i­son of wc --lines and sort --unique | wc --lines shows that a sur­pris­ingly low num­ber of lines are unique, sug­gest­ing even more rep­e­ti­tion in the HTML parts than I expect­ed.

torch-rn­n’s preprocess.py script, and its train­ing, store all data in RAM, so using all 9GB turns out to be infea­si­ble. 1GB turns out to use an accept­able aver­age ~34% of my lap­top’s 16GB RAM for pre­pro­cess­ing & train­ing.


My ini­tial set of train­ing hyper­pa­ra­me­ters:

  • check­point­ing: 1s per mini­batch, want to check­point every few hours, so 20,000

  • batch size: 2, to reduce VRAM use as much as pos­si­ble (RNN train­ing will be less sta­ble with such tiny batches but will still work)

  • lay­ers: 3 for com­pa­ra­bil­ity with the orig­i­nal

  • neu­ron count: as large as will fit, which turns out to be ~5× or 2600

  • dropout: since we have a lot of data to fit over­fit­ting, dropout does not need to be high; 0.1

  • BPTT sequence length: 20 (re­duced from default 50 to again reduce VRAM use at some cost to final model qual­ity in terms of mod­el­ing long-term depen­den­cies)

  • batch­norm: usu­ally helps, so turned on

  • learn­ing rate, decay, word­vec size, clip­ping: torch-rnn defaults

  • total:

    th train.lua -input_h5 geocities-corpus.h5 -input_json geocities-corpus.json -checkpoint_name cv/geocities
                      -checkpoint_every 20000 -batch_size 2 -seq_length 20 -rnn_size 2600 -num_layers 3 -learning_rate 2e-3
                      -dropout 0.2 -batchnorm 1 -init_from `ls -t ./cv/*.t7 | head -1`

Per­for­mance was bad: train­ing loss ~3.5, val­i­da­tion loss after 2 days: 4.61/4.69/4.49 Not good! Is 3 lay­ers too unsta­ble? A mini­batch size of 2 too unsta­ble? (In­creas­ing the mini­batch requires decreas­ing RNN size because there’s noth­ing left to cut.) Not enough BPTT? Let’s try switch­ing to 2 lay­ers, which frees up a ton of mem­ory for the mini­batch & BPTT:

th train.lua -input_h5 geocities-corpus.h5 -input_json geocities-corpus.json -checkpoint_name cv/geocities \
                  -checkpoint_every 20000 -batch_size 5 -seq_length 90 -rnn_size 3300 -num_layers 2
                  -learning_rate 2e-3 -dropout 0.2 -batchnorm 1

Trains within 1000 batches to ~0.6 train­ing loss, often with train­ing loss below the xz bound, but val­i­da­tion loss explodes! there’s also odd train­ing loss behav­ior: it seems to bounce from the low train­ing loss regime past 1 to as high as the 3s for long peri­ods.

If not over­fit­ting in gen­er­al, could be non-s­ta­tion­ar­ity of input and over­fit­ting on spe­cific parts; preprocess.py does­n’t do any shuffling. Can force shuffling by going back and shuffling the extract files or on a line-level basis by re-pre­pro­cess­ing the cor­pus:

split -l 1000 geocities-corpus.txt tmp
cat $(ls tmp* | shuf) > geocities-corpus-snuffled.txt
rm tmp*
python scripts/preprocess.py --val_frac 0.000012 --test_frac 0.000012 --input_txt geocities-corpus.txt \
                             --output_h5 geocities-corpus.h5 --output_json geocities-corpus.json

And by increas­ing BPTT & dropout:

th train.lua -input_h5 geocities-corpus.h5 -input_json geocities-corpus.json -checkpoint_name cv/geocities
    -checkpoint_every 15000 -batch_size 5 -seq_length 100 -rnn_size 3300 -num_layers 2
    -learning_rate 2e-3 -dropout 0.5 -batchnorm 1 -init_from cv/geocities_60000.t7

Still we see the same ‘bounce’ from bet­ter-than-xz pre­dic­tive per­for­mance to 2–3 train­ing loss. To check if it was size that was the prob­lem, I went back to Aanand’s orig­i­nal 3×512 archi­tec­ture:

th train.lua -input_h5 data/geocities-corpus.h5 -input_json data/geocities-corpus.json -checkpoint_name cv/geocities \
             -checkpoint_every 10000 -batch_size 130 -seq_length 225 -rnn_size 512 -num_layers 3 -learning_rate 2e-3
             -dropout 0.5 -batchnorm 1

After ~9 hours, it had reached a val­i­da­tion loss of 1.05 and gen­er­ated out­put looks pretty good1 but then it bounced over night and out­put became garbage again. (For 1GB and 3×512 RNN, 1 epoch takes some­what over 1 day.) It is still act­ing like it’s over­fit­ting. Why?

Data cleaning

I took a closer look at the data: and noticed some­thing odd skim­ming through it—it’s not just the HTML boil­er­plate that’s repeat­ed, but many parts of the con­tent as well (eg search­ing for the word “rude” turns up the same lengthy com­plaint repeated hun­dreds of times in the sam­ple). Is the excel­lent xz com­pres­sion and occa­sional excel­lent RNN train­ing loss, and then the ‘bounce’ due to con­tent being repeated many times, lead­ing to severe over­fit­ting and then extremely high error when it finally runs into some of the unre­peated con­tent?

There are pos­si­ble ways for rep­e­ti­tion: the orig­i­nal find com­mand ran on all 7z archives includ­ing the mul­ti­part archives in the tor­rent, so pos­si­bly some archives got decom­pressed mul­ti­ple times (if per­haps 7z, given an archive like “archive.7z.8” then goes back and tries to decom­press start­ing with “archive.7z.1”)? If so, then rerun­ning it but writ­ing all files to disk will make the dupli­cates go away (the dupli­cates will sim­ply get decom­pressed & over­writ­ten repeat­ed­ly). And if the rep­e­ti­tion is due to mul­ti­ple iden­ti­cal files with differ­ent names/paths, then there will still be a lot of dupli­ca­tion, but a file-level dupli­ca­tion tool like fdupes should detect and delete them.

For file-level dupli­cate dele­tion and recre­at­ing the cor­pus:

for ARCHIVE in `find LOWERCASE/ UPPERCASE/ -type f -name "*.7z*" | shuf`
    nice 7z x -so $ARCHIVE | tar x --verbose --wildcards "*.html"
fdupes . --recurse --omitfirst --sameline --size --summarize --delete --noprompt

find . -type f -name "*.html" -print0 | shuf --zero-terminated | xargs --null cat | \
    iconv -c -tascii | fold --spaces --width=150 | \
    head --bytes=1GB > geocities-corpus.txt

After extract­ing to disk to elim­i­nate redun­dant writes, and checking/deleting dupli­cated files, I restarted train­ing. After 20k mini­batch­es, train­ing loss steady in the 2–3 range, val­i­da­tion loss con­tin­ues to explode, and I can­not even sam­ple because the out­put is so ill-be­haved (the multi­n­o­mial prob­a­bil­ity prob­lem). So the prob­lem was still not solved, and a grep for “rude” indi­cated the redun­dancy prob­lem was still pre­sent.

I went back into the orig­i­nal extracted Geoc­i­ties HTML files look­ing for that weird ‘rude’ page which appears thou­sands of times; an ag search indi­cated that it shows up ~31k times in two direc­to­ries:

  • ./geocities/YAHOOIDS/m/i/mitzrah_cl/ (5.2GB, 334595 HTML files)
  • ./geocities/YAHOOIDS/T/o/Tokyo/6140/myself/sailormars/karen/site_features/hints_n_tips/site_features/www_stuff/www_resources.html (0.527GB, 33715 files)

Look­ing at file­names, there are also many pos­si­bly dupli­cated pages:

find . -type f -name "*.html" | parallel basename | sort | uniq --count | sort --numeric-sort | tac | less
#  612978 index.html
#  114691 sb.html
#   72080 links.html
#   37688 awards.html
#   36558 pics.html
#   34700 music.html
#   32987 geobook.html
#   32010 myaward.html
#   31216 hints.html
#   31053 sailormoon_rei.html
#   30953 www_resources.html
#   30670 myself.html
#   30522 intro.html
#   30184 banner_xchange.html
#   30126 tutorial_intro.html
#   13885 main.html
#   11642 disclaimer.html
#   10051 index2.html
#    7732 live.html
#    7490 tmmb.html
#    7472 everclear.html
#    7325 sublime.html
#    7264 sugarray.html
#    7065 gallery.html
#    6637 news.html
#    6566 menu.html
#    6344 home.html
#    5924 page2.html
#    5426 me.html
#    5224 friends.html
#    4986 pictures.html
#    4435 page3.html
#    4186 pictures2.html
#    4105 addbook.html
#    4076 contact.html
#    4008 profile.html
#    3935 bio.html
#    3822 history.html
#    3778 about.html
#    3769 Links.html
#    3728 photos.html
#    3682 page4.html
#    3549 webrings.html
#    3468 index1.html
#    3378 family.html
#    3297 chat.html
#    3136 link.html
#    3058 aboutme.html
#    3021 page5.html
#    2980 baking.html
#    2937 info.html
#    2855 film.html
#    2816 talents.html
#    2800 balloon.html
#    2793 quotes.html

I could delete every­thing except one ran­dom “bio.html” or “myaward.html” etc, but first I tried delet­ing every­thing in mitzrah/ and myself/. This makes the file­names look much more diverse; spot checks of the files named “sb.html” & “ever­clear.html” sug­gests that the dupli­cated file names now rep­re­sent legit­i­mate, non-re­peated con­tent which hap­pen to have sim­i­lar file­names due to serv­ing sim­i­lar roles in peo­ples’ per­sonal web­pages.

#  612967 index.html
#  114691 sb.html
#   40122 links.html
#   32986 geobook.html
#   13885 main.html
#   11642 disclaimer.html
#   10051 index2.html
#    7732 live.html
#    7490 tmmb.html
#    7472 everclear.html
#    7325 sublime.html
#    7264 sugarray.html
#    7065 gallery.html
#    6637 news.html
#    6605 awards.html
#    6566 menu.html
#    6344 home.html
#    5924 page2.html
#    5426 me.html
#    5224 friends.html
#    4986 pictures.html
#    4605 music.html
#    4598 pics.html
#    4435 page3.html
#    4186 pictures2.html
#    4105 addbook.html
#    4074 contact.html
#    4008 profile.html
#    3935 bio.html
#    3822 history.html
#    3778 about.html
#    3769 Links.html
#    3728 photos.html
#    3682 page4.html
#    3549 webrings.html
#    3467 index1.html
#    3378 family.html
#    3297 chat.html
#    3136 link.html
#    3058 aboutme.html
#    3021 page5.html
#    2980 baking.html
#    2937 info.html
#    2855 film.html
#    2816 talents.html
#    2800 balloon.html
#    2793 quotes.html
#    2681 intro.html
#    2621 lyrics.html
#    2597 top.html
#    2587 banjo.html
#    2577 webmaster.html
#    2529 roleplay.html
#    2494 garden.html
#    2474 index3.html

Skim­ming the final cor­pus also does­n’t show any bla­tant rep­e­ti­tion.

The bounce continues

After this data clean­ing, I restarted train­ing from the last check­point, same set­tings. 100,000 minibatches/4 epoches lat­er, sam­pling still fails and val­i­da­tion loss is in the 100s! Restart­ing with higher dropout (0.8) did­n’t help. Restart­ing with 0 dropout did­n’t help either—after 50,000 mini­batch­es, val­i­da­tion loss of 55.

I thought that the 512×3 may sim­ply lack model capac­ity and the orig­i­nal one worked because he used a small cor­pus which was not too diverse.

Try­ing some­thing inter­me­di­ate between 512×3 and 3000×1, 2000×1, after 30k mini­batches / 0.7 epoches, val­i­da­tion loss is ~0.98 and gen­er­ated sam­ples look good. So the larger flat­ter RNN is han­dling it bet­ter than the smaller deeper one.

Unfor­tu­nate­ly, the bounce is still pre­sen­t—ini­tially a bounce around epoch 0.84 with gen­er­ated sam­ples much worse. After another 65k mini­batch­es, very high qual­ity sam­ples but then bounced in train­ing at a differ­ent place in the dataset—e­poch 0.04 (after a restart due to crash). In pre­vi­ous train­ing, the data located at ~4% is per­fectly well behaved and eas­ily mod­eled, so it’s not the data’s fault but the RNN, sug­gest­ing it’s still over­fit­ting. If so, the learn­ing rate may be too high; I increased the learn­ing rate to 4× small­er, 8e-3.

The lower learn­ing rate RNN still bounced, but not quite as badly as usu­al, with steady val­i­da­tion loss ~3 after a week.

Unfor­tu­nate­ly, fur­ther progress by the RNN or the per­for­mance in restart­ing from scratch with a much smaller learn­ing rate is unknown, as on 26 Novem­ber my Acer lap­top died (ap­par­ent moth­er­board fail­ure, I sus­pect pos­si­bly due to the stress of all the months of GPU train­ing var­i­ous char-RNN and other deep learn­ing mod­els) and due to prob­lems with my back­ups, I lost data back to 14 Novem­ber, includ­ing the train­ing records & lat­est check­points.

Since the Geoc­i­ties char-RNN was­n’t going any­where & I wor­ried may’ve con­tributed to my lap­top fail­ure, I stopped there. My guess is that good results could be obtained with a smaller cor­pus (per­haps 500MB) and a large char-RNN like 2×3000 trained with very low learn­ing rates, but it would require at least GPU-weeks on a top-end GPU with more than 4GB RAM (to allow larger mini­batch­es) and isn’t suffi­ciently amus­ing as to be worth­while.

Finetuning the GPT-2-117M Transformer for English Poetry Generation

In Feb­ru­ary 2019, fol­low­ing up on my 2015–2016 tex­t-gen­er­a­tion exper­i­ments with char-RNNs, I exper­i­ment with the cut­ting-edge Trans­former NN archi­tec­ture for lan­guage mod­el­ing & text gen­er­a­tion. Using Ope­nAI’s GPT-2-117M model pre-trained on a large Inter­net cor­pus and nshep­perd’s fine­tun­ing code, I retrain GPT-2-117M on a large (117MB) Project Guten­berg poetry cor­pus. I demon­strate how to train 2 vari­ants: “GPT-2-poetry”, trained on the poems as a con­tin­u­ous stream of text, and “GPT-2-poetry-prefix”, with each line pre­fixed with the meta­data of the PG book it came from.

With just a few GPU-days on 1080ti GPUs, GPT-2-117M fine­tun­ing can pro­duce high qual­ity poetry which is more con­sis­tent than my char-RNN poems & capa­ble of mod­el­ing sub­tle fea­tures like rhyming.

Split out to sep­a­rate arti­cle, .

  1. I could­n’t com­pare the qual­ity to Aanand’s orig­i­nal 3×512 because he did­n’t pro­vide the final val­i­da­tion score of his or the exact 50MB cor­pus to retrain on.↩︎