RNN metadata for mimicking individual author style

Teaching a text-generating char-RNN to automatically imitate many different authors by labeling the input text by author; additional experiments include imitating Geocities and retraining GPT-2 on a large Project Gutenberg poetry corpus.
statistics, NN, fiction, shell, R, GPT, tutorial, poetry
2015-09-122019-03-26 finished certainty: likely importance: 8


Char-RNNs are un­su­per­vised gen­er­a­tive mod­els which learn to mimic text se­quences. I sug­gest ex­tend­ing char-RNNs with in­line meta­data such as genre or au­thor pre­fixed to each line of in­put, al­low­ing for bet­ter & more effi­cient meta­data, and more con­trol­lable sam­pling of gen­er­ated out­put by feed­ing in de­sired meta­da­ta. A 2015 ex­per­i­ment us­ing torch-rnn on a set of ~30 Project Guten­berg e-books (1 per au­thor) to train a large char-RNN shows that a char-RNN can learn to re­mem­ber meta­data such as au­thors, learn as­so­ci­ated prose styles, and often gen­er­ate text vis­i­bly sim­i­lar to that of a spec­i­fied au­thor.

I fur­ther try & fail to train a char-RNN on Geoc­i­ties HTML for un­clear rea­sons.

More suc­cess­ful­ly, , the Trans­former NN ar­chi­tec­ture, by fine­tun­ing train­ing Ope­nAI’s GPT-2-117M Trans­former model on a much larger (117MB) Project Guten­berg po­etry cor­pus us­ing both un­la­beled lines & lines with in­line meta­data (the source book). The gen­er­ated po­etry is much bet­ter. And is bet­ter still.

A char­ac­ter-level (“char-RNN”) trained on cor­puses like can pro­duce amus­ing tex­tual out­put mim­ic­k­ing them. Mu­sic can also be gen­er­ated by a char-RNN if it is trained on tex­tual scores or tran­scrip­tions, and some effec­tive mu­sic has been pro­duced this way (I par­tic­u­larly liked Stur­m’s).

A char-RNN is sim­ple: dur­ing train­ing, it takes a bi­nary blob (its mem­ory or “hid­den state”) and tries to pre­dict a char­ac­ter based on it and a new bi­nary blob; that bi­nary blob gets fed back in to a sec­ond copy of the RNN which tries to pre­dict the sec­ond char­ac­ter us­ing the sec­ond bi­nary blob, and this gets fed into a third copy of the RNN and so on (“un­rolling through time”). Whether each char­ac­ter is cor­rect is the train­ing er­ror, which get back­prop­a­gated to the pre­vi­ous RNNs; since they are still hang­ing around in RAM, blame can be as­signed ap­pro­pri­ate­ly, and even­tu­ally gib­ber­ish hope­fully evolves into a pow­er­ful se­quence mod­eler which learns how to com­pactly en­code rel­e­vant mem­o­ries into the hid­den state, and what char­ac­ters can be pre­dicted from the hid­den state. This does­n’t re­quire us to have la­bels or com­plex loss func­tions or a big ap­pa­ra­tus—the RNN gets trained char­ac­ter by char­ac­ter.

Handling multiple corpuses

A prob­lem with this ap­proach is that a char-RNN has to be trained for each cor­pus: if you want Shake­spearean gib­ber­ish, you must train it only on Shake­speare, and if you want Irish mu­sic, you must train only on Irish—if you don’t, and you cre­ate a cor­pus which is Shake­speare con­cate­nated with the Bible, you will prob­a­bly get some­thing halfway be­tween the two, which might be some­what in­ter­est­ing, but is not a step for­ward to gen­er­at­ing bet­ter & more in­ter­est­ing gib­ber­ish; or if you have a few hun­dred songs of Irish mu­sic writ­ten in ABC for­mat and then you have a few dozen of rock or clas­si­cal pieces writ­ten in MIDI, train­ing an RNN on them all mixed to­gether will sim­ply yield gib­ber­ish out­put be­cause you will get an ‘av­er­age syn­tax’ of ABC & MIDI and an ‘av­er­age mu­sic’ of Irish & Rock. This is in part be­cause the train­ing is un­su­per­vised in the sense that the char-RNN is only at­tempt­ing to pre­dict the next char­ac­ter given the pre­vi­ous char­ac­ters, and it has no rea­son to give you just Shake­speare or just Bible out­put; it is bounc­ing be­tween them

How­ev­er, it seems like it should be pos­si­ble to do this. An RNN is a pow­er­ful neural net­work, and we can see in ex­am­ples us­ing Karpa­thy’s char-rnn that such RNNs have learned ‘sub­lan­guages’: in the Linux C source code ex­am­ples, the RNN has learned to switch ap­pro­pri­ately be­tween com­ments, source code, and string lit­er­als; in the CSS ex­am­ples, it’s learned to switch be­tween com­ments, CSS source code, string lit­er­als, URLs, and . If the RNN can de­cide on its own while gen­er­at­ing C or CSS to switch from “source code mode” to “com­ment mode”, then it should be able to also learn to switch be­tween Shake­speare and Bible mode, or even more au­thors.

If we could get the RNN to do such switch­ing on de­mand, there are sev­eral pos­si­ble ben­e­fits. Hu­man-au­thored tex­tual out­put is al­ways more sim­i­lar than differ­ent: a text file of Shake­speare is much more sim­i­lar to a text file of the Bible than it is to an equiv­a­lent length of ASCII gen­er­ated at ran­dom such as $M@Spc&kl?,U.(rUB)x9U0gd6G; a baroque clas­si­cal mu­sic score is more sim­i­lar to a tran­script of a tra­di­tional Irish mu­sic jam. Since they share such mu­tual in­for­ma­tion, a trained RNN to pro­duce Shake­speare and the Bible will be smaller than the sum of2 RNNs for Shake­speare & the Bible sep­a­rate­ly; this makes it eas­ier to share trained RNNs since you can dis­trib­ute 1 RNN cov­er­ing many gen­res or au­thors for peo­ple to play with, rather than hav­ing to train & host a dozen differ­ent RNNs. Such an RNN may also gen­er­ate bet­ter out­put for all cases since less of the cor­pus­es’ in­for­ma­tion is spent on learn­ing the ba­sics of Eng­lish shared by both cor­puses and more is avail­able for learn­ing the finer de­tails of each kind of writ­ing, which may help in cases like mu­sic where large datasets of tex­tual tran­scrip­tions of a de­sired genre may not be avail­able (by train­ing on a large cor­pus of clas­si­cal mu­sic, a smaller cor­pus of Irish mu­sic may go fur­ther than it would’ve on its own). More spec­u­la­tive­ly, the meta­data it­self may dy­nam­i­cally im­prove gen­er­a­tion by mak­ing it eas­ier for the RNN to not ‘wan­der’ but, since the RNN is keep­ing a mem­ory of the meta­data in its hid­den state, out­put may be more the­mat­i­cally co­her­ent since the RNN can pe­ri­od­i­cally re­fer back to the hid­den state to re­mem­ber what it was talk­ing about.

How can we do that? The RNN in the C or CSS ex­am­ples is able to mod­e-switch like this be­cause, I think, there are clear tran­si­tion mark­ers in­side the CSS or C which ‘tell’ the RNN that it needs to switch modes now; a com­ment be­gins /* ... or a data-URI in CSS be­gins url('data:image/png;base64,...). In con­trast, the most straight­for­ward way of com­bin­ing mu­sic or books and feed­ing them into a char-RNN is to sim­ply con­cate­nate them; but then the RNN has no syn­tac­tic or se­man­tic mark­ers which tell it where ‘Bible’ be­gins and ‘Shake­speare’ ends. Per­haps we can fix that by pro­vid­ing meta­data such as author/genre and turn­ing it into a semi­-su­per­vised task, some­how, along the lines of the source code: dis­tin­guish the text of one au­thor from an­oth­er, and then let the RNN learn the dis­tinc­tions on its own, just like the CSS/C.

Implementation

There are two ap­proaches for how to en­code the meta­data into the RNN:

  1. in band: sys­tem­at­i­cally en­code the meta­data into the cor­pus it­self, such as by a pre­fixed or suffixed string, and hope that the RNN will be able to learn the rel­e­vance of the meta­data and use it dur­ing train­ing to im­prove its pre­dic­tions (which it should, as LSTM/GRU units are sup­posed to help prop­a­gate long-term de­pen­den­cies like this); then spe­cific gen­res or au­thors or styles can be elicited dur­ing sam­pling by pro­vid­ing that meta­data as a seed.

    So for ex­am­ple, a Shake­speare cor­pus might be trans­formed by pre­fix­ing each line with a unique string which does­n’t to ap­pear in the cor­pus it­self, eg “SHAKESPEARE|To be or not to be,|SHAKESPEARE”. Then dur­ing sam­pling, Shake­spearean prose will be trig­gered like th sample.lua rnn.t7 -primetext "SHAKESPEARE|". (Why the pipe char­ac­ter? Be­cause it’s rarely used in prose but is­n’t hard to type or work with.) To add in more meta­data, one adds in more pre­fix­es; for ex­am­ple, per­haps the spe­cific work might be thought rel­e­vant and so the cor­pus is trans­formed to “SHAKESPEARE|HAMLET|To be or not to be,|HAMLET|SHAKESPEARE”. Then one can sam­ple with the spe­cific work, au­thor, or both. For mu­si­cal gen­er­a­tion, rel­e­vant meta­data might be mu­si­cal gen­re, au­thor, tem­po, in­stru­ments, type of work, tags pro­vided by mu­sic lis­ten­ers (“en­er­getic”, “sad”, “for_run­ning” etc), so one could ask for en­er­getic Irish mu­sic for two fid­dles.

    This has the ad­van­tage of be­ing easy to set up (some regexes to add meta­data) and easy to ex­tend (take an ex­ist­ing trained RNN and use it on the mod­i­fied cor­pus); the dis­ad­van­tage is that it may not work as the RNN may be un­able to jointly learn to re­call and use the meta­data—it may in­stead learn to for­get the meta­data im­me­di­ate­ly, or spend all its learn­ing ca­pac­ity on mod­el­ing an ‘av­er­age’ in­put be­cause that yields bet­ter log-loss er­ror. This in band ap­proach can also eas­ily be ex­tended to cover clas­si­fi­ca­tion; in clas­si­fi­ca­tion, the meta­data is put at the end of each line, so in­stead of learn­ing to pre­dict text con­di­tional on meta­data & pre­vi­ous text, the RNN is learn­ing to pre­dict meta­data con­di­tional on pre­vi­ous text, and clas­si­fi­ca­tions can be ex­tracted by low-tem­per­a­ture sam­pling with the in­put as the prime text fol­lowed by the sep­a­ra­tor char­ac­ter and see­ing what meta­data is pre­dicted (eg th sample.lua classification.t7 -temperature 0.1 -primetext "...text...|" → "SHAKESPEARE\n").

    As far as I know, no one has done this ex­cept per­haps in­ad­ver­tently or im­plic­it­ly.

  2. out of band: in­stead of de­pend­ing on the RNN to learn the value of the meta­data and pre­serv­ing it in its hid­den state, one can change the RNN ar­chi­tec­ture to in­ject the meta­data at each timestep. So if one has an RNN of 500 neu­rons, 5 of them will be hard­wired at each timestep to the meta­data value for the se­quence be­ing worked on.

    The down­side is that all meta­data in­puts will re­quire mod­i­fi­ca­tion of the RNN ar­chi­tec­ture to map them onto a par­tic­u­lar hid­den neu­ron. The ad­van­tage is that the meta­data value will al­ways be pre­sent, there is no need to hope that the RNN will learn to hold onto the meta­data, and it only has to learn the as­so­ci­ated differ­ences; so it will learn more re­li­ably and faster. Vari­ants of this turn out to have been done be­fore:

    1. Mikolov & Zweig 2012, “Con­text de­pen­dent re­cur­rent neural net­work lan­guage model”: RNN aug­mented with topic in­for­ma­tion from , achiev­ing bet­ter pre­dic­tion on the Penn Tree­bank & WSJ tran­scrip­tion task

    2. Aransa et al 2013/2015, “Im­prov­ing Con­tin­u­ous Space Lan­guage Mod­els us­ing Aux­il­iary Fea­tures”: a feed­for­ward NN given n char­ac­ters at a time, with the in­puts at each se­quence in­clud­ing em­bed­dings of the pre­vi­ous lines and, par­tic­u­lar­ly, 5 ‘gen­res’ (in this case, Egypt­ian Ara­bic SMS/chat, mod­ern stan­dard Ara­bic, Egypt­ian Ara­bic fo­rum dis­cus­sions, Lev­an­tine fo­rum dis­cus­sions, for­mal MSA from UN trans­la­tions, Egypt­ian Ara­bic tele­phone call­s), hard­wired into the in­put lay­er; find­ing that genre par­tic­u­larly helped BLEU scores. (In­clud­ing meta­data like genre to as­sist train­ing ap­pears to have been used fairly reg­u­larly in ear­lier text top­ic-mod­el­ing work, but not so much neural net­works or for in­creas­ing re­al­ism of gen­er­ated tex­t.)

    3. Chen et al 2015, “Re­cur­rent Neural Net­work Lan­guage Model Adap­ta­tion for mul­ti­-Genre Broad­cast Speech Recog­ni­tion”: an RNN aug­mented with the text in­put be­ing fed into stan­dard text top­ic-mod­el­ing al­go­rithms like LDA, par­tially trained on BBC gen­res (advice/children/comedy/competition/documentary/drama/events/news), and the to­tal out­puts from the topic al­go­rithms hard­wired into the in­put layer along with the text; giv­ing mod­er­ate im­prove­ments on au­dio→­text tran­scrip­tion.

    4. Sen­nrich et al 2016, “Con­trol­ling Po­lite­ness in Neural Ma­chine Trans­la­tion via Side Con­straints”: a stan­dard neural ma­chine trans­la­tion us­ing RNNs in the en­coder-de­coder frame­work, here for trans­lat­ing Eng­lish→Ger­man movie sub­ti­tles, but the Ger­man cor­pus’s sen­tences are an­no­tated by po­lite­ness meta­data de­scrib­ing the pronouns/verb con­ju­ga­tions; they ob­tain both bet­ter BLEU scores on trans­la­tion as well as the abil­ity to change to change the gen­er­ated Eng­lish

    5. This has also been done in (see also ): they model beer re­views with a char­ac­ter-level RNN which is given meta­data (beer types: “Amer­i­can IPA”, “Russ­ian Im­pe­r­ial Stout”, “Amer­i­can Porter”, “Fruit/Vegetable Beer”, and “Amer­i­can Ad­junct Lager”) as a hard­wired in­put to the RNN at each timestep, not­ing that

      It might seem re­dun­dant to repli­cate xaux at each se­quence step, but by pro­vid­ing it, we elim­i­nate pres­sure on the model to mem­o­rize it. In­stead, all com­pu­ta­tion can fo­cus on mod­el­ing the text and its in­ter­ac­tion with the aux­il­iary in­put…­Such mod­els have suc­cess­fully pro­duced (short) im­age cap­tions, but seem im­prac­ti­cal for gen­er­at­ing full re­views at the char­ac­ter level be­cause sig­nal from xaux must sur­vive for hun­dreds of se­quence steps. We take in­spi­ra­tion from an anal­ogy to hu­man text gen­er­a­tion. Con­sider that given a topic and told to speak at length, a hu­man might be apt to me­an­der and ram­ble. But given a sub­ject to stare at, it is far eas­ier to re­main fo­cused.

      They ex­pe­ri­enced trou­ble train­ing their beer char-RNN, and they adopt a strat­egy of train­ing nor­mally with­out the hard­wired meta­data down to a loss of <1.0/character and then train­ing with meta­data to a fi­nal loss of 0.7–0.8. This is rea­son­able be­cause at a loss of 1.1 on Eng­lish text, sam­pled out­put has many clear er­rors, but at <0.9 the out­put be­comes un­can­ny; it stands to rea­son that sub­tle differ­ences of style & vo­cab­u­lary will only be­gin to emerge once the RNN has the ba­sics of Eng­lish down pat (the differ­ences be­tween skilled au­thors’ Eng­lishes are, un­sur­pris­ing­ly, smaller than the differ­ences be­tween reg­u­lar Eng­lish & gib­ber­ish).

      Pre­train­ing+meta­data works well for Lip­ton et al 2015, but they don’t com­pare it to in­lined meta­data or show that the pre­train­ing is nec­es­sary. I am also a lit­tle skep­ti­cal about the ra­tio­nale that out of band sig­nal­ing is use­ful be­cause it puts less pres­sure on the hid­den state: while it may re­duce pres­sure on the RNN’s LSTMs to mem­o­rize the meta­data, one is still los­ing RAM to rein­ject­ing the meta­data into the RNN at every timestep. Ei­ther way, the meta­data must be stored some­where in RAM and it does­n’t make much differ­ence if it’s 495 effec­tive neu­rons (with 5 hard­wired to meta­data) or if it’s 500 effec­tive neu­rons (of which 5 even­tu­ally get trained to hold meta­data, yield­ing 495 effec­tive neu­ron­s). Pre­train­ing also won’t work with torch-rnn as the word-em­bed­ding it com­putes is differ­ent on each dataset, so it’s cur­rently im­pos­si­ble to train on an un­la­beled dataset, change the data to la­beled, and re­sume train­ing.

    6. after my ex­per­i­ments here, Deep­Mind pub­lished a CNN for gen­er­at­ing raw au­dio: , van den Oord et al 2016. They noted sim­i­lar phe­nom­e­na: the WaveNet could im­i­tate spe­cific speak­ers if pro­vided speaker la­bels along with the raw au­dio, and spec­i­fy­ing meta­data like in­stru­ments al­lowed con­trol of gen­er­ated mu­si­cal out­put. An­other later Google pa­per, John­son et al 2016’s , ap­plies in­-band meta­data to gen­er­al­ize a RNN trans­la­tor by spec­i­fy­ing the tar­get lan­guage in­-band and hav­ing the RNN learn how to ex­ploit this meta­data for bet­ter nat­ural lan­guage gen­er­a­tion and the abil­ity to trans­late be­tween lan­guage pairs with no avail­able cor­pus­es.

Given the at­trac­tive sim­plic­i­ty, I am go­ing to try in band meta­da­ta.

Data

The eas­i­est kind of data to test with is Eng­lish prose: I can rec­og­nize prose differ­ences eas­i­ly, and there are count­less nov­els or fic­tional works which can be con­verted into la­beled prose.

If we just down­load some com­plete works off (googling ‘Project Guten­berg “com­plete works of”’), pre­fix each line with “$AUTHOR|”, con­cate­nate the com­plete works, and throw them into char-rnn, we should not ex­pect good re­sults: the au­thor meta­data will now make up some­thing like 5% of the en­tire char­ac­ter count (be­cause PG wraps them to short lines) and by train­ing on 5M of ex­clu­sively Austen and then 5M of ex­clu­sively Churchill, we might run into over­fit­ting prob­lems and due to the lack of prox­im­ity of differ­ent styles, the RNN might not ‘re­al­ize’ that the au­thor meta­data is­n’t just some eas­ily pre­dicted & then ig­nored noise but can be used to pre­dict far into the fu­ture. We also don’t want the PG head­ers ex­plain­ing what PG is, and to make sure the files are all con­verted to ASCII.

So to deal with these 4 is­sues I’m go­ing to process the PG col­lected works thus­ly:

  1. delete the first 80 lines and last ~300 lines, and fil­ter out any line men­tion­ing “Guten­berg”

  2. con­vert to ASCII

  3. delete all new­lines and then rewrap to make lines which are 10000 bytes—­long enough to have a great deal of in­ter­nal struc­ture and form a good batch to learn from, and thus can be ran­domly sorted with the oth­ers.

    But new­lines do carry se­man­tic in­for­ma­tion—­think about di­a­logues—and does delet­ing them carry a cost? Per­haps we should map new­lines to some rare char­ac­ter like tilde, or use the po­etry con­ven­tion of de­not­ing new­lines with for­ward-s­lash­es?

  4. pre­fix each long line with the au­thor it was sam­pled from

Unlabeled

As a base­line, a char-RNN with 2×2500 neu­rons, trained with 50% dropout, batch-size 55, and BPTT length 200, on the PG dataset with­out any au­thor pre­fixes or suffix­es, con­verges to a val­i­da­tion loss of ~1.08 after ~20 epoches.

Training with prefixes

Small RNN

For my first try, I grabbed 7 au­thors, giv­ing a good fi­nal dataset of 46M, and fed it into char-rnn, choos­ing a fairly small 2-layer RNN and us­ing up the rest of my GPU RAM by do­ing un­rolling far more than the de­fault 50 timesteps to en­cour­age it to learn the long-range de­pen­den­cies of style:

cd ~/src/char-rnn/data/
mkdir ./styles/ ; cd ./styles/

## "The Complete Project Gutenberg Works of Jane Austen" http://www.gutenberg.org/ebooks/31100
wget 'https://www.gutenberg.org/ebooks/31100.txt.utf-8' -O austen.txt
## "The Complete Works of Josh Billings" https://www.gutenberg.org/ebooks/36556
wget 'https://www.gutenberg.org/files/36556/36556-0.txt' -O billings.txt
## "Project Gutenberg Complete Works of Winston Churchill" http://www.gutenberg.org/ebooks/5400
wget 'https://www.gutenberg.org/ebooks/5400.txt.utf-8' -O churchill.txt
## "The Project Gutenberg Complete Works of Gilbert Parker" https://www.gutenberg.org/ebooks/6300
wget 'https://www.gutenberg.org/ebooks/6300.txt.utf-8' -O parker.txt
## "The Complete Works of William Shakespeare" http://www.gutenberg.org/ebooks/100
wget 'https://www.gutenberg.org/ebooks/100.txt.utf-8' -O shakespeare.txt
## "The Entire Project Gutenberg Works of Mark Twain" http://www.gutenberg.org/ebooks/3200
wget 'https://www.gutenberg.org/ebooks/3200.txt.utf-8' -O twain.txt
## "The Complete Works of Artemus Ward" https://www.gutenberg.org/ebooks/6946
wget 'https://www.gutenberg.org/ebooks/6946.txt.utf-8' -O ward.txt
du -ch *.txt; wc --char *.txt
# 4.2M  austen.txt
# 836K  billings.txt
# 9.0M  churchill.txt
# 34M   input.txt
# 12M   parker.txt
# 5.3M  shakespeare.txt
# 15M   twain.txt
# 12K   ward.txt
# 80M   total
#  4373566 austen.txt
#   849872 billings.txt
#  9350541 churchill.txt
# 34883356 input.txt
# 12288956 parker.txt
#  5465099 shakespeare.txt
# 15711658 twain.txt
#     9694 ward.txt
# 82932742 total
for FILE in *.txt; do
  dos2unix $FILE
  AUTHOR=$(echo $FILE | sed -e 's/\.txt//' | tr '[:lower:]' '[:upper:]')
  cat $FILE | tail -n +80 | grep -v -i 'Gutenberg' | iconv -c -tascii | tr '\n' ' ' | \
   fold --spaces --bytes --width=10000 | sed -e "s/^/$AUTHOR\|/" > $FILE.transformed
done
rm input.txt
cat *.transformed | shuf > input.txt
cd ../../
th train.lua -data_dir data/styles/ -gpuid 0 -rnn_size 747 -num_layers 2 -seq_length 187
# using CUDA on GPU 0...
# loading data files...
# cutting off end of data so that the batches/sequences divide evenly
# reshaping tensor...
# data load done. Number of data batches in train: 4852, val: 256, test: 0
# vocab size: 96
# creating an LSTM with 2 layers
# number of parameters in the model: 7066716
# cloning rnn
# cloning criterion
# 1⁄242600 (epoch 0.000), train_loss = 4.57489208, grad/param norm = 9.6573e-01, time/batch = 2.03s
# ...
# 15979⁄242600 (epoch 3.293), train_loss = 1.01393854, grad/param norm = 1.8754e-02, time/batch = 1.40s

This gets us a cor­pus in which every line spec­i­fies its au­thor and then switches au­thors, while still be­ing long enough to have read­able mean­ing. After about 22 hours of train­ing yield­ing a val­i­da­tion loss of 1.0402 (with lit­tle im­prove­ment ev­i­dent after the first 7 hours), we can try out our best can­di­date and see if it knows Shake­speare ver­sus Austen:

BEST=`ls cv/*.t7 | sort --field-separator="_" --key=4 --numeric-sort --reverse | tail -1`
th sample.lua $BEST -temperature 0.8 -length 500 -primetext "SHAKESPEARE|"
# SHAKESPEARE|is of no regular complexion.  The action of the plain chatter--"  "Alas, they
# have discovered what was to be afforded since then?"  "We can believe--for the signature of
# the Church."  "So they do, dear lord, do they their home?  Oh, no, to the devil which we
# have not written, the Church is not in the world; but not in this harmless way then to the
# captain of man--therefore while the praise of it was allurious he would not reflect on the
# curious man's hatch deemed that his life should be very con

th sample.lua $BEST -temperature 0.8 -length 500 -primetext "SHAKESPEARE|" -seed 105
# SHAKESPEARE|   CHAPTER VII  FROM A WESPERON IN STORY  "MOST INGURIFELLOWSELLES," Antoinette
# now looked at him a sharp pleasure in passing southward and again in portion of his mother's
# reach of it. Suddenly the thing was said.  "We'll sit down and find out," he inquired, with a
# pity to see Mr. Carvel driving beside the bedroom, which was almost as much as he could bear
# the potion.  "You say you're strong," said Mrs. Holy, indignantly, "you won't have to go
# away, about the doctor. What is it?"  "Why, we are"

th sample.lua $BEST -temperature 0.8 -length 500 -primetext "AUSTEN|"
# AUSTEN|business, and the gout--a constant and foolish figure in which Fellowes' ring is
# nearer to distemper than meek and steady interest and clean iron. The episode for the future
# and the war, and the seedy and effective sun-elogs and the others ventured its remote room,
# whose hair was a suffering man--that the work of the circumstance interested him. It had no
# long served to open the papers to answer up a quiet road, free from the long row of white
# to the lash called No. 14,000 to a sweet conversatio

th sample.lua $BEST -temperature 0.8 -length 500 -primetext "TWAIN|"
# TWAIN|quarrelling with a little book, and so on, considering its sensations as to whether
# it were not possible to eat it.  He thought that the leader of the conference with his own
# death would be recognized as a common expression.  The men that mounted from motive powers,
# how big the calf, commander of the rights of the new economic steamer, the English, a lass
# of manhood, will exhibit no praise or increase out of a sort of meaning in the senses, and
# send them back to such a winter as we can go into t

We can see that while the RNN is pro­duc­ing very Eng­lish-sound­ing nov­el­is­tic prose and pro­duces its usual mix of flaw­less syn­tax and hi­lar­i­ous se­man­tics (I par­tic­u­larly like the phrase “Oh, no, to the devil which we have not writ­ten, the Church is not in the world”), it has failed to learn the styles I was hop­ing for. The Austen and Twain sam­ples sound some­what like them­selves, but the Shake­speare sam­ples are to­tally wrong and sound like a Vic­to­rian Eng­lish nov­el. And given the lack of im­prove­ments on the val­i­da­tion set, it seems un­likely that an­other 10 epochs will rem­edy the sit­u­a­tion: the RNN should quickly learn how to use the very use­ful meta­da­ta.

Since the style varies so lit­tle be­tween the sam­ples, I won­der if mim­ic­k­ing Eng­lish uses up all the ca­pac­ity in the RNN? I gave it only 747 neu­rons, but I could’ve given it much more.

Larger RNN

So to try again:

  • to bet­ter pre­serve the se­man­tics, in­stead of delet­ing new­li­nes, re­place them with a slash
  • try much shorter lines of 1000 bytes (in­creas­ing the rel­a­tive den­sity of the meta­data)
  • back off on the very long back­prop­a­ga­tion through time, and in­stead, de­vote the GPU RAM to many more neu­rons.
  • the de­fault set­ting for the val­i­da­tion set is a bit ex­ces­sive here and I’d rather use some of that text for train­ing
rm input.txt *.transformed
for FILE in *.txt; do
  dos2unix $FILE
  AUTHOR=$(echo $FILE | sed -e 's/\.txt//' | tr '[:lower:]' '[:upper:]')
  cat $FILE | tail -n +80 | grep -v -i 'Gutenberg' | iconv -c -tascii | tr '\n' '/' | \
   fold --spaces --bytes --width=1000 | sed -e "s/^/$AUTHOR\|/" > $FILE.transformed
done
cat *.transformed | shuf > input.txt
cd ../../
th train.lua -data_dir data/styles/ -gpuid 0 -rnn_size 2600 -num_layers 2 -val_frac 0.01
# ...data load done. Number of data batches in train: 18294, val: 192, test: 771
# vocab size: 96
# creating an LSTM with 2 layers
# number of parameters in the model: 82409696
# cloning rnn
# cloning criterion
# 1⁄914700 (epoch 0.000), train_loss = 4.80300702, grad/param norm = 1.1946e+00, time/batch = 2.78s
# 2⁄914700 (epoch 0.000), train_loss = 13.66862074, grad/param norm = 1.5432e+00, time/batch = 2.63s
# ...

Er­rored out of mem­ory early the next day; the val­i­da­tion loss is still pretty meh, but at 1.1705, can’t ex­pect much, and in­deed, the style is not im­pres­sive when I check sev­eral pre­fix­es:

th sample.lua cv/lm_lstm_epoch0.93_1.1705.t7 -temperature 0.8 -length 500 -primetext "SHAKESPEARE|"
# seeding with SHAKESPEARE|
# --------------------------
# SHAKESPEARE|jung's own,/which is on the house again.  There is no endeavour to be dressed in the midst of the/present of
# Belle, who persuades himself to know to have a condition of/the half, but "The garnal she was necessary, but it was high,
# consecrets, and/excursions of the worst and thing and different honor to flew himself.  But/since the building closed the
# mass of inspiration of the children of French wind,/hurried down--but he was in the second farmer of the Cald endless figures,
# Mary/Maeaches, and t

th sample.lua cv/lm_lstm_epoch0.93_1.1705.t7 -temperature 0.8 -length 500 -primetext "AUSTEN|"
# AUSTEN|mill./And now the good deal now be alone, there is no endeavour to be dreaming./In fact, what was the story of his
# state, must be a steady carriages of pointing out/both till he has walked at a long time, and not convinced that he
# remembers/her in this story of a purpose of this captain in stock. There was/no doubt of interest, that Mr. Crewe's
# mother could not be got the/loss of first poor sister, and who looked warm enough by a/great hay below and making a
# leaver and with laid with a murder to

th sample.lua cv/lm_lstm_epoch0.93_1.1705.t7 -temperature 0.8 -length 500 -primetext "TWAIN|"
# TWAIN|nor contributed/she has filled on behind him.  He had been satisfied by little just as to/deliver that the inclination
# of the possession of a thousand expenses in the group of feeling had destroyed/him to descend.  The physical had he darted
# before him that he was worth a
# PARKER|George Pasha, for instance?"//"Then it is not the marvel of laws upon Sam and the Sellers."  She said/he would ask
# himself to, one day standing from the floor, as he/stood for the capital.  He was no good of conversation

Larger author count

Next, I de­cided to in­crease di­ver­sity of styles: ramp­ing up to 38 au­thors, in­clud­ing mod­ern SF/F fic­tion au­thors (Robert Jor­dan’s Wheel of Time, Gene Wolfe, R.A. Laffer­ty, Ryuk­ishi07’s Umineko no naku koro ni, Kafka), po­etry an­cient and mod­ern (Il­iad, Be­owulf, Dan­te, Keats, Co­leridge, Poe, Whit­man, Gilbert & Sul­li­van), an­cient fic­tion (the Bible), mis­cel­la­neous non­fic­tion (Aris­totle, Machi­avel­li, Paine) etc. By adding in many more au­thors from many differ­ent gen­res and time pe­ri­ods, this may force the RNN to re­al­ize that it needs to take se­ri­ously the meta­data pre­fix.

wget 'https://dl.dropboxusercontent.com/u/182368464/umineko-compress.tar.xz'
untar umineko-compress.tar.xz && rm umineko-compress.tar.xz
mv umineko/umineko.txt  ryukishi07.txt; mv  umineko/wot.txt jordan.txt; rm -rf ./umineko/

cat /home/gwern/doc-misc/fiction/lafferty/*.txt > lafferty.txt
cat /home/gwern/doc-misc/fiction/wolfe/fiction/*.txt > wolfe.txt

wget 'https://www.gutenberg.org/ebooks/10031.txt.utf-8'  -O poe.txt && sleep 5s ## avoid anti-crawl defenses
wget 'https://www.gutenberg.org/ebooks/11.txt.utf-8'     -O carroll.txt && sleep 5s
wget 'https://www.gutenberg.org/ebooks/1232.txt.utf-8'   -O machiavelli.txt && sleep 5s
wget 'https://www.gutenberg.org/ebooks/12699.txt.utf-8'  -O aristotle.txt && sleep 5s
wget 'https://www.gutenberg.org/ebooks/1322.txt.utf-8'   -O whitman.txt && sleep 5s
wget 'https://www.gutenberg.org/ebooks/16328.txt.utf-8'  -O beowulf.txt && sleep 5s
wget 'https://www.gutenberg.org/ebooks/1661.txt.utf-8'   -O doyle.txt && sleep 5s
wget 'https://www.gutenberg.org/ebooks/23684.txt.utf-8'  -O keats.txt && sleep 5s
wget 'https://www.gutenberg.org/ebooks/2383.txt.utf-8'   -O chaucer.txt && sleep 5s
wget 'https://www.gutenberg.org/ebooks/2701.txt.utf-8'   -O melville.txt && sleep 5s
wget 'https://www.gutenberg.org/ebooks/30.txt.utf-8'     -O bible.txt && sleep 5s
wget 'https://www.gutenberg.org/ebooks/3090.txt.utf-8'   -O maupassant.txt && sleep 5s
wget 'https://www.gutenberg.org/ebooks/31270.txt.utf-8'  -O paine.txt && sleep 5s
wget 'https://www.gutenberg.org/ebooks/3253.txt.utf-8'   -O lincoln.txt && sleep 5s
wget 'https://www.gutenberg.org/ebooks/345.txt.utf-8'    -O stoker.txt && sleep 5s
wget 'https://www.gutenberg.org/ebooks/3567.txt.utf-8'   -O bonaparte.txt && sleep 5s
wget 'https://www.gutenberg.org/ebooks/3600.txt.utf-8'   -O montaigne.txt && sleep 5s
wget 'https://www.gutenberg.org/ebooks/4200.txt.utf-8'   -O pepys.txt && sleep 5s
wget 'https://www.gutenberg.org/ebooks/4361.txt.utf-8'   -O sherman.txt && sleep 5s
wget 'https://www.gutenberg.org/ebooks/4367.txt.utf-8'   -O grant.txt && sleep 5s
wget 'https://www.gutenberg.org/ebooks/6130.txt.utf-8'   -O homer.txt && sleep 5s
wget 'https://www.gutenberg.org/ebooks/7849.txt.utf-8'   -O kafka.txt && sleep 5s
wget 'https://www.gutenberg.org/ebooks/808.txt.utf-8'    -O gilbertsullivan.txt && sleep 5s
wget 'https://www.gutenberg.org/ebooks/8800.txt.utf-8'   -O dante.txt && sleep 5s
wget 'https://www.gutenberg.org/files/28289/28289-0.txt' -O eliot.txt && sleep 5s
wget 'https://www.gutenberg.org/files/29090/29090-0.txt' -O coleridge.txt && sleep 5s
wget 'https://www.gutenberg.org/files/5000/5000-8.txt'   -O davinci.txt && sleep 5s

Due to OOM crash, I de­creased the neu­ron count. With a much big­ger mod­el, also nec­es­sary to have dropout en­abled (de­fault of 0 means progress seems to halt around a loss of 3.5 and makes no dis­cernible progress for hours)

rm input.txt *.transformed *.t7
wc --char *.txt
# 100972224 total
for FILE in *.txt; do
  dos2unix $FILE;
  AUTHOR=$(echo $FILE | sed -e 's/\.txt//' | tr '[:lower:]' '[:upper:]')
  cat $FILE | tail -n +80 | grep -i -v -e 'Gutenberg' -e 'http' -e 'file://' -e 'COPYRIGHT' -e 'ELECTRONIC VERSION' -e 'ISBN' \
   | iconv -c -tascii | sed -e ':a;N;$!ba;s/\n/ /g' -e 's/  */ /g' -e 's/ \/ \/ //g' | \
   fold --spaces --bytes --width=3000 | head --bytes=1M | sed -e "s/^/$AUTHOR\|/" > $FILE.transformed
done
cat *.transformed | shuf > input.txt
cd ../../
th train.lua -data_dir data/styles/ -gpuid 0 -rnn_size 2400 -num_layers 2 -val_frac 0.01 -dropout 0.5
# ...data load done. Number of data batches in train: 39862, val: 419, test: 1679
# vocab size: 98
# creating an LSTM with 2 layers
# number of parameters in the model: 70334498
# cloning rnn
# cloning criterion
# 1⁄1993100 (epoch 0.000), train_loss = 4.68234798, grad/param norm = 7.4220e-01, time/batch = 2.53s
# 2⁄1993100 (epoch 0.000), train_loss = 13.00693768, grad/param norm = 1.7191e+00, time/batch = 2.35s
# ...

Did OK but seemed to have diffi­culty im­prov­ing past a loss of 1.14, had is­sues with ex­plod­ing er­ror (one ex­plod­ing er­ror up to a loss of 59 ter­mi­nated an overnight train­ing run) and then be­gan er­ror­ing out every time I tried to re­sume, so I be­gan a third try, this time ex­per­i­ment­ing with deeper lay­ers and in­creas­ing the data pre­pro­cess­ing steps to catch var­i­ous con­trol-char­ac­ters and copyright/boilerplate which snuck in:

nice th train.lua -data_dir data/styles/ -gpuid 0 -rnn_size 1000 -num_layers 3 -val_frac 0.005 -seq_length 75 -dropout 0.7

This one even­tu­ally ex­ploded too, hav­ing maxed out at a loss of 1.185.

After delet­ing even more con­trol char­ac­ters and con­stantly restart­ing after ex­plo­sions (which had be­come a reg­u­lar thing as the val­i­da­tion loss be­gan bounc­ing around a range of 1.09–1.2, the RNN seem­ing to have se­vere trou­ble do­ing any bet­ter) I did some sam­pling. The re­sults are cu­ri­ous: the RNN has mem­o­rized the pre­fix­es, of course, and at higher tem­per­a­tures will spon­ta­neously end with a new­line and be­gin with a new pre­fix; many of the pre­fixes like “BIBLE|” look noth­ing like the orig­i­nal source, but the “JORDAN|” pre­fix per­forms ex­tremely well in mim­ic­k­ing the Wheel of Time, drop­ping in many char­ac­ter names and WoT ne­ol­o­gisms like “Aiel” or (of course) “Aes Sedai”. This is­n’t too sur­pris­ing since the WoT cor­pus makes up 20M or a sixth of the in­put; it’s also not too sur­pris­ing when WoT terms pop up with other pre­fix­es, but they do so at a far lower rate. So at least to some ex­tent, the RNN has learned to use Jor­dan ver­sus non-Jor­dan pre­fixes to de­cide whether to drop in WoT vo­cab. The next largest au­thor in the cor­pus is Mark Twain, and here too we see some­thing sim­i­lar: when gen­er­at­ing Twain text, we see a lot of words that sound like Twain vo­cab­u­lary (river­boats, “Amer­ica”, “the Con­sti­tu­tion” etc), and while these some­times pop up in the smaller pre­fix sam­ples it’s at a much lower rate. So the RNN is learn­ing that differ­ent pre­fixes in­di­cate differ­ent vo­cab­u­lar­ies, but it’s only do­ing this well on the largest au­thors.

Class imbalance fix

Does this re­flect that <2M of text from an au­thor is too lit­tle to learn from and so the bet­ter-learned au­thors’ ma­te­r­ial in­her­ently pulls the weaker sam­ples to­wards them (bor­row­ing strength), that the other au­thors’ differ­ences are too sub­tle com­pared to the dis­tinctly differ­ent vo­cab of Jor­dan & Twain (so the RNN fo­cuses on the more pre­dic­tive­ly-valu­able differ­ences in ne­ol­o­gisms etc), or that the RNN is too small to store the differ­ences be­tween so many au­thors?

For com­par­ison, a one-layer RNN trained on solely the Robert Jor­dan cor­pus (but still for­mat­ted with pre­fixes etc) got down to a loss of 0.9638, and just the Bible, 0.9420 So the penalty for the Bible for hav­ing to learn Jor­dan is 0.9763 − 0.9420 = 0.0343, and vice-versa is 0.9763 − 0.9638 = 0.0125. Pre­sum­ably the rea­son the Bible RNN is hurt 2.7× more is be­cause the Jor­dan cor­pus is 4.3× larger and more learn­ing ca­pac­ity goes to its vo­cab­u­lary & style since a bias to­wards Jor­dan style will pay off more in re­duced loss, a clas­sic class-im­bal­ance prob­lem.

Class-im­bal­ance prob­lems can some­times be fixed by chang­ing the loss func­tion to bet­ter match what one wants (such as by pe­nal­iz­ing more er­rors on the smaller class), re­duc­ing the too-big class, or in­creas­ing the too-s­mall class (by col­lect­ing more data or fak­ing that with data aug­men­ta­tion). I tried bal­anc­ing the cor­puses bet­ter by lim­it­ing how much was taken from the biggest.

Also at this time, torch-rnn was re­leased by Justin John­son, with claims of much greater mem­ory effi­ciency & bet­ter per­for­mance com­pared to char-rnn, so I tried it out. torch-rnn was ca­pa­ble of train­ing larger RNNs, and I ex­pe­ri­enced many fewer prob­lems with ex­plod­ing loss or OOM er­rors, so I switched to us­ing it. The pre­pro­cess­ing step re­mains much the same, with the ex­cep­tion of a | head --bytes=1M call added to the pipeline to limit each of the 31 au­thors to 1MB:

rm *.transformed
for FILE in *.txt; do
  dos2unix $FILE;
  AUTHOR=$(echo $FILE | sed -e 's/\.txt//' | tr '[:lower:]' '[:upper:]')
  cat $FILE | tail -n +80 | head -n -362 | grep -i -v -e 'Gutenberg' -e 'http' -e 'file://' -e 'COPYRIGHT' -e 'ELECTRONIC VERSION' \
    -e 'ISBN' | tr -d '[:cntrl:]' | iconv -c -tascii | sed -e ':a;N;$!ba;s/\n/ /g' -e 's/  */ /g' -e 's/ \/ \/ //g' | \
    fold --spaces --bytes --width=3000 | head --bytes=1M | sed -e "s/^/$AUTHOR\|/" > $FILE.transformed
done
cat *.transformed | shuf > input.txt

## with limiting:
findhog *.transformed
# 8   coleridge.txt.transformed
# 8   dante.txt.transformed
# 8   davinci.txt.transformed
# 8   eliot.txt.transformed
# 8   gilbertsullivan.txt.transformed
# 8   grant.txt.transformed
# 8   homer.txt.transformed
# 8   kafka.txt.transformed
# 8   pepys.txt.transformed
# 8   sherman.txt.transformed
# 152 carroll.txt.transformed
# 240 keats.txt.transformed
# 244 beowulf.txt.transformed
# 284 machiavelli.txt.transformed
# 356 poe.txt.transformed
# 560 doyle.txt.transformed
# 596 aristotle.txt.transformed
# 692 whitman.txt.transformed
# 832 stoker.txt.transformed
# 1028    bible.txt.transformed
# 1028    bonaparte.txt.transformed
# 1028    chaucer.txt.transformed
# 1028    jordan.txt.transformed
# 1028    lafferty.txt.transformed
# 1028    lincoln.txt.transformed
# 1028    maupassant.txt.transformed
# 1028    melville.txt.transformed
# 1028    montaigne.txt.transformed
# 1028    paine.txt.transformed
# 1028    ryukishi07.txt.transformed
# 1028    wolfe.txt.transformed

cd ../../
python scripts/preprocess.py --input_txt data/multi/input.txt --output_h5 multi.h5 --output_json multi.json --val_frac 0.005 --test_frac 0.005
nice th train.lua -input_h5 multi.h5 -input_json multi.json -batch_size 100 -seq_length 70 -dropout 0.5 -rnn_size 2500 -num_layers 2
# ...
# Epoch 28.52 / 50, i = 65000 / 118100, loss = 0.901009
# val_loss =      1.028011712161

This trained to con­ver­gence with a loss of ~1.03 after ~30 epochs tak­ing a week or two, yield­ing 2016-03-27-metadata.t7 (583M­B). This is ~0.05 bet­ter than the un­la­beled base­line.

Did it suc­ceed in learn­ing to use the meta­data and mim­ic­k­ing style?

Success

Yes. Sam­pling 80K char­ac­ters of text on CPU and set­ting the tem­per­a­ture high enough that the RNN will pe­ri­od­i­cally emit a new­line and jump to a new mode with the in­vo­ca­tion th sample.lua -gpu -1 -checkpoint cv/2016-03-27-metadata.t7 -length 80000 -temperature 0.8 -start_text 'JORDAN|', there are 13 tran­si­tions:

  1. Jor­dan: short but fail. Men­tions “Lon­don”, “Jacques”, “Nan­tucket”, etc

  2. Mau­pas­sant: suc­cess. Poi­son, mur­der, city etc

  3. Lafferty: mixed suc­cess. Clubs, girls, Chicago, heavy on di­a­logue, and Amer­i­can names, but also some vo­cab­u­lary creep­ing in from other au­thors such as “Tar Valon” (Jor­dan)

  4. Chaucer: suc­cess. Clearly old-timey with in­vo­ca­tions of Je­sus. Sam­ple:

    “…through­out this world, and shall thereby be called in trust, as now O first cause of this world we have no dan­ger; That women were with you and the mes­sage, As I loved them they that should pray: No more of this so lit­tle wicked­ness.” When she saw him that there was no wight to see, For in h is cursed peace, his Chris­te’s hand, And cried his daugh­ter many a long time For he took her out of the world so dear. And she was not holy and more jol­ly, Had wed­ded her no sooth and blithe sore; The lady is this mar­riage and her wife. Come to the priest, what woe we have to do, And thanke him to make a dream, and I can Thomas, with that he saide, may I not stand: And the time went him all out of the town, And with the corpse, and set­tled him like As Je­sus Christ, as he was thought, They would have been a full con­fused grace.

  5. Whit­man: short but suc­cess?

    WHITMAN|but lusty, clos­ing the walls, Who are the clauses of cav­alry with

  6. Chaucer: suc­cess

  7. Lin­coln: suc­cess. Sam­ple:

    LINCOLN|of his con­sti­tu­tional affairs, is bet­ter put down by their own things than above the ex­tent of the ma­jor­ity of the peo­ple or of the Re­pub­li­cans of the United States which in the ex­tremes may be said to be one of those who will ob­tain bad ne­gro as il­l-de­manded and sim­ple means as they have be­longed. r. Pitt in the same man­ner in Par­lia­ment I have not seen him in the other un­com­mon per­sonal ex­pe­di­tion to the British court, and that his thirst was the ob­ject, or in which he wrote lib­erty for sup­port­ing him in the present day with an ex­treme res­o­lu­tion of the sov­er­eign­ty…

  8. Bible: suc­cess. Sam­ple:

    BIBLE|with him two cities which I com­manded them; he shall not die: for the LORD is among us. And the LORD was come unto his son that sent him to seek the way to Adon. 02:019:019 And it came to pass at the end of three days after the peo­ple of Is­rael, that they had to touch their voice, and give him a south, and be cut be­fore Pharaoh: 04:030:028 And the LORD spake unto os­es, say­ing, 03:022:002 There shall not a man be found out of the house of the LORD. 03:013:028 And the priest shall have one lot and the length of the bul­lock, and shall put the blood upon the al­tar, and put the al­tar of gold to his feet, and set his fin­ger in wa­ter, and shall come into the plain. 03:011:027 And the priest shall take the but­ler and the head of the ser­vant shall sprin­kle it out, and the priest shall burn it into a ring, and cover the fat that is upon the al­tar, and shall pitch it out. 03:001:004 And he shall put the lamps in wa­ter, even a tres­pass offer­ing, and the hang­ing for the robe of the burnt offer­ing, and put the al­tar of shit­tim wood, and burn the al­tar of burnt offer­ing unto the LORD.

  9. Stoker: suc­cess. Vic­to­rian Eng­lish, men­tion of ceme­ter­ies, dis­emvow­el­ing, Van Hels­ing.

  10. Lafferty: mixed suc­cess. More Chicago and Laffer­ty-like vo­cab­u­lary, but what is “Ren­field” do­ing there—that’s Stok­er!

  11. Ryuk­ishi07: suc­cess. Sam­ple:

    RYUKISHI07|of some­thing like that. You can stop too long, a lit­tle bit more spin­ning stuff. You could put away the first side of your way out on the study at the end of the ‘Sea From Bat­tler’. “I see, is­n’t it‽ Ooooooohh­hh…” In other words, if the seag­ulls had been known to have been over there al­ready, the Shan­non would­n’t have ac­cepted a ser­vant. …And when George-aniki sud­denly put his head over and spat on his shoul­ders, Rand said, show­ing some re­la­tion­ship to her. He was calm and was jeal­ous of his nearly much im­age or ex­pe­ri­ence. “………………Ha­ha­ha­ha­ha……….” Nat­suhi no­ticed that tune from the warm block, and it was quite a small part of it… “I’m not gonna be out of the main way. Where’s the witch‽” Nat­suhi oba-san said some­thing about forty… The fork of gold was­n’t like whis­per­ing every day. “…Y­ou’re still un­able to make me. Now if you stay back to the back of the world part of my heart, that’s wrong. …………But I re­ally have here a mag­a­zine.” “Ah, ………­don’t worry about it. I would­n’t call a lot one.” “That’s right. …If it was a metal bird, I would also stay here. I’m sor­ry, but it’s a fan­tas­tic per­son who is still liv­ing in your speed… If you could­n’t think of it, that’s right. If you want to call me a bed, I’d be swept by your duty and you may be fine.” “…………………” “……W, ………what are you go­ing to do with the cul­prit? Did you say some­thing like that…?” Nat­suhi re­turned the rose gar­den. As the an­nounce­ment had fin­ished look­ing over his, he heard the over­whelm­ing sound of the falling hair, on the win­dows, his eyes slic­ing around the sound of a pair of hold of holes in one hand. …

  12. Doyle: mixed suc­cess. There ap­pears to be in­fil­tra­tion from Lin­coln.

  13. Mon­taigne: mixed suc­cess. Dis­cusses France, but also Melville’s Nan­tuck­et.

So of the 13 sam­ples, 8 were defi­nitely in the style of the right au­thor, 5 were mixed suc­cesses as they mostly re­sem­bled their au­thor but not en­tire­ly, and only 1 was a clear fail­ure. With 31 au­thors to choose from, that’s not an ac­ci­dent.

One Walt Whit­man pas­tiche sam­ple I gen­er­ated while test­ing struck me as quite po­et­ic; with line breaks in­serted where in­di­cated by cap­i­tal­iza­tion:

"WITH THE QUEEN OF OTHER HOLY SAILOR"
And shes my brothers to be put upon me, intense and sound,
All are me. Sounds purified, O sound of the streets!
O landscapes! O still the fierce and the scraping of beauty!
The murderous twinkle of the sky and basement,
How the beasts at first began to bite and the waves near the floor.
The walls of lands discover'd passions,
Earth, sword-ships, enders, storms, pools, limailes, shapes of violent,
Rooters, alarms, the light-starring mail, untold arms, patients, portals, the well-managed number, the bravest farms,
The effect of doubts, the bad ways, the deeds of true signs, the curious things, the sound of the world,
It is of figure and anthem, the common battle rais'd,
The beautiful lips of the world that child in them can chase it
...

For a more sys­tem­atic look, I gen­er­ated sam­ples from all in­cluded au­thors:

(for AUTHOR in `echo "ARISTOTLE BEOWULF BIBLE BONAPARTE CARROLL CHAUCER COLERIDGE DANTE DAVINCI DOYLE ELIOT GILBERTSULLIVAN \
                      GRANT HOMER JORDAN KAFKA KEATS LAFFERTY LINCOLN MACHIAVELLI MAUPASSANT MELVILLE MONTAIGNE PAINE PEPYS \
                      POE RYUKISHI07 SHERMAN STOKER WHITMAN WOLFE"`; do
    th sample.lua -gpu -1 -checkpoint cv/2016-03-27-metadata.t7 -length 5000 -temperature 0.8 -start_text "$AUTHOR|"
done) > 2016-03-27-rnn-metadata-samples-all.txt

The Eliot out­put was per­plex­ingly bad, con­sist­ing mostly of num­bers, so I looked at the orig­i­nal. It turned out that in this par­tic­u­lar cor­pus, 10 of the text files had failed to down­load, and in­stead, Project Guten­berg served up some HTML CAPTCHAs (not cool, guys)! This affect­ed: Co­leridge, Dan­te, Da Vin­ci, Eliot, Gilbert & Sul­li­van, Grant, Homer, Kafka, Pepys, & Sher­man. (Check­ing the out­put, I also no­ticed that a num­ber of words start­ing with cap­i­tal ‘M’ were miss­ing the ‘M’, which I traced to the tr call try­ing to strip out con­trol char­ac­ters that did not do what I thought it did.) Ex­clud­ing the cor­rupted au­thors, I’d in­for­mally rank the out­put sub­jec­tively as:

  • bad: Aris­totle, Be­owulf, Bible, Chaucer, Jor­dan, Keats
  • un­cer­tain: Car­roll, Wolfe
  • good: Stok­er, Paine, Bona­parte, Laffer­ty, Melville, Doyle, Ryuk­ishi07, Whit­man, Laffer­ty, Machi­avel­li, Aris­totle, Bible

The RNN is some­what in­con­sis­tent: some­times it’ll gen­er­ate spot-on prose and other times fail. In this case, good and bad Bible sam­ples were pre­sent, and pre­vi­ous Chaucer was fine but the Chaucer in this sam­ple was bad. (This might be due to the high tem­per­a­ture set­ting, or the messed-up texts.) But over­all, it does­n’t change my con­clu­sion that the RNN has in­deed learned to use meta­data and suc­cess­fully mimic differ­ent au­thors.

Training with prefixes+suffixes

The RNN seems to learn the con­nec­tion of the pre­fix meta­data to the vo­cab­u­lary & style of the fol­low­ing text only at the very end of train­ing, as sam­ples gen­er­ated be­fore then tend to have dis­con­nected metadata/text. This might be due to the RNN ini­tially learn­ing to for­get the meta­data to fo­cus on lan­guage mod­el­ing, and only after de­vel­op­ing an im­plicit model of the differ­ent kinds of text, ‘no­tice’ the con­nec­tion be­tween the meta­data and kinds of text. (Or, to put it an­other way, it does­n’t learn to re­mem­ber the meta­data im­me­di­ate­ly, as the meta­data tag is too dis­tant from the rel­e­vant text and the meta­data is only use­ful for too-sub­tle dis­tinc­tions which it has­n’t learned yet.) What if we tried to force the RNN to mem­o­rize the meta­data into the hid­den state, thereby mak­ing it eas­ier to draw on it for pre­dic­tions? One way of forc­ing the mem­o­riza­tion is to force it to pre­dict the meta­data later on; a sim­ple way to do this is to ap­pend the meta­data as well, so the RNN can im­prove pre­dic­tions at the end of a sam­ple (pre­dict­ing poorly if it has for­got­ten the orig­i­nal con­tex­t); so text would look some­thing like SHAKESPEARE|...to be or not to be...|SHAKESPEARE.

I mod­i­fied the data pre­pro­cess­ing script slightly to ap­pend the au­thor as well, but oth­er­wise used the same dataset (in­clud­ing the cor­rupt au­thors) and train­ing set­tings.

My first try at ap­pend­ing re­sulted in a fail­ure, as it con­verged to a loss of 1.129 after a week or two of train­ing, much worse than the 1.03 achieved with pre­fix-on­ly. Sam­pling text in­di­cated that it had learned to gen­er­ate ran­dom au­thor meta­data at the end of each line, and had learned to mimic some differ­ent prose styles (eg Bib­li­cal prose vs non-Bib­li­cal) but it had not learned to mem­o­rize the pre­fix nor even the use of the pre­fix (!).

A sec­ond try with the same set­tings con­verged to 1.1227 after 25 epochs, with the same sam­pling per­for­mance.

In a third try, I re­sumed from that check­point but in­creased the BPTT un­rolling seq_length from 50 to 210 to see if that would help it. It con­verged to 1.114 with suffixes still ran­dom. For a fourth try, I re­duced dropout from 0.5 to 0.1, which did not make a differ­ence and con­verged to 1.117 after 8 epoches.

So in this case, train­ing with suffixes did not speed up train­ing, and im­peded learn­ing.

While I am not too sur­prised that suffixes did not speed up train­ing, I am sur­prised how it barred learn­ing pre­fixes at all and I don’t know why. This should have been, if any­thing, an eas­ier task.

Classification

I won­dered if the same meta­data ap­proach could be used to trick the char-RNN into learn­ing clas­si­fi­ca­tion as well—per­haps if the RNN learns lan­guage mod­el­ing by try­ing to pre­dict sub­se­quent char­ac­ters, it ac­quires a greater nat­ural lan­guage un­der­stand­ing than if it was trained di­rectly on pre­dict­ing the au­thor?

I fixed the cor­rupted HTML files and the tr bug, and mod­i­fied the script to read fold --spaces --bytes --width=3000 (so each line is 3000 char­ac­ters long) and the au­thor is now placed at the end: sed -e "s/$/\|$AUTHOR/". So the char-RNN is trained to pre­dict each sub­se­quent char­ac­ter, and at the end of 3000 char­ac­ters, it sees a | and (in the­o­ry) will then pre­dict the au­thor. To test the re­sults, one can feed in a short stereo­typ­i­cal piece of text end­ing in a pipe, and see if it is able to re­spond by gen­er­at­ing the au­thor.

This turned out to be a to­tal fail­ure. After over a week of train­ing, the val­i­da­tion loss had fallen to 1.02, yet when I sam­pled it, it was un­able to clas­sify text, eg:

th sample.lua -gpu -1 -checkpoint `ls -t cv/*.t7|head -1` -length 44 -temperature 0.1 -start_text "Thou shalt not tempt the Lord thy God|B"
# Thou shalt not tempt the Lord thy God|Becaus

At best, it some­times would add ran­dom up­cased text fol­low­ing the pipe (“|CHAPTER” was com­mon), or ran­dom au­thors (n­ever the right one).

I thought per­haps the penalty for miss­ing the fi­nal char­ac­ters in a line was too small as it rep­re­sented no more than 0.3% of each line, and so I re­duced the line-length down to 500 char­ac­ters (so the au­thor was now ~2% of each line). This did­n’t work ei­ther (val­i­da­tion loss of ~1.12, prob­a­bly due to shorter lines with less con­text to work with), so I dis­abled dropout, added batch­norm, and in­creased the BPTT enough to back­prop­a­gate over the en­tire line.

After an­other week or two, the val­i­da­tion loss as­ymp­toted at ~1.09, but still no clas­si­fi­ca­tion per­for­mance. Here is a sam­ple (adding line-breaks for read­abil­ity at cap­i­tal­ized words which cor­re­spond to line­breaks in the orig­i­nal):

41 Book 40 With patient ones of the seas, the form of the sea which was gained the streets of the moon.
Yet more all contest in the place, See
the stream and constant spirit, that is of a material spirit,
The live of the storm of forms and the first stretch
Of the complexion of the mountains;
The sea fell at the tree, twenty feet wide,
And the taste of a scarlet spot where the captain bears,
She shook the sound the same that was white,
Where the permanent eye of the sea had scarce assembled,
The many such, the beauteous of a subject of such spectacles.
If thou be too sure that thou the second shall not last,
Thou canst not be the exceeding strength of all.
Thou wert as far off as thou goest, the sea Of the bands and the streams of the bloody stars
Of the world are the mountains of the sun,
And so the sun and the sand strike the light,
But each through the sea dead the sun and spire
And the beams of the mountain shed the spirits half so long,
That of the which we throw them all in air.
Think of thy seas, and come thee from that for him,
That thou hast slain in dreams, as they do not see
The horses; but the world beholds me; and behold
The same the dark shadows to the sand,
And stream and slipping of the darkness from the flood.
He that I shall be seen the flying strain,
That pierces with the wind, and the storm of many a thousand rays
Were seen from the act of love to the course.
There was a stream, and all the land and bare
Ereth shall thy spirit be suppos'd
To fall in water, and the wind should go home on all the parts
That stood and meet the world, that with the strong the place
Of thy prayer, or the continual rose,
So that the shape of the brand broke the face,
And to the band of the ring which erewhile
Is turn'd the merchant bride.
I am thine only then such as thou seest,
That the spirits stood in those ancient courses,
And in their spirit to be seen, as in the hard form
Of their laws the people in the land,
That they are between, that thou dost hear a strong shadow,
And then, nor war in all their powers, who purposes hanging to the road,
And to the living sorrow shall make thy days
Behold the strains of the fair streets, and burn,
And the shepherd for the day of the secret tear,
That thou seest so high shall be so many a man.
What can ye see, as sinking on the part
Of this reminiscence of the pursuit?
Behold the martial spirits of men of the rock,
From the flowers of the touch of the land with the sea and the blow
The steamer and the bust of the fair cloud.
The steps behind them still advanc'd, and drew,
As prepared they were alone all now
The sharp stick and all their shapes that winds,
And the trembling streams with silver the showering fires
The same resort; they stood there from the plain,
And shook their arms, sad and strong, and speaks the stars,
Or pointed and his head in the blood,
In light and blue he went, as the contrary came and beat his hands.
The stars, that heard what she approach'd, and drew
The shore, and thus her breast retraced the rushing throng:
"And more with every man the sun
Proclaims the force of future tongues
That this of all the streams are crack'd."
"The thought of me, alas!" said he,
"Now that the thirst of life your country's father sang,
That in the realms of this beast the prince
The victor from the true betray beginnings of the day."

The gen­er­ated text is semi­-in­ter­est­ing, so it’s not that the RNN was bro­ken. It was fo­cused on learn­ing to model the av­er­age text.

So it would seem that the clas­si­fi­ca­tion sig­nal was not strong enough to cause learn­ing of it. The wors­ened val­i­da­tion score sug­gests that this ap­proach sim­ply won’t work: the longer the lines, the less in­cen­tive there is for clas­si­fi­ca­tion, but the shorter the lines, the worse it learns to model the reg­u­lar text.

Transforms

Can we learn mul­ti­ple meta­data pre­fix­es? Like an au­thor and then a trans­form of some sort—in mu­sic, a use­ful trans­form might be time sig­na­ture or in­stru­ment set.

A sim­ple trans­form we could ap­ply here is up­cas­ing and down­cas­ing every char­ac­ter, so we might have a set of 6 pre­fixes like Bible+up­case, Bible+­down­case, Bible+mix, etc, writ­ten as BIBLE|U|, BIBLE|D|, BIBLE|M|, and to help en­force ab­strac­tion, also re­verse or­der­ing like U|BIBLE|, giv­ing 12 to­tal pre­fixes (3×2×2). The in­ter­est­ing ques­tion here is whether the RNN would be able to fac­tor out the trans­for­ma­tions and learn the up/mix/downcase trans­for­ma­tion sep­a­rately from the Bible/Jordan differ­ence in styles. (If it thought that Jor­dan up­cased was a differ­ent au­thor, and to be learned differ­ent­ly, from Jor­dan down­cased, then we would have to con­clude that it was not see­ing two pieces of meta­data, Jor­dan+up­case, but see­ing it as one JORDANUPCASE, and a fail­ure of both learn­ing and ab­strac­tion.) But if we in­cluded each of the 12 pre­fix­es, then we would­n’t know if it had man­aged to do this, since it could have learned each of the 12 sep­a­rate­ly, which might or might not show up as much worse per­for­mance. So we should leave out two pre­fix­es: one to test out gen­er­al­iza­tion of cas­ing, and one to test out swap­ping (drop­ping 1 from Bible and 1 from Jor­dan to be fair). At the end, we should get an RNN with a val­i­da­tion loss slightly worse than 0.9763 (the ex­tra trans­for­ma­tion & key­word must cost some­thing), and one which will hope­fully be able to yield the cor­rect out­put for the pre­fixes JORDAN|U| and C|BIBLE|

rm *.t7 *.transformed input.txt
for FILE in *.txt; do
  AUTHOR=$(echo $FILE | sed -e 's/\.txt//' | tr '[:lower:]' '[:upper:]')
  TEXT=$(cat $FILE | tail -n +80 | grep -i -v -e 'Gutenberg' -e 'http' -e 'file://' -e 'COPYRIGHT' -e 'ELECTRONIC VERSION' \
    -e 'ISBN' | iconv -c -tascii | sed -e ':a;N;$!ba;s/\n/ /g' -e 's/  */ /g' -e 's/ \/ \/ //g')
  echo $TEXT | fold --spaces --width=3000 |                              sed -e "s/^/$AUTHOR\|M\|/" >> $FILE.transformed
  echo $TEXT | fold --spaces --width=3000 | tr '[:lower:]' '[:upper:]' | sed -e "s/^/$AUTHOR\|U\|/" >> $FILE.transformed
  echo $TEXT | fold --spaces --width=3000 | tr '[:upper:]' '[:lower:]' | sed -e "s/^/$AUTHOR\|D\|/" >> $FILE.transformed
  echo $TEXT | fold --spaces --width=3000 |                              sed -e "s/^/M\|$AUTHOR\|/" >> $FILE.transformed
  echo $TEXT | fold --spaces --width=3000 | tr '[:lower:]' '[:upper:]' | sed -e "s/^/U\|$AUTHOR\|/" >> $FILE.transformed
  echo $TEXT | fold --spaces --width=3000 | tr '[:upper:]' '[:lower:]' | sed -e "s/^/D\|$AUTHOR\|/" >> $FILE.transformed
done
cat *.transformed | grep -v -e "JORDAN|U|" -e "M|BIBLE|" | shuf > input.txt

First ver­sion sans dropout got to a loss of 0.7969 (!); con­t­a­m­i­na­tion or leak­age of the val­i­da­tion test set? But since the ver­sions in the val­i­da­tion set could be only differ­en­t-cased ver­sions, then would­n’t’ve the RNN’d’t’ve learned the trans­for­ma­tion and it’s not re­ally leak­age at all? After it hit a limit at 0.79 and started turn­ing in losses of 0.8+ for hours, tried re­train­ing it with some dropout and the loss ex­plod­ed, not shrink­ing even after train­ing it all night, so I restarted with a fresh RNN and some dropout, get­ting a more sta­ble train­ing re­sult.

Un­for­tu­nate­ly, it did not work. Us­ing the un­ob­served pairs showed it had not learned to gen­er­al­ize.

Conclusions

So some lessons here are:

  1. use a suffi­ciently large RNN; 500 neu­rons may be ad­e­quate to model a sin­gle au­thor like the Bible or Shake­speare but is too small to learn many au­thors de­spite the sav­ings
  2. train to con­ver­gence; the differ­ences be­tween au­thors is smaller than be­tween the av­er­age of au­thors & ran­dom noise, and the meta­data will only show its worth at the end when it has reached ~1 loss
  3. keep data rel­a­tively bal­anced, or the RNN will spend all its effort try­ing to learn pat­terns & vo­cab­u­lary of the most com­mon kind of in­put

Fur­ther work:

  • mul­ti­ple meta­data: author/genre/work, per­haps. The RNN might learn to dis­en­tan­gle the var­i­ous fac­tors, so one could gen­er­ate sam­ples from BIBLE|RELIGION|RAYMOND_CHANDLER|. Mu­sic in ABC no­ta­tion would be an­other tar­get as ABC sup­ports genre meta­data and there might be use­ful ABC data­bas­es.

  • vi­su­al­ize the RNN hid­den state to look for ‘grand­mother neu­rons’; could such neu­rons be used to cre­ate the equiv­a­lent of or and ‘trans­fer’ the style of, say, Bib­li­cal prose to hard-boiled de­tec­tive sto­ries?

    My be­lief is that a genre/author-classification+unsupervised-prediction char-RNN may be able to do style trans­fer. This is be­cause such a char-RNN should learn a clean sep­a­ra­tion be­tween the meta­data (style) and the se­man­tics (con­tent).

    In genre/author clas­si­fi­ca­tion, the hid­den state in­cre­men­tally builds up an in­ferred genre/author as it processes the text se­quence; in un­su­per­vised pre­dic­tion, the hid­den state in­cre­men­tally builds up a sum­mary of past se­man­tic­s+syn­tax as it tries to pre­dict the next char­ac­ter. The hid­den state rep­re­sent­ing the best cur­rent guess for clas­si­fi­ca­tion will be mostly sta­tic be­cause it will quickly reach high con­fi­dence as to the genre/author and then the neu­rons en­cod­ing that in­for­ma­tion must be pro­tected long-term from be­ing mod­i­fied; in con­trast, the se­man­tic­s+syn­tax hid­den state is chang­ing every time-step and if its dis­trib­uted en­cod­ing over­lapped with the genre/author dis­trib­uted en­cod­ing, it would quickly for­get its orig­i­nal con­clu­sions about genre/author.

    This op­po­si­tion should yield a trained char-RNN with a few neu­rons de­voted solely to genre/author and the rest de­voted to se­man­tic­s+syn­tax en­cod­ing.

    Given such a clean split, some­thing anal­o­gous to the style trans­fer CNN should be pos­si­ble. First, fig­ure out which neu­rons are which; then feed in texts from differ­ent genre/authors and ex­tract the hid­den state cor­re­spond­ing to each genre/author, eg Bible vs Wheel of Time. To con­vert a piece of Wheel of Time prose into Bib­li­cal prose or vice ver­sa, feed in a de­sired piece of text to pro­duce the genre/author and se­man­tic­s+syn­tax hid­den state vec­tors; now, hard­wire the se­man­tic­s+syn­tax vec­tor and do gra­di­ent as­cent on the in­put text to grad­u­ally turn the orig­i­nal genre/author hid­den state into the tar­get genre/author hid­den state; once the trans­formed text yields both the tar­get genre/author hid­den state but also the same se­man­tic­s+syn­tax hid­den state, it has been con­vert­ed. Hy­po­thet­i­cal­ly, to the ex­tent that the char-RNN has learned Eng­lish se­man­tics and prose styles, this would con­vert text into differ­ent styles while pre­serv­ing the se­man­tics.

    This might not work with a char-RNN do­ing char­ac­ter-level pre­dic­tion if the learned se­man­tic­s+syn­tax turns out to be weak enough that a con­verted piece of text only bears a faint re­sem­blance to the orig­i­nal. (Per­haps the se­man­tics don’t add enough pre­dic­tive pow­er, or the char-RNN is small enough that it must use all its ca­pac­ity learn­ing vo­cab­u­lary etc.) If it does­n’t, some other ap­proaches might be to train a clas­si­fi­ca­tion char-RNN, pro­vid­ing the style met­ric, and also a se­quence-to-se­quence au­toen­cod­ing RNN to pro­vide a se­man­tics en­cod­ing; then set the style tar­get to be the de­sired style, hard­wire the au­toen­coder, and use them jointly as a loss to do gra­di­ent de­scent on. RNNs can also be com­bined with CNNs, and this may al­low a more di­rect bor­row­ing of the orig­i­nal style trans­fer al­go­rithm.

Appendix

Geocities char-RNN

(1994–2009) was an In­ter­net ser­vice for host­ing per­sonal web­pages which fea­tured a wide range of idio­syn­cratic and un­usual con­tent. Geoc­i­ties For­ever is a web­site cre­ated by Aanand which fea­tures text gen­er­ated by a small CPU-trained 3×512 char-RNN on a small 50MB sam­ple of the raw HTML from the ArchiveTeam Geoc­i­ties cor­pus. The gen­er­ated HTML is amus­ing but also shows some weak­nesses in gen­er­at­ing in­ter­leaved English/HTML, which I thought was con­nected to un­der­train­ing on a small cor­pus—based on my ear­lier ex­per­i­ments with char-RNN mod­els of CSS and mul­ti­ple Eng­lish au­thors, I know that char-RNNs are ca­pa­ble of switch­ing lan­guages smooth­ly. Dur­ing Oc­to­ber-No­vem­ber 2016, I at­tempted to train a larger 2×3000 RNN with a 1GB+ sam­ple us­ing torch-rnn, and ran into is­sues:

  • the larger cor­pus had qual­ity is­sues re­lated to some files be­ing present many times, in­clud­ing 1 file which was present in sev­eral thou­sand copies
  • train­ing re­peat­edly “bounced” in that after quickly reach­ing low train­ing & val­i­da­tion losses and gen­er­at­ing high­-qual­ity text sam­ples, er­ror would sky­rocket & text sam­ples plum­met in qual­ity (or not be gen­er­ated at all due to mal­formed prob­a­bil­i­ties)

Clean­ing and shuffling the cor­pus re­duced the qual­ity is­sue, and re­duc­ing learn­ing rate sub­stan­tially helped avoid the bounc­ing prob­lem, but ul­ti­mately the goal of high qual­ity text sam­ples was not reached be­fore my lap­top died and I was forced to stop GPU train­ing. Train­ing a char-RNN on very large text cor­puses is more diffi­cult than I thought, per­haps be­cause the va­ri­ety of con­tent over­loads the RNN model ca­pac­ity and can cre­ate cat­a­strophic for­get­ting un­less trained for a very long time at low learn­ing rates for many epoches.

Hav­ing down­loaded the tor­rent, the -com­pressed files are laid out ac­cord­ing to the orig­i­nal Geoc­i­ties ‘neigh­bor­hood’ struc­ture and must be ex­tract­ed.

Data extraction

The bulk of the tor­rent is im­age files and other me­dia con­tent, while we only want to the HTML, so we ex­tract those, and to keep the con­tent eas­ily read and avoid any pos­si­ble bi­nary cor­rup­tion or weird char­ac­ters, we con­vert every­thing to ASCII be­fore writ­ing to disk:

cd ~/torrent/geocities.archiveteam.torrent/
## 'shuf' call added to randomize order of HTML files and make minibatches more i.i.d.
## due to training problems
for ARCHIVE in `find LOWERCASE/ UPPERCASE/ -type f -name "*.7z*" | shuf`;
do
    7z x -so $ARCHIVE | tar x --wildcards "*.html" --to-stdout | iconv -c -tascii >> geocities-corpus.txt
done

wc --chars data/geocities-corpus.txt
# 984248725
du data/geocities-corpus.txt
# 961188 geocities-corpus.txt

The to­tal HTML con­tent is ~9GB, more than ad­e­quate.

A quick in­spec­tion shows that the HTML is ex­cep­tion­ally ver­bose and repet­i­tive due to in­jected Geoc­i­ties HTML and copy­-paste. What sort of train­ing loss could we ex­pect from the con­tent? We can look at the bit­s-per-char­ac­ter per­for­mance of a com­pres­sion util­i­ty:

LZMA/xz base­line:

cat data/geocities-corpus.txt  | xz -9 --stdout | wc --bytes
# 146915476

(146915476*8) / 984248725
# 1.194132924

xz man­ages 1.194bpc; in terms of a neg­a­tive log loss, xz man­aged a loss of 0.69:

1 - exp(-1.194132924)
# [1] 0.6970334647

RNNs can model very non­lin­ear and com­pli­cated phe­nom­e­na, but they also have tiny hidden-state/memories and so suffer in com­par­i­son to a com­pres­sion util­ity which can store long lit­er­als in RAM (xz -9 will use up to 4GB of RAM for con­tex­t). So if the RNN can reach 0.69, that would be ac­cept­able.

An­other way to put it: how many lines are re­peat­ed? A com­par­i­son of wc --lines and sort --unique | wc --lines shows that a sur­pris­ingly low num­ber of lines are unique, sug­gest­ing even more rep­e­ti­tion in the HTML parts than I ex­pect­ed.

torch-rn­n’s preprocess.py script, and its train­ing, store all data in RAM, so us­ing all 9GB turns out to be in­fea­si­ble. 1GB turns out to use an ac­cept­able av­er­age ~34% of my lap­top’s 16GB RAM for pre­pro­cess­ing & train­ing.

Training

My ini­tial set of train­ing hy­per­pa­ra­me­ters:

  • check­point­ing: 1s per mini­batch, want to check­point every few hours, so 20,000

  • batch size: 2, to re­duce VRAM use as much as pos­si­ble (RNN train­ing will be less sta­ble with such tiny batches but will still work)

  • lay­ers: 3 for com­pa­ra­bil­ity with the orig­i­nal

  • neu­ron count: as large as will fit, which turns out to be ~5× or 2600

  • dropout: since we have a lot of data to fit over­fit­ting, dropout does not need to be high; 0.1

  • BPTT se­quence length: 20 (re­duced from de­fault 50 to again re­duce VRAM use at some cost to fi­nal model qual­ity in terms of mod­el­ing long-term de­pen­den­cies)

  • batch­norm: usu­ally helps, so turned on

  • learn­ing rate, de­cay, word­vec size, clip­ping: torch-rnn de­faults

  • to­tal:

    th train.lua -input_h5 geocities-corpus.h5 -input_json geocities-corpus.json -checkpoint_name cv/geocities
                      -checkpoint_every 20000 -batch_size 2 -seq_length 20 -rnn_size 2600 -num_layers 3 -learning_rate 2e-3
                      -dropout 0.2 -batchnorm 1 -init_from `ls -t ./cv/*.t7 | head -1`

Per­for­mance was bad: train­ing loss ~3.5, val­i­da­tion loss after 2 days: 4.61/4.69/4.49 Not good! Is 3 lay­ers too un­sta­ble? A mini­batch size of 2 too un­sta­ble? (In­creas­ing the mini­batch re­quires de­creas­ing RNN size be­cause there’s noth­ing left to cut.) Not enough BPTT? Let’s try switch­ing to 2 lay­ers, which frees up a ton of mem­ory for the mini­batch & BPTT:

th train.lua -input_h5 geocities-corpus.h5 -input_json geocities-corpus.json -checkpoint_name cv/geocities \
                  -checkpoint_every 20000 -batch_size 5 -seq_length 90 -rnn_size 3300 -num_layers 2
                  -learning_rate 2e-3 -dropout 0.2 -batchnorm 1

Trains within 1000 batches to ~0.6 train­ing loss, often with train­ing loss be­low the xz bound, but val­i­da­tion loss ex­plodes! there’s also odd train­ing loss be­hav­ior: it seems to bounce from the low train­ing loss regime past 1 to as high as the 3s for long pe­ri­ods.

If not over­fit­ting in gen­er­al, could be non-s­ta­tion­ar­ity of in­put and over­fit­ting on spe­cific parts; preprocess.py does­n’t do any shuffling. Can force shuffling by go­ing back and shuffling the ex­tract files or on a line-level ba­sis by re-pre­pro­cess­ing the cor­pus:

split -l 1000 geocities-corpus.txt tmp
cat $(ls tmp* | shuf) > geocities-corpus-snuffled.txt
rm tmp*
python scripts/preprocess.py --val_frac 0.000012 --test_frac 0.000012 --input_txt geocities-corpus.txt \
                             --output_h5 geocities-corpus.h5 --output_json geocities-corpus.json

And by in­creas­ing BPTT & dropout:

th train.lua -input_h5 geocities-corpus.h5 -input_json geocities-corpus.json -checkpoint_name cv/geocities
    -checkpoint_every 15000 -batch_size 5 -seq_length 100 -rnn_size 3300 -num_layers 2
    -learning_rate 2e-3 -dropout 0.5 -batchnorm 1 -init_from cv/geocities_60000.t7

Still we see the same ‘bounce’ from bet­ter-than-xz pre­dic­tive per­for­mance to 2–3 train­ing loss. To check if it was size that was the prob­lem, I went back to Aanand’s orig­i­nal 3×512 ar­chi­tec­ture:

th train.lua -input_h5 data/geocities-corpus.h5 -input_json data/geocities-corpus.json -checkpoint_name cv/geocities \
             -checkpoint_every 10000 -batch_size 130 -seq_length 225 -rnn_size 512 -num_layers 3 -learning_rate 2e-3
             -dropout 0.5 -batchnorm 1

After ~9 hours, it had reached a val­i­da­tion loss of 1.05 and gen­er­ated out­put looks pretty good1 but then it bounced over night and out­put be­came garbage again. (For 1GB and 3×512 RNN, 1 epoch takes some­what over 1 day.) It is still act­ing like it’s over­fit­ting. Why?

Data cleaning

I took a closer look at the data: and no­ticed some­thing odd skim­ming through it—it’s not just the HTML boil­er­plate that’s re­peat­ed, but many parts of the con­tent as well (eg search­ing for the word “rude” turns up the same lengthy com­plaint re­peated hun­dreds of times in the sam­ple). Is the ex­cel­lent xz com­pres­sion and oc­ca­sional ex­cel­lent RNN train­ing loss, and then the ‘bounce’ due to con­tent be­ing re­peated many times, lead­ing to se­vere over­fit­ting and then ex­tremely high er­ror when it fi­nally runs into some of the un­re­peated con­tent?

There are pos­si­ble ways for rep­e­ti­tion: the orig­i­nal find com­mand ran on all 7z archives in­clud­ing the mul­ti­part archives in the tor­rent, so pos­si­bly some archives got de­com­pressed mul­ti­ple times (if per­haps 7z, given an archive like “archive.7z.8” then goes back and tries to de­com­press start­ing with “archive.7z.1”)? If so, then re­run­ning it but writ­ing all files to disk will make the du­pli­cates go away (the du­pli­cates will sim­ply get de­com­pressed & over­writ­ten re­peat­ed­ly). And if the rep­e­ti­tion is due to mul­ti­ple iden­ti­cal files with differ­ent names/paths, then there will still be a lot of du­pli­ca­tion, but a file-level du­pli­ca­tion tool like fdupes should de­tect and delete them.

For file-level du­pli­cate dele­tion and recre­at­ing the cor­pus:

for ARCHIVE in `find LOWERCASE/ UPPERCASE/ -type f -name "*.7z*" | shuf`
do
    nice 7z x -so $ARCHIVE | tar x --verbose --wildcards "*.html"
done
fdupes . --recurse --omitfirst --sameline --size --summarize --delete --noprompt

find . -type f -name "*.html" -print0 | shuf --zero-terminated | xargs --null cat | \
    iconv -c -tascii | fold --spaces --width=150 | \
    head --bytes=1GB > geocities-corpus.txt

After ex­tract­ing to disk to elim­i­nate re­dun­dant writes, and checking/deleting du­pli­cated files, I restarted train­ing. After 20k mini­batch­es, train­ing loss steady in the 2–3 range, val­i­da­tion loss con­tin­ues to ex­plode, and I can­not even sam­ple be­cause the out­put is so il­l-be­haved (the multi­n­o­mial prob­a­bil­ity prob­lem). So the prob­lem was still not solved, and a grep for “rude” in­di­cated the re­dun­dancy prob­lem was still pre­sent.

I went back into the orig­i­nal ex­tracted Geoc­i­ties HTML files look­ing for that weird ‘rude’ page which ap­pears thou­sands of times; an ag search in­di­cated that it shows up ~31k times in two di­rec­to­ries:

  • ./geocities/YAHOOIDS/m/i/mitzrah_cl/ (5.2GB, 334595 HTML files)
  • ./geocities/YAHOOIDS/T/o/Tokyo/6140/myself/sailormars/karen/site_features/hints_n_tips/site_features/www_stuff/www_resources.html (0.527GB, 33715 files)

Look­ing at file­names, there are also many pos­si­bly du­pli­cated pages:

find . -type f -name "*.html" | parallel basename | sort | uniq --count | sort --numeric-sort | tac | less
#  612978 index.html
#  114691 sb.html
#   72080 links.html
#   37688 awards.html
#   36558 pics.html
#   34700 music.html
#   32987 geobook.html
#   32010 myaward.html
#   31216 hints.html
#   31053 sailormoon_rei.html
#   30953 www_resources.html
#   30670 myself.html
#   30522 intro.html
#   30184 banner_xchange.html
#   30126 tutorial_intro.html
#   13885 main.html
#   11642 disclaimer.html
#   10051 index2.html
#    7732 live.html
#    7490 tmmb.html
#    7472 everclear.html
#    7325 sublime.html
#    7264 sugarray.html
#    7065 gallery.html
#    6637 news.html
#    6566 menu.html
#    6344 home.html
#    5924 page2.html
#    5426 me.html
#    5224 friends.html
#    4986 pictures.html
#    4435 page3.html
#    4186 pictures2.html
#    4105 addbook.html
#    4076 contact.html
#    4008 profile.html
#    3935 bio.html
#    3822 history.html
#    3778 about.html
#    3769 Links.html
#    3728 photos.html
#    3682 page4.html
#    3549 webrings.html
#    3468 index1.html
#    3378 family.html
#    3297 chat.html
#    3136 link.html
#    3058 aboutme.html
#    3021 page5.html
#    2980 baking.html
#    2937 info.html
#    2855 film.html
#    2816 talents.html
#    2800 balloon.html
#    2793 quotes.html

I could delete every­thing ex­cept one ran­dom “bio.html” or “myaward.html” etc, but first I tried delet­ing every­thing in mitzrah/ and myself/. This makes the file­names look much more di­verse; spot checks of the files named “sb.html” & “ever­clear.html” sug­gests that the du­pli­cated file names now rep­re­sent le­git­i­mate, non-re­peated con­tent which hap­pen to have sim­i­lar file­names due to serv­ing sim­i­lar roles in peo­ples’ per­sonal web­pages.

...
#  612967 index.html
#  114691 sb.html
#   40122 links.html
#   32986 geobook.html
#   13885 main.html
#   11642 disclaimer.html
#   10051 index2.html
#    7732 live.html
#    7490 tmmb.html
#    7472 everclear.html
#    7325 sublime.html
#    7264 sugarray.html
#    7065 gallery.html
#    6637 news.html
#    6605 awards.html
#    6566 menu.html
#    6344 home.html
#    5924 page2.html
#    5426 me.html
#    5224 friends.html
#    4986 pictures.html
#    4605 music.html
#    4598 pics.html
#    4435 page3.html
#    4186 pictures2.html
#    4105 addbook.html
#    4074 contact.html
#    4008 profile.html
#    3935 bio.html
#    3822 history.html
#    3778 about.html
#    3769 Links.html
#    3728 photos.html
#    3682 page4.html
#    3549 webrings.html
#    3467 index1.html
#    3378 family.html
#    3297 chat.html
#    3136 link.html
#    3058 aboutme.html
#    3021 page5.html
#    2980 baking.html
#    2937 info.html
#    2855 film.html
#    2816 talents.html
#    2800 balloon.html
#    2793 quotes.html
#    2681 intro.html
#    2621 lyrics.html
#    2597 top.html
#    2587 banjo.html
#    2577 webmaster.html
#    2529 roleplay.html
#    2494 garden.html
#    2474 index3.html

Skim­ming the fi­nal cor­pus also does­n’t show any bla­tant rep­e­ti­tion.

The bounce continues

After this data clean­ing, I restarted train­ing from the last check­point, same set­tings. 100,000 minibatches/4 epoches lat­er, sam­pling still fails and val­i­da­tion loss is in the 100s! Restart­ing with higher dropout (0.8) did­n’t help. Restart­ing with 0 dropout did­n’t help ei­ther—after 50,000 mini­batch­es, val­i­da­tion loss of 55.

I thought that the 512×3 may sim­ply lack model ca­pac­ity and the orig­i­nal one worked be­cause he used a small cor­pus which was not too di­verse.

Try­ing some­thing in­ter­me­di­ate be­tween 512×3 and 3000×1, 2000×1, after 30k mini­batches / 0.7 epoches, val­i­da­tion loss is ~0.98 and gen­er­ated sam­ples look good. So the larger flat­ter RNN is han­dling it bet­ter than the smaller deeper one.

Un­for­tu­nate­ly, the bounce is still pre­sen­t—ini­tially a bounce around epoch 0.84 with gen­er­ated sam­ples much worse. After an­other 65k mini­batch­es, very high qual­ity sam­ples but then bounced in train­ing at a differ­ent place in the dataset—e­poch 0.04 (after a restart due to crash). In pre­vi­ous train­ing, the data lo­cated at ~4% is per­fectly well be­haved and eas­ily mod­eled, so it’s not the data’s fault but the RNN, sug­gest­ing it’s still over­fit­ting. If so, the learn­ing rate may be too high; I in­creased the learn­ing rate to 4× small­er, 8e-3.

The lower learn­ing rate RNN still bounced, but not quite as badly as usu­al, with steady val­i­da­tion loss ~3 after a week.

Un­for­tu­nate­ly, fur­ther progress by the RNN or the per­for­mance in restart­ing from scratch with a much smaller learn­ing rate is un­known, as on 26 No­vem­ber my Acer lap­top died (ap­par­ent moth­er­board fail­ure, I sus­pect pos­si­bly due to the stress of all the months of GPU train­ing var­i­ous char-RNN and other deep learn­ing mod­els) and due to prob­lems with my back­ups, I lost data back to 14 No­vem­ber, in­clud­ing the train­ing records & lat­est check­points.

Since the Geoc­i­ties char-RNN was­n’t go­ing any­where & I wor­ried may’ve con­tributed to my lap­top fail­ure, I stopped there. My guess is that good re­sults could be ob­tained with a smaller cor­pus (per­haps 500MB) and a large char-RNN like 2×3000 trained with very low learn­ing rates, but it would re­quire at least GPU-weeks on a top-end GPU with more than 4GB RAM (to al­low larger mini­batch­es) and is­n’t suffi­ciently amus­ing as to be worth­while.

Finetuning the GPT-2-117M Transformer for English Poetry Generation

In Feb­ru­ary 2019, fol­low­ing up on my 2015–2016 tex­t-gen­er­a­tion ex­per­i­ments with char-RNNs, I ex­per­i­ment with the cut­ting-edge Trans­former NN ar­chi­tec­ture for lan­guage mod­el­ing & text gen­er­a­tion. Us­ing Ope­nAI’s GPT-2-117M model pre-trained on a large In­ter­net cor­pus and nshep­perd’s fine­tun­ing code, I re­train GPT-2-117M on a large (117MB) Project Guten­berg po­etry cor­pus. I demon­strate how to train 2 vari­ants: “GPT-2-poetry”, trained on the po­ems as a con­tin­u­ous stream of text, and “GPT-2-poetry-prefix”, with each line pre­fixed with the meta­data of the PG book it came from.

With just a few GPU-days on 1080ti GPUs, GPT-2-117M fine­tun­ing can pro­duce high qual­ity po­etry which is more con­sis­tent than my char-RNN po­ems & ca­pa­ble of mod­el­ing sub­tle fea­tures like rhyming.

Split out to sep­a­rate ar­ti­cle, .


  1. I could­n’t com­pare the qual­ity to Aanand’s orig­i­nal 3×512 be­cause he did­n’t pro­vide the fi­nal val­i­da­tion score of his or the ex­act 50MB cor­pus to re­train on.↩︎