Touhou music by the numbers

Collect music metadata and look for patterns
statistics, sociology, Haskell, shell, R
2013-02-282015-02-28 abandoned certainty: possible importance: 1


Idea: cor­re­late Touhou mu­sic pro­duc­tion against Japan­ese youth un­em­ploy­ment: does the to­tal pro­duc­tion of mu­sic as mea­sured in sec­onds in­crease with un­em­ploy­ment?

Op­po­site view, re­ces­sions dent pro­duc­tion (per­haps be­cause peo­ple are work­ing harder and so have less free time even if other peo­ple are un­em­ployed?) http://www.game­set­watch.­com/2009/12/­sound_cur­ren­t_yoko­hamas_­me­di­am.php

While the turnout at M3 re­mains strong, at the same time an eco­nomic re­ces­sion can­not help but touch a com­mu­nity whose ac­tiv­i­ties rely on hav­ing free time. Fur­ther­more while pre­vi­ously many hob­by­ists dreamed of some­day break­ing into the in­dus­try, more re­cently many also fear that game com­pa­nies will be­gin crack­ing down on un­li­censed trib­utes.

Data

Unemployment data source

Used FRED Ad­justed Un­em­ploy­ment Rate for Youth [15-24yo] in Japan (JPNURYNAA); down­loaded as CSV, an­nual per­cent­age 2000-2011

jpn <- read.table(stdin(),header=TRUE)
      DATE VALUE
2000-01-01   8.9
2001-01-01   9.1
2002-01-01   9.5
2003-01-01   9.6
2004-01-01   9.0
2005-01-01   8.1
2006-01-01   7.5
2007-01-01   7.5
2008-01-01   7.0
2009-01-01   8.9
2010-01-01   9.0
2011-01-01   8.1

Touhou data sources

TODO Sug­ges­tions: https://old.reddit.com/r/TOUHOUMUSIC/comments/19hh2m/touhou_music_databases_comprehensive_easily/ http://board­s.4chan.org/jp/res/10559057

al­ter­na­tive sources

  • VGMdb Touhou en­tries but it is sub­stan­tially smaller with meta­data on <1389 al­bums
  • The open source al­ter­na­tive is ; looks like it has 190 al­bums, but like 90%+ link to VGMdb, so I’m not sure I want to in­clude them (waste of effort, and if some­one just copied over all of VGMdb a few years ago, it’ll be badly mis­lead­ing to any cap­ture-re­cap­ture analy­sis of pop­u­la­tion size).
  • Touhouwi­k­i.net an­other source, with <1182 al­bums

Arrange Circle Database

“東方音団録 ~ Arrange Cir­cle Data­base ver.3.0”; home­page & re­lease info (de­bate, chat), prod­uct page with j-sub­cul­ture.­com quot­ing a par­tial to­tal of $15, Pay­pal and ship­ping out­side Japan boost­ing to ~$30! (Al­ter­na­tive prox­ies in­cluded Yokatta.)

Email in 2013-03-01 failed to elicit any re­ply by 27 May; then re­quested a pur­chase on `/r/TOUHOUMUSIC.

My re­quest was filled on 2013-06-08 as an ISO and ZIP file. The ISO file seemed to be bro­ken: file just calls it ‘data’, and when I mount it as a loop­back iso9660 file, mount throws an er­ror. I re­down­loaded it and com­pared it, but the copies were iden­ti­cal. The good news is that the zip file seems to work fine. The data is in a .accdb file in a sub­fold­er, which turns out to be the lat­est Mi­crosoft Ac­cess data­base for­mat. Un­for­tu­nate­ly, this turns out to be al­most en­tirely un­sup­ported by any­thing on Linux (ex­cept for a Java li­brary), but for­tu­nate­ly, an ac­quain­tance had an Office 365 sub­scrip­tion and re-ex­ported the .accdb file as the older Ac­cess for­mat .mdb file (Mi­crosoft Ac­cess data­base) which was suc­cess­fully read and con­verted to CSV by mdb-tools’s mdb-export (UTF-8 CSV for­mat).

The CSV seems to con­tain ~1316 en­tries; cor­re­spond­ing to ~1316 al­bums with the re­spec­tive cir­cle name, URL, gen­re, pos­si­bly vo­cal­ist, and 2 more fields I can­not fig­ure out due to lack of Japan­ese pro­fi­ciency (but none seem to be re­lease dates). The en­tries look like this:

$ mdb-tables Toho_arrange_circle_database-gwern.mdb
アレンジサークルリスト
$ mdb-export Toho_arrange_circle_database-gwern.mdb `mdb-tables Toho_arrange_circle_database-gwern.mdb` > Toho_arrange_circle_database-gwern.csv
$ head Toho_arrange_circle_database-gwern.csv
    ID,pin,サークル名,ふりがな,URL,ジャンル,Vocal,主な頒布CD,原曲アレンジの程度,一言,memo
    1,,"っ´Д`)っゼロ式の処刑場(っ´Д`)っゼロ式の家・DESTRUCTIVE ANGEL)","ぜろしきのしょけいじょう","http://shinzanzeroshiki.fc2web.com/","メタル","男","Crazy Trancy Ecstasy","原曲維持","重厚感あふれる荘厳なメタル。(Black)",
    2,,"α music","あるふぁみゅーじっく","http://www19.atpages.jp/tatu4/","オーケストラ、ピアノ",,"東方風水華月","原曲維持",,
    3,,"凸凹えんたーていめんとすたじお","でこぼこえんたーていめんとすたじお","http://3rd.geocities.jp/deko_boko_es/index/","リコーダー、ゲームミュージック",,"Electronic Magus","原曲維持","8ビットあり、リコーダー生演奏ありと多彩。(Black)",
    4,,"[4989]","しくはっく","http://4989mm.littlestar.jp/","ロック","男女","musick for me","原曲維持","スローなロック風アレンジにハイトーン男声ボーカル、エレクトロも混ざったインストも。ドラマCDあり。(Black)",
    5,,"[ kapparecords])","かっぱれこーず","http://www5f.biglobe.ne.jp/~kapparecords/","ハードロック","男","SCARLET FANTASIA","原曲維持","生演奏ギタードラムベースと男声ボーカルがんばれ。(Black)",
    6,,"<echo>PROJECT","えこぷろじぇくと","http://echoproject.3rin.net/","ダンス、ポップス、ロック","女","eclat:","原曲重視~維持","女性ボーカルを軸にしてジャンルは何でもアレンジ。(Black)",
    7,,"#039","しゃーぷさんきゅー","http://sharp039.web.fc2.com/",,,"EMERGENCE",,,
    8,,"#ゆうかりんちゃんねる","ゆうかりんちゃんねる","http://yuukach.web.fc2.com/","オーケストラ、クラシック、ピアノ、エレクトロ、ロック",,"ゴリラ人間のための華麗なる大幻想曲集","原曲維持","クラシカルな構成と管弦を散りばめたオーケストラ、クラシック風インスト。バイオリンの倍音の響きが印象的。(Black)",
    9,,"10-GALLON(Digit Smith)","てんがろん","http://10-gallon.net/","ロック、エレクトロ、ハードロック","女","悪魔城レミリア","原曲維持","原曲メロをミドルテンポなロックとエレクトロに乗せて。(Black)",
$ tail -1 Toho_arrange_circle_database-gwern.csv
    1316,,"侘助","わびすけ","http://ameblo.jp/wa-bi-su-ke/","エレクトロ、ロック","女","東方乙女椿","原曲維持","速めのエレクトロ、ロックアレンジが多め。女声ボーカル。(Black)",

I am a lit­tle sur­prised that there are only 1316 en­tries. Ei­ther I’ve over­es­ti­mated their thor­ough­ness or this is lim­ited to a spe­cific con­ven­tion or some­thing like that… Need to look into this more. This does­n’t in­clude the track­-level data I was hop­ing for, but a list of al­bums can still be use­ful for es­ti­mat­ing com­plete­ness of cap­ture.

Circle count

Cross­check: num­ber of cir­cles at Comiket & Re­itai­sai? cir­cles self­-clas­sify by ‘genre code’ and this is recorded in the offi­cial Comiket cat­a­logues, where most (al­l?) of these num­bers are drawn

  • C87: 1756 http­s://­me­di­a.8chan.­co/2hu/s­r­c/1419976746668.png

  • C86: “Sta­tis­tics taken from the C86 cat­a­logue.” http://asci­i.jp/elem/000/000/917/917701/002_976x637.jpg http­s://i.imgur.­com/R0uXljZ.jpg ~1938 cir­cles ; later es­ti­mate, 1910 http­s://we­b­cat­a­log-free.­cir­cle.m­s/­Cir­cle?­gen­re­Code=241&­day=2 & i2.kym-cd­n.­com ; an­other es­ti­mate1 1918 http­s://­me­di­a.8chan.­co/2hu/s­r­c/1419976746668.png

  • C85: “Grey: C85 sta­tis­tics” http­s://i.imgur.­com/R0uXljZ.jpg 2258 / 2272 (http://www.crunchy­rol­l.­com/anime-news/2013/11/01-1/­top-dou­jin­shi-events-most-pop­u­lar-by-the-num­bers orig­i­nal: http://­yaraon.blog109.fc2.­com/blog-en­try-19664.html) / 2246 (ac­cord­ing to i2.kym-cd­n.­com)

  • C84: 2526 (ac­cord­ing to i2.kym-cd­n.­com)

  • C83: 2492 (ac­cord­ing to i2.kym-cd­n.­com)

  • C82: Touhou=2670 / 2694 (ac­cord­ing to i2.kym-cd­n.­com)

  • C81: Touhou=2690 / 2656 (ac­cord­ing to i2.kym-cd­n.­com)

  • C80: Touhou=2808 / 2774 (ac­cord­ing to i2.kym-cd­n.­com) & http://d.hate­na.ne.jp/myrmecoleon/20131101/1383333933

  • http://www.sankaku­com­plex.­com/2008/08/02/­touhou-takeover/ (http­s://www.­google.­com/search?num=100&q=­touhou%20­comiket%20­cir­cle%20­gen­re%20site%3Asankaku­com­plex.­com%20-site%3Awww.sankaku­com­plex.­com%2F­tag%2F) orig­i­nal: http­s://we­b.archive.org/we­b/20110722080729/http://ad­db.jp/in­dex.ph­p?­Di­ary%2F2008-07-30 :

  • C75-C79: (http://i2.kym-cd­n.­com/pho­to­s/im­ages/o­rig­i­nal/000/866/132/075.jpg un­known source) C75: 1387+1356(http­s://we­b.archive.org/we­b/20110812023434/http://ad­db.jp/in­dex.ph­p?­Di­ary%2F2008-12-17)/C76:1739(http://d.hate­na.ne.jp/myrmecoleon/20101230/1293717241)/C77:2372(d.hate­na.ne.jp)/C78:2416+2394(d.hate­na.ne.jp)/C79: 2774(d.hate­na.ne.jp)

  • C74: 885

  • C73: 793

  • C72: 558

  • C71: 574

  • C70: 366

  • C69: 232

  • C68: 229

  • C67: 98

  • C66: 50

  • C65: 7 / 39 (en.­touhouwi­k­i.net)

  • C64: 0 / 12 (en.­touhouwi­k­i.net)

  • C63: 0 / 1 (ac­cord­ing to http://en.­touhouwi­k­i.net/wik­i/Re­lease_­Time­line#2002 ; asked about dis­crep­ancy http://en.­touhouwi­k­i.net/in­dex.ph­p?ti­tle=Talk:Re­lease_­Time­line&curid=46413&d­iff=331144&ol­did=262193 )

  • Re­itai­sai (Hakurei Jinja Re­itai­sai / 博麗神社例大祭):

  • 2004-4-18 R1: 114

  • 2005-5-4 R2: 362

  • 2006-5-21 R3: 680

  • 2007-5-20 R4: 653

  • 2008-5-25 R5: 1086

  • http://th­wi­k­i.c­c/%E4%BE%8B%E5%A4%A7%E7%A5%AD#.E5.8E.86.E5.B1.8A.E4.BF.A1.E6.81.AF :

  • 2009-3-8 R6: 2948

  • 2010-3-14 R7: 4050

  • 2011-5-8 R8: 4940 (see Wikipedia)

  • 2012-5-27 R9: 5058

  • 2013-5-26 R10: 5013

  • 2014-5-11 R11: 4312

  • 2015-5-10 R12: ?

  • http://www.comiket.co.jp/info-a/C77/C77CMKSymposiumPresentationEnglish.pdf / http://www.­comiket.­co.jp/in­fo-a/Wha­tIsEn­g080225.pdf / http://www.­comiket.­co.jp/in­fo-a/Wha­tIsEn­g080528.pdf

  • C76: al­l=>35000

  • C75: al­l=>35000

  • C73: al­l=>35000

  • C72: al­l=>35000

  • pg5 of CMKSymposiumPresentationEnglish: cir­cles graph, C1-C76

  • pg6 of Wha­tIsEn­g080225: C1-C73

Torrent

Mu­sic source: Touhou lossy mu­sic col­lec­tion v.15.2 (derived from the Touhou loss­less mu­sic col­lec­tion col­lec­tion), 265.2GB of 44421 tracks from 4952 al­bums pro­duced by <1,264 groups or “cir­cles”.

$ find ~/torrent/Touhou\ lossy\ music\ collection/ -type f -name "*.mp3" | wc
  44421
$ ls torrent/Touhou\ lossy\ music\ collection/ | wc
   1264
$ ls torrent/Touhou\ lossy\ music\ collection/*/ | wc
   7477

File name, mu­sic length, and meta­data year (if any) are ex­tracted us­ing exiftool:

events: 例大祭["“,”SP“,”SP2",2-9]: an­nual Re­itai­sai M3*: an­nual Me­dia Mix Mar­ket eg. http://poly­met­ri­ca.­word­press.­com/2009/10/09/things-i-am-ex­cit­ed-about-04-m324/ C[63-82]: semi­-an­nual Comiket サンクリ[28-50]: an­nu­al? Sun­shine Cre­ation http://­ja.wikipedi­a.org/wik­i/%E3%82%AF%E3%83%AA%E3%82%A8%E3%82%A4%E3%82%B7%E3%83%A7%E3%83%B3_%28%E5%90%8C%E4%BA%BA%E5%8D%B3%E5%A3%B2%E4%BC%9A%29 東方紅楼夢[?2-8]: an­nual Ko­romu 月の宴?2-5: an­nu­al? Feast of the Month 紅のひろば?2-6: semi­an­nual Red Square 東方不敗小町?2-6, SP, ぷちこまち: Ko­machi 杜の奇跡[15-16] 東方杜郷想[2-3] 幺樂団カァニバル!?2-3 東方幻楽祭[2]: semi­an­nual コミコミ[12-14] FF[9-17] ? こみトレ[12-17] COMIC1☆2-6 COMIC CITY大阪[63,73] 恋魔理?2-3 東方椰麟祭?2-3 東方名華祭2

exiftool; json

Is exiftool’s length ap­prox­i­ma­tion trust­wor­thy? Yes, it seems to be al­ways within sec­onds of the full mp3info an­swer:

$ find "/home/gwern/torrent/Touhou lossy music collection/" -type f -name "*.mp3" \
        -exec mp3info -F -p "0:%02m:%02s " {} \; -exec exiftool -Duration {} \;
0:00:23 Duration                        : 23.12 s (approx)
0:04:28 Duration                        : 0:04:28 (approx)
0:02:44 Duration                        : 0:02:44 (approx)
0:04:52 Duration                        : 0:04:52 (approx)
0:03:56 Duration                        : 0:03:56 (approx)
0:01:44 Duration                        : 0:01:44 (approx)
0:04:34 Duration                        : 0:04:34 (approx)
0:02:30 Duration                        : 0:02:30 (approx)
0:03:02 Duration                        : 0:03:02 (approx)
0:03:43 Duration                        : 0:03:42 (approx)
0:03:23 Duration                        : 0:03:23 (approx)
0:03:11 Duration                        : 0:03:11 (approx)
0:04:22 Duration                        : 0:04:22 (approx)
0:03:13 Duration                        : 0:03:13 (approx)
0:04:04 Duration                        : 0:04:04 (approx)
0:03:58 Duration                        : 0:03:57 (approx)
0:05:24 Duration                        : 0:05:24 (approx)
0:04:17 Duration                        : 0:04:17 (approx)
0:03:14 Duration                        : 0:03:14 (approx)
0:01:59 Duration                        : 0:01:58 (approx)
0:03:21 Duration                        : 0:03:21 (approx)
0:02:35 Duration                        : 0:02:35 (approx)
0:04:20 Duration                        : 0:04:20 (approx)
...
# generate and parse and cleanup data
#
# takes ~30m:
# R
# system("exiftool -extension mp3 -json -forcePrint
#           -Title -Year -Album -Artist -Duration -Genre -Track -Directory -FileName -FileSize -AudioBitrate
#           ~/torrent/Touhou\\ lossy\\ music\\ collection/*/* > ~/touhou.json")
library(rjson)
# download from https://www.gwern.net/docs/touhou/2013-torrent.json.xz and decompress with xz
json_data <- fromJSON(paste(readLines("2013-gwern-touhoutorrent.json"), collapse=""))
touhou <- data.frame(matrix(unlist(json_data), ncol=12, byrow=TRUE))
colnames(touhou) <- c("SourceFile", "Title", "Year", "Album", "Artist", "Length", "Genre",
                      "Track", "Directory", "FileName", "FileSize", "AudioBitrate")
# Delete SourceFile column; redundant with Directory/FileName
touhou <- touhou[,-1]
touhou$Directory <- sub("/home/gwern/torrent/Touhou lossy music collection/", "",
                         as.character(touhou$Directory))
touhou[touhou==""] <- NA
touhou[touhou=="-"] <- NA
touhou$Year <- as.integer(as.character(touhou$Year))
# torrent doesn't cover 2013 music, and music predating the PC-98 games doesn't exist...
touhou$Year[touhou$Year<1990] <- NA
touhou$Year[touhou$Year>2012] <- NA
# Genre is "None" or " "? both useless and false (thanks, tagger); so it goes too:
touhou$Genre[touhou$Genre=="None"] <- NA
touhou$Genre[touhou$Genre==" "] <- NA
# turn the track lengths and bitrates into usable numbers on a common scale (seconds and MBs, respectively)
touhou$Length <- gsub(" \\(approx\\)","",as.character(touhou$Length))
touhou$AudioBitrate <- as.integer(sub(" kbps","",as.character(touhou$AudioBitrate)))
# exiftool leaves us "16 s"; if so, strip the " s" and turn it into an integer
# else, eg. "0:04:37"; split on colon,
# multiply hour by 3600 seconds, minutes=60 each, seconds=seconds; and sum it
interval <- function(x) { if (!is.na(x)) { if (grepl(" s",x)) as.integer(sub(" s","",x))
                                           else { y <- unlist(strsplit(x, ":"));
                                                  as.integer(y[[1]])*3600 + as.integer(y[[2]])*60 + as.integer(y[[3]]); }
                                                  }
                          else NA
                          }
touhou$Length <- sapply(touhou$Length,interval)
filesize <- function(x) { if (grepl(" kB",x)) (as.integer(sub(" kB","",x))/1000) else as.integer(sub(" MB","",x))}
touhou$FileSize <- sapply(touhou$FileSize, filesize)
# Serious work: turn the encoded information in Directory into usable columns. Not for the faint of heart.
#
# The Directory column looks like "[twith1450]/2009.03.08 TOHOMOHO [例大祭6]"
# The schema here is "[circle]/eventDate album [event]"
#
# "[Angelic Quasar]/2006.01.29 [AQSH-0003] Racial Ethnic Nation"
# "[Alstroemeria Records]/[ARCD0001] The regret of stars, but stars shine bright (C65) (mp3)"
# "[Aqua Style/ひえろぐらふ]/2010.05.24 [AQUA-0031] 春宵一刻値千金 -シュンショウイッコク アタイセンキン-"
brackets <- function(b) sub("\\]","", sub("\\[","",b))
# easy first step: parse out the leading group/circle (always there, terminated by forward-slash) as new column
touhou$Circle <- sapply(touhou$Directory, function(x) brackets(unlist(strsplit(as.character(x), "/"))[1]))
# destructively update by removing the group/circle, to make the next step easier
# this makes Directory looks like "2009.06.07 [PAER-0007] #01 -LILITH- [東方幻楽祭2]"
# or "[ARCD0001] The regret of stars, but stars shine bright (C65) (mp3)"
touhou$Directory <- sapply(touhou$Directory, function(x) unlist(strsplit(as.character(x), "/"))[2])
touhou$Date <- as.Date(sapply(touhou$Directory, function(x) substring(x, 1, 10)), format="%Y.%m.%d")
# and like before, we strip the event date that we've parsed out, leaving eg. "ピアノのための東方小品集 Op.1-1 [御射宮司祭]"
# or "月遊 [例大祭8]" or "[AQUA-0031] 春宵一刻値千金 -シュンショウイッコク アタイセンキン-"
touhou$Directory <- sapply(touhou$Directory, function(x) substring(x, 12))
# extracting the next parameter, the event the album was released at, is harder still
library(stringr)
# if the directory does not end in a right-bracket, there's no event info and we should bomb out
# else, grab w/regexp last pair of brackets with a space before (excludes any album numbering schemes) & trim
# that didn't work? then it must be one of the directories where there's no space before the bracketed event, retry without leading space
touhou$Event <- sapply(touhou$Directory, function(x) { if (str_sub(x,start=-1) == "]") { res <- brackets(unlist(str_split(x, " \\["))[2]); if (!is.na(res)) res else brackets(unlist(str_split(x, "\\["))[2]) } else x})
# if you examine the Event column, it's full of wrong entries. I have made a list of 19 event-prefixes (I hope), which
# we'll use as a whitelist by erasing anything which lacks all of the 19 prefixes.
isPrefix <- function(x,y) grepl(paste0("^",x), y)
events <- c("例大祭","M3","C","サンクリ","東方紅楼夢","月の宴","紅のひろば","東方不敗小町","杜の奇跡","東方杜郷想",
            "幺樂団カァニバル!","東方幻楽祭","コミコミ","FF","こみトレ","COMIC1","COMIC CITY大阪","恋魔理","東方椰麟祭","東方名華祭")
touhou$Event <- sapply(as.character(touhou$Event),
                       function(target) if (sum(sapply(events, function(e) isPrefix(e,target))) != 0) target else NA)
# this whitelist covers almost the entire sample, so I think it works well:
## sum(!is.na(touhou$Event))
## # [1] 39190
## length(touhou$Event)
## # [1] 41866
#
# one final thing, since (almost) all directories had Dates while not all files had Years; overwrite any missing Years
# based on the Date we just extracted
touhou$Year <- as.integer(format(touhou$Date, "%Y"))
rm(touhou$Directory) # clean up
# escape with the loot:
write.csv(touhou, file="2013-gwern-touhoumusic-torrent.csv", row.names=FALSE)

VGMdb

The Touhou project page turns out to be in­com­plete: each en­try had to be man­u­ally an­no­tated as re­lated to Touhou. I was pointed to a search query which turned up many more re­sults by look­ing for any page with the string “Touhou” in the “games” field.

The VGMdb ad­min­is­tra­tors kindly gave me read­-only ac­cess to their MySQL data­bas­es. I grabbed the en­tirety of the ta­bles vgmdb_albums and vgmdb_tracks from the main VGMdb data­base; I ex­ported them as 2 CSV files with comma sep­a­ra­tors, re­named 2013-vgmdb-albums.csv and 2013-vgmdb-tracks.csv. Be­fore load­ing the ex­ports, I had to delete all es­caped quotes; the de­fault R CSV pars­ing does­n’t han­dle them. The track rows are 1 track with an al­bum ID, so to turn each track record/row into an equiv­a­lent of the tor­rent rows, I need to fill in based on the al­bum table.

albums <- read.csv("https://www.gwern.net/docs/touhou/2013-vgmdb.csv")
albums <- with(albums,
           data.frame(albumid,reldate,publisher,game,albumtitles))
albums <- albums[grepl("ouhou", albums$game),]; rm(albums$game)

tracks <- read.csv("2013-vgmdb-tracks.csv")
rm(tracks$tracklistid, tracks$trackid, tracks$disctitle, tracks$disc)
tracks$length[tracks$length==0] <- NA # 0 is the default in the VGMdb schema
tracks <- tracks[tracks$albumid %in% albums$albumid,]
touhou <- merge(tracks, albums)
rm(touhou$albumid, albums, tracks)
# deal with the 41 dates with the format "2005-09-00" (the 0th months or days are not real...)
touhou$date <- as.Date(sub("-00","-01",as.character(touhou$reldate)))
rm(touhou$reldate)
touhou$year <- as.integer(format(touhou$date, "%Y"))
# upcase, rearrange to torrent's order
colnames(touhou) <- c("Track","Title","Length","Circle","Album","Date","Year")
touhou <- touhou[c(2,7,5,3,1,4,6)]
write.csv(touhou, file="2013-vgmdb-touhou.csv", row.names=FALSE)

touhouwiki.net

Personal downloads

4chan /jp/ C83 threads

A loose group of users on the /jp/ sub­fo­rum col­lab­o­rate each Comiket to up­load and dis­trib­ute dou­jin man­ga, games, and mu­sic re­leased at that Comiket; some are up­loaded by Comiket at­ten­dees, some are bought from re­sellers like , and many files are har­vested from Japan­ese P2P file­shar­ing net­works like //. I com­piled a list of ~400 files from the /r/­TouhouMu­sic C83 thread (prin­ci­pally from the 4chan links) & the blog All Dou­jin Mu­sic and grad­u­ally down­loaded them from Jan­u­ary to March 2013. After dead links, I was left with 400-500 files. Many are not mu­sic, or even Touhou-re­lat­ed, so I hand-fil­tered al­bums, look­ing for signs of be­ing Touhou dou­jin works (cred­its to ZUN, Touhou char­ac­ters in the art­work, themes I rec­og­nized as Touhou, etc); when I was not sure, I erred on the side of ex­clu­sion. The fi­nal com­pi­la­tion yielded 3503 files (evenly split: 1776 Touhou vs 1728 “other”) with 953 Touhou mu­sic files.

# exiftool -extension ogg -json -forcePrint -Title -Year -Album -Artist -Duration -Genre -TrackNumber -Directory
# -FileName -FileSize -NominalBitrate -Date -recurse
# ~/c83/touhou/ ~/c83/touhou/*/** ~/c83/touhou/*/*/** ~/c83/touhou/*/*/*/** > ~/2013-c83-downloads.json

library(rjson)
# download from https://www.gwern.net/docs/touhou/2013-torrent.json.xz and decompress with xz
json_data <- fromJSON(paste(readLines("2013-c83-downloads.json"), collapse=""))
touhou <- data.frame(matrix(unlist(json_data), ncol=13, byrow=TRUE))
colnames(touhou) <- c("SourceFile", "Title", "Year", "Album", "Artist", "Length", "Genre",
                      "Track", "Directory", "FileName", "FileSize", "AudioBitrate", "Date")
# Delete SourceFile column; redundant with Directory/FileName
touhou <- touhou[,-1]
for (filter in c("/home/gwern/c83/touhou/", "\\[touhou.vnsharing.net\\]", " \\(320K\\+BK\\)", "/mp3",
                 " MP3v0", " v0", " \\(flac\\+scans\\)", " \\(128K\\)", " \\(V0\\)", " \\(320\\)",
                 " \\(mp3 320\\)", " \\(v0\\+jpg\\)"))
 { touhou$Directory <- sub(filter, "", as.character(touhou$Directory)) }
touhou[touhou==""] <- NA
touhou[touhou=="-"] <- NA

Play­back length:

find c83/touhou/ -name "*.ogg" -exec ogginfo {} \;|fgrep "Playback length"

4chan /jp/ C84 threads

591 Touhou mu­sic files:

/doc­s/­touhou/2013-c84-down­load­s.j­son

4chan /jp/ C85 threads

449 Touhou mu­sic files:

/doc­s/­touhou/2013-c85-down­load­.j­son

4chan /jp/ Reitaisai 10 threads

Sim­i­larly to above, draw­ing on the /r/­TouhouMu­sic dis­cus­sion and man­u­ally prun­ing du­pli­cates & non-Touhou down to 491 mu­sic files.

exiftool -extension ogg -json -forcePrint -Title -Year -Album -Artist -Duration -Genre -TrackNumber -Directory -FileName -FileSize -NominalBitrate -Date */*.ogg > ~/2013-reitaisai-downloads.json

Reitaisai 10 torrent

In May & June 2013, an anony­mous per­son com­piled 67 al­bums and re­leased two tor­rents of Re­itai­sai 10 al­bums (vol. 1, vol. 2)

exiftool -extension ogg -json -forcePrint -Title -Year -Album -Artist -Duration -Genre -TrackNumber -Directory -FileName -FileSize -NominalBitrate -Date */*.ogg > ~/2013-reitaisai-downloads-torrent.json

Analysis

touhou <- read.csv("https://www.gwern.net/docs/touhou/2013-torrent.csv",
                   colClasses=c("character", "integer", "factor", "character", "integer", "factor",
                                "character", "character", "numeric", "integer", "factor", "Date"))

# do stuff with the data
# general correlations
t <- data.frame(touhou$Year, touhou$Length, touhou$FileSize, touhou$AudioBitrate)
cor(t,use="pairwise.complete.obs")
#                     touhou.Year touhou.Length touhou.FileSize touhou.AudioBitrate
# touhou.Year
# touhou.Length          -0.01091
# touhou.FileSize         0.04484       0.93915
# touhou.AudioBitrate     0.19188       0.11091         0.35499

# test the correlation between higher bitrate and larger files:
cor.test(touhou$FileSize, touhou$AudioBitrate)

# the genre metadata is useless!
sort(table(touhou$Genre), decreasing=TRUE)

# boxplot avg length per year
plot(touhou$Length ~ factor(touhou$Year))

Eco­nom­ics mod­el­ing:

jpn <- read.csv(stdin(),header=TRUE)
DATE,VALUE
2000-01-01,8.9
2001-01-01,9.1
2002-01-01,9.5
2003-01-01,9.6
2004-01-01,9.0
2005-01-01,8.1
2006-01-01,7.5
2007-01-01,7.5
2008-01-01,7.0
2009-01-01,8.9
2010-01-01,9.0
2011-01-01,8.1

# number of works per year does not correlate:
cor.test(jpn$VALUE[3:12], table(touhou$Year)[1:10])
    Pearson`s product-moment correlation

data:  jpn$VALUE[3:12] and table(touhou$Year)[1:10]
t = -0.3053, df = 8, p-value = 0.768
alternative hypothesis: true correlation is not equal to 0
95% confidence interval:
 -0.6903  0.5602
sample estimates:
    cor
-0.1073

model <- lm(table(touhou$Year)[1:10] ~ c(2002:2011) + jpn$VALUE[3:12]); summary(model)
Call:
lm(formula = table(touhou$Year)[1:10] ~ c(2002:2011) + jpn$VALUE[3:12])

Residuals:
    Min      1Q  Median      3Q     Max
-2128.6  -716.4    52.4   632.6  2253.4

Coefficients:
                Estimate Std. Error t value Pr(>|t|)
(Intercept)     -2775278     342171   -8.11  8.3e-05
c(2002:2011)        1379        170    8.13  8.2e-05
jpn$VALUE[3:12]     1450        567    2.56    0.038

Residual standard error: 1400 on 7 degrees of freedom
Multiple R-squared: 0.905,  Adjusted R-squared: 0.878
F-statistic: 33.5 on 2 and 7 DF,  p-value: 0.00026



logModel <- lm(log(table(touhou$Year)[1:10]) ~ c(2002:2011) + jpn$VALUE[3:12])
summary(logModel)
Call:
lm(formula = log(table(touhou$Year)[1:10]) ~ c(2002:2011) + jpn$VALUE[3:12])

Residuals:
   Min     1Q Median     3Q    Max
-1.218 -0.632  0.108  0.551  0.982

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)
(Intercept)     -1295.060    207.535   -6.24  0.00043
c(2002:2011)        0.652      0.103    6.34  0.00039
jpn$VALUE[3:12]    -0.725      0.344   -2.11  0.07301

Residual standard error: 0.849 on 7 degrees of freedom
Multiple R-squared: 0.906,  Adjusted R-squared: 0.879
F-statistic: 33.8 on 2 and 7 DF,  p-value: 0.000253

plot(c(2002:2011),table(touhou$Year)[1:10])
points(c(2002:2011),exp(predict(logModel)),type='l',col='blue')

Growth over time

How fast is the cor­pus of Touhou mu­sic grow­ing?

Con­stant growth mod­el: the first game was re­leased in 1996, no? So that gives 17 years to ac­cu­mu­late 1.26TB or 1,260GB or 74.1GB per year. The screen­shot is down­load­ing at 0k­b/s, which is not use­ful, but it says 2640 days left so we can es­ti­mate that he’s down­load­ing at 0.47GB per day (1260/2640), and over a year 0.47GB is 174GB which is 2.35x faster than the 74GB per year. So at that an­nual in­crease, OP is not doomed and can in fact catch up.

Ex­po­nen­tial growth mode: a lit­tle trick­ier since we can’t force a for­mula just from the cu­mu­la­tive to­tal and elapsed time. I need more da­ta. So us­ing my 2012 Touhou Lossy Tor­rent data, I can try to regress an ex­po­nen­tial against the an­nual count… but wait! The amount of mu­sic does not seem to be in­creas­ing ex­po­nen­tial­ly!

touhou <- read.csv("https://www.gwern.net/docs/touhou/2013-torrent.csv",
                    colClasses=c("character", "integer", "factor", "character",
                                         "integer", "factor", "character",
                                         "character", "numeric", "integer",
                                         "factor", "Date"))
summary(touhou$Year)
perYear <- table(touhou$Year); perYear; plot(perYear)
#
#  2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012
#    13    23   255  1241  2070  2599  5073 10278  9765  7494  2999

Graph: http://i.imgur.­com/23f­MA5c.png

Looks like Touhou mu­sic’s growth peaked in 2009; this might re­flect the tor­ren­t’s in­com­plete­ness, ex­cept the tor­rent is from 2012, and you’d ex­pect cov­er­age of 2010 or 2011 to be pretty good by that point. So the growth of the tor­rent over­all looks more like a sig­moid or log:

runningTotal <- cumsum(table(touhou$Year)); runningTotal; plot(runningTotal)
# 2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012
#   13    36   291  1532  3602  6201 11274 21552 31317 38811 41810

http://i.imgur.­com/Lv9vHrZ.png

So prob­a­bly it’d be bet­ter to ask, ‘if 2012-rate growth con­tin­ues, what is the ra­tio to his down­load speed?’

2012 added 19.5GB to the tor­rent:

## FileSize is in megabytes, so gigabytes
sum(touhou[touhou$Year==2012,]$FileSize, na.rm=TRUE) / 1000
# [1] 19.5

Appendix

touhouwiki.net scraping code

Uses the Tag­soup and split pack­ages; emits CSV to stan­dard out. It is a pile of kludges which I am a lit­tle ashamed to post pub­licly, and may not work for you.

import Text.HTML.TagSoup (fromAttrib, fromTagText, isTagOpen, isTagText,
                          parseTags, (~/=), Tag(TagComment, TagOpen,TagText))
import Network.HTTP (getResponseBody, getRequest, simpleHTTP)
import Data.List (isInfixOf, isPrefixOf, nub, sort, stripPrefix)
import Data.Char (isSpace)
import Codec.Binary.UTF8.String (decodeString)
import Data.Maybe (fromMaybe, isJust, listToMaybe)
import Data.List.Split (keepDelimsL, split, whenElt)
import Control.Monad (join, unless)

main :: IO ()
main = do albums1 <- getAlbums "http://en.touhouwiki.net/wiki/List_by_Groups"
          albums2 <- getAlbums "http://en.touhouwiki.net/wiki/List_of_Old_Touhou_Arrangement_Groups"
          let albumURLs = map ("http://en.touhouwiki.net"++) $ nub $ sort $ albums1 ++ albums2
          putStrLn header
          mapM_ getAlbum albumURLs

getAlbums :: String -> IO [String]
getAlbums index = do touhou <- openURL index
                     return $ drop 22 $ reverse [link | (TagOpen "a" ((ref, link):_)) <- parseTags touhou,
                                                        ref=="href",
                                                        "/wiki/" `isPrefixOf` link,
                                                        let fltr x = not (x `isInfixOf` link),
                                                        fltr "_Groups", fltr "Touhou_Wiki:", fltr "Special:",
                                                        fltr "Template:", fltr "Category:", fltr "Talk:"]

type Album = [Track]
data Track = Track { title :: String, year :: Int, date :: String,artist :: String,
                     album :: String, event :: String, circle :: String, duration :: Maybe Int,
                     track :: Int } deriving Show
empty :: Track
empty = Track {title="",year=0,date="",album="",event="",circle="",artist="",duration=Nothing,track=0}

header :: String
header = "Title,Year,Date,Album,Event,Circle,Duration,Track"
convert :: Track -> String
convert t = "\"" ++ dequote (title t) ++ "\"," ++ show (year t) ++ "," ++ show (date t) ++ ",\"" ++ dequote (album t) ++ "\",\"" ++
             event t ++ "\",\"" ++ circle t ++ "\"," ++ maybe "" show (duration t) ++ "," ++ show (track t)

-- You are not expected to understand this.
getAlbum :: String -> IO ()
getAlbum a = do t <- fmap parseTags $ openURL a
                --TagText "Released",TagClose "th",TagOpen "td" [],TagText "\n2007-03-23"
                let dt = let target = dropWhile (TagText "Released" ~/=) t
                         in if null target then ""
                            else deleteParens $ fromTagText $ head $ tail $ filter isTagText target
                -- TagText "Released",TagClose "th",TagOpen "td" [],TagText "\n2009/02/08 (",
                -- TagOpen "a" [("href","/wiki/Category:Sunshine_Creation_42"),("title","Category:Sunshine Creation 42")],
                -- TagText "Sunshine Creation 42",TagClose "a",TagText ")",TagClose "td"
                let evnt = let stream = filter isTagText $ dropWhile (TagText "Released" ~/=) t
                           in if '(' == last (fromTagText $ head $ take 5 $ drop 1 stream) -- )
                              then fromTagText (filter isTagText (dropWhile (TagText "Released" ~/=) t) !! 2)
                              else ""
                -- "2007-03-23"
                let yr = if null dt then 0 else read (take 4 dt)::Int
                -- TagText "Album by CODE ZTS LABEL"
                let hasCrcl = [cl | TagText cl <- t, "Album by " `isPrefixOf` cl]
                unless (null hasCrcl || (yr==0 && null dt)) $ do
                    let crcl = lookForCircle t
                    -- TagText "Selfregards2 - Touhou Wiki - Characters, games, locations, and more"
                    let albm = reverse $ drop 55 $ reverse $ head [al | TagText al <- t,
                                 " - Touhou Wiki - Characters, games, locations, and more" `isInfixOf` al]
                    let dflt = empty { date = dt, year = yr, circle = crcl, album = albm, event = evnt }
                    let table = filter (\x -> not ("Disc" `isInfixOf` x || " CD" `isInfixOf` x)) $
                                 filter (not . all isSpace) $
                                  map (trim . fromTagText) $ filter isTagText $
                                   drop 5 $ takeWhile (TagOpen "table"
                                     [("class","navbox"),("cellspacing","0"),
                                      ("style","background:#FFFBEE;border-color:#A8A077;")] ~/=) $
                                       takeWhile (TagComment "" ~/=) $
                                        dropWhile (TagOpen "span"
                                         [("class","mw-headline"),("id","Tracks")] ~/=) t
                    let tracks = filter (not . null) $
                                  split (keepDelimsL $ whenElt
                                   (\x -> length x==3 && "." `isInfixOf` x &&  isJust(maybeRead x :: Maybe Int))) table
                    mapM_ (putStrLn . convert . trackToTrack dflt) tracks

-- TagText "Album by ",TagOpen "a"
-- [("href","/wiki/ALiCE%27S_EMOTiON"),("title","ALiCE'S EMOTiON")],TagText
-- "ALiCE'S EMOTiON",TagClose "a"
lookForCircle :: [Tag String] -> String
lookForCircle t = let c = head [cl | TagText cl <- t, "Album by " `isPrefixOf` cl]
                      res = if c == "Album by " then (let tg = (dropWhile (TagText "Album by " ~/=) t !! 2)
                        in if isTagText tg then fromTagText tg else
                            (if isTagOpen tg then fromAttrib "title" tg else "") ) else drop 9 c
                      in if "(page does not exist)" `isInfixOf` res then takeWhile (/='(') res else res -- )

-- ["01.","The mom","(04:07)","arrangement: ZTS","composition: ZTS","original title: The mom","source: Parhelia"]
trackToTrack :: Track -> [String] -> Track
trackToTrack tr t = tr { track = fromMaybe 0 (maybeRead (head t) :: Maybe Int),
                         title = t !! 1,
                         duration = if length t >=3 then Just (timeConverter $ deleteParens (t !! 2)) else Nothing,
                         artist =  lookForAnArtist t }

lookForAnArtist :: [String] -> String
lookForAnArtist t = let targets = dropWhile (\x -> not ("arrangement:" `isPrefixOf` x || "composition:" `isPrefixOf` x)) t
                        target
                            | null targets = ""
                            | last (head targets) == ':' = head targets ++ (targets !! 1)
                            | otherwise = head targets
                    in trim $ fromMaybe "" $ join $ fmap (stripPrefix "arrangement:") $ stripPrefix "composition:" target

-- utility functions
openURL :: String -> IO String
openURL url = fmap decodeString (simpleHTTP (getRequest url) >>= getResponseBody)
deleteParens, trim, dequote :: String -> String
deleteParens = trim . filter (\x -> x /= '(' && x /= ')')
trim = reverse . dropWhile isSpace . reverse . dropWhile isSpace
dequote = map (\x -> if x=='"' then '\'' else x) -- "')
timeConverter :: String -> Int
timeConverter n = let (m,s) = break (==':') n
                      m' = maybeRead m :: Maybe Int
                      s' = maybeRead (drop 1 s) :: Maybe Int
                  in (fromMaybe 0 m' * 60) + fromMaybe 0 s'
maybeRead :: Read a => String -> Maybe a
maybeRead = fmap fst . listToMaybe . reads

VGMdb scraping code

The fol­low­ing is a buggy pro­gram for scrap­ing Touhou al­bums from VGMdb; it works on a lim­ited sub­set of al­bum pages, but has an un­known num­ber of fa­tal bugs. I aban­doned it once I was offered read­-only data­base ac­cess, and that was what I ac­tu­ally used to get my VGMdb da­ta. This is in case I ever need to go back.

import Text.HTML.TagSoup (fromTagText, isTagOpenName, isTagText, Tag(TagOpen,TagText), parseTags)
import Network.HTTP (getResponseBody, getRequest, simpleHTTP)
import Data.List (isPrefixOf, sort)
import Data.Char (isSpace)
import Codec.Binary.UTF8.String (decodeString)

main :: IO ()
main = do albumsURLs <- getAlbums
          albums <- mapM openURL (sort albumsURLs)
          let metadata = map toAlbum albums
          writeFile "vgmdb.csv" $ unlines (header : concatMap (map convert) metadata)

type Album = [Track]
data Track = Track { title :: String,
                     year :: Int,
                     date :: String,
                     album :: String,
                     circle :: String,
                     duration :: Maybe Int,
                     track :: Int } deriving Show
empty :: Track
empty = Track {title="",year=0,date="",album="",circle="",duration=Nothing,track=0}
header :: String
header = "Title,Year,Date,Album,Circle,Duration,Track"
convert :: Track -> String
convert t = "\"" ++ title t ++ "\"," ++ show(year t) ++ "," ++ show (date t) ++ ",\"" ++
              album t ++ "\",\"" ++ circle t ++ "\"," ++ maybe "" show(duration t) ++ "," ++ show (track t)

-- example album link: 'TagOpen "a" [("class","albumtitle album-doujin"),
--                                   ("href","http://vgmdb.net/album/36901"),
--                                   ("title","Majo to Ringo to Samayou Kimi to")]'
getAlbums :: IO [String]
getAlbums = do touhou <- openURL "http://vgmdb.net/product/9"
               return [snd(atts !! 1) | TagOpen "a" atts <- parseTags touhou, snd(head atts)=="albumtitle album-doujin"]

-- need 'decodeString' to deal with Japanese glyphs; see http://stackoverflow.com/questions/10558003/how-to-get-utf8-rss-feed
openURL :: String -> IO String
openURL url = fmap decodeString (simpleHTTP (getRequest url) >>= getResponseBody)

toAlbum :: String -> Album
toAlbum page = let tags = parseTags page
                   (yr,dt) = extractDate tags
                   albm = extractAlbum tags
                   crcl = extractCircle tags
                   files = extractMusic tags
               in map (\t -> Track {title = title t, year = yr, date = dt,
                                    album = albm, circle = crcl, duration = duration t,
                                    track = track t}) files


-- TagOpen "a" [("title","View albums released on Dec 30, 2011"),("href","/db/calendar.php?year=2011&month=12#20111230")]
extractDate :: [Tag String] -> (Int,String)
extractDate t = let (a:b:_) = map snd $ head [atts | TagOpen "a" atts <- t,
                                       let ttle = snd(head atts),
                                       "View albums released on " `isPrefixOf` ttle]
        in (read(reverse $ take 4 $ reverse a)::Int,
           tail$ snd $ break (=='#') b)

-- TagOpen "title" [],TagText "Gensou Rashinban - VGMdb",TagClose "title",
extractAlbum :: [Tag String] -> String
extractAlbum t = (\(TagText x) -> reverse $ drop 8 $ reverse x) (dropWhile (not . isTagOpenName "title") t !! 1)

-- [TagText "Published by",TagClose "b",TagClose "span",TagClose "td",TagText "\r\n",
-- TagOpen "td" [],TagOpen "a" [("href","/org/217")],TagOpen "span"
-- [("class","productname"),("lang","en"),("style","display:inline")],TagText "PopKorn"]
extractCircle :: [Tag String] -> String
extractCircle t = fromTagText(head (drop 8 (dropWhile (\x -> not(isTagText x && (fromTagText x)=="Published by")) t)))

extractMusic :: [Tag String] -> [Track]
extractMusic t = let tracks = filter (not . all isSpace) $
                               tail $ dropWhile (\z -> z /= "Disc 1") $
                                map fromTagText $ filter isTagText $
                                 takeWhile (\y -> not(isTagText y && (fromTagText y)=="Disc length")) $
                                  dropWhile (\x -> not(isTagText x && (fromTagText x)=="Tracklist")) t
                   in if length tracks `rem` 3 == 0 then threezip tracks else twozip tracks
       where
       twozip,threezip :: [String] -> [Track]
       threezip [] = []
       threezip (a:b:c:d) = empty {title=b,duration=Just (timeConverter c),track=read a} : threezip d
       threezip _ = []
       twozip [] = []
       twozip (a:b:d) = empty {title=b,duration=Nothing,track=read a} : twozip d
       twozip _ = []
       timeConverter :: String -> Int
       timeConverter n = let (m,s) = break (==':') n in ((read m :: Int) * 60) + (read (tail s) :: Int)