Touhou music by the numbers

Collect music metadata and look for patterns
statistics, sociology, Haskell, shell, R
2013-02-282015-02-28 abandoned certainty: possible importance: 1


Idea: cor­re­late Touhou music pro­duc­tion against Japan­ese youth unem­ploy­ment: does the total pro­duc­tion of music as mea­sured in sec­onds increase with unem­ploy­ment?

Oppo­site view, reces­sions dent pro­duc­tion (per­haps because peo­ple are work­ing harder and so have less free time even if other peo­ple are unem­ployed?) http://www.gamesetwatch.com/2009/12/sound_current_yokohamas_mediam.php

While the turnout at M3 remains strong, at the same time an eco­nomic reces­sion can­not help but touch a com­mu­nity whose activ­i­ties rely on hav­ing free time. Fur­ther­more while pre­vi­ously many hob­by­ists dreamed of some­day break­ing into the indus­try, more recently many also fear that game com­pa­nies will begin crack­ing down on unli­censed trib­utes.

Data

Unemployment data source

Used FRED Adjusted Unem­ploy­ment Rate for Youth [15-24yo] in Japan (JPNURYNAA); down­loaded as CSV, annual per­cent­age 2000-2011

jpn <- read.table(stdin(),header=TRUE)
      DATE VALUE
2000-01-01   8.9
2001-01-01   9.1
2002-01-01   9.5
2003-01-01   9.6
2004-01-01   9.0
2005-01-01   8.1
2006-01-01   7.5
2007-01-01   7.5
2008-01-01   7.0
2009-01-01   8.9
2010-01-01   9.0
2011-01-01   8.1

Touhou data sources

TODO Sug­ges­tions: https://old.reddit.com/r/TOUHOUMUSIC/comments/19hh2m/touhou_music_databases_comprehensive_easily/ http://boards.4chan.org/jp/res/10559057

alter­na­tive sources

  • VGMdb Touhou entries but it is sub­stan­tially smaller with meta­data on <1389 albums
  • The open source alter­na­tive is ; looks like it has 190 albums, but like 90%+ link to VGMdb, so I’m not sure I want to include them (waste of effort, and if some­one just copied over all of VGMdb a few years ago, it’ll be badly mis­lead­ing to any cap­ture-re­cap­ture analy­sis of pop­u­la­tion size).
  • Touhouwi­k­i.net another source, with <1182 albums

Arrange Circle Database

“東方音団録 ~ Arrange Cir­cle Data­base ver.3.0”; home­page & release info (debate, chat), prod­uct page with j-sub­cul­ture.­com quot­ing a par­tial total of $15, Pay­pal and ship­ping out­side Japan boost­ing to ~$30! (Al­ter­na­tive prox­ies included Yokatta.)

Email in 2013-03-01 failed to elicit any reply by 27 May; then requested a pur­chase on `/r/TOUHOUMUSIC.

My request was filled on 2013-06-08 as an ISO and ZIP file. The ISO file seemed to be bro­ken: file just calls it ‘data’, and when I mount it as a loop­back iso9660 file, mount throws an error. I redown­loaded it and com­pared it, but the copies were iden­ti­cal. The good news is that the zip file seems to work fine. The data is in a .accdb file in a sub­fold­er, which turns out to be the lat­est Microsoft Access data­base for­mat. Unfor­tu­nate­ly, this turns out to be almost entirely unsup­ported by any­thing on Linux (ex­cept for a Java library), but for­tu­nate­ly, an acquain­tance had an Office 365 sub­scrip­tion and re-ex­ported the .accdb file as the older Access for­mat .mdb file (Microsoft Access data­base) which was suc­cess­fully read and con­verted to CSV by mdb-tools’s mdb-export (UTF-8 CSV for­mat).

The CSV seems to con­tain ~1316 entries; cor­re­spond­ing to ~1316 albums with the respec­tive cir­cle name, URL, gen­re, pos­si­bly vocal­ist, and 2 more fields I can­not fig­ure out due to lack of Japan­ese pro­fi­ciency (but none seem to be release dates). The entries look like this:

$ mdb-tables Toho_arrange_circle_database-gwern.mdb
アレンジサークルリスト
$ mdb-export Toho_arrange_circle_database-gwern.mdb `mdb-tables Toho_arrange_circle_database-gwern.mdb` > Toho_arrange_circle_database-gwern.csv
$ head Toho_arrange_circle_database-gwern.csv
    ID,pin,サークル名,ふりがな,URL,ジャンル,Vocal,主な頒布CD,原曲アレンジの程度,一言,memo
    1,,"っ´Д`)っゼロ式の処刑場(っ´Д`)っゼロ式の家・DESTRUCTIVE ANGEL)","ぜろしきのしょけいじょう","http://shinzanzeroshiki.fc2web.com/","メタル","男","Crazy Trancy Ecstasy","原曲維持","重厚感あふれる荘厳なメタル。(Black)",
    2,,"α music","あるふぁみゅーじっく","http://www19.atpages.jp/tatu4/","オーケストラ、ピアノ",,"東方風水華月","原曲維持",,
    3,,"凸凹えんたーていめんとすたじお","でこぼこえんたーていめんとすたじお","http://3rd.geocities.jp/deko_boko_es/index/","リコーダー、ゲームミュージック",,"Electronic Magus","原曲維持","8ビットあり、リコーダー生演奏ありと多彩。(Black)",
    4,,"[4989]","しくはっく","http://4989mm.littlestar.jp/","ロック","男女","musick for me","原曲維持","スローなロック風アレンジにハイトーン男声ボーカル、エレクトロも混ざったインストも。ドラマCDあり。(Black)",
    5,,"[ kapparecords])","かっぱれこーず","http://www5f.biglobe.ne.jp/~kapparecords/","ハードロック","男","SCARLET FANTASIA","原曲維持","生演奏ギタードラムベースと男声ボーカルがんばれ。(Black)",
    6,,"<echo>PROJECT","えこぷろじぇくと","http://echoproject.3rin.net/","ダンス、ポップス、ロック","女","eclat:","原曲重視~維持","女性ボーカルを軸にしてジャンルは何でもアレンジ。(Black)",
    7,,"#039","しゃーぷさんきゅー","http://sharp039.web.fc2.com/",,,"EMERGENCE",,,
    8,,"#ゆうかりんちゃんねる","ゆうかりんちゃんねる","http://yuukach.web.fc2.com/","オーケストラ、クラシック、ピアノ、エレクトロ、ロック",,"ゴリラ人間のための華麗なる大幻想曲集","原曲維持","クラシカルな構成と管弦を散りばめたオーケストラ、クラシック風インスト。バイオリンの倍音の響きが印象的。(Black)",
    9,,"10-GALLON(Digit Smith)","てんがろん","http://10-gallon.net/","ロック、エレクトロ、ハードロック","女","悪魔城レミリア","原曲維持","原曲メロをミドルテンポなロックとエレクトロに乗せて。(Black)",
$ tail -1 Toho_arrange_circle_database-gwern.csv
    1316,,"侘助","わびすけ","http://ameblo.jp/wa-bi-su-ke/","エレクトロ、ロック","女","東方乙女椿","原曲維持","速めのエレクトロ、ロックアレンジが多め。女声ボーカル。(Black)",

I am a lit­tle sur­prised that there are only 1316 entries. Either I’ve over­es­ti­mated their thor­ough­ness or this is lim­ited to a spe­cific con­ven­tion or some­thing like that… Need to look into this more. This does­n’t include the track­-level data I was hop­ing for, but a list of albums can still be use­ful for esti­mat­ing com­plete­ness of cap­ture.

Circle count

Cross­check: num­ber of cir­cles at Comiket & Reitai­sai? cir­cles self­-clas­sify by ‘genre code’ and this is recorded in the offi­cial Comiket cat­a­logues, where most (al­l?) of these num­bers are drawn

  • C87: 1756 https://media.8chan.co/2hu/src/1419976746668.png

  • C86: “Sta­tis­tics taken from the C86 cat­a­logue.” http://ascii.jp/elem/000/000/917/917701/002_976x637.jpg https://i.imgur.com/R0uXljZ.jpg ~1938 cir­cles ; later esti­mate, 1910 https://webcatalog-free.circle.ms/Circle?genreCode=241&day=2 & i2.kym-cd­n.­com ; another esti­mate1 1918 https://media.8chan.co/2hu/src/1419976746668.png

  • C85: “Grey: C85 sta­tis­tics” https://i.imgur.com/R0uXljZ.jpg 2258 / 2272 (http://www.crunchyroll.com/anime-news/2013/11/01-1/top-doujinshi-events-most-popular-by-the-numbers orig­i­nal: http://yaraon.blog109.fc2.com/blog-entry-19664.html) / 2246 (ac­cord­ing to i2.kym-cd­n.­com)

  • C84: 2526 (ac­cord­ing to i2.kym-cd­n.­com)

  • C83: 2492 (ac­cord­ing to i2.kym-cd­n.­com)

  • C82: Touhou=2670 / 2694 (ac­cord­ing to i2.kym-cd­n.­com)

  • C81: Touhou=2690 / 2656 (ac­cord­ing to i2.kym-cd­n.­com)

  • C80: Touhou=2808 / 2774 (ac­cord­ing to i2.kym-cd­n.­com) & http://d.hatena.ne.jp/myrmecoleon/20131101/1383333933

  • http://www.sankakucomplex.com/2008/08/02/touhou-takeover/ (https://www.google.com/search?num=100&q=touhou%20comiket%20circle%20genre%20site%3Asankakucomplex.com%20-site%3Awww.sankakucomplex.com%2Ftag%2F) orig­i­nal: https://web.archive.org/web/20110722080729/http://addb.jp/index.php?Diary%2F2008-07-30 :

  • C75-C79: (http://i2.kym-cdn.com/photos/images/original/000/866/132/075.jpg unknown source) C75: 1387+1356(https://web.archive.org/web/20110812023434/http://addb.jp/index.php?Diary%2F2008-12-17)/C76:1739(http://d.hatena.ne.jp/myrmecoleon/20101230/1293717241)/C77:2372(d.hatena.ne.jp)/C78:2416+2394(d.hatena.ne.jp)/C79: 2774(d.hate­na.ne.jp)

  • C74: 885

  • C73: 793

  • C72: 558

  • C71: 574

  • C70: 366

  • C69: 232

  • C68: 229

  • C67: 98

  • C66: 50

  • C65: 7 / 39 (en.­touhouwi­k­i.net)

  • C64: 0 / 12 (en.­touhouwi­k­i.net)

  • C63: 0 / 1 (ac­cord­ing to http://en.touhouwiki.net/wiki/Release_Timeline#2002 ; asked about dis­crep­ancy http://en.touhouwiki.net/index.php?title=Talk:Release_Timeline&curid=46413&diff=331144&oldid=262193 )

  • Reitai­sai (Hakurei Jinja Reitai­sai / 博麗神社例大祭):

  • 2004-4-18 R1: 114

  • 2005-5-4 R2: 362

  • 2006-5-21 R3: 680

  • 2007-5-20 R4: 653

  • 2008-5-25 R5: 1086

  • http://thwiki.cc/%E4%BE%8B%E5%A4%A7%E7%A5%AD#.E5.8E.86.E5.B1.8A.E4.BF.A1.E6.81.AF :

  • 2009-3-8 R6: 2948

  • 2010-3-14 R7: 4050

  • 2011-5-8 R8: 4940 (see Wikipedia)

  • 2012-5-27 R9: 5058

  • 2013-5-26 R10: 5013

  • 2014-5-11 R11: 4312

  • 2015-5-10 R12: ?

  • http://www.comiket.co.jp/info-a/C77/C77CMKSymposiumPresentationEnglish.pdf / http://www.comiket.co.jp/info-a/WhatIsEng080225.pdf / http://www.comiket.co.jp/info-a/WhatIsEng080528.pdf

  • C76: all=>35000

  • C75: all=>35000

  • C73: all=>35000

  • C72: all=>35000

  • pg5 of CMKSymposiumPresentationEnglish: cir­cles graph, C1-C76

  • pg6 of Wha­tIsEn­g080225: C1-C73

Torrent

Music source: Touhou lossy music col­lec­tion v.15.2 (derived from the Touhou loss­less music col­lec­tion col­lec­tion), 265.2GB of 44421 tracks from 4952 albums pro­duced by <1,264 groups or “cir­cles”.

$ find ~/torrent/Touhou\ lossy\ music\ collection/ -type f -name "*.mp3" | wc
  44421
$ ls torrent/Touhou\ lossy\ music\ collection/ | wc
   1264
$ ls torrent/Touhou\ lossy\ music\ collection/*/ | wc
   7477

File name, music length, and meta­data year (if any) are extracted using exiftool:

events: 例大祭["“,”SP“,”SP2",2-9]: annual Reitai­sai M3*: annual Media Mix Mar­ket eg. http://polymetrica.wordpress.com/2009/10/09/things-i-am-excited-about-04-m324/ C[63-82]: semi­-an­nual Comiket サンクリ[28-50]: annu­al? Sun­shine Cre­ation http://ja.wikipedia.org/wiki/%E3%82%AF%E3%83%AA%E3%82%A8%E3%82%A4%E3%82%B7%E3%83%A7%E3%83%B3_%28%E5%90%8C%E4%BA%BA%E5%8D%B3%E5%A3%B2%E4%BC%9A%29 東方紅楼夢[?2-8]: annual Koromu 月の宴?2-5: annu­al? Feast of the Month 紅のひろば?2-6: semi­an­nual Red Square 東方不敗小町?2-6, SP, ぷちこまち: Komachi 杜の奇跡[15-16] 東方杜郷想[2-3] 幺樂団カァニバル!?2-3 東方幻楽祭[2]: semi­an­nual コミコミ[12-14] FF[9-17] ? こみトレ[12-17] COMIC1☆2-6 COMIC CITY大阪[63,73] 恋魔理?2-3 東方椰麟祭?2-3 東方名華祭2

exiftool; json

Is exiftool’s length approx­i­ma­tion trust­wor­thy? Yes, it seems to be always within sec­onds of the full mp3info answer:

$ find "/home/gwern/torrent/Touhou lossy music collection/" -type f -name "*.mp3" \
        -exec mp3info -F -p "0:%02m:%02s " {} \; -exec exiftool -Duration {} \;
0:00:23 Duration                        : 23.12 s (approx)
0:04:28 Duration                        : 0:04:28 (approx)
0:02:44 Duration                        : 0:02:44 (approx)
0:04:52 Duration                        : 0:04:52 (approx)
0:03:56 Duration                        : 0:03:56 (approx)
0:01:44 Duration                        : 0:01:44 (approx)
0:04:34 Duration                        : 0:04:34 (approx)
0:02:30 Duration                        : 0:02:30 (approx)
0:03:02 Duration                        : 0:03:02 (approx)
0:03:43 Duration                        : 0:03:42 (approx)
0:03:23 Duration                        : 0:03:23 (approx)
0:03:11 Duration                        : 0:03:11 (approx)
0:04:22 Duration                        : 0:04:22 (approx)
0:03:13 Duration                        : 0:03:13 (approx)
0:04:04 Duration                        : 0:04:04 (approx)
0:03:58 Duration                        : 0:03:57 (approx)
0:05:24 Duration                        : 0:05:24 (approx)
0:04:17 Duration                        : 0:04:17 (approx)
0:03:14 Duration                        : 0:03:14 (approx)
0:01:59 Duration                        : 0:01:58 (approx)
0:03:21 Duration                        : 0:03:21 (approx)
0:02:35 Duration                        : 0:02:35 (approx)
0:04:20 Duration                        : 0:04:20 (approx)
...
# generate and parse and cleanup data
#
# takes ~30m:
# R> system("exiftool -extension mp3 -json -forcePrint
#           -Title -Year -Album -Artist -Duration -Genre -Track -Directory -FileName -FileSize -AudioBitrate
#           ~/torrent/Touhou\\ lossy\\ music\\ collection/*/* > ~/touhou.json")
library(rjson)
# download from https://www.gwern.net/docs/touhou/2013-torrent.json.xz and decompress with xz
json_data <- fromJSON(paste(readLines("2013-gwern-touhoutorrent.json"), collapse=""))
touhou <- data.frame(matrix(unlist(json_data), ncol=12, byrow=TRUE))
colnames(touhou) <- c("SourceFile", "Title", "Year", "Album", "Artist", "Length", "Genre",
                      "Track", "Directory", "FileName", "FileSize", "AudioBitrate")
# Delete SourceFile column; redundant with Directory/FileName
touhou <- touhou[,-1]
touhou$Directory <- sub("/home/gwern/torrent/Touhou lossy music collection/", "",
                         as.character(touhou$Directory))
touhou[touhou==""] <- NA
touhou[touhou=="-"] <- NA
touhou$Year <- as.integer(as.character(touhou$Year))
# torrent doesn't cover 2013 music, and music predating the PC-98 games doesn't exist...
touhou$Year[touhou$Year<1990] <- NA
touhou$Year[touhou$Year>2012] <- NA
# Genre is "None" or " "? both useless and false (thanks, tagger); so it goes too:
touhou$Genre[touhou$Genre=="None"] <- NA
touhou$Genre[touhou$Genre==" "] <- NA
# turn the track lengths and bitrates into usable numbers on a common scale (seconds and MBs, respectively)
touhou$Length <- gsub(" \\(approx\\)","",as.character(touhou$Length))
touhou$AudioBitrate <- as.integer(sub(" kbps","",as.character(touhou$AudioBitrate)))
# exiftool leaves us "16 s"; if so, strip the " s" and turn it into an integer
# else, eg. "0:04:37"; split on colon,
# multiply hour by 3600 seconds, minutes=60 each, seconds=seconds; and sum it
interval <- function(x) { if (!is.na(x)) { if (grepl(" s",x)) as.integer(sub(" s","",x))
                                           else { y <- unlist(strsplit(x, ":"));
                                                  as.integer(y[[1]])*3600 + as.integer(y[[2]])*60 + as.integer(y[[3]]); }
                                                  }
                          else NA
                          }
touhou$Length <- sapply(touhou$Length,interval)
filesize <- function(x) { if (grepl(" kB",x)) (as.integer(sub(" kB","",x))/1000) else as.integer(sub(" MB","",x))}
touhou$FileSize <- sapply(touhou$FileSize, filesize)
# Serious work: turn the encoded information in Directory into usable columns. Not for the faint of heart.
#
# The Directory column looks like "[twith1450]/2009.03.08 TOHOMOHO [例大祭6]"
# The schema here is "[circle]/eventDate album [event]"
#
# "[Angelic Quasar]/2006.01.29 [AQSH-0003] Racial Ethnic Nation"
# "[Alstroemeria Records]/[ARCD0001] The regret of stars, but stars shine bright (C65) (mp3)"
# "[Aqua Style/ひえろぐらふ]/2010.05.24 [AQUA-0031] 春宵一刻値千金 -シュンショウイッコク アタイセンキン-"
brackets <- function(b) sub("\\]","", sub("\\[","",b))
# easy first step: parse out the leading group/circle (always there, terminated by forward-slash) as new column
touhou$Circle <- sapply(touhou$Directory, function(x) brackets(unlist(strsplit(as.character(x), "/"))[1]))
# destructively update by removing the group/circle, to make the next step easier
# this makes Directory looks like "2009.06.07 [PAER-0007] #01 -LILITH- [東方幻楽祭2]"
# or "[ARCD0001] The regret of stars, but stars shine bright (C65) (mp3)"
touhou$Directory <- sapply(touhou$Directory, function(x) unlist(strsplit(as.character(x), "/"))[2])
touhou$Date <- as.Date(sapply(touhou$Directory, function(x) substring(x, 1, 10)), format="%Y.%m.%d")
# and like before, we strip the event date that we've parsed out, leaving eg. "ピアノのための東方小品集 Op.1-1 [御射宮司祭]"
# or "月遊 [例大祭8]" or "[AQUA-0031] 春宵一刻値千金 -シュンショウイッコク アタイセンキン-"
touhou$Directory <- sapply(touhou$Directory, function(x) substring(x, 12))
# extracting the next parameter, the event the album was released at, is harder still
library(stringr)
# if the directory does not end in a right-bracket, there's no event info and we should bomb out
# else, grab w/regexp last pair of brackets with a space before (excludes any album numbering schemes) & trim
# that didn't work? then it must be one of the directories where there's no space before the bracketed event, retry without leading space
touhou$Event <- sapply(touhou$Directory, function(x) { if (str_sub(x,start=-1) == "]") { res <- brackets(unlist(str_split(x, " \\["))[2]); if (!is.na(res)) res else brackets(unlist(str_split(x, "\\["))[2]) } else x})
# if you examine the Event column, it's full of wrong entries. I have made a list of 19 event-prefixes (I hope), which
# we'll use as a whitelist by erasing anything which lacks all of the 19 prefixes.
isPrefix <- function(x,y) grepl(paste0("^",x), y)
events <- c("例大祭","M3","C","サンクリ","東方紅楼夢","月の宴","紅のひろば","東方不敗小町","杜の奇跡","東方杜郷想",
            "幺樂団カァニバル!","東方幻楽祭","コミコミ","FF","こみトレ","COMIC1","COMIC CITY大阪","恋魔理","東方椰麟祭","東方名華祭")
touhou$Event <- sapply(as.character(touhou$Event),
                       function(target) if (sum(sapply(events, function(e) isPrefix(e,target))) != 0) target else NA)
# this whitelist covers almost the entire sample, so I think it works well:
## R> sum(!is.na(touhou$Event))
## [1] 39190
## R> length(touhou$Event)
## [1] 41866
#
# one final thing, since (almost) all directories had Dates while not all files had Years; overwrite any missing Years
# based on the Date we just extracted
touhou$Year <- as.integer(format(touhou$Date, "%Y"))
rm(touhou$Directory) # clean up
# escape with the loot:
write.csv(touhou, file="2013-gwern-touhoumusic-torrent.csv", row.names=FALSE)

VGMdb

The Touhou project page turns out to be incom­plete: each entry had to be man­u­ally anno­tated as related to Touhou. I was pointed to a search query which turned up many more results by look­ing for any page with the string “Touhou” in the “games” field.

The VGMdb admin­is­tra­tors kindly gave me read­-only access to their MySQL data­bas­es. I grabbed the entirety of the tables vgmdb_albums and vgmdb_tracks from the main VGMdb data­base; I exported them as 2 CSV files with comma sep­a­ra­tors, renamed 2013-vgmdb-albums.csv and 2013-vgmdb-tracks.csv. Before load­ing the exports, I had to delete all escaped quotes; the default R CSV pars­ing does­n’t han­dle them. The track rows are 1 track with an album ID, so to turn each track record/row into an equiv­a­lent of the tor­rent rows, I need to fill in based on the album table.

albums <- read.csv("https://www.gwern.net/docs/touhou/2013-vgmdb.csv")
albums <- with(albums,
           data.frame(albumid,reldate,publisher,game,albumtitles))
albums <- albums[grepl("ouhou", albums$game),]; rm(albums$game)

tracks <- read.csv("2013-vgmdb-tracks.csv")
rm(tracks$tracklistid, tracks$trackid, tracks$disctitle, tracks$disc)
tracks$length[tracks$length==0] <- NA # 0 is the default in the VGMdb schema
tracks <- tracks[tracks$albumid %in% albums$albumid,]
touhou <- merge(tracks, albums)
rm(touhou$albumid, albums, tracks)
# deal with the 41 dates with the format "2005-09-00" (the 0th months or days are not real...)
touhou$date <- as.Date(sub("-00","-01",as.character(touhou$reldate)))
rm(touhou$reldate)
touhou$year <- as.integer(format(touhou$date, "%Y"))
# upcase, rearrange to torrent's order
colnames(touhou) <- c("Track","Title","Length","Circle","Album","Date","Year")
touhou <- touhou[c(2,7,5,3,1,4,6)]
write.csv(touhou, file="2013-vgmdb-touhou.csv", row.names=FALSE)

touhouwiki.net

Personal downloads

4chan /jp/ C83 threads

A loose group of users on the /jp/ sub­fo­rum col­lab­o­rate each Comiket to upload and dis­trib­ute dou­jin man­ga, games, and music released at that Comiket; some are uploaded by Comiket atten­dees, some are bought from resellers like , and many files are har­vested from Japan­ese P2P file­shar­ing net­works like //. I com­piled a list of ~400 files from the /r/TouhouMusic C83 thread (prin­ci­pally from the 4chan links) & the blog All Dou­jin Music and grad­u­ally down­loaded them from Jan­u­ary to March 2013. After dead links, I was left with 400-500 files. Many are not music, or even Touhou-re­lat­ed, so I hand-fil­tered albums, look­ing for signs of being Touhou dou­jin works (cred­its to ZUN, Touhou char­ac­ters in the art­work, themes I rec­og­nized as Touhou, etc); when I was not sure, I erred on the side of exclu­sion. The final com­pi­la­tion yielded 3503 files (evenly split: 1776 Touhou vs 1728 “other”) with 953 Touhou music files.

# exiftool -extension ogg -json -forcePrint -Title -Year -Album -Artist -Duration -Genre -TrackNumber -Directory
# -FileName -FileSize -NominalBitrate -Date -recurse
# ~/c83/touhou/ ~/c83/touhou/*/** ~/c83/touhou/*/*/** ~/c83/touhou/*/*/*/** > ~/2013-c83-downloads.json

library(rjson)
# download from https://www.gwern.net/docs/touhou/2013-torrent.json.xz and decompress with xz
json_data <- fromJSON(paste(readLines("2013-c83-downloads.json"), collapse=""))
touhou <- data.frame(matrix(unlist(json_data), ncol=13, byrow=TRUE))
colnames(touhou) <- c("SourceFile", "Title", "Year", "Album", "Artist", "Length", "Genre",
                      "Track", "Directory", "FileName", "FileSize", "AudioBitrate", "Date")
# Delete SourceFile column; redundant with Directory/FileName
touhou <- touhou[,-1]
for (filter in c("/home/gwern/c83/touhou/", "\\[touhou.vnsharing.net\\]", " \\(320K\\+BK\\)", "/mp3",
                 " MP3v0", " v0", " \\(flac\\+scans\\)", " \\(128K\\)", " \\(V0\\)", " \\(320\\)",
                 " \\(mp3 320\\)", " \\(v0\\+jpg\\)"))
 { touhou$Directory <- sub(filter, "", as.character(touhou$Directory)) }
touhou[touhou==""] <- NA
touhou[touhou=="-"] <- NA

Play­back length:

find c83/touhou/ -name "*.ogg" -exec ogginfo {} \;|fgrep "Playback length"

4chan /jp/ C84 threads

591 Touhou music files:

/docs/touhou/2013-c84-downloads.json

4chan /jp/ C85 threads

449 Touhou music files:

/docs/touhou/2013-c85-download.json

4chan /jp/ Reitaisai 10 threads

Sim­i­larly to above, draw­ing on the /r/TouhouMusic dis­cus­sion and man­u­ally prun­ing dupli­cates & non-Touhou down to 491 music files.

exiftool -extension ogg -json -forcePrint -Title -Year -Album -Artist -Duration -Genre -TrackNumber -Directory -FileName -FileSize -NominalBitrate -Date */*.ogg > ~/2013-reitaisai-downloads.json

Reitaisai 10 torrent

In May & June 2013, an anony­mous per­son com­piled 67 albums and released two tor­rents of Reitai­sai 10 albums (vol. 1, vol. 2)

exiftool -extension ogg -json -forcePrint -Title -Year -Album -Artist -Duration -Genre -TrackNumber -Directory -FileName -FileSize -NominalBitrate -Date */*.ogg > ~/2013-reitaisai-downloads-torrent.json

Analysis

touhou <- read.csv("https://www.gwern.net/docs/touhou/2013-torrent.csv",
                   colClasses=c("character", "integer", "factor", "character", "integer", "factor",
                                "character", "character", "numeric", "integer", "factor", "Date"))

# do stuff with the data
# general correlations
t <- data.frame(touhou$Year, touhou$Length, touhou$FileSize, touhou$AudioBitrate)
cor(t,use="pairwise.complete.obs")
#                     touhou.Year touhou.Length touhou.FileSize touhou.AudioBitrate
# touhou.Year
# touhou.Length          -0.01091
# touhou.FileSize         0.04484       0.93915
# touhou.AudioBitrate     0.19188       0.11091         0.35499

# test the correlation between higher bitrate and larger files:
cor.test(touhou$FileSize, touhou$AudioBitrate)

# the genre metadata is useless!
sort(table(touhou$Genre), decreasing=TRUE)

# boxplot avg length per year
plot(touhou$Length ~ factor(touhou$Year))

Eco­nom­ics mod­el­ing:

jpn <- read.csv(stdin(),header=TRUE)
DATE,VALUE
2000-01-01,8.9
2001-01-01,9.1
2002-01-01,9.5
2003-01-01,9.6
2004-01-01,9.0
2005-01-01,8.1
2006-01-01,7.5
2007-01-01,7.5
2008-01-01,7.0
2009-01-01,8.9
2010-01-01,9.0
2011-01-01,8.1

# number of works per year does not correlate:
cor.test(jpn$VALUE[3:12], table(touhou$Year)[1:10])
    Pearson`s product-moment correlation

data:  jpn$VALUE[3:12] and table(touhou$Year)[1:10]
t = -0.3053, df = 8, p-value = 0.768
alternative hypothesis: true correlation is not equal to 0
95% confidence interval:
 -0.6903  0.5602
sample estimates:
    cor
-0.1073

model <- lm(table(touhou$Year)[1:10] ~ c(2002:2011) + jpn$VALUE[3:12]); summary(model)
Call:
lm(formula = table(touhou$Year)[1:10] ~ c(2002:2011) + jpn$VALUE[3:12])

Residuals:
    Min      1Q  Median      3Q     Max
-2128.6  -716.4    52.4   632.6  2253.4

Coefficients:
                Estimate Std. Error t value Pr(>|t|)
(Intercept)     -2775278     342171   -8.11  8.3e-05
c(2002:2011)        1379        170    8.13  8.2e-05
jpn$VALUE[3:12]     1450        567    2.56    0.038

Residual standard error: 1400 on 7 degrees of freedom
Multiple R-squared: 0.905,  Adjusted R-squared: 0.878
F-statistic: 33.5 on 2 and 7 DF,  p-value: 0.00026



logModel <- lm(log(table(touhou$Year)[1:10]) ~ c(2002:2011) + jpn$VALUE[3:12])
summary(logModel)
Call:
lm(formula = log(table(touhou$Year)[1:10]) ~ c(2002:2011) + jpn$VALUE[3:12])

Residuals:
   Min     1Q Median     3Q    Max
-1.218 -0.632  0.108  0.551  0.982

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)
(Intercept)     -1295.060    207.535   -6.24  0.00043
c(2002:2011)        0.652      0.103    6.34  0.00039
jpn$VALUE[3:12]    -0.725      0.344   -2.11  0.07301

Residual standard error: 0.849 on 7 degrees of freedom
Multiple R-squared: 0.906,  Adjusted R-squared: 0.879
F-statistic: 33.8 on 2 and 7 DF,  p-value: 0.000253

plot(c(2002:2011),table(touhou$Year)[1:10])
points(c(2002:2011),exp(predict(logModel)),type='l',col='blue')

Growth over time

How fast is the cor­pus of Touhou music grow­ing?

Con­stant growth mod­el: the first game was released in 1996, no? So that gives 17 years to accu­mu­late 1.26TB or 1,260GB or 74.1GB per year. The screen­shot is down­load­ing at 0kb/s, which is not use­ful, but it says 2640 days left so we can esti­mate that he’s down­load­ing at 0.47GB per day (1260/2640), and over a year 0.47GB is 174GB which is 2.35x faster than the 74GB per year. So at that annual increase, OP is not doomed and can in fact catch up.

Expo­nen­tial growth mode: a lit­tle trick­ier since we can’t force a for­mula just from the cumu­la­tive total and elapsed time. I need more data. So using my 2012 Touhou Lossy Tor­rent data, I can try to regress an expo­nen­tial against the annual count… but wait! The amount of music does not seem to be increas­ing expo­nen­tial­ly!

R> touhou <- read.csv("https://www.gwern.net/docs/touhou/2013-torrent.csv",
+                    colClasses=c("character", "integer", "factor", "character",
+                                         "integer", "factor", "character",
+                                         "character", "numeric", "integer",
+                                         "factor", "Date"))
R> summary(touhou$Year)
R> perYear <- table(touhou$Year); perYear; plot(perYear)

 2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012
   13    23   255  1241  2070  2599  5073 10278  9765  7494  2999

Graph: http://i.imgur.com/23fMA5c.png

Looks like Touhou music’s growth peaked in 2009; this might reflect the tor­ren­t’s incom­plete­ness, except the tor­rent is from 2012, and you’d expect cov­er­age of 2010 or 2011 to be pretty good by that point. So the growth of the tor­rent over­all looks more like a sig­moid or log:

R> runningTotal <- cumsum(table(touhou$Year)); runningTotal; plot(runningTotal)
 2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012
   13    36   291  1532  3602  6201 11274 21552 31317 38811 41810

http://i.imgur.com/Lv9vHrZ.png

So prob­a­bly it’d be bet­ter to ask, ‘if 2012-rate growth con­tin­ues, what is the ratio to his down­load speed?’

2012 added 19.5GB to the tor­rent:

R> # FileSize is in megabytes, so gigabytes
R> sum(touhou[touhou$Year==2012,]$FileSize, na.rm=TRUE) / 1000
[1] 19.5

Appendix

touhouwiki.net scraping code

Uses the Tag­soup and split pack­ages; emits CSV to stan­dard out. It is a pile of kludges which I am a lit­tle ashamed to post pub­licly, and may not work for you.

import Text.HTML.TagSoup (fromAttrib, fromTagText, isTagOpen, isTagText,
                          parseTags, (~/=), Tag(TagComment, TagOpen,TagText))
import Network.HTTP (getResponseBody, getRequest, simpleHTTP)
import Data.List (isInfixOf, isPrefixOf, nub, sort, stripPrefix)
import Data.Char (isSpace)
import Codec.Binary.UTF8.String (decodeString)
import Data.Maybe (fromMaybe, isJust, listToMaybe)
import Data.List.Split (keepDelimsL, split, whenElt)
import Control.Monad (join, unless)

main :: IO ()
main = do albums1 <- getAlbums "http://en.touhouwiki.net/wiki/List_by_Groups"
          albums2 <- getAlbums "http://en.touhouwiki.net/wiki/List_of_Old_Touhou_Arrangement_Groups"
          let albumURLs = map ("http://en.touhouwiki.net"++) $ nub $ sort $ albums1 ++ albums2
          putStrLn header
          mapM_ getAlbum albumURLs

getAlbums :: String -> IO [String]
getAlbums index = do touhou <- openURL index
                     return $ drop 22 $ reverse [link | (TagOpen "a" ((ref, link):_)) <- parseTags touhou,
                                                        ref=="href",
                                                        "/wiki/" `isPrefixOf` link,
                                                        let fltr x = not (x `isInfixOf` link),
                                                        fltr "_Groups", fltr "Touhou_Wiki:", fltr "Special:",
                                                        fltr "Template:", fltr "Category:", fltr "Talk:"]

type Album = [Track]
data Track = Track { title :: String, year :: Int, date :: String,artist :: String,
                     album :: String, event :: String, circle :: String, duration :: Maybe Int,
                     track :: Int } deriving Show
empty :: Track
empty = Track {title="",year=0,date="",album="",event="",circle="",artist="",duration=Nothing,track=0}

header :: String
header = "Title,Year,Date,Album,Event,Circle,Duration,Track"
convert :: Track -> String
convert t = "\"" ++ dequote (title t) ++ "\"," ++ show (year t) ++ "," ++ show (date t) ++ ",\"" ++ dequote (album t) ++ "\",\"" ++
             event t ++ "\",\"" ++ circle t ++ "\"," ++ maybe "" show (duration t) ++ "," ++ show (track t)

-- You are not expected to understand this.
getAlbum :: String -> IO ()
getAlbum a = do t <- fmap parseTags $ openURL a
                --TagText "Released",TagClose "th",TagOpen "td" [],TagText "\n2007-03-23"
                let dt = let target = dropWhile (TagText "Released" ~/=) t
                         in if null target then ""
                            else deleteParens $ fromTagText $ head $ tail $ filter isTagText target
                -- TagText "Released",TagClose "th",TagOpen "td" [],TagText "\n2009/02/08 (",
                -- TagOpen "a" [("href","/wiki/Category:Sunshine_Creation_42"),("title","Category:Sunshine Creation 42")],
                -- TagText "Sunshine Creation 42",TagClose "a",TagText ")",TagClose "td"
                let evnt = let stream = filter isTagText $ dropWhile (TagText "Released" ~/=) t
                           in if '(' == last (fromTagText $ head $ take 5 $ drop 1 stream) -- )
                              then fromTagText (filter isTagText (dropWhile (TagText "Released" ~/=) t) !! 2)
                              else ""
                -- "2007-03-23"
                let yr = if null dt then 0 else read (take 4 dt)::Int
                -- TagText "Album by CODE ZTS LABEL"
                let hasCrcl = [cl | TagText cl <- t, "Album by " `isPrefixOf` cl]
                unless (null hasCrcl || (yr==0 && null dt)) $ do
                    let crcl = lookForCircle t
                    -- TagText "Selfregards2 - Touhou Wiki - Characters, games, locations, and more"
                    let albm = reverse $ drop 55 $ reverse $ head [al | TagText al <- t,
                                 " - Touhou Wiki - Characters, games, locations, and more" `isInfixOf` al]
                    let dflt = empty { date = dt, year = yr, circle = crcl, album = albm, event = evnt }
                    let table = filter (\x -> not ("Disc" `isInfixOf` x || " CD" `isInfixOf` x)) $
                                 filter (not . all isSpace) $
                                  map (trim . fromTagText) $ filter isTagText $
                                   drop 5 $ takeWhile (TagOpen "table"
                                     [("class","navbox"),("cellspacing","0"),
                                      ("style","background:#FFFBEE;border-color:#A8A077;")] ~/=) $
                                       takeWhile (TagComment "" ~/=) $
                                        dropWhile (TagOpen "span"
                                         [("class","mw-headline"),("id","Tracks")] ~/=) t
                    let tracks = filter (not . null) $
                                  split (keepDelimsL $ whenElt
                                   (\x -> length x==3 && "." `isInfixOf` x &&  isJust(maybeRead x :: Maybe Int))) table
                    mapM_ (putStrLn . convert . trackToTrack dflt) tracks

-- TagText "Album by ",TagOpen "a"
-- [("href","/wiki/ALiCE%27S_EMOTiON"),("title","ALiCE'S EMOTiON")],TagText
-- "ALiCE'S EMOTiON",TagClose "a"
lookForCircle :: [Tag String] -> String
lookForCircle t = let c = head [cl | TagText cl <- t, "Album by " `isPrefixOf` cl]
                      res = if c == "Album by " then (let tg = (dropWhile (TagText "Album by " ~/=) t !! 2)
                        in if isTagText tg then fromTagText tg else
                            (if isTagOpen tg then fromAttrib "title" tg else "") ) else drop 9 c
                      in if "(page does not exist)" `isInfixOf` res then takeWhile (/='(') res else res -- )

-- ["01.","The mom","(04:07)","arrangement: ZTS","composition: ZTS","original title: The mom","source: Parhelia"]
trackToTrack :: Track -> [String] -> Track
trackToTrack tr t = tr { track = fromMaybe 0 (maybeRead (head t) :: Maybe Int),
                         title = t !! 1,
                         duration = if length t >=3 then Just (timeConverter $ deleteParens (t !! 2)) else Nothing,
                         artist =  lookForAnArtist t }

lookForAnArtist :: [String] -> String
lookForAnArtist t = let targets = dropWhile (\x -> not ("arrangement:" `isPrefixOf` x || "composition:" `isPrefixOf` x)) t
                        target
                            | null targets = ""
                            | last (head targets) == ':' = head targets ++ (targets !! 1)
                            | otherwise = head targets
                    in trim $ fromMaybe "" $ join $ fmap (stripPrefix "arrangement:") $ stripPrefix "composition:" target

-- utility functions
openURL :: String -> IO String
openURL url = fmap decodeString (simpleHTTP (getRequest url) >>= getResponseBody)
deleteParens, trim, dequote :: String -> String
deleteParens = trim . filter (\x -> x /= '(' && x /= ')')
trim = reverse . dropWhile isSpace . reverse . dropWhile isSpace
dequote = map (\x -> if x=='"' then '\'' else x) -- "')
timeConverter :: String -> Int
timeConverter n = let (m,s) = break (==':') n
                      m' = maybeRead m :: Maybe Int
                      s' = maybeRead (drop 1 s) :: Maybe Int
                  in (fromMaybe 0 m' * 60) + fromMaybe 0 s'
maybeRead :: Read a => String -> Maybe a
maybeRead = fmap fst . listToMaybe . reads

VGMdb scraping code

The fol­low­ing is a buggy pro­gram for scrap­ing Touhou albums from VGMdb; it works on a lim­ited sub­set of album pages, but has an unknown num­ber of fatal bugs. I aban­doned it once I was offered read­-only data­base access, and that was what I actu­ally used to get my VGMdb data. This is in case I ever need to go back.

import Text.HTML.TagSoup (fromTagText, isTagOpenName, isTagText, Tag(TagOpen,TagText), parseTags)
import Network.HTTP (getResponseBody, getRequest, simpleHTTP)
import Data.List (isPrefixOf, sort)
import Data.Char (isSpace)
import Codec.Binary.UTF8.String (decodeString)

main :: IO ()
main = do albumsURLs <- getAlbums
          albums <- mapM openURL (sort albumsURLs)
          let metadata = map toAlbum albums
          writeFile "vgmdb.csv" $ unlines (header : concatMap (map convert) metadata)

type Album = [Track]
data Track = Track { title :: String,
                     year :: Int,
                     date :: String,
                     album :: String,
                     circle :: String,
                     duration :: Maybe Int,
                     track :: Int } deriving Show
empty :: Track
empty = Track {title="",year=0,date="",album="",circle="",duration=Nothing,track=0}
header :: String
header = "Title,Year,Date,Album,Circle,Duration,Track"
convert :: Track -> String
convert t = "\"" ++ title t ++ "\"," ++ show(year t) ++ "," ++ show (date t) ++ ",\"" ++
              album t ++ "\",\"" ++ circle t ++ "\"," ++ maybe "" show(duration t) ++ "," ++ show (track t)

-- example album link: 'TagOpen "a" [("class","albumtitle album-doujin"),
--                                   ("href","http://vgmdb.net/album/36901"),
--                                   ("title","Majo to Ringo to Samayou Kimi to")]'
getAlbums :: IO [String]
getAlbums = do touhou <- openURL "http://vgmdb.net/product/9"
               return [snd(atts !! 1) | TagOpen "a" atts <- parseTags touhou, snd(head atts)=="albumtitle album-doujin"]

-- need 'decodeString' to deal with Japanese glyphs; see http://stackoverflow.com/questions/10558003/how-to-get-utf8-rss-feed
openURL :: String -> IO String
openURL url = fmap decodeString (simpleHTTP (getRequest url) >>= getResponseBody)

toAlbum :: String -> Album
toAlbum page = let tags = parseTags page
                   (yr,dt) = extractDate tags
                   albm = extractAlbum tags
                   crcl = extractCircle tags
                   files = extractMusic tags
               in map (\t -> Track {title = title t, year = yr, date = dt,
                                    album = albm, circle = crcl, duration = duration t,
                                    track = track t}) files


-- TagOpen "a" [("title","View albums released on Dec 30, 2011"),("href","/db/calendar.php?year=2011&month=12#20111230")]
extractDate :: [Tag String] -> (Int,String)
extractDate t = let (a:b:_) = map snd $ head [atts | TagOpen "a" atts <- t,
                                       let ttle = snd(head atts),
                                       "View albums released on " `isPrefixOf` ttle]
        in (read(reverse $ take 4 $ reverse a)::Int,
           tail$ snd $ break (=='#') b)

-- TagOpen "title" [],TagText "Gensou Rashinban - VGMdb",TagClose "title",
extractAlbum :: [Tag String] -> String
extractAlbum t = (\(TagText x) -> reverse $ drop 8 $ reverse x) (dropWhile (not . isTagOpenName "title") t !! 1)

-- [TagText "Published by",TagClose "b",TagClose "span",TagClose "td",TagText "\r\n",
-- TagOpen "td" [],TagOpen "a" [("href","/org/217")],TagOpen "span"
-- [("class","productname"),("lang","en"),("style","display:inline")],TagText "PopKorn"]
extractCircle :: [Tag String] -> String
extractCircle t = fromTagText(head (drop 8 (dropWhile (\x -> not(isTagText x && (fromTagText x)=="Published by")) t)))

extractMusic :: [Tag String] -> [Track]
extractMusic t = let tracks = filter (not . all isSpace) $
                               tail $ dropWhile (\z -> z /= "Disc 1") $
                                map fromTagText $ filter isTagText $
                                 takeWhile (\y -> not(isTagText y && (fromTagText y)=="Disc length")) $
                                  dropWhile (\x -> not(isTagText x && (fromTagText x)=="Tracklist")) t
                   in if length tracks `rem` 3 == 0 then threezip tracks else twozip tracks
       where
       twozip,threezip :: [String] -> [Track]
       threezip [] = []
       threezip (a:b:c:d) = empty {title=b,duration=Just (timeConverter c),track=read a} : threezip d
       threezip _ = []
       twozip [] = []
       twozip (a:b:d) = empty {title=b,duration=Nothing,track=read a} : twozip d
       twozip _ = []
       timeConverter :: String -> Int
       timeConverter n = let (m,s) = break (==':') n in ((read m :: Int) * 60) + (read (tail s) :: Int)