Touhou music by the numbers

Collect music metadata and look for patterns (statistics, sociology, Haskell)
created: 28 Feb 2013; modified: 23 Jun 2017; status: abandoned; confidence: possible; importance: 1

Idea: correlate Touhou music production against Japanese youth unemployment: does the total production of music as measured in seconds increase with unemployment?

Opposite view, recessions dent production (perhaps because people are working harder and so have less free time even if other people are unemployed?) http://www.gamesetwatch.com/2009/12/sound_current_yokohamas_mediam.php

While the turnout at M3 remains strong, at the same time an economic recession cannot help but touch a community whose activities rely on having free time. Furthermore while previously many hobbyists dreamed of someday breaking into the industry, more recently many also fear that game companies will begin cracking down on unlicensed tributes.

Data

Unemployment data source

Used FRED Adjusted Unemployment Rate for Youth [15-24yo] in Japan (JPNURYNAA); downloaded as CSV, annual percentage 2000-2011

jpn <- read.table(stdin(),header=TRUE)
      DATE VALUE
2000-01-01   8.9
2001-01-01   9.1
2002-01-01   9.5
2003-01-01   9.6
2004-01-01   9.0
2005-01-01   8.1
2006-01-01   7.5
2007-01-01   7.5
2008-01-01   7.0
2009-01-01   8.9
2010-01-01   9.0
2011-01-01   8.1

Touhou data sources

TODO Suggestions: http://www.reddit.com/r/TOUHOUMUSIC/comments/19hh2m/touhou_music_databases_comprehensive_easily/ http://boards.4chan.org/jp/res/10559057

alternative sources

  • VGMdb Touhou entries but it is substantially smaller with metadata on <1389 albums
  • The open source alternative is MusicBrainz; looks like it has 190 albums, but like 90%+ link to VGMdb, so I’m not sure I want to include them (waste of effort, and if someone just copied over all of VGMdb a few years ago, it’ll be badly misleading to any capture-recapture analysis of population size).
  • Touhouwiki.net another source, with <1182 albums

Arrange Circle Database

東方音団録 ~ Arrange Circle Database ver.3.0; homepage & release info (debate, chat), product page with j-subculture.com quoting a partial total of $15, Paypal and shipping outside Japan boosting to ~$30! (Alternative proxies included Yokatta.)

Email in 1 March 2013 failed to elicit any reply by 27 May; then requested a purchase on `/r/TOUHOUMUSIC.

My request was filled on 8 June 2013 as an ISO and ZIP file. The ISO file seemed to be broken: file just calls it data, and when I mount it as a loopback iso9660 file, mount throws an error. I redownloaded it and compared it, bu the copies were identical. The good news is that the zip file seems to work fine. The data is in a .accdb file in a subfolder, which turns out to be the latest Microsoft Access database format. Unfortunately, this turns out to be almost entirely unsupported by anything on Linux (except for a Java library), but fortunately, an acquaintance had an Office 365 subscription and re-exported the .accdb file as the older Access format .mdb file (Microsoft Access database) which was successfully read and converted to CSV by mdb-tools’s mdb-export (UTF-8 CSV format).

The CSV seems to contain ~1316 entries; corresponding to ~1316 albums with the respective circle name, URL, genre, possibly vocalist, and 2 more fields I cannot figure out due to lack of Japanese proficiency (but none seem to be release dates). The entries look like this:

$ mdb-tables Toho_arrange_circle_database-gwern.mdb
アレンジサークルリスト
$ mdb-export Toho_arrange_circle_database-gwern.mdb `mdb-tables Toho_arrange_circle_database-gwern.mdb` > Toho_arrange_circle_database-gwern.csv
$ head Toho_arrange_circle_database-gwern.csv
    ID,pin,サークル名,ふりがな,URL,ジャンル,Vocal,主な頒布CD,原曲アレンジの程度,一言,memo
    1,,"っ´Д`)っゼロ式の処刑場(っ´Д`)っゼロ式の家・DESTRUCTIVE ANGEL)","ぜろしきのしょけいじょう","http://shinzanzeroshiki.fc2web.com/","メタル","男","Crazy Trancy Ecstasy","原曲維持","重厚感あふれる荘厳なメタル。(Black)",
    2,,"α music","あるふぁみゅーじっく","http://www19.atpages.jp/tatu4/","オーケストラ、ピアノ",,"東方風水華月","原曲維持",,
    3,,"凸凹えんたーていめんとすたじお","でこぼこえんたーていめんとすたじお","http://3rd.geocities.jp/deko_boko_es/index/","リコーダー、ゲームミュージック",,"Electronic Magus","原曲維持","8ビットあり、リコーダー生演奏ありと多彩。(Black)",
    4,,"[4989]","しくはっく","http://4989mm.littlestar.jp/","ロック","男女","musick for me","原曲維持","スローなロック風アレンジにハイトーン男声ボーカル、エレクトロも混ざったインストも。ドラマCDあり。(Black)",
    5,,"[ kapparecords])","かっぱれこーず","http://www5f.biglobe.ne.jp/~kapparecords/","ハードロック","男","SCARLET FANTASIA","原曲維持","生演奏ギタードラムベースと男声ボーカルがんばれ。(Black)",
    6,,"<echo>PROJECT","えこぷろじぇくと","http://echoproject.3rin.net/","ダンス、ポップス、ロック","女","eclat:","原曲重視~維持","女性ボーカルを軸にしてジャンルは何でもアレンジ。(Black)",
    7,,"#039","しゃーぷさんきゅー","http://sharp039.web.fc2.com/",,,"EMERGENCE",,,
    8,,"#ゆうかりんちゃんねる","ゆうかりんちゃんねる","http://yuukach.web.fc2.com/","オーケストラ、クラシック、ピアノ、エレクトロ、ロック",,"ゴリラ人間のための華麗なる大幻想曲集","原曲維持","クラシカルな構成と管弦を散りばめたオーケストラ、クラシック風インスト。バイオリンの倍音の響きが印象的。(Black)",
    9,,"10-GALLON(Digit Smith)","てんがろん","http://10-gallon.net/","ロック、エレクトロ、ハードロック","女","悪魔城レミリア","原曲維持","原曲メロをミドルテンポなロックとエレクトロに乗せて。(Black)",
$ tail -1 Toho_arrange_circle_database-gwern.csv
    1316,,"侘助","わびすけ","http://ameblo.jp/wa-bi-su-ke/","エレクトロ、ロック","女","東方乙女椿","原曲維持","速めのエレクトロ、ロックアレンジが多め。女声ボーカル。(Black)",

I am a little surprised that there are only 1316 entries. Either I’ve overestimated their thoroughness or this is limited to a specific convention or something like that… Need to look into this more. This doesn’t include the track-level data I was hoping for, but a list of albums can still be useful for estimating completeness of capture.

Circle count

Crosscheck: number of circles at Comiket & Reitaisai? circles self-classify by genre code and this is recorded in the official Comiket catalogues, where most (all?) of these numbers are drawn

C87: 1756 https://media.8chan.co/2hu/src/1419976746668.png C86: Statistics taken from the C86 catalogue. http://ascii.jp/elem/000/000/917/917701/002_976x637.jpg https://i.imgur.com/R0uXljZ.jpg ~1938 circles ; later estimate, 1910 https://webcatalog-free.circle.ms/Circle?genreCode=241&day=2 & i2.kym-cdn.com ; another estimate1 1918 https://media.8chan.co/2hu/src/1419976746668.png C85: " Grey: C85 statistics" https://i.imgur.com/R0uXljZ.jpg 2258 / 2272 (http://www.crunchyroll.com/anime-news/2013/11/01-1/top-doujinshi-events-most-popular-by-the-numbers original: http://yaraon.blog109.fc2.com/blog-entry-19664.html) / 2246 (according to i2.kym-cdn.com) C84: 2526 (according to i2.kym-cdn.com) C83: 2492 (according to i2.kym-cdn.com) C82: Touhou=2670 / 2694 (according to i2.kym-cdn.com) C81: Touhou=2690 / 2656 (according to i2.kym-cdn.com) C80: Touhou=2808 / 2774 (according to i2.kym-cdn.com) & http://d.hatena.ne.jp/myrmecoleon/20131101/1383333933 http://www.sankakucomplex.com/2008/08/02/touhou-takeover/ (https://www.google.com/search?num=100&q=touhou%20comiket%20circle%20genre%20site%3Asankakucomplex.com%20-site%3Awww.sankakucomplex.com%2Ftag%2F) original: https://web.archive.org/web/20110722080729/http://addb.jp/index.php?Diary%2F2008-07-30 : C75-C79: (http://i2.kym-cdn.com/photos/images/original/000/866/132/075.jpg unknown source) C75: 1387+1356(https://web.archive.org/web/20110812023434/http://addb.jp/index.php?Diary%2F2008-12-17)/C76:1739(http://d.hatena.ne.jp/myrmecoleon/20101230/1293717241)/C77:2372(d.hatena.ne.jp)/C78:2416+2394(d.hatena.ne.jp)/C79: 2774(d.hatena.ne.jp) C74: 885 C73: 793 C72: 558 C71: 574 C70: 366 C69: 232 C68: 229 C67: 98 C66: 50 C65: 7 / 39 (en.touhouwiki.net) C64: 0 / 12 (en.touhouwiki.net) C63: 0 / 1 (according to http://en.touhouwiki.net/wiki/Release_Timeline#2002 ; asked about discrepancy http://en.touhouwiki.net/index.php?title=Talk:Release_Timeline&curid=46413&diff=331144&oldid=262193 ) Reitaisai (Hakurei Jinja Reitaisai / 博麗神社例大祭): 2004-4-18 R1: 114 2005-5-4 R2: 362 2006-5-21 R3: 680 2007-5-20 R4: 653 2008-5-25 R5: 1086 http://thwiki.cc/%E4%BE%8B%E5%A4%A7%E7%A5%AD#.E5.8E.86.E5.B1.8A.E4.BF.A1.E6.81.AF : 2009-3-8 R6: 2948 2010-3-14 R7: 4050 2011-5-8 R8: 4940 (see Wikipedia) 2012-5-27 R9: 5058 2013-5-26 R10: 5013 2014-5-11 R11: 4312 2015-5-10 R12: ?

http://www.comiket.co.jp/info-a/C77/C77CMKSymposiumPresentationEnglish.pdf / http://www.comiket.co.jp/info-a/WhatIsEng080225.pdf / http://www.comiket.co.jp/info-a/WhatIsEng080528.pdf C76: all=>35000 C75: all=>35000 C73: all=>35000 C72: all=>35000 pg5 of CMKSymposiumPresentationEnglish: circles graph, C1-C76 pg6 of WhatIsEng080225: C1-C73

Torrent

Music source: Touhou lossy music collection v.15.2 (derived from the Touhou lossless music collection collection), 265.2GB of 44421 tracks from 4952 albums produced by <1,264 groups or circles.

$ find ~/torrent/Touhou\ lossy\ music\ collection/ -type f -name "*.mp3" | wc
  44421
$ ls torrent/Touhou\ lossy\ music\ collection/ | wc
   1264
$ ls torrent/Touhou\ lossy\ music\ collection/*/ | wc
   7477

File name, music length, and metadata year (if any) are extracted using exiftool:

events: 例大祭[,SP,SP2,2-9]: annual Reitaisai M3*: annual Media Mix Market eg. http://polymetrica.wordpress.com/2009/10/09/things-i-am-excited-about-04-m324/ C[63-82]: semi-annual Comiket サンクリ[28-50]: annual? Sunshine Creation http://ja.wikipedia.org/wiki/%E3%82%AF%E3%83%AA%E3%82%A8%E3%82%A4%E3%82%B7%E3%83%A7%E3%83%B3_%28%E5%90%8C%E4%BA%BA%E5%8D%B3%E5%A3%B2%E4%BC%9A%29 東方紅楼夢[?2-8]: annual Koromu 月の宴?2-5: annual? Feast of the Month 紅のひろば?2-6: semiannual Red Square 東方不敗小町?2-6, SP, ぷちこまち: Komachi 杜の奇跡[15-16] 東方杜郷想[2-3] 幺樂団カァニバル!?2-3 東方幻楽祭[2]: semiannual コミコミ[12-14] FF[9-17] ? こみトレ[12-17] COMIC1☆2-6 COMIC CITY大阪[63,73] 恋魔理?2-3 東方椰麟祭?2-3 東方名華祭2

exiftool; json

Is exiftool’s length approximation trustworthy? Yes, it seems to be always within seconds of the full mp3info answer:

$ find "/home/gwern/torrent/Touhou lossy music collection/" -type f -name "*.mp3" \
        -exec mp3info -F -p "0:%02m:%02s " {} \; -exec exiftool -Duration {} \;
0:00:23 Duration                        : 23.12 s (approx)
0:04:28 Duration                        : 0:04:28 (approx)
0:02:44 Duration                        : 0:02:44 (approx)
0:04:52 Duration                        : 0:04:52 (approx)
0:03:56 Duration                        : 0:03:56 (approx)
0:01:44 Duration                        : 0:01:44 (approx)
0:04:34 Duration                        : 0:04:34 (approx)
0:02:30 Duration                        : 0:02:30 (approx)
0:03:02 Duration                        : 0:03:02 (approx)
0:03:43 Duration                        : 0:03:42 (approx)
0:03:23 Duration                        : 0:03:23 (approx)
0:03:11 Duration                        : 0:03:11 (approx)
0:04:22 Duration                        : 0:04:22 (approx)
0:03:13 Duration                        : 0:03:13 (approx)
0:04:04 Duration                        : 0:04:04 (approx)
0:03:58 Duration                        : 0:03:57 (approx)
0:05:24 Duration                        : 0:05:24 (approx)
0:04:17 Duration                        : 0:04:17 (approx)
0:03:14 Duration                        : 0:03:14 (approx)
0:01:59 Duration                        : 0:01:58 (approx)
0:03:21 Duration                        : 0:03:21 (approx)
0:02:35 Duration                        : 0:02:35 (approx)
0:04:20 Duration                        : 0:04:20 (approx)
...
# generate and parse and cleanup data
#
# takes ~30m:
# R> system("exiftool -extension mp3 -json -forcePrint
#           -Title -Year -Album -Artist -Duration -Genre -Track -Directory -FileName -FileSize -AudioBitrate
#           ~/torrent/Touhou\\ lossy\\ music\\ collection/*/* > ~/touhou.json")
library(rjson)
# download from https://www.gwern.net/docs/touhou/2013-torrent.json.xz and decompress with xz
json_data <- fromJSON(paste(readLines("2013-gwern-touhoutorrent.json"), collapse=""))
touhou <- data.frame(matrix(unlist(json_data), ncol=12, byrow=TRUE))
colnames(touhou) <- c("SourceFile", "Title", "Year", "Album", "Artist", "Length", "Genre",
                      "Track", "Directory", "FileName", "FileSize", "AudioBitrate")
# Delete SourceFile column; redundant with Directory/FileName
touhou <- touhou[,-1]
touhou$Directory <- sub("/home/gwern/torrent/Touhou lossy music collection/", "",
                         as.character(touhou$Directory))
touhou[touhou==""] <- NA
touhou[touhou=="-"] <- NA
touhou$Year <- as.integer(as.character(touhou$Year))
# torrent doesn't cover 2013 music, and music predating the PC-98 games doesn't exist...
touhou$Year[touhou$Year<1990] <- NA
touhou$Year[touhou$Year>2012] <- NA
# Genre is "None" or " "? both useless and false (thanks, tagger); so it goes too:
touhou$Genre[touhou$Genre=="None"] <- NA
touhou$Genre[touhou$Genre==" "] <- NA
# turn the track lengths and bitrates into usable numbers on a common scale (seconds and MBs, respectively)
touhou$Length <- gsub(" \\(approx\\)","",as.character(touhou$Length))
touhou$AudioBitrate <- as.integer(sub(" kbps","",as.character(touhou$AudioBitrate)))
# exiftool leaves us "16 s"; if so, strip the " s" and turn it into an integer
# else, eg. "0:04:37"; split on colon,
# multiply hour by 3600 seconds, minutes=60 each, seconds=seconds; and sum it
interval <- function(x) { if (!is.na(x)) { if (grepl(" s",x)) as.integer(sub(" s","",x))
                                           else { y <- unlist(strsplit(x, ":"));
                                                  as.integer(y[[1]])*3600 + as.integer(y[[2]])*60 + as.integer(y[[3]]); }
                                                  }
                          else NA
                          }
touhou$Length <- sapply(touhou$Length,interval)
filesize <- function(x) { if (grepl(" kB",x)) (as.integer(sub(" kB","",x))/1000) else as.integer(sub(" MB","",x))}
touhou$FileSize <- sapply(touhou$FileSize, filesize)
# Serious work: turn the encoded information in Directory into usable columns. Not for the faint of heart.
#
# The Directory column looks like "[twith1450]/2009.03.08 TOHOMOHO [例大祭6]"
# The schema here is "[circle]/eventDate album [event]"
#
# "[Angelic Quasar]/2006.01.29 [AQSH-0003] Racial Ethnic Nation"
# "[Alstroemeria Records]/[ARCD0001] The regret of stars, but stars shine bright (C65) (mp3)"
# "[Aqua Style/ひえろぐらふ]/2010.05.24 [AQUA-0031] 春宵一刻値千金 -シュンショウイッコク アタイセンキン-"
brackets <- function(b) sub("\\]","", sub("\\[","",b))
# easy first step: parse out the leading group/circle (always there, terminated by forward-slash) as new column
touhou$Circle <- sapply(touhou$Directory, function(x) brackets(unlist(strsplit(as.character(x), "/"))[1]))
# destructively update by removing the group/circle, to make the next step easier
# this makes Directory looks like "2009.06.07 [PAER-0007] #01 -LILITH- [東方幻楽祭2]"
# or "[ARCD0001] The regret of stars, but stars shine bright (C65) (mp3)"
touhou$Directory <- sapply(touhou$Directory, function(x) unlist(strsplit(as.character(x), "/"))[2])
touhou$Date <- as.Date(sapply(touhou$Directory, function(x) substring(x, 1, 10)), format="%Y.%m.%d")
# and like before, we strip the event date that we've parsed out, leaving eg. "ピアノのための東方小品集 Op.1-1 [御射宮司祭]"
# or "月遊 [例大祭8]" or "[AQUA-0031] 春宵一刻値千金 -シュンショウイッコク アタイセンキン-"
touhou$Directory <- sapply(touhou$Directory, function(x) substring(x, 12))
# extracting the next parameter, the event the album was released at, is harder still
library(stringr)
# if the directory does not end in a right-bracket, there's no event info and we should bomb out
# else, grab w/regexp last pair of brackets with a space before (excludes any album numbering schemes) & trim
# that didn't work? then it must be one of the directories where there's no space before the bracketed event, retry without leading space
touhou$Event <- sapply(touhou$Directory, function(x) { if (str_sub(x,start=-1) == "]") { res <- brackets(unlist(str_split(x, " \\["))[2]); if (!is.na(res)) res else brackets(unlist(str_split(x, "\\["))[2]) } else x})
# if you examine the Event column, it's full of wrong entries. I have made a list of 19 event-prefixes (I hope), which
# we'll use as a whitelist by erasing anything which lacks all of the 19 prefixes.
isPrefix <- function(x,y) grepl(paste0("^",x), y)
events <- c("例大祭","M3","C","サンクリ","東方紅楼夢","月の宴","紅のひろば","東方不敗小町","杜の奇跡","東方杜郷想",
            "幺樂団カァニバル!","東方幻楽祭","コミコミ","FF","こみトレ","COMIC1","COMIC CITY大阪","恋魔理","東方椰麟祭","東方名華祭")
touhou$Event <- sapply(as.character(touhou$Event),
                       function(target) if (sum(sapply(events, function(e) isPrefix(e,target))) != 0) target else NA)
# this whitelist covers almost the entire sample, so I think it works well:
## R> sum(!is.na(touhou$Event))
## [1] 39190
## R> length(touhou$Event)
## [1] 41866
#
# one final thing, since (almost) all directories had Dates while not all files had Years; overwrite any missing Years
# based on the Date we just extracted
touhou$Year <- as.integer(format(touhou$Date, "%Y"))
rm(touhou$Directory) # clean up
# escape with the loot:
write.csv(touhou, file="2013-gwern-touhoumusic-torrent.csv", row.names=FALSE)

VGMdb

The Touhou project page turns out to be incomplete: each entry had to be manually annotated as related to Touhou. I was pointed to a search query which turned up many more results by looking for any page with the string Touhou in the games field.

The VGMdb administrators kindly gave me read-only access to their MySQL databases. I grabbed the entirety of the tables vgmdb_albums and vgmdb_tracks from the main VGMdb database; I exported them as 2 CSV files with comma separators, renamed 2013-vgmdb-albums.csv and 2013-vgmdb-tracks.csv. Before loading the exports, I had to delete all escaped quotes; the default R CSV parsing doesn’t handle them. The track rows are 1 track with an album ID, so to turn each track record/row into an equivalent of the torrent rows, I need to fill in based on the album table.

albums <- read.csv("https://www.gwern.net/docs/touhou/2013-vgmdb.csv")
albums <- with(albums,
           data.frame(albumid,reldate,publisher,game,albumtitles))
albums <- albums[grepl("ouhou", albums$game),]; rm(albums$game)

tracks <- read.csv("2013-vgmdb-tracks.csv")
rm(tracks$tracklistid, tracks$trackid, tracks$disctitle, tracks$disc)
tracks$length[tracks$length==0] <- NA # 0 is the default in the VGMdb schema
tracks <- tracks[tracks$albumid %in% albums$albumid,]
touhou <- merge(tracks, albums)
rm(touhou$albumid, albums, tracks)
# deal with the 41 dates with the format "2005-09-00" (the 0th months or days are not real...)
touhou$date <- as.Date(sub("-00","-01",as.character(touhou$reldate)))
rm(touhou$reldate)
touhou$year <- as.integer(format(touhou$date, "%Y"))
# upcase, rearrange to torrent's order
colnames(touhou) <- c("Track","Title","Length","Circle","Album","Date","Year")
touhou <- touhou[c(2,7,5,3,1,4,6)]
write.csv(touhou, file="2013-vgmdb-touhou.csv", row.names=FALSE)

touhouwiki.net

Personal downloads

4chan /jp/ C83 threads

A loose group of 4chan users on the /jp/ subforum collaborate each Comiket to upload and distribute doujin manga, games, and music released at that Comiket; some are uploaded by Comiket attendees, some are bought from resellers like Comic Toranoana, and many files are harvested from Japanese P2P filesharing networks like Winny/Share/Perfect Dark. I compiled a list of ~400 files from the /r/TouhouMusic C83 thread (principally from the 4chan links) & the blog All Doujin Music and gradually downloaded them from January to March 2013. After dead links, I was left with 400-500 files. Many are not music, or even Touhou-related, so I hand-filtered albums, looking for signs of being Touhou doujin works (credits to ZUN, Touhou characters in the artwork, themes I recognized as Touhou, etc); when I was not sure, I erred on the side of exclusion. The final compilation yielded 3503 files (evenly split: 1776 Touhou vs 1728 other) with 953 Touhou music files.

# exiftool -extension ogg -json -forcePrint -Title -Year -Album -Artist -Duration -Genre -TrackNumber -Directory
# -FileName -FileSize -NominalBitrate -Date -recurse
# ~/c83/touhou/ ~/c83/touhou/*/** ~/c83/touhou/*/*/** ~/c83/touhou/*/*/*/** > ~/2013-c83-downloads.json

library(rjson)
# download from https://www.gwern.net/docs/touhou/2013-torrent.json.xz and decompress with xz
json_data <- fromJSON(paste(readLines("2013-c83-downloads.json"), collapse=""))
touhou <- data.frame(matrix(unlist(json_data), ncol=13, byrow=TRUE))
colnames(touhou) <- c("SourceFile", "Title", "Year", "Album", "Artist", "Length", "Genre",
                      "Track", "Directory", "FileName", "FileSize", "AudioBitrate", "Date")
# Delete SourceFile column; redundant with Directory/FileName
touhou <- touhou[,-1]
for (filter in c("/home/gwern/c83/touhou/", "\\[touhou.vnsharing.net\\]", " \\(320K\\+BK\\)", "/mp3",
                 " MP3v0", " v0", " \\(flac\\+scans\\)", " \\(128K\\)", " \\(V0\\)", " \\(320\\)",
                 " \\(mp3 320\\)", " \\(v0\\+jpg\\)"))
 { touhou$Directory <- sub(filter, "", as.character(touhou$Directory)) }
touhou[touhou==""] <- NA
touhou[touhou=="-"] <- NA

Playback length:

find c83/touhou/ -name "*.ogg" -exec ogginfo {} \;|fgrep "Playback length"

4chan /jp/ C84 threads

591 Touhou music files:

/docs/touhou/2013-c84-downloads.json

4chan /jp/ C85 threads

449 Touhou music files:

/docs/touhou/2013-c85-download.json

4chan /jp/ Reitaisai 10 threads

Similarly to above, drawing on the /r/TouhouMusic discussion and manually pruning duplicates & non-Touhou down to 491 music files.

exiftool -extension ogg -json -forcePrint -Title -Year -Album -Artist -Duration -Genre -TrackNumber -Directory -FileName -FileSize -NominalBitrate -Date */*.ogg > ~/2013-reitaisai-downloads.json

Reitaisai 10 torrent

In May & June 2013, an anonymous person compiled 67 albums and released two torrents of Reitaisai 10 albums (vol. 1, vol. 2)

exiftool -extension ogg -json -forcePrint -Title -Year -Album -Artist -Duration -Genre -TrackNumber -Directory -FileName -FileSize -NominalBitrate -Date */*.ogg > ~/2013-reitaisai-downloads-torrent.json

Analysis

touhou <- read.csv("https://www.gwern.net/docs/touhou/2013-torrent.csv",
                   colClasses=c("character", "integer", "factor", "character", "integer", "factor",
                                "character", "character", "numeric", "integer", "factor", "Date"))

# do stuff with the data
# general correlations
t <- data.frame(touhou$Year, touhou$Length, touhou$FileSize, touhou$AudioBitrate)
cor(t,use="pairwise.complete.obs")
#                     touhou.Year touhou.Length touhou.FileSize touhou.AudioBitrate
# touhou.Year
# touhou.Length          -0.01091
# touhou.FileSize         0.04484       0.93915
# touhou.AudioBitrate     0.19188       0.11091         0.35499

# test the correlation between higher bitrate and larger files:
cor.test(touhou$FileSize, touhou$AudioBitrate)

# the genre metadata is useless!
sort(table(touhou$Genre), decreasing=TRUE)

# boxplot avg length per year
plot(touhou$Length ~ factor(touhou$Year))

Economics modeling:

jpn <- read.csv(stdin(),header=TRUE)
DATE,VALUE
2000-01-01,8.9
2001-01-01,9.1
2002-01-01,9.5
2003-01-01,9.6
2004-01-01,9.0
2005-01-01,8.1
2006-01-01,7.5
2007-01-01,7.5
2008-01-01,7.0
2009-01-01,8.9
2010-01-01,9.0
2011-01-01,8.1

# number of works per year does not correlate:
cor.test(jpn$VALUE[3:12], table(touhou$Year)[1:10])
    Pearson`s product-moment correlation

data:  jpn$VALUE[3:12] and table(touhou$Year)[1:10]
t = -0.3053, df = 8, p-value = 0.768
alternative hypothesis: true correlation is not equal to 0
95% confidence interval:
 -0.6903  0.5602
sample estimates:
    cor
-0.1073

model <- lm(table(touhou$Year)[1:10] ~ c(2002:2011) + jpn$VALUE[3:12]); summary(model)
Call:
lm(formula = table(touhou$Year)[1:10] ~ c(2002:2011) + jpn$VALUE[3:12])

Residuals:
    Min      1Q  Median      3Q     Max
-2128.6  -716.4    52.4   632.6  2253.4

Coefficients:
                Estimate Std. Error t value Pr(>|t|)
(Intercept)     -2775278     342171   -8.11  8.3e-05
c(2002:2011)        1379        170    8.13  8.2e-05
jpn$VALUE[3:12]     1450        567    2.56    0.038

Residual standard error: 1400 on 7 degrees of freedom
Multiple R-squared: 0.905,  Adjusted R-squared: 0.878
F-statistic: 33.5 on 2 and 7 DF,  p-value: 0.00026



logModel <- lm(log(table(touhou$Year)[1:10]) ~ c(2002:2011) + jpn$VALUE[3:12])
summary(logModel)
Call:
lm(formula = log(table(touhou$Year)[1:10]) ~ c(2002:2011) + jpn$VALUE[3:12])

Residuals:
   Min     1Q Median     3Q    Max
-1.218 -0.632  0.108  0.551  0.982

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)
(Intercept)     -1295.060    207.535   -6.24  0.00043
c(2002:2011)        0.652      0.103    6.34  0.00039
jpn$VALUE[3:12]    -0.725      0.344   -2.11  0.07301

Residual standard error: 0.849 on 7 degrees of freedom
Multiple R-squared: 0.906,  Adjusted R-squared: 0.879
F-statistic: 33.8 on 2 and 7 DF,  p-value: 0.000253

plot(c(2002:2011),table(touhou$Year)[1:10])
points(c(2002:2011),exp(predict(logModel)),type='l',col='blue')

Growth over time

How fast is the corpus of Touhou music growing?

Constant growth model: the first game was released in 1996, no? So that gives 17 years to accumulate 1.26TB or 1,260GB or 74.1GB per year. The screenshot is downloading at 0kb/s, which is not useful, but it says 2640 days left so we can estimate that he’s downloading at 0.47GB per day (1260/2640), and over a year 0.47GB is 174GB which is 2.35x faster than the 74GB per year. So at that annual increase, OP is not doomed and can in fact catch up.

Exponential growth mode: a little trickier since we can’t force a formula just from the cumulative total and elapsed time. I need more data. So using my 2012 Touhou Lossy Torrent data, I can try to regress an exponential against the annual count… but wait! The amount of music does not seem to be increasing exponentially!

R> touhou <- read.csv("https://www.gwern.net/docs/touhou/2013-torrent.csv",
+                    colClasses=c("character", "integer", "factor", "character",
+                                         "integer", "factor", "character",
+                                         "character", "numeric", "integer",
+                                         "factor", "Date"))
R> summary(touhou$Year)
R> perYear <- table(touhou$Year); perYear; plot(perYear)

 2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012
   13    23   255  1241  2070  2599  5073 10278  9765  7494  2999

Graph: http://i.imgur.com/23fMA5c.png

Looks like Touhou music’s growth peaked in 2009; this might reflect the torrent’s incompleteness, except the torrent is from 2012, and you’d expect coverage of 2010 or 2011 to be pretty good by that point. So the growth of the torrent overall looks more like a sigmoid or log:

R> runningTotal <- cumsum(table(touhou$Year)); runningTotal; plot(runningTotal)
 2002  2003  2004  2005  2006  2007  2008  2009  2010  2011  2012
   13    36   291  1532  3602  6201 11274 21552 31317 38811 41810

http://i.imgur.com/Lv9vHrZ.png

So probably it’d be better to ask, if 2012-rate growth continues, what is the ratio to his download speed?

2012 added 19.5GB to the torrent:

R> # FileSize is in megabytes, so gigabytes
R> sum(touhou[touhou$Year==2012,]$FileSize, na.rm=TRUE) / 1000
[1] 19.5

Appendix

touhouwiki.net scraping code

Uses the Tagsoup and split packages; emits CSV to standard out. It is a pile of kludges which I am a little ashamed to post publicly, and may not work for you.

import Text.HTML.TagSoup (fromAttrib, fromTagText, isTagOpen, isTagText,
                          parseTags, (~/=), Tag(TagComment, TagOpen,TagText))
import Network.HTTP (getResponseBody, getRequest, simpleHTTP)
import Data.List (isInfixOf, isPrefixOf, nub, sort, stripPrefix)
import Data.Char (isSpace)
import Codec.Binary.UTF8.String (decodeString)
import Data.Maybe (fromMaybe, isJust, listToMaybe)
import Data.List.Split (keepDelimsL, split, whenElt)
import Control.Monad (join, unless)

main :: IO ()
main = do albums1 <- getAlbums "http://en.touhouwiki.net/wiki/List_by_Groups"
          albums2 <- getAlbums "http://en.touhouwiki.net/wiki/List_of_Old_Touhou_Arrangement_Groups"
          let albumURLs = map ("http://en.touhouwiki.net"++) $ nub $ sort $ albums1 ++ albums2
          putStrLn header
          mapM_ getAlbum albumURLs

getAlbums :: String -> IO [String]
getAlbums index = do touhou <- openURL index
                     return $ drop 22 $ reverse [link | (TagOpen "a" ((ref, link):_)) <- parseTags touhou,
                                                        ref=="href",
                                                        "/wiki/" `isPrefixOf` link,
                                                        let fltr x = not (x `isInfixOf` link),
                                                        fltr "_Groups", fltr "Touhou_Wiki:", fltr "Special:",
                                                        fltr "Template:", fltr "Category:", fltr "Talk:"]

type Album = [Track]
data Track = Track { title :: String, year :: Int, date :: String,artist :: String,
                     album :: String, event :: String, circle :: String, duration :: Maybe Int,
                     track :: Int } deriving Show
empty :: Track
empty = Track {title="",year=0,date="",album="",event="",circle="",artist="",duration=Nothing,track=0}

header :: String
header = "Title,Year,Date,Album,Event,Circle,Duration,Track"
convert :: Track -> String
convert t = "\"" ++ dequote (title t) ++ "\"," ++ show (year t) ++ "," ++ show (date t) ++ ",\"" ++ dequote (album t) ++ "\",\"" ++
             event t ++ "\",\"" ++ circle t ++ "\"," ++ maybe "" show (duration t) ++ "," ++ show (track t)

-- You are not expected to understand this.
getAlbum :: String -> IO ()
getAlbum a = do t <- fmap parseTags $ openURL a
                --TagText "Released",TagClose "th",TagOpen "td" [],TagText "\n2007-03-23"
                let dt = let target = dropWhile (TagText "Released" ~/=) t
                         in if null target then ""
                            else deleteParens $ fromTagText $ head $ tail $ filter isTagText target
                -- TagText "Released",TagClose "th",TagOpen "td" [],TagText "\n2009/02/08 (",
                -- TagOpen "a" [("href","/wiki/Category:Sunshine_Creation_42"),("title","Category:Sunshine Creation 42")],
                -- TagText "Sunshine Creation 42",TagClose "a",TagText ")",TagClose "td"
                let evnt = let stream = filter isTagText $ dropWhile (TagText "Released" ~/=) t
                           in if '(' == last (fromTagText $ head $ take 5 $ drop 1 stream) -- )
                              then fromTagText (filter isTagText (dropWhile (TagText "Released" ~/=) t) !! 2)
                              else ""
                -- "2007-03-23"
                let yr = if null dt then 0 else read (take 4 dt)::Int
                -- TagText "Album by CODE ZTS LABEL"
                let hasCrcl = [cl | TagText cl <- t, "Album by " `isPrefixOf` cl]
                unless (null hasCrcl || (yr==0 && null dt)) $ do
                    let crcl = lookForCircle t
                    -- TagText "Selfregards2 - Touhou Wiki - Characters, games, locations, and more"
                    let albm = reverse $ drop 55 $ reverse $ head [al | TagText al <- t,
                                 " - Touhou Wiki - Characters, games, locations, and more" `isInfixOf` al]
                    let dflt = empty { date = dt, year = yr, circle = crcl, album = albm, event = evnt }
                    let table = filter (\x -> not ("Disc" `isInfixOf` x || " CD" `isInfixOf` x)) $
                                 filter (not . all isSpace) $
                                  map (trim . fromTagText) $ filter isTagText $
                                   drop 5 $ takeWhile (TagOpen "table"
                                     [("class","navbox"),("cellspacing","0"),
                                      ("style","background:#FFFBEE;border-color:#A8A077;")] ~/=) $
                                       takeWhile (TagComment "" ~/=) $
                                        dropWhile (TagOpen "span"
                                         [("class","mw-headline"),("id","Tracks")] ~/=) t
                    let tracks = filter (not . null) $
                                  split (keepDelimsL $ whenElt
                                   (\x -> length x==3 && "." `isInfixOf` x &&  isJust(maybeRead x :: Maybe Int))) table
                    mapM_ (putStrLn . convert . trackToTrack dflt) tracks

-- TagText "Album by ",TagOpen "a"
-- [("href","/wiki/ALiCE%27S_EMOTiON"),("title","ALiCE'S EMOTiON")],TagText
-- "ALiCE'S EMOTiON",TagClose "a"
lookForCircle :: [Tag String] -> String
lookForCircle t = let c = head [cl | TagText cl <- t, "Album by " `isPrefixOf` cl]
                      res = if c == "Album by " then (let tg = (dropWhile (TagText "Album by " ~/=) t !! 2)
                        in if isTagText tg then fromTagText tg else
                            (if isTagOpen tg then fromAttrib "title" tg else "") ) else drop 9 c
                      in if "(page does not exist)" `isInfixOf` res then takeWhile (/='(') res else res -- )

-- ["01.","The mom","(04:07)","arrangement: ZTS","composition: ZTS","original title: The mom","source: Parhelia"]
trackToTrack :: Track -> [String] -> Track
trackToTrack tr t = tr { track = fromMaybe 0 (maybeRead (head t) :: Maybe Int),
                         title = t !! 1,
                         duration = if length t >=3 then Just (timeConverter $ deleteParens (t !! 2)) else Nothing,
                         artist =  lookForAnArtist t }

lookForAnArtist :: [String] -> String
lookForAnArtist t = let targets = dropWhile (\x -> not ("arrangement:" `isPrefixOf` x || "composition:" `isPrefixOf` x)) t
                        target
                            | null targets = ""
                            | last (head targets) == ':' = head targets ++ (targets !! 1)
                            | otherwise = head targets
                    in trim $ fromMaybe "" $ join $ fmap (stripPrefix "arrangement:") $ stripPrefix "composition:" target

-- utility functions
openURL :: String -> IO String
openURL url = fmap decodeString (simpleHTTP (getRequest url) >>= getResponseBody)
deleteParens, trim, dequote :: String -> String
deleteParens = trim . filter (\x -> x /= '(' && x /= ')')
trim = reverse . dropWhile isSpace . reverse . dropWhile isSpace
dequote = map (\x -> if x=='"' then '\'' else x) -- "')
timeConverter :: String -> Int
timeConverter n = let (m,s) = break (==':') n
                      m' = maybeRead m :: Maybe Int
                      s' = maybeRead (drop 1 s) :: Maybe Int
                  in (fromMaybe 0 m' * 60) + fromMaybe 0 s'
maybeRead :: Read a => String -> Maybe a
maybeRead = fmap fst . listToMaybe . reads

VGMdb scraping code

The following is a buggy program for scraping Touhou albums from VGMdb; it works on a limited subset of album pages, but has an unknown number of fatal bugs. I abandoned it once I was offered read-only database access, and that was what I actually used to get my VGMdb data. This is in case I ever need to go back.

import Text.HTML.TagSoup (fromTagText, isTagOpenName, isTagText, Tag(TagOpen,TagText), parseTags)
import Network.HTTP (getResponseBody, getRequest, simpleHTTP)
import Data.List (isPrefixOf, sort)
import Data.Char (isSpace)
import Codec.Binary.UTF8.String (decodeString)

main :: IO ()
main = do albumsURLs <- getAlbums
          albums <- mapM openURL (sort albumsURLs)
          let metadata = map toAlbum albums
          writeFile "vgmdb.csv" $ unlines (header : concatMap (map convert) metadata)

type Album = [Track]
data Track = Track { title :: String,
                     year :: Int,
                     date :: String,
                     album :: String,
                     circle :: String,
                     duration :: Maybe Int,
                     track :: Int } deriving Show
empty :: Track
empty = Track {title="",year=0,date="",album="",circle="",duration=Nothing,track=0}
header :: String
header = "Title,Year,Date,Album,Circle,Duration,Track"
convert :: Track -> String
convert t = "\"" ++ title t ++ "\"," ++ show(year t) ++ "," ++ show (date t) ++ ",\"" ++
              album t ++ "\",\"" ++ circle t ++ "\"," ++ maybe "" show(duration t) ++ "," ++ show (track t)

-- example album link: 'TagOpen "a" [("class","albumtitle album-doujin"),
--                                   ("href","http://vgmdb.net/album/36901"),
--                                   ("title","Majo to Ringo to Samayou Kimi to")]'
getAlbums :: IO [String]
getAlbums = do touhou <- openURL "http://vgmdb.net/product/9"
               return [snd(atts !! 1) | TagOpen "a" atts <- parseTags touhou, snd(head atts)=="albumtitle album-doujin"]

-- need 'decodeString' to deal with Japanese glyphs; see http://stackoverflow.com/questions/10558003/how-to-get-utf8-rss-feed
openURL :: String -> IO String
openURL url = fmap decodeString (simpleHTTP (getRequest url) >>= getResponseBody)

toAlbum :: String -> Album
toAlbum page = let tags = parseTags page
                   (yr,dt) = extractDate tags
                   albm = extractAlbum tags
                   crcl = extractCircle tags
                   files = extractMusic tags
               in map (\t -> Track {title = title t, year = yr, date = dt,
                                    album = albm, circle = crcl, duration = duration t,
                                    track = track t}) files


-- TagOpen "a" [("title","View albums released on Dec 30, 2011"),("href","/db/calendar.php?year=2011&month=12#20111230")]
extractDate :: [Tag String] -> (Int,String)
extractDate t = let (a:b:_) = map snd $ head [atts | TagOpen "a" atts <- t,
                                       let ttle = snd(head atts),
                                       "View albums released on " `isPrefixOf` ttle]
        in (read(reverse $ take 4 $ reverse a)::Int,
           tail$ snd $ break (=='#') b)

-- TagOpen "title" [],TagText "Gensou Rashinban - VGMdb",TagClose "title",
extractAlbum :: [Tag String] -> String
extractAlbum t = (\(TagText x) -> reverse $ drop 8 $ reverse x) (dropWhile (not . isTagOpenName "title") t !! 1)

-- [TagText "Published by",TagClose "b",TagClose "span",TagClose "td",TagText "\r\n",
-- TagOpen "td" [],TagOpen "a" [("href","/org/217")],TagOpen "span"
-- [("class","productname"),("lang","en"),("style","display:inline")],TagText "PopKorn"]
extractCircle :: [Tag String] -> String
extractCircle t = fromTagText(head (drop 8 (dropWhile (\x -> not(isTagText x && (fromTagText x)=="Published by")) t)))

extractMusic :: [Tag String] -> [Track]
extractMusic t = let tracks = filter (not . all isSpace) $
                               tail $ dropWhile (\z -> z /= "Disc 1") $
                                map fromTagText $ filter isTagText $
                                 takeWhile (\y -> not(isTagText y && (fromTagText y)=="Disc length")) $
                                  dropWhile (\x -> not(isTagText x && (fromTagText x)=="Tracklist")) t
                   in if length tracks `rem` 3 == 0 then threezip tracks else twozip tracks
       where
       twozip,threezip :: [String] -> [Track]
       threezip [] = []
       threezip (a:b:c:d) = empty {title=b,duration=Just (timeConverter c),track=read a} : threezip d
       threezip _ = []
       twozip [] = []
       twozip (a:b:d) = empty {title=b,duration=Nothing,track=read a} : twozip d
       twozip _ = []
       timeConverter :: String -> Int
       timeConverter n = let (m,s) = break (==':') n in ((read m :: Int) * 60) + (read (tail s) :: Int)