Skip to main content

R directory


“The Most ‘Abandoned’ Books on GoodReads”, Branwen 2019

GoodReads: “The Most ‘Abandoned’ Books on GoodReads”⁠, Gwern Branwen (2019-12-09; ⁠, ⁠, ; backlinks; similar):

Which books on GoodReads are most difficult to finish? Estimating proportions in December 2019 gives an entirely different result than absolute counts.

What books are hardest for a reader who starts them to finish, and most likely to be abandoned? I scrape a crowdsourced tag⁠, abandoned, from the GoodReads book social network on 2019-12-09 to estimate conditional probability of being abandoned.

The default GoodReads tag interface presents only raw counts of tags, not counts divided by total ratings ( = reads). This conflates popularity with probability of being abandoned: a popular but rarely-abandoned book may have more abandoned tags than a less popular but often-abandoned book. There is also residual error from the winner’s curse where books with fewer ratings are more mis-estimated than popular books. I fix that to see what more correct rankings look like.

Correcting for both changes the top-5 ranking completely, from (raw counts):

  1. The Casual Vacancy, J. K. Rowling
  2. Catch-22, Joseph Heller
  3. American Gods, Neil Gaiman
  4. A Game of Thrones, George R. R. Martin
  5. The Book Thief, Markus Zusak

to (shrunken posterior proportions):

  1. Black Leopard, Red Wolf, Marlon James
  2. Space Opera⁠, Catherynne M. Valente
  3. Little, Big, John Crowley
  4. The Witches: Salem, 1692⁠, Stacy Schiff
  5. Tender Morsels, Margo Lanagan

I also consider a model adjusting for covariates (author/​average-rating/​year), to see what books are most surprisingly often-abandoned given their pedigrees & rating etc. Abandon rates increase the newer a book is, and the lower the average rating.

Adjusting for those, the top-5 are:

  1. The Casual Vacancy, J. K. Rowling
  2. The Chemist⁠, Stephenie Meyer
  3. Infinite Jest, David Foster Wallace
  4. The Glass Bead Game, Hermann Hesse
  5. Theft by Finding: Diaries (1977–2002), David Sedaris

Books at the top of the adjusted list appear to reflect a mix of highly-popular authors changing genres, and ‘prestige’ books which are highly-rated but a slog to read.

These results are interesting for how they highlight how people read books for many reasons (such as marketing campaigns, literary prestige, or following a popular author), and this is reflected in their decision whether to continue reading or to abandon a book.

“Estimating Distributional Models With Brms: Additive Distributional Models”, Bürkner 2019

“Estimating Distributional Models with brms: Additive Distributional Models”⁠, Paul Bürkner (2019-08-29; ; backlinks; similar):

This vignette provides an introduction on how to fit distributional regression models with brms. We use the term distributional model to refer to a model, in which we can specify predictor terms for all parameters of the assumed response distribution.

In the vast majority of regression model implementations, only the location parameter (usually the mean) of the response distribution depends on the predictors and corresponding regression parameters. Other parameters (eg. scale or shape parameters) are estimated as auxiliary parameters assuming them to be constant across observations. This assumption is so common that most researchers applying regression models are often (in my experience) not aware of the possibility of relaxing it. This is understandable insofar as relaxing this assumption drastically increase model complexity and thus makes models hard to fit. Fortunately, brms uses Stan on the backend, which is an incredibly flexible and powerful tool for estimating Bayesian models so that model complexity is much less of an issue.

…In the examples so far, we did not have multilevel data and thus did not fully use the capabilities of the distributional regression framework of brms. In the example presented below, we will not only show how to deal with multilevel data in distributional models, but also how to incorporate smooth terms (ie. splines) into the model. In many applications, we have no or only a very vague idea how the relationship between a predictor and the response looks like. A very flexible approach to tackle this problems is to use splines and let them figure out the form of the relationship.

“Dog Cloning For Special Forces: Breed All You Can Breed”, Branwen 2018

Clone: “Dog Cloning For Special Forces: Breed All You Can Breed”⁠, Gwern Branwen (2018-09-18; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Decision analysis of whether cloning the most elite Special Forces dogs is a profitable improvement over standard selection procedures. Unless training is extremely cheap or heritability is extremely low, dog cloning is hypothetically profitable.

Cloning is widely used in animal & plant breeding despite steep costs due to its advantages; more unusual recent applications include creating entire polo horse teams and reported trials of cloning in elite police/​Special Forces war dogs. Given the cost of dog cloning, however, can this ever make more sense than standard screening methods for selecting from working dog breeds, or would the increase in successful dog training be too low under all reasonable models to turn a profit?

I model the question as one of expected cost per dog with the trait of successfully passing training, success in training being a dichotomous liability threshold with a polygenic genetic architecture; given the extreme level of selection possible in selecting the best among already-elite Special Forces dogs and a range of heritabilities, this predicts clones’ success probabilities. To approximate the relevant parameters, I look at some reported training costs and success rates for regular dog candidates, broad dog heritabilities, and the few current dog cloning case studies reported in the media.

Since none of the relevant parameters are known with confidence, I run the cost-benefit equation for many hypothetical scenarios, and find that in a large fraction of them covering most plausible values, dog cloning would improve training yields enough to be profitable (in addition to its other advantages).

As further illustration of the use-case of screening for an extreme outcome based on a partial predictor, I consider the question of whether height PGSes could be used to screen the US population for people of NBA height, which turns out to be reasonably doable with current & future PGSes.

“On Having Enough Socks”, Branwen 2017

Socks: “On Having Enough Socks”⁠, Gwern Branwen (2017-11-22; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Personal experience and surveys on running out of socks; discussion of socks as small example of human procrastination and irrationality, caused by lack of explicit deliberative thought where no natural triggers or habits exist.

After running out of socks one day, I reflected on how ordinary tasks get neglected. Anecdotally and in 3 online surveys, people report often not having enough socks, a problem which correlates with rarity of sock purchases and demographic variables, consistent with a neglect/​procrastination interpretation: because there is no specific time or triggering factor to replenish a shrinking sock stockpile, it is easy to run out.

This reminds me of akrasia on minor tasks, ‘yak shaving’, and the nature of disaster in complex systems: lack of hard rules lets errors accumulate, without any ‘global’ understanding of the drift into disaster (or at least inefficiency). Humans on a smaller scale also ‘drift’ when they engage in System I reactive thinking & action for too long, resulting in cognitive biases⁠. An example of drift is the generalized human failure to explore/​experiment adequately, resulting in overly greedy exploitative behavior of the current local optimum. Grocery shopping provides a case study: despite large gains, most people do not explore, perhaps because there is no established routine or practice involving experimentation. Fixes for these things can be seen as ensuring that System II deliberative cognition is periodically invoked to review things at a global level, such as developing a habit of maximum exploration at first purchase of a food product, or annually reviewing possessions to note problems like a lack of socks.

While socks may be small things, they may reflect big things.

“ZMA Sleep Experiment”, Branwen 2017

ZMA: “ZMA Sleep Experiment”⁠, Gwern Branwen (2017-03-13; ⁠, ⁠, ⁠, ; backlinks; similar):

A randomized blinded self-experiment of the effects of ZMA (zinc+magnesium+vitamin B6) on my sleep; results suggest small benefit to sleep quality but are underpowered and damaged by Zeo measurement error/​data issues.

I ran a blinded randomized self-experiment of 2.5g nightly ZMA powder effect on Zeo-recorded sleep data during March-October 2017 (n = 127). The linear model and SEM model show no statistically-significant effects or high posterior probability of benefits, although all point-estimates were in the direction of benefits. Data quality issues reduced the available dataset, rendering the experiment particularly underpowered and the results more inconclusive. I decided to not continue use of ZMA after running out; ZMA may help my sleep but I need to improve data quality before attempting any further sleep self-experiments on it.

“Long Bets As Charitable Giving Opportunity”, Branwen 2017

Long-Bets: “Long Bets as Charitable Giving Opportunity”⁠, Gwern Branwen (2017-02-24; ⁠, ⁠, ⁠, ; backlinks; similar):

Evaluating Long Bets as a prediction market shows it is dysfunctional and poorly-structured; despite the irrationality of many users, it is not good even as a way to raise money for charity.

Long Bets is a 15-year-old real-money prediction market run by the Long Now Foundation for incentivizing forecasts/​bets about long-term events of social importance such as technology or the environment. I evaluate use of Long Bets as a charitable giving opportunity by winning bets and directing the earnings to a good charity by making forecasts for all available bet opportunities and ranking them by expected value after adjusting for opportunity cost (defined by expected return of stock market indexing) and temporally discounting. I find that while there are ~41 open bets which I expect have positive expected value if counter-bets were accepted, few or none of my counter-bets were accepted. In general, LB has had almost zero activity for the past decade, and has not incentivized much forecasting. This failure is likely caused by its extreme restriction to even-odds bets, no return on bet funds (resulting in enormous opportunity costs), and lack of maintenance or publicity. All of these issues are highly likely to continue barring extensive changes to Long Bets, and I suggest that Long Bets should be wound down.

“The Kelly Coin-Flipping Game: Exact Solutions”, Branwen et al 2017

Coin-flip: “The Kelly Coin-Flipping Game: Exact Solutions”⁠, Gwern Branwen, Arthur B., nshepperd, FeepingCreature, Gurkenglas (2017-01-19; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Decision-theoretic analysis of how to optimally play Haghani & Dewey 2016’s 300-round double-or-nothing coin-flipping game with an edge and ceiling better than using the Kelly Criterion. Computing and following an exact decision tree increases earnings by $6.6 over a modified KC.

Haghani & Dewey 2016 experiment with a double-or-nothing coin-flipping game where the player starts with $30.4[^\$25.0^~2016~]{.supsub} and has an edge of 60%, and can play 300 times, choosing how much to bet each time, winning up to a maximum ceiling of $303.8[^\$250.0^~2016~]{.supsub}. Most of their subjects fail to play well, earning an average $110.6[^\$91.0^~2016~]{.supsub}, compared to Haghani & Dewey 2016’s heuristic benchmark of ~$291.6[^\$240.0^~2016~]{.supsub} in winnings achievable using a modified Kelly Criterion as their strategy. The KC, however, is not optimal for this problem as it ignores the ceiling and limited number of plays.

We solve the problem of the value of optimal play exactly by using decision trees & dynamic programming for calculating the value function, with implementations in R, Haskell⁠, and C. We also provide a closed-form exact value formula in R & Python, several approximations using Monte Carlo/​random forests⁠/​neural networks, visualizations of the value function, and a Python implementation of the game for the OpenAI Gym collection. We find that optimal play yields $246.61 on average (rather than ~$240), and so the human players actually earned only 36.8% of what was possible, losing $155.6 in potential profit. Comparing decision trees and the Kelly criterion for various horizons (bets left), the relative advantage of the decision tree strategy depends on the horizon: it is highest when the player can make few bets (at b = 23, with a difference of ~$36), and decreases with number of bets as more strategies hit the ceiling.

In the Kelly game, the maximum winnings, number of rounds, and edge are fixed; we describe a more difficult generalized version in which the 3 parameters are drawn from Pareto, normal, and beta distributions and are unknown to the player (who can use Bayesian inference to try to estimate them during play). Upper and lower bounds are estimated on the value of this game. In the variant of this game where subjects are not told the exact edge of 60%, a Bayesian decision tree approach shows that performance can closely approach that of the decision tree, with a penalty for 1 plausible prior of only $1. Two deep reinforcement learning agents, DQN & DDPG⁠, are implemented but DQN fails to learn and DDPG doesn’t show acceptable performance, indicating better deep RL methods may be required to solve the generalized Kelly game.

“Banner Ads Considered Harmful”, Branwen 2017

Ads: “Banner Ads Considered Harmful”⁠, Gwern Branwen (2017-01-08; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

9 months of daily A/​B-testing of Google AdSense banner ads on indicates banner ads decrease total traffic substantially, possibly due to spillover effects in reader engagement and resharing.

One source of complexity & JavaScript use on is the use of Google AdSense advertising to insert banner ads. In considering design & usability improvements, removing the banner ads comes up every time as a possibility, as readers do not like ads, but such removal comes at a revenue loss and it’s unclear whether the benefit outweighs the cost, suggesting I run an A/​B experiment. However, ads might be expected to have broader effects on traffic than individual page reading times/​bounce rates, affecting total site traffic instead through long-term effects on or spillover mechanisms between readers (eg. social media behavior), rendering the usual A/​B testing method of per-page-load/​session randomization incorrect; instead it would be better to analyze total traffic as a time-series experiment.

Design: A decision analysis of revenue vs readers yields an maximum acceptable total traffic loss of ~3%. Power analysis of historical traffic data demonstrates that the high autocorrelation yields low statistical power with standard tests & regressions but acceptable power with ARIMA models. I design a long-term Bayesian ARIMA(4,0,1) time-series model in which an A/​B-test running January–October 2017 in randomized paired 2-day blocks of ads/​no-ads uses client-local JS to determine whether to load & display ads, with total traffic data collected in Google Analytics & ad exposure data in Google AdSense. The A/​B test ran from 2017-01-01 to 2017-10-15, affecting 288 days with collectively 380,140 pageviews in 251,164 sessions.

Correcting for a flaw in the randomization, the final results yield a surprisingly large estimate of an expected traffic loss of −9.7% (driven by the subset of users without adblock), with an implied −14% traffic loss if all traffic were exposed to ads (95% credible interval: −13–16%), exceeding my decision threshold for disabling ads & strongly ruling out the possibility of acceptably small losses which might justify further experimentation.

Thus, banner ads on appear to be harmful and AdSense has been removed. If these results generalize to other blogs and personal websites, an important implication is that many websites may be harmed by their use of banner ad advertising without realizing it.

“Internet WiFi Improvement”, Branwen 2016

WiFi: “Internet WiFi improvement”⁠, Gwern Branwen (2016-10-20; ⁠, ⁠, ⁠, ; backlinks; similar):

After putting up with slow glitchy WiFi Internet for years, I investigate improvements. Upgrading the router, switching to a high-gain antenna, and installing a buried Ethernet cable all offer increasing speeds.

My laptop in my apartment receives Internet via a WiFi repeater to another house, yielding slow speeds and frequent glitches. I replaced the obsolete WiFi router and increased connection speeds somewhat but still inadequate. For a better solution, I used a directional antenna to connect directly to the new WiFi router, which, contrary to my expectations, yielded a ~6× increase in speed. Extensive benchmarking of all possible arrangements of laptops/​dongles/​repeaters/​antennas/​routers/​positions shows that the antenna+router is inexpensive and near optimal speed, and that the only possible improvement would be a hardwired Ethernet line, which I installed a few weeks later after learning it was not as difficult as I thought it would be.

“‘Genius Revisited’ Revisited”, Branwen 2016

Hunter: “‘Genius Revisited’ Revisited”⁠, Gwern Branwen (2016-06-19; ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

A book study of surveys of the high-IQ elementary school HCES concludes that high IQ is not predictive of accomplishment; I point out that results are consistent with regression to the mean from extremely early IQ tests and small total sample size.

Genius Revisited documents the longitudinal results of a high-IQ/​gifted-and-talented elementary school, Hunter College Elementary School (HCES); one of the most striking results is the general high education & income levels, but absence of great accomplishment on a national or global scale (eg. a Nobel prize). The authors suggest that this may reflect harmful educational practices at their elementary school or the low predictive value of IQ.

I suggest that there is no puzzle to this absence nor anything for HCES to be blamed for, as the absence is fully explainable by their making 2 statistical errors: base-rate neglect⁠, and regression to the mean⁠.

First, their standards fall prey to a base-rate fallacy and even extreme predictive value of IQ would not predict 1 or more Nobel prizes because Nobel prize odds are measured at 1 in millions, and with a small total sample size of a few hundred, it is highly likely that there would simply be no Nobels.

Secondly, and more seriously, the lack of accomplishment is inherent and unavoidable as it is driven by the regression to the mean caused by the relatively low correlation of early childhood with adult IQs—which means their sample is far less elite as adults than they believe. Using early-childhood/​adult IQ correlations, regression to the mean implies that HCES students will fall from a mean of 157 IQ in kindergarten (when selected) to somewhere around 133 as adults (and possibly lower). Further demonstrating the role of regression to the mean, in contrast, HCES’s associated high-IQ/​gifted-and-talented high school, Hunter High, which has access to the adolescents’ more predictive IQ scores, has much higher achievement in proportion to its lesser regression to the mean (despite dilution by Hunter elementary students being grandfathered in).

This unavoidable statistical fact undermines the main rationale of HCES: extremely high-IQ adults cannot be accurately selected as kindergartners on the basis of a simple test. This greater-regression problem can be lessened by the use of additional variables in admissions, such as parental IQs or high-quality genetic polygenic scores⁠; unfortunately, these are either politically unacceptable or dependent on future scientific advances. This suggests that such elementary schools may not be a good use of resources and HCES students should not be assigned scarce magnet high school slots.

“CO2/ventilation Sleep Experiment”, Branwen 2016

CO2: “CO2/ventilation sleep experiment”⁠, Gwern Branwen (2016-06-05; ⁠, ⁠, ⁠, ; backlinks; similar):

Self-experiment on whether changes in bedroom CO2 levels affect sleep quality

Some psychology studies find that CO2 impairs cognition, and some sleep studies find that better ventilation may improve sleep quality. Use of a Netatmo air quality sensor reveals that closing my bedroom tightly to reduce morning light also causes CO2 levels to spike overnight to 7x daytime levels. To investigate the possible harmful effects, I run a self-experiment randomizing an open bedroom door and a bedroom box fan (2x2) and analyze the data using a structural equation model of air quality effects on a latent sleep factor with measurement error⁠.

“Candy Japan’s New Box A/B Test”, Branwen 2016

Candy-Japan: “Candy Japan’s new box A/B test”⁠, Gwern Branwen (2016-05-06; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Bayesian decision-theoretic analysis of the effect of fancier packaging on subscription cancellations & optimal experiment design.

I analyze an A/​B test from a mail-order company of two different kinds of box packaging from a Bayesian decision-theory perspective, balancing posterior probability of improvements & greater profit against the cost of packaging & risk of worse results, finding that as the company’s analysis suggested, the new box is unlikely to be sufficiently better than the old. Calculating expected values of information shows that it is not worth experimenting on further, and that such fixed-sample trials are unlikely to ever be cost-effective for packaging improvements. However, adaptive experiments may be worthwhile.

“Calculating The Gaussian Expected Maximum”, Branwen 2016

Order-statistics: “Calculating The Gaussian Expected Maximum”⁠, Gwern Branwen (2016-01-22; ⁠, ⁠, ; backlinks; similar):

In generating a sample of n datapoints drawn from a normal/​Gaussian distribution, how big on average the biggest datapoint is will depend on how large n is. I implement a variety of exact & approximate calculations from the literature in R to compare efficiency & accuracy.

In generating a sample of n datapoints drawn from a normal/​Gaussian distribution with a particular mean/​SD, how big on average the biggest datapoint is will depend on how large n is. Knowing this average is useful in a number of areas like sports or breeding or manufacturing, as it defines how bad/​good the worst/​best datapoint will be (eg. the score of the winner in a multi-player game).

The order statistic of the mean/​average/​expectation of the maximum of a draw of n samples from a normal distribution has no exact formula, unfortunately, and is generally not built into any programming language’s libraries.

I implement & compare some of the approaches to estimating this order statistic in the R programming language, for both the maximum and the general order statistic. The overall best approach is to calculate the exact order statistics for the n range of interest using numerical integration via lmomco and cache them in a lookup table, rescaling the mean/​SD as necessary for arbitrary normal distributions; next best is a polynomial regression approximation; finally, the Elfving correction to the Blom 1958 approximation is fast, easily implemented, and accurate for reasonably large n such as n > 100.

“Embryo Selection For Intelligence”, Branwen 2016

Embryo-selection: “Embryo Selection For Intelligence”⁠, Gwern Branwen (2016-01-22; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

A cost-benefit analysis of the marginal cost of IVF-based embryo selection for intelligence and other traits with 2016-2017 state-of-the-art

With genetic predictors of a phenotypic trait, it is possible to select embryos during an in vitro fertilization process to increase or decrease that trait. Extending the work of Shulman & Bostrom 2014⁠/​Hsu 2014⁠, I consider the case of human intelligence using SNP-based genetic prediction, finding:

  • a meta-analysis of GCTA results indicates that SNPs can explain >33% of variance in current intelligence scores, and >44% with better-quality phenotype testing
  • this sets an upper bound on the effectiveness of SNP-based selection: a gain of 9 IQ points when selecting the top embryo out of 10
  • the best 2016 polygenic score could achieve a gain of ~3 IQ points when selecting out of 10
  • the marginal cost of embryo selection (assuming IVF is already being done) is modest, at $1,822.7[^\$1,500.0^~2016~]{.supsub} + $243.0[^\$200.0^~2016~]{.supsub} per embryo, with the sequencing cost projected to drop rapidly
  • a model of the IVF process, incorporating number of extracted eggs, losses to abnormalities & vitrification & failed implantation & miscarriages from 2 real IVF patient populations, estimates feasible gains of 0.39 & 0.68 IQ points
  • embryo selection is currently unprofitable (mean: -$435.0[^\$358.0^~2016~]{.supsub}) in the USA under the lowest estimate of the value of an IQ point, but profitable under the highest (mean: $7,570.3[^\$6,230.0^~2016~]{.supsub}). The main constraints on selection profitability is the polygenic score; under the highest value, the NPV EVPI of a perfect SNP predictor is $29.2[^\$24.0^~2016~]{.supsub}b and the EVSI per education/​SNP sample is $86.3[^\$71.0^~2016~]{.supsub}k
  • under the worst-case estimate, selection can be made profitable with a better polygenic score, which would require n > 237,300 using education phenotype data (and much less using fluid intelligence measures)
  • selection can be made more effective by selecting on multiple phenotype traits: considering an example using 7 traits (IQ/​height/​BMI/​diabetes/​ADHD⁠/​bipolar/​schizophrenia), there is a factor gain over IQ alone; the outperformance of multiple selection remains after adjusting for genetic correlations & polygenic scores and using a broader set of 16 traits.

“The Power of Twins: The Scottish Milk Experiment”, Branwen 2016

Milk: “The Power of Twins: The Scottish Milk Experiment”⁠, Gwern Branwen (2016-01-12; ⁠, ; backlinks; similar):

In discussing a large Scottish public health experiment, Student noted that it would’ve been vastly more efficient using a twin experiment design; I fill in the details with a power analysis.

Randomized experiments require more subjects the more variable each datapoint is to overcome the noise which obscures any effects of the intervention. Reducing noise enables better inferences with the same data, or less data to be collected, which can be done by balancing observed characteristics between control and experimental datapoints.

A particularly dramatic example of this approach is running experiments on identical twins rather than regular people, because twins vary far less from each other than random people due to shared genetics & family environment. In 1931, the great statistician Student (William Sealy Gosset) noted problems with an extremely large (n = 20,000) Scottish experiment in feeding children milk (to see if they grew more in height or weight), and claimed that the experiment could have been done far more cost-effectively with an extraordinary reduction of >95% fewer children if it had been conducted using twins, and claimed that 100 identical twins would have been more accurate than 20,000 children. He, however, did not provide any calculations or data demonstrating this.

I revisit the issue and run a power calculation on height indicating that Student’s claims were correct and that the experiment would have required ~97% fewer children if run with twins.

This reduction is not unique to the Scottish milk experiment on height/​weight, and in general, one can expect a reduction of 89% in experiment sample sizes using twins rather than regular people, demonstrating the benefits of using behavioral genetics in experiment design⁠/​power analysis⁠.

“World Catnip Surveys”, Branwen 2015

Catnip-survey: “World Catnip Surveys”⁠, Gwern Branwen (2015-11-15; ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

International population online surveys of cat owners about catnip and other cat stimulant use.

In compiling a meta-analysis of reports of catnip response rats in domestic cats⁠, yielding a meta-analytic average of ~2⁄3, the available data suggests heterogeneity from cross-country differences in rates (possibly for genetic reasons) but is insufficient to definitively demonstrate the existence of or estimate those differences (particularly a possible extremely high catnip response rate in Japan). I use Google Surveys August–September 2017 to conduct a brief 1-question online survey of a proportional population sample of 9 countries about cat ownership & catnip use, specifically: Canada, the USA, UK, Japan, Germany, Brazil, Spain, Australia, & Mexico. In total, I surveyed n = 31,471 people, of whom n = 9,087 are cat owners, of whom n = 4,402 report having used catnip on their cat, and of whom n = 2996 report a catnip response.

The survey yields catnip response rates of Canada (82%), USA (79%), UK (74%), Japan (71%), Germany (57%), Brazil (56%), Spain (54%), Australia (53%), and Mexico (52%). The differences are substantial and of high posterior probability, supporting the existence of large cross-country differences. In additional analysis, the other conditional probabilities of cat ownership and trying catnip with a cat appear to correlate with catnip response rates; this intercorrelation suggests a “cat factor” of some sort influencing responses, although what causal relationship there might be between proportion of cat owners and proportion of catnip-responder cats is unclear.

An additional survey of a convenience sample of primarily US Internet users about catnip is reported, although the improbable catnip response rates compared to the population survey suggest the respondents are either highly unrepresentative or the questions caused demand bias.

“Catnip Immunity and Alternatives”, Branwen 2015

Catnip: “Catnip immunity and alternatives”⁠, Gwern Branwen (2015-11-07; ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Estimation of catnip immunity rates by country with meta-analysis and surveys, and discussion of catnip alternatives.

Not all cats respond to the catnip stimulant; the rate of responders is generally estimated at ~70% of cats. A meta-analysis of catnip response experiments since the 1940s indicates the true value is ~62%. The low quality of studies and the reporting of their data makes examination of possible moderators like age, sex, and country difficult. Catnip responses have been recorded for a number of species both inside and outside the Felidae family; of them, there is evidence for a catnip response in the Felidae, and, more uncertainly, the Paradoxurinae, and Herpestinae.

To extend the analysis, I run large-scale online surveys measuring catnip response rates globally in domestic cats, finding high heterogeneity but considerable rates of catnip immunity worldwide.

As a piece of practical advice for cat-hallucinogen sommeliers, I treat catnip response & finding catnip substitutes as a decision problem, modeling it as a Markov decision process where one wishes to find a working psychoactive at minimum cost. Bol et al 2017 measured multiple psychoactives simultaneously in a large sample of cats, permitting prediction of responses conditional on not responding to others. (The solution to the specific problem is to test in the sequence catnip → honeysuckle → silvervine → Valerian⁠.)

For discussion of cat psychology in general, see my Cat Sense review.

“Bitter Melon for Blood Glucose”, Branwen 2015

Melon: “Bitter Melon for blood glucose”⁠, Gwern Branwen (2015-09-14; ⁠, ; similar):

Analysis of whether bitter melon reduces blood glucose in one self-experiment and utility of further self-experimentation

I re-analyze a bitter-melon/​blood-glucose self-experiment, finding a small effect of increasing blood glucose after correcting for temporal trends & daily variation, giving both frequentist & Bayesian analyses. I then analyze the self-experiment from a subjective Bayesian decision-theoretic perspective, cursorily estimating the costs of diabetes & benefits of intervention in order to estimate Value Of Information for the self-experiment and the benefit of further self-experimenting; I find that the expected value of more data (EVSI) is negative and further self-experimenting would not be optimal compared to trying out other anti-diabetes interventions.

“RNN Metadata for Mimicking Author Style”, Branwen 2015

RNN-metadata: “RNN Metadata for Mimicking Author Style”⁠, Gwern Branwen (2015-09-12; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Teaching a text-generating char-RNN to automatically imitate many different authors by labeling the input text by author; additional experiments include imitating Geocities and retraining GPT-2 on a large Project Gutenberg poetry corpus.

Char-RNNs are unsupervised generative models which learn to mimic text sequences. I suggest extending char-RNNs with inline metadata such as genre or author prefixed to each line of input, allowing for better & more efficient metadata, and more controllable sampling of generated output by feeding in desired metadata. A 2015 experiment using torch-rnn on a set of ~30 Project Gutenberg e-books (1 per author) to train a large char-RNN shows that a char-RNN can learn to remember metadata such as authors, learn associated prose styles, and often generate text visibly similar to that of a specified author.

I further try & fail to train a char-RNN on Geocities HTML for unclear reasons.

More successfully, I experiment in 2019 with a recently-developed alternative to char-RNNs⁠, the Transformer NN architecture, by finetuning training OpenAI’s GPT-2-117M Transformer model on a much larger (117MB) Project Gutenberg poetry corpus using both unlabeled lines & lines with inline metadata (the source book). The generated poetry is much better. And GPT-3 is better still.

“When Should I Check The Mail?”, Branwen 2015

Mail-delivery: “When Should I Check The Mail?”⁠, Gwern Branwen (2015-06-21; ⁠, ⁠, ⁠, ; backlinks; similar):

Bayesian decision-theoretic analysis of local mail delivery times: modeling deliveries as survival analysis, model comparison, optimizing check times with a loss function⁠, and optimal data collection.

Mail is delivered by the USPS mailman at a regular but not observed time; what is observed is whether the mail has been delivered at a time, yielding somewhat-unusual “interval-censored data”. I describe the problem of estimating when the mailman delivers, write a simulation of the data-generating process, and demonstrate analysis of interval-censored data in R using maximum-likelihood (survival analysis with Gaussian regression using survival library), MCMC (Bayesian model in JAGS), and likelihood-free Bayesian inference (custom ABC, using the simulation). This allows estimation of the distribution of mail delivery times. I compare those estimates from the interval-censored data with estimates from a (smaller) set of exact delivery-times provided by USPS tracking & personal observation, using a multilevel model to deal with heterogeneity apparently due to a change in USPS routes/​postmen. Finally, I define a loss function on mail checks, enabling: a choice of optimal time to check the mailbox to minimize loss (exploitation); optimal time to check to maximize information gain (exploration); Thompson sampling (balancing exploration & exploitation indefinitely), and estimates of the value-of-information of another datapoint (to estimate when to stop exploration and start exploitation after a finite amount of data).

“Statistical Notes”, Branwen 2014

Statistical-notes: “Statistical Notes”⁠, Gwern Branwen (2014-07-17; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Miscellaneous statistical stuff

Given two disagreeing polls, one small & imprecise but taken at face-value, and the other large & precise but with a high chance of being totally mistaken, what is the right Bayesian model to update on these two datapoints? I give ABC and MCMC implementations of Bayesian inference on this problem and find that the posterior is bimodal with a mean estimate close to the large unreliable poll’s estimate but with wide credible intervals to cover the mode based on the small reliable poll’s estimate.

“Complexity No Bar to AI”, Branwen 2014

Complexity-vs-AI: “Complexity no Bar to AI”⁠, Gwern Branwen (2014-06-01; ⁠, ⁠, ⁠, ; backlinks; similar):

Critics of AI risk suggest diminishing returns to computing (formalized asymptotically) means AI will be weak; this argument relies on a large number of questionable premises and ignoring additional resources, constant factors, and nonlinear returns to small intelligence advantages, and is highly unlikely.

Computational complexity theory describes the steep increase in computing power required for many algorithms to solve larger problems; frequently, the increase is large enough to render problems a few times larger totally intractable. Many of these algorithms are used in AI-relevant contexts. It has been argued that this implies that AIs will fundamentally be limited in accomplishing real-world tasks better than humans because they will run into the same computational complexity limit as humans, and so the consequences of developing AI will be small, as it is impossible for there to be any large fast global changes due to human or superhuman-level AIs. I examine the assumptions of this argument and find it neglects the many conditions under which computational complexity theorems are valid and so the argument doesn’t work: problems can be solved more efficiently than complexity classes would imply, large differences in problem solubility between humans and AIs is possible, greater resource consumption is possible, the real-world consequences of small differences on individual tasks can be large on agent impacts, such consequences can compound, and many agents can be created; any of these independent objections being true destroys the argument.

“Spearman’s Rho for the AMH Copula: a Beautiful Formula”, Machler 2014

“Spearman’s Rho for the AMH Copula: a Beautiful Formula”⁠, Martin M̈achler (2014-06; ; backlinks; similar):

We derive a beautiful series expansion for Spearman’s rho⁠, ρ(θ) of the Ali-Mikhail-Haq (AMH) copula with parameter θ which is also called α or θ. Further, via experiments we determine the cutoffs to be used for practically fast and accurate computation of ρ(θ) for all θ ∈ [−1,1].

[Keywords: Archimedean copulas, Spearman’s rho.]

“Bacopa Quasi-Experiment”, Branwen 2014

Bacopa: “Bacopa Quasi-Experiment”⁠, Gwern Branwen (2014-05-06; ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

A small 2014-2015 non-blinded self-experiment using Bacopa monnieri to investigate effect on memory/​sleep/​self-ratings in an ABABA design; no particular effects were found.

Bacopa is a supplement herb often used for memory or stress adaptation. Its chronic effects reportedly take many weeks to manifest, with no important acute effects. Out of curiosity, I bought 2 bottles of Bacognize Bacopa pills and ran a non-randomized non-blinded ABABA quasi-self-experiment from June 2014 to September 2015, measuring effects on my memory performance, sleep, and daily self-ratings of mood/​productivity. For analysis, a multi-level Bayesian model on two memory performance variables was used to extract per-day performance, factor analysis was used to extract a sleep index from 9 Zeo sleep variables, and the 3 endpoints were modeled as a multivariate Bayesian time-series regression with splines. Because of the slow onset of chronic effects, small effective sample size, definite temporal trends probably unrelated to Bacopa, and noise in the variables, the results were as expected, ambiguous, and do not strongly support any correlation between Bacopa and memory/​sleep/​self-rating (+/​-/​- respectively).

“The Sort --key Trick”, Branwen 2014

Sort: “The sort --key Trick”⁠, Gwern Branwen (2014-03-03; ⁠, ⁠, ⁠, ; backlinks; similar):

Commandline folklore: sorting files by filename or content before compression can save large amounts of space by exposing redundancy to the compressor. Examples and comparisons of different sorts.

Programming folklore notes that one way to get better lossless compression efficiency is by the precompression trick of rearranging files inside the archive to group ‘similar’ files together and expose redundancy to the compressor, in accordance with information-theoretical principles. A particularly easy and broadly-applicable way of doing this, which does not require using any unusual formats or tools and is fully compatible with the default archive methods, is to sort the files by filename and especially file extension.

I show how to do this with the standard Unix command-line sort tool, using the so-called “sort --key trick”, and give examples of the large space-savings possible from my archiving work for personal website mirrors and for making darknet market mirror datasets where the redundancy at the file level is particularly extreme and the sort --key trick shines compared to the naive approach.

“2013 LLLT Self-experiment”, Branwen 2013

LLLT: “2013 LLLT self-experiment”⁠, Gwern Branwen (2013-12-20; ⁠, ⁠, ; backlinks; similar):

An LLLT user’s blinded randomized self-experiment in 2013 on the effects of near-infrared light on a simple cognitive test battery: positive results

A short randomized & blinded self-experiment on near-infrared LED light stimulation of one’s brain yields statistically-significant dose-related improvements to 4 measures of cognitive & motor performance. Concerns include whether the blinding succeeded and why the results are so good.

“Darknet Market Archives (2013–2015)”, Branwen 2013

DNM-archives: “Darknet Market Archives (2013–2015)”⁠, Gwern Branwen (2013-12-01; ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Mirrors of ~89 Tor-Bitcoin darknet markets & forums 2011–2015, and related material.

Dark Net Markets (DNM) are online markets typically hosted as Tor hidden services providing escrow services between buyers & sellers transacting in Bitcoin or other cryptocoins, usually for drugs or other illegal/​regulated goods; the most famous DNM was Silk Road 1, which pioneered the business model in 2011.

From 2013–2015, I scraped/​mirrored on a weekly or daily basis all existing English-language DNMs as part of my research into their usage⁠, lifetimes /  ​ characteristics⁠, & legal riskiness⁠; these scrapes covered vendor pages, feedback, images, etc. In addition, I made or obtained copies of as many other datasets & documents related to the DNMs as I could.

This uniquely comprehensive collection is now publicly released as a 50GB (~1.6TB uncompressed) collection covering 89 DNMs & 37+ related forums, representing <4,438 mirrors, and is available for any research.

This page documents the download, contents, interpretation, and technical methods behind the scrapes.

“Darknet Market Mortality Risks”, Branwen 2013

DNM-survival: “Darknet Market mortality risks”⁠, Gwern Branwen (2013-10-30; ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Survival analysis of lifespans, deaths, and predictive factors of Tor-Bitcoin darknet markets

I compile a dataset of 87 public English-language darknet markets (DNMs) 2011–2016 in the vein of the famous Silk Road 1⁠, recording their openings/​closing and relevant characteristics. A survival analysis indicates the markets follow a Type TODO lifespan, with a median life of TODO months. Risk factors include TODO. With the best model, I generate estimates for the currently-operating markets.

“Drugs 2.0: Your Crack’s in the Post”, Power 2013

2013-power: “Drugs 2.0: Your Crack’s in the Post”⁠, Mike Power (2013-10-19; ⁠, ):

May 2013 overview of Silk Road 1’s rise, powered by Tor & Bitcoin, enabling safe and easy online drug sales through the mail.

This is an annotated transcript of the chapter “Your Crack’s in the Post” (pg219–244) & an excerpt from the chapter “Prohibition in the Digital Age” (pg262), of Drugs 2.0: The Web Revolution That’s Changing How the World Gets High⁠, Mike Power (2013-05-02); it is principally on the topic of Bitcoin⁠, Tor⁠, and Silk Road 1⁠.

Note: to hide apparatus like the links, you can use reader-mode ().

“Creatine Cognition Meta-analysis”, Branwen 2013

Creatine: “Creatine Cognition Meta-analysis”⁠, Gwern Branwen (2013-09-06; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Does creatine increase cognitive performance? Maybe for vegetarians but probably not.

I attempt to meta-analyze conflicting studies about the cognitive benefits of creatine supplementation. The wide variety of psychological measures by uniformly small studies hampers any aggregation. 3 studies measured IQ and turn in a positive result, but suggestive of vegetarianism causing half the benefit. Discussions indicate that publication bias is at work. Given the variety of measures, small sample sizes, publication bias, possible moderators, and small-study biases, any future creatine studies should use the most standard measures of cognitive function like RAPM in a reasonably large pre-registered experiment.

“Lunar Circadian Rhythms”, Branwen 2013

Lunar-sleep: “Lunar circadian rhythms”⁠, Gwern Branwen (2013-07-26; ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Is sleep affected by the phase of the moon? An analysis of several years of 4 Zeo users’ sleep data shows no lunar cycle.

I attempt to replicate, using public Zeo-recorded sleep datasets, a finding of a monthly circadian rhythm affecting sleep in a small sleep lab. I find only small non-statistically-significant correlations, despite being well-powered⁠.

“2013 Lewis Meditation Results”, Branwen 2013

Lewis-meditation: “2013 Lewis meditation results”⁠, Gwern Branwen (2013-07-12; ⁠, ⁠, ; backlinks; similar):

Multilevel modeling of effect of small group’s meditation on math errors

A small group of Quantified Selfers tested themselves daily on arithmetic and engaged in a month of meditation. I analyze their scores with a multilevel model with per-subject grouping, and find the expect result: a small decrease in arithmetic errors which is not statistically-significant⁠, with practice & time-of-day effects (but not day-of-week or weekend effects). This suggests a longer experiment by twice as many experimenters in order to detect this effect.

“Alerts Over Time”, Branwen 2013

Google-Alerts: “Alerts Over Time”⁠, Gwern Branwen (2013-07-01; ⁠, ⁠, ; backlinks; similar):

Does Google Alerts return fewer results each year? A statistical investigation

Has Google Alerts been sending fewer results the past few years? Yes. Responding to rumors of its demise, I investigate the number of results in my personal Google Alerts notifications 2007-2013, and find no overall trend of decline until I look at a transition in mid-2011 where the results fall dramatically. I speculate about the cause and implications for Alerts’s future.

“Magnesium Self-Experiments”, Branwen 2013

Magnesium: “Magnesium Self-Experiments”⁠, Gwern Branwen (2013-05-13; ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

3 magnesium self-experiments on magnesium l-threonate and magnesium citrate.

Encouraged by TruBrain’s magnesium & my magnesium l-threonate use, I design and run a blind random self-experiment to see whether magnesium citrate supplementation would improve my mood or productivity. I collected ~200 days of data at two dose levels. The analysis finds that the net effect was negative, but a more detailed look shows time-varying effects with a large initial benefit negated by an increasingly-negative effect. Combined with my expectations, the long half-life, and the higher-than-intended dosage, I infer that I overdosed on the magnesium. To verify this, I will be running a followup experiment with a much smaller dose.

“Caffeine Wakeup Experiment”, Branwen 2013

Caffeine: “Caffeine wakeup experiment”⁠, Gwern Branwen (2013-04-07; ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Self-experiment on whether consuming caffeine immediately upon waking results in less time in bed & higher productivity. The results indicate a small and uncertain effect.

One trick to combat morning sluggishness is to get caffeine extra-early by using caffeine pills shortly before or upon trying to get up. From 2013-2014 I ran a blinded & placebo-controlled randomized experiment measuring the effect of caffeine pills in the morning upon awakening time and daily productivity. The estimated effect is small and the posterior probability relatively low, but a decision analysis suggests that since caffeine pills are so cheap, it would be worthwhile to conduct another experiment; however, increasing Zeo equipment problems have made me hold off additional experiments indefinitely.

“Predicting Google Closures”, Branwen 2013

Google-shutdowns: “Predicting Google closures”⁠, Gwern Branwen (2013-03-28; ⁠, ⁠, ⁠, ; backlinks; similar):

Analyzing predictors of Google abandoning products; predicting future shutdowns

Prompted by the shutdown of Google Reader⁠, I ponder the evanescence of online services and wonder what is the risk of them disappearing. I collect data on 350 Google products launched before March 2013, looking for variables predictive of mortality (web hits, service vs software, commercial vs free, FLOSS, social networking, and internal vs acquired). Shutdowns are unevenly distributed over the calendar year or Google’s history. I use logistic regression & survival analysis (which can deal with right-censorship) to model the risk of shutdown over time and examine correlates. The logistic regression indicates socialness, acquisitions, and lack of web hits predict being shut down, but the results may not be right. The survival analysis finds a median lifespan of 2824 days with a roughly Type III survival curve (high early-life mortality); a Cox regression finds similar results as the logistic - socialness, free, acquisition, and long life predict lower mortality. Using the best model, I make predictions about probability of shutdown of the most risky and least risky services in the next 5 years (up to March 2018). (All data & R source code is provided.)

“Weather and My Productivity”, Branwen 2013

Weather: “Weather and My Productivity”⁠, Gwern Branwen (2013-03-19; ⁠, ⁠, ⁠, ; backlinks; similar):

Rain or shine affect my mood? Not much.

Weather is often said to affect our mood, and that people in sunnier places are happier because of that. Curious about the possible effect (it could be worth controlling for in my future QS analyses or attempting to imitate benefits inside my house eg. brighter lighting), I combine my long-term daily self-ratings with logs from the nearest major official weather stations, which offer detailed weather information about temperature, humidity, precipitation, cloud cover, wind speed, brightness etc, and try to correlate them.

In general, despite considerable data, there are essentially no bivariate correlations, nothing in several versions of a linear model, and nothing found by a random forest⁠. It would appear that weather does not correlate with my self-ratings to any detectable degree, much less cause it.

“Potassium Sleep Experiments”, Branwen 2012

Potassium: “Potassium sleep experiments”⁠, Gwern Branwen (2012-12-21; ⁠, ⁠, ⁠, ; backlinks; similar):

2 self-experiments on potassium citrate effects on sleep: harm to sleep when taken daily or in the morning

Potassium and magnesium are minerals that many Americans are deficient in. I tried using potassium citrate and immediately noticed difficulty sleeping. A short randomized (but not blinded) self-experiment of ~4g potassium taken throughout the day confirmed large negative effects on my sleep. A longer followup randomized and blinded self-experiment used standardized doses taken once a day early in the morning, and also found some harm to sleep, and I discontinued potassium use entirely.

“2012 Election Predictions”, Branwen 2012

2012-election-predictions: “2012 election predictions”⁠, Gwern Branwen (2012-11-05; ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Compiling academic and media forecaster’s 2012 American Presidential election predictions and statistically judging correctness; Nate Silver was not the best.

Statistically analyzing in R hundreds of predictions compiled for ~10 forecasters of the 2012 American Presidential election, and ranking them by Brier, RMSE, & log scores; the best overall performance seems to be by Drew Linzer and Wang & Holbrook, while Nate Silver appears as somewhat over-rated and the famous Intrade prediction market turning in a disappointing overall performance.

“‘HP: Methods of Rationality’ Review Statistics”, Branwen 2012

hpmor: “‘HP: Methods of Rationality’ review statistics”⁠, Gwern Branwen (2012-11-03; ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Recording fan speculation for retrospectives; statistically modeling reviews for ongoing story with R

The unprecedented gap in Methods of Rationality updates prompts musing about whether readership is increasing enough & what statistics one would use; I write code to download reviews, clean it, parse it, load into R, summarize the data & depict it graphically, run linear regression on a subset & all reviews, note the poor fit, develop a quadratic fit instead, and use it to predict future review quantities.

Then, I run a similar analysis on a competing fanfiction to find out when they will have equal total review-counts. A try at logarithmic fits fails; fitting a linear model to the previous 100 days of MoR and the competitor works much better, and they predict a convergence in <5 years.

A survival analysis finds no major anomalies in reviewer lifetimes, but an apparent increase in mortality for reviewers who started reviewing with later chapters, consistent with (but far from proving) the original theory that the later chapters’ delays are having negative effects.

“LSD Microdosing RCT”, Branwen 2012

LSD-microdosing: “LSD microdosing RCT”⁠, Gwern Branwen (2012-08-20; ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Self-experiment with sub-psychedelic doses of LSD; no benefit

Some early experimental studies with LSD suggested that doses of LSD too small to cause any noticeable effects may improve mood and creativity. Prompted by recent discussion of this claim and the purely anecdotal subsequent evidence for it, I decided to run a well-powered randomized blind trial of 3-day LSD microdoses from September 2012 to March 2013. No beneficial effects reached statistical-significance and there were worrisome negative trends. LSD microdosing did not help me.

“DNM-related Arrests, 2011–2015”, Branwen 2012

DNM-arrests: “DNM-related arrests, 2011–2015”⁠, Gwern Branwen (2012-07-14; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

A census database of all publicly-reported arrests and prosecutions connected to the Tor-Bitcoin drug darknet markets 2011-2015, and analysis of mistakes.

I compile a table and discussion of all known arrests and prosecutions related to English-language Tor-Bitcoin darknet markets (DNMs) such as Silk Road 1, primarily 2011–2015, along with discussion of how they came to be arrested.

“Treadmill Desk Observations”, Branwen 2012

Treadmill: “Treadmill desk observations”⁠, Gwern Branwen (2012-06-19; ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Notes relating to my use of a treadmill desk and 2 self-experiments showing walking treadmill use interferes with typing and memory performance.

It has been claimed that doing spaced repetition review while on a walking treadmill improves memory performance. I did a randomized experiment August 2013 – May 2014 and found that using a treadmill damaged my recall performance.

“A/B Testing Long-form Readability on”, Branwen 2012

AB-testing: “A/B testing long-form readability on”⁠, Gwern Branwen (2012-06-16; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

A log of experiments done on the site design, intended to render pages more readable, focusing on the challenge of testing a static site, page width, fonts, plugins, and effects of advertising.

To gain some statistical & web development experience and to improve my readers’ experiences, I have been running a series of CSS A/​B tests since June 2012. As expected, most do not show any meaningful difference.

“Dual N-Back Meta-Analysis”, Branwen 2012

DNB-meta-analysis: “Dual n-Back Meta-Analysis”⁠, Gwern Branwen (2012-05-20; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Does DNB increase IQ? What factors affect the studies? Probably not: gains are driven by studies with weakest methodology like apathetic control groups.

I meta-analyze the >19 studies up to 2016 which measure IQ after an n-back intervention, finding (over all studies) a net gain (medium-sized) on the post-training IQ tests.

The size of this increase on IQ test score correlates highly with the methodological concern of whether a study used active or passive control groups⁠. This indicates that the medium effect size is due to methodological problems and that n-back training does not increase subjects’ underlying fluid intelligence but the gains are due to the motivational effect of passive control groups (who did not train on anything) not trying as hard as the n-back-trained experimental groups on the post-tests. The remaining studies using active control groups find a small positive effect (but this may be due to matrix-test-specific training, undetected publication bias, smaller motivational effects, etc.)

I also investigate several other n-back claims, criticisms, and indicators of bias, finding:

“Redshift Sleep Experiment”, Branwen 2012

Redshift: “Redshift sleep experiment”⁠, Gwern Branwen (2012-05-09; ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Self-experiment on whether screen-tinting software such as Redshift/​f.lux affect sleep times and sleep quality; Redshift lets me sleep earlier but doesn’t improve sleep quality.

I ran a randomized experiment with a free program (Redshift) which reddens screens at night to avoid tampering with melatonin secretion & the sleep from 2012–2013, measuring sleep changes with my Zeo⁠. With 533 days of data, the main result is that Redshift causes me to go to sleep half an hour earlier but otherwise does not improve sleep quality.

“Iodine and Adult IQ Meta-analysis”, Branwen 2012

Iodine: “Iodine and Adult IQ meta-analysis”⁠, Gwern Branwen (2012-02-29; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Iodine improves IQ in fetuses; adults as well? A meta-analysis of relevant studies says no.

Iodization is one of the great success stories of public health intervention: iodizing salt costs pennies per ton, but as demonstrated in randomized & natural experiments, prevents goiters, cretinism, and can boost population IQs by a fraction of a standard deviation in the most iodine-deficient populations.

These experiments are typically done on pregnant women, and results suggest that the benefits of iodization diminish throughout the trimesters of a pregnancy. So does iodization benefit normal healthy adults, potentially even ones in relatively iodine-sufficient Western countries?

Compiling existing post-natal iodization studies which use cognitive tests, I find that—outliers aside—the benefit appears to be nearly zero, and so likely it does not help normal healthy adults, particularly in Western adults.

“LW Anchoring Experiment”, Branwen 2012

Anchoring: “LW anchoring experiment”⁠, Gwern Branwen (2012-02-27; ⁠, ⁠, ; similar):

Do mindless positive/​negative comments skew article quality ratings up and down?

I do an informal experiment testing whether LessWrong karma scores are susceptible to a form of anchoring based on the first comment posted; a medium-large effect size is found. Although the data does not fit the assumed normal distribution so there may or may not be any actual anchoring effect.

“Vitamin D Sleep Experiments”, Branwen 2012

Vitamin-D: “Vitamin D sleep experiments”⁠, Gwern Branwen (2012; ⁠, ⁠, ⁠, ; backlinks; similar):

Self-experiment on vitamin D effects on sleep: harmful taken at night, no or beneficial effects when taken in the morning.

Vitamin D is a hormone endogenously created by exposure to sunlight; due to historically low outdoors activity levels, it has become a popular supplement and I use it. Some anecdotes suggest that vitamin D may have circadian and zeitgeber effects due to its origin, and is harmful to sleep when taken at night. I ran a blinded randomized self-experiment on taking vitamin D pills at bedtime. The vitamin D damaged my sleep and especially how rested I felt upon wakening, suggesting vitamin D did have a stimulating effect which obstructed sleep. I conducted a followup blinded randomized self-experiment on the logical next question: if vitamin D is a daytime cue, then would vitamin D taken in the morning show some beneficial effects? The results were inconclusive (but slightly in favor of benefits). Given the asymmetry, I suggest that vitamin D supplements should be taken only in the morning.

“Silk Road 1: Theory & Practice”, Branwen 2011

Silk-Road: “Silk Road 1: Theory & Practice”⁠, Gwern Branwen (2011-07-11; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

History, background, visiting, ordering, using, & analyzing the drug market Silk Road 1

The cypherpunk movement laid the ideological roots of Bitcoin and the online drug market Silk Road; balancing previous emphasis on cryptography, I emphasize the non-cryptographic market aspects of Silk Road which is rooted in cypherpunk economic reasoning, and give a fully detailed account of how a buyer might use market information to rationally buy, and finish by discussing strengths and weaknesses of Silk Road, and what future developments are predicted by cypherpunk ideas.

“Tea Reviews”, Branwen 2011

Tea: “Tea Reviews”⁠, Gwern Branwen (2011-04-13; ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Teas I have drunk, with reviews and future purchases; focused primarily on oolongs and greens. Plus experiments on water.

Electric kettles are faster, but I was curious how much faster my electric kettle heated water to high or boiling temperatures than does my stove-top kettle. So I collected some data and compared them directly, trying out a number of statistical methods (principally: nonparametric & parametric tests of difference, linear & beta regression models, and a Bayesian measurement error model). My electric kettle is faster than the stove-top kettle (the difference is both statistically-significant p≪0.01 & the posterior probability of difference is P ≈ 1), and the modeling suggests time to boil is largely predictable from a combination of volume, end-temperature, and kettle type.

“Archiving URLs”, Branwen 2011

Archiving-URLs: “Archiving URLs”⁠, Gwern Branwen (2011-03-10; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Archiving the Web, because nothing lasts forever: statistics, online archive services, extracting URLs automatically from browsers, and creating a daemon to regularly back up URLs to multiple sources.

Links on the Internet last forever or a year, whichever comes first. This is a major problem for anyone serious about writing with good references, as link rot will cripple several% of all links each year, and compounding.

To deal with link rot, I present my multi-pronged archival strategy using a combination of scripts, daemons, and Internet archival services: URLs are regularly dumped from both my web browser’s daily browsing and my website pages into an archival daemon I wrote, which pre-emptively downloads copies locally and attempts to archive them in the Internet Archive⁠. This ensures a copy will be available indefinitely from one of several sources. Link rot is then detected by regular runs of linkchecker, and any newly dead links can be immediately checked for alternative locations, or restored from one of the archive sources.

As an additional flourish, my local archives are efficiently cryptographically timestamped using Bitcoin in case forgery is a concern, and I demonstrate a simple compression trick for substantially reducing sizes of large web archives such as crawls (particularly useful for repeated crawls such as my DNM archives).

“ Website Traffic”, Branwen 2011

Traffic: “ Website Traffic”⁠, Gwern Branwen (2011-02-03; ⁠, ⁠, ⁠, ⁠, ; similar):

Meta page describing editing activity, traffic statistics, and referrer details, primarily sourced from Google Analytics (2011-present).

On a semi-annual basis, since 2011, I review website traffic using Google Analytics; although what most readers value is not what I value, I find it motivating to see total traffic statistics reminding me of readers (writing can be a lonely and abstract endeavour), and useful to see what are major referrers. typically enjoys steady traffic in the 50–100k range per month, with occasional spikes from social media, particularly Hacker News; over the first decade (2010–2020), there were 7.98m pageviews by 3.8m unique users.

“Zeo Sleep Self-experiments”, Branwen 2010

Zeo: “Zeo sleep self-experiments”⁠, Gwern Branwen (2010-12-28; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

EEG recordings of sleep and my experiments with things affecting sleep quality or durations: melatonin, potassium, vitamin D etc

I discuss my beliefs about Quantified Self⁠, and demonstrate with a series of single-subject design self-experiments using a Zeo. A Zeo records sleep via EEG; I have made many measurements and performed many experiments. This is what I have learned so far:

  1. the Zeo headband is wearable long-term
  2. melatonin improves my sleep
  3. one-legged standing does little
  4. Vitamin D at night damages my sleep & Vitamin D in morning does not affect my sleep
  5. potassium (over the day but not so much the morning) damages my sleep and does not improve my mood/​productivity
  6. small quantities of alcohol appear to make little difference to my sleep quality
  7. I may be better off changing my sleep timing by waking up somewhat earlier & going to bed somewhat earlier
  8. lithium orotate does not affect my sleep
  9. Redshift causes me to go to bed earlier
  10. ZMA: inconclusive results slightly suggestive of benefits

“Nootropics”, Branwen 2010

Nootropics: “Nootropics”⁠, Gwern Branwen (2010-01-02; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar)

“Who Wrote The ‘Death Note’ Script?”, Branwen 2009

Death-Note-script: “Who Wrote The ‘Death Note’ Script?”⁠, Gwern Branwen (2009-11-02; ⁠, ⁠, ⁠, ; backlinks; similar):

Internal, external, stylometric evidence point to live-action leak of Death Note Hollywood script being real.

I give a history of the 2009 leaked script, discuss internal & external evidence for its realness including stylometrics; and then give a simple step-by-step Bayesian analysis of each point. We finish with high confidence in the script being real, discussion of how this analysis was surprisingly enlightening, and what followup work the analysis suggests would be most valuable.

“Miscellaneous”, Branwen 2009

Notes: “Miscellaneous”⁠, Gwern Branwen (2009-08-05; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Misc thoughts, memories, proto-essays, musings, etc.

We usually clean up after ourselves, but sometimes, we are expected to clean before (ie. after others) instead. Why?

Because in those cases, pre-cleanup is the same amount of work, but game-theoretically better whenever a failure of post-cleanup would cause the next person problems.

“Dual N-Back FAQ”, Branwen 2009

DNB-FAQ: “Dual n-Back FAQ”⁠, Gwern Branwen (2009-03-25; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

A compendium of DNB, WM⁠, IQ information up to 2015.

Between 2008 and 2011, I collected a number of anecdotal reports about the effects of n-backing; there are many other anecdotes out there, but the following are a good representation—for what they’re worth.

“In Defense of Inclusionism”, Branwen 2009

In-Defense-Of-Inclusionism: “In Defense of Inclusionism”⁠, Gwern Branwen (2009-01-15; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Iron Law of Bureaucracy: the downwards deletionism spiral discourages contribution and is how Wikipedia will die.

English Wikipedia is in decline. As a long-time editor & former admin, I was deeply dismayed by the process. Here, I discuss UI principles, changes in Wikipedian culture, the large-scale statistical evidence of decline, run small-scale experiments demonstrating the harm, and conclude with parting thoughts.

“Brms: an R Package for Bayesian Generalized Multivariate Non-linear Multilevel Models Using Stan”, Bürkner 2022

“brms: an R package for Bayesian generalized multivariate non-linear multilevel models using Stan”⁠, Paul Bürkner (⁠, ; backlinks; similar):

The brms package provides an interface to fit Bayesian generalized (non-)linear multivariate multilevel models using Stan⁠, which is a C++ package for performing full Bayesian inference. The formula syntax is very similar to that of the package lme4 to provide a familiar and simple interface for performing regression analyses.

A wide range of response distributions are supported, allowing users to fit—among others—linear, robust linear, count data, survival, response times, ordinal, zero-inflated, and even self-defined mixture models all in a multilevel context. Further modeling options include non-linear and smooth terms, auto-correlation structures, censored data, missing value imputation⁠, and quite a few more. In addition, all parameters of the response distribution can be predicted in order to perform distributional regression. Multivariate models (ie. models with multiple response variables) can be fit, as well.

Prior specifications are flexible and explicitly encourage users to apply prior distributions that actually reflect their beliefs.

Model fit can easily be assessed and compared with posterior predictive checks, cross-validation, and Bayes factors.