I discuss my beliefs about Quantified Self, and demonstrate with a series of single-subject design self-experiments using a Zeo. A Zeo records sleep via EEG; I have made many measurements and performed many experiments. This is what I have learned so far:

1. the Zeo headband is wearable long-term
2. melatonin improves my sleep
3. one-legged standing does little
4. Vitamin D (at night) damages my sleep
5. Vitamin D (in morning) improves my morning mood
6. potassium (over the day but not so much the morning) damages my sleep and does not improve my mood/productivity

Quantified Self (QS) is a movement with many faces and as many variations as participants, but the core of everything is this: experiment with things that can improve your life.

# What is QS?

Quantified Self is not expensive devices, or meet-ups, or videos, or even ebooks telling you what to do. Those are tools to an end. If reading this page does anything, my hope is to pass on to some readers the Quantified Self attitude: a playful thoughtful attitude, of wondering whether this thing affects that other thing and what implications could be easily tested. Science without the capital S or the belief that only scientists are allowed to think.

That’s all Quantified Self is, no matter how simple or complicated your devices, no matter how automated your data collection, no matter whether you found a pedometer lying around or hand-engineered your own EEG headset.

Quantified Self is simply about having ideas, gathering some data, seeing what it says, and improving one’s life based on the data. If gathering data is too hard and would make your life worse off - then don’t do it! If the data can’t make your life better - then don’t do it! Not every idea can or should be tested.

The QS cycle is straightforward and flexible:

1. Have an idea
2. Gather data
3. Test the data
4. Make a change; GOTO 1

Any of these steps can overlap: you may be collecting sleep data long before you have the idea (in the expectation that you will have an idea), or you may be making the change as part of the data in an experimental design, or you may inadvertently engage in a natural experiment before wondering what the effects were (perhaps the baby wakes you up on random nights and lets you infer the costs of poor sleep).

The point is not publishable scientific rigor. If you are the sort of person who wants to run such rigorous self-experiments, fantastic! The point is making your life better, for which scientific certainty is not necessary: imagine you are choosing between equally priced sleep pills and equal safety; the first sleep pill will make you go to sleep faster by 1 minute and has been validated in countless scientific trials, and while the second sleep pill has in the past week has ended the sweaty nightmares that have plagued you every few days since childhood but alas has only a few small trials in its favor - which would you choose? I would choose the second pill!

To put it in more economic/statistical terms, what we want from a self-experiment is for it to give us a confidence just good enough to tell whether the expected value of our idea is more than the idea will cost. But we don’t need more confidence unless we want to persuade other people! (So from this perspective, it is possible to do a QS self-experiment which is too good. Much like one can overpay for safety and buy too much insurance - like extra warranties on electronics such as video game consoles, a notorious rip-off.)

## What QS Is Not: (Just) Data Gathering

One failure mode which is particularly dangerous for QSers is to overdo the data collection and collect masses of data they never use. Famous computer entrepreneur & mathematician Stephen Wolfram exemplified this for me in March 2012 with his lengthy blog post The Personal Analytics of My Life in which he did some impressive graphing and exploration of data from 1989 to 2012: a third of a million (!) emails, full keyboard logging, calendar, phone call logs (with missed calls include), a pedometer, revision history of his tome A New Kind of Science, file types accessed per date, parsing scanned documents for dates, a treadmill, and perhaps more he didn’t mention.

Wolfram’s dataset is well-depicted in informative graphs, breathtaking in its thoroughness, and even more impressive for its duration. So why do I read his post with sorrow? I am sad for him because I have read the post several times, and as far as I can see, he has not benefited in any way from his data collection, with one minor exception:

Very early on, back in the 1990s, when I first analyzed my e-mail archive, I learned that a lot of e-mail threads at my company would, by a certain time of day, just resolve themselves. That was a useful thing to know, because if I jumped in too early I was just wasting my time.

Nothing else in his life was better 1989-2012 because he did all this, and he shows no indication that he will benefit in the future (besides having a very nifty blog post). And just reading through his post with a little imagination suggests plenty of experiments he could do:

1. He mentions that 7% of his keystrokes are the Backspace key.

This seems remarkably high and must be slowing down his typing by a nontrivial amount. Why doesn’t he try a typing tutor to see if he can improve his typing skill, or learn the keyboard shortcuts in his text editor? If he is wasted >7% of all his typing (because he had to type what he is Backspacing over, of course), then he is wasting typing time, slowing things done, adding frustration to his computer interactions and worst, putting himself at greater risk of crippling RSI.
2. How often does he access old files? Since he records access to all files, he can ask whether all the logging is paying for itself.
3. Is there any connection between the steps his pedometer records and things like his mood or emailing? Exercise has been linked to many benefits, both physical and mental, but on the other hand, walking isn’t a very quick form of exercise. Which effect predominates? This could have the practical consequence of scheduling a daily walk just as he tries to make sure he can have dinner with his family.
4. Does a flurry of emails or phone calls disrupt his other forms of productivity that day? For example, while writing his book would he have been better off barricading himself in solitude or working on it in between other tasks?
5. His email counts are astonishingly high in general:

Is answering so many emails really necessary? Perhaps he has put too much emphasis on email communication, or perhaps this indicates he should delegate more - or if running Mathematica is so time-consuming, perhaps he should re-evaluate his life and ask whether that is what he truly wants to do now. I have no idea what the answer to any of these questions are or whether an experiment of any kind could be run on them, but these are key life decisions which could be prompted by the data - but weren’t.

Another QS piece(It’s Hard to Stay Friends With a Digital Exercise Monitor) struck me when the author, Jenna Wortham, reflected on her experience with her Nike+ FuelBand motion sensor:

The forgetfulness and guilt I experienced as my FuelBand honeymoon wore off is not uncommon, according to people who study behavioral science. The collected data is often interesting, but it is hard to analyze and use in a way that spurs change. It doesn’t trigger you to do anything habitually, said Michael Kim, who runs Kairos Labs, a Seattle-based company specializing in designing social software to influence behavior…Mr. Kim, whose résumé includes a stint as director of Xbox Live, the online gaming system created by Microsoft, said the game-like mechanisms of the Nike device and others like it were not enough for the average user. Points and badges do not lead to behavior change, he said.

One thinks of a saying of W. Edwards Deming: Experience by itself teaches nothing. Indeed. A QS experiment is a 4-legged beast: if any leg is far too short or far too long, it can’t carry our burdens.

And with Wolfram and Wortham, we see that 2 legs of the poor beast have been amputated. They collected data, but they had no ideas and they made no changes in their life; and because QS was not part of their life, it soon left their life. Wortham seems to have dropped the approach entirely, and Wolfram may only persevere for as long as the data continues to be useful in demonstrating the abilities of his company’s products.

# Zeo QS

On Christmas 2010, I received one of Zeo Inc’s (founded 2003, shutting down 2013) Zeo bedside unit after long coveting it and dreaming of using it for all sorts of sleep-related questions. (As of February 2013, the bedside unit seems to’ve been discontinued; the most comparable Zeo Inc. product seems to be the Zeo Sleep Manager Pro, ~$90.) With it, I begin to apply my thoughts about Quantified Self. A Zeo is a scaled-down (one-electrode) EEG sensor-headband, which happens to have an alarm clock attached. The EEG data is processed to estimate whether one is asleep and what stage of sleep one is in. Zeo breaks sleep down into waking, REM, light, and deep. (The phases aren’t necessarily that physiologically distinct.) It’s been compared with regular polysomnography by Zeo Inc and others and seems to be reasonably accurate. (Since regular sleep tests cost thousands of dollars per session and are of questionable external validity since they are a very different setting than your own bedroom, I am fine with a Zeo being just reasonably accurate.) The data is much better than what you would get from more popular methods like cellphones with accelerometers, since an accelerometer only knows if you are moving or not, which isn’t a very reliable indicator of sleep1. (You could just be lying there staring at the ceiling, wide awake. Or perhaps the cat is kneading you while you are in light sleep.) As well, half the interest is how exactly sleep phases are arranged and how long the cycles are; you could use that information to devise a custom polyphasic schedule or just figure out a better nap length than the rule-of-thumb of 20 minutes. And the price isn’t too bad -$150 for the normal Zeo as of February 2012. (The basic mobile Zeo is much cheaper, but I’ve seen people complain about it and apparently it doesn’t collect the same data as more expensive mobile version or the original bedside unit.)

# Tests

A thinker sees his own actions as experiments & questions - as attempts to find out something. Success and failure are for him answers above all.Friedrich Nietzsche, The Happy Science #41

I personally want the data for a few distinct purposes, but in the best Quantified Self vein, mostly experimenting:

1. more thoroughly quantifying the benefits of melatonin

• and dose levels: 1.5mg may be too much. I should experiment with a variety: 0.1, 0.5, 1.0, 1.5, and 3mg?
2. quantifying the costs of modafinil
3. testing benefits of huperzine-A2
4. designing & starting polyphasic sleep
5. assisting lucid dreaming
6. reducing sleep time in general (better & less sleep)
7. investigating effects of n-backing:

• do n-backing just before sleep, and see whether percentages shift (more deep sleep as the brain grows/changes?) or whether one sleeps better (fewer awakenings, less light sleep).
• do n-backing after waking up, to look for correlation between good/bad sleeps and performance (one would expect good sleep ~> good scores).
• test the costs of polyphasic sleep on memory3
8. (positive) effect of Seth Roberts’s one-legged standing on sleep depth/efficiency
9. possible sleep reductions due to meditation
10. serial cable uses:

• quantifying meditation (eg. length of gamma frequencies)
• rank music by distractibility?
• measure focus over the day and during specific activities (eg. correlate frequencies against n-backing performance)
11. testing benefit of using Redshift/f.lux to adjust monitor color temperature
12. Measure negative effect of nicotine on sleep & determine appropriate buffer
13. test claims of sleep benefits from magnesium

I have tried to do my little self-experiments as well as I know how to, and hopefully my results are less bogus than the usual anecdotes one runs into online. What I would really like is for other people (especially Zeo owners) to replicate my results. To that end I have taken pains to describe my setups in complete detail so others can use it, and provided the data and complete R or Haskell programs used in analysis. If anyone replicates my results in any fashion, please contact me and I would be happy to link your self-experiment here!

# First impressions

## First night

Christmas morning, I unpacked it and admired the packaging, and then looked through the manual. The base-station/alarm-clock seems pretty sturdy and has a large clear screen. The headband seemed comfortable enough that it wouldn’t bother me. The various writings with it seemed rather fluffy and preppy, but I did my technical homework before hand, so could ignore their crap.

Late that night (quite late, since the girls stayed up playing Fable 3 and Xbox Kinect dancing games and what not), I turn in wearily. I had noticed that the alarm seemed to be set for ~3:30 AM, but I was very tired from the long day and taking my melatonin, and didn’t investigate further - I mean, what electronic would ship with the alarm both enabled and enabled for a bizarre time? It wasn’t worth bothering the other sleeper by turning on the light and messing with it. I put on the headband, verified that the Zeo seemed to be doing stuff, and turned in. Come 3 AM, and the damn music goes off! I hit snooze, too discombobulated to figure out how to turn off the alarm.

So that explains the strange Zeo data for the first day:

The major surprise in this data was how quickly I fell asleep: 18 minutes. I had always thought that I took much longer to fall asleep, more like 45 minutes, and had budgeted accordingly; but apparently being deluded about when you are awake and asleep is common - which leads into an interesting philosophical point: if your memories disagree with the Zeo, who should you believe? The rest of the data seemed too messed up by the alarm to learn anything from.

# Uses

## Meditation

One possible application for Zeo was meditation. If it’s measuring via EEG, then presumably it’s learning something about how relaxed and activity-less one’s mind is. I’m not seeking enlightenment, just calmness, which would seem to be in the purview of an EEG signal. (As Charles Babbage said. errors made using insufficient data are still less than errors made using no data at all.) But alas, I meditated for a solid 25 minutes and the Zeo stubbornly read at the same wake level the entire time; I then read my Donald Keene book, Modern Japanese diaries, for a similar period with no change at all. It is possible that the 5-minute averaging (Zeo measures every 2 seconds) is hiding useful changes, but probably it’s simply not picking up any real differences. Oh well.

## Smart alarm

The second night I had set the alarm to a more reasonable time, and also enabled its smart alarm mode (SmartWake), where the alarm will go off up to 30 minutes early if you are ever detected to be awake or in light sleep (as opposed to REM or deep sleep). One thing I forgot to do was take my melatonin; I keep my supplements in the car and there was a howling blizzard outside. It didn’t bother me since I am not addicted to melatonin.

In the morning, the smart alarm mode seemed to work pretty well. I woke up early in a good mode, thought clearly and calmly about the situation - and went back to sleep. (It’s a holiday, after all.)

Around 15 May 2011, I gave up on the original headband - it was getting too dirty to get good readings - and decided to rip it apart to see what it was made of, and to order a new set of three for $35 (which seems reasonable given the expensive material that the contacts are made of - silver fabric); they then cost$50. A little googling found me a coupon, FREESHIP, but apparently it only applied to the Zeo itself and so the pads were actually $40, or ~$13 a piece. I won’t say that buying replacement headbands semi-annually is something that thrills me, but $20 a year for sleep data is a small sum. Certainly it’s more cost-effective than most of the nootropics I have used. (Full disclosure: 9 months after starting this page, Zeo offered me a free set of sensors. I used them and when the news broke about Zeo going out of business, I bought another set.) / / / In the future, I might try to make my own; eok.gnah claims that buying the silver fabric is apparently cheaper than ordering from Zeo, marciot reports success in making headbands, and it seems one can even hook up other sensors to the headband. Another alternative is, since the Zeo headband is a one-electrode EEG headset, to take an approach similar to the EEG people and occasionally add small dabs of conductive paste, since fairly large quantities are cheap (eg. 12oz for$30).

# Melatonin

Before writing my melatonin advocacy article, I had used melatonin regularly for 6+ years, ever since I discovered (somewhen in high school or college) that it was useful for enforcing bedtimes and seemed to improve sleep quality; when I posted my writeup to LessWrong people were naturally a little skeptical of my specific claim that it improved the quality of my sleep such that I could reduce scheduled time by an hour or so. Now that I had a Zeo, wouldn’t it be a good idea to see whether it did anything, lo these many years later?

## Melatonin analysis

The following section represents 5 or 6 months of data (raw CSV data; guide to Zeo CSV). My basic dosage was 1.5mg of melatonin taken 0-30 minutes before going to sleep. The data is very noisy (especially towards the end, perhaps as the headband got dirty), but hopefully the overall conclusions are not entirely untrustworthy. Let’s look at some average. Zeo’s website lets you enter in a 3-valued variable and then graph the average day for each variable against a particular recorded property like ZQ or total length of REM sleep. I defined one dummy variable, and decided that a 0 would correspond to not using melatonin, 1 would correspond to using it, and 2 would correspond to using a double-dose or more (on the rare occasions I felt I needed sleep insurance). The following additional NHST-style4 analyses of p-values56 is done by importing the CSV into R; given all the issues with self-experimentation (these melatonin days weren’t even blinded), the p-values should be treated as gross guesses, where <0.01 indicates I should take it seriously, <0.05 is pretty good, <0.10 means I shouldn’t sweat it, and anything bigger than 0.20 is, at most, interesting while >0.5 means ignore it; we’ll also look at correcting for multiple comparisons7, for the heck of it.

As expected, when using melatonin, total sleep and times awoken were both reduced (p=0.083 & p=0.431); the total sleep fell by 12% (456 minutes to 402 minutes, effect size 0.378):

The calculated effect size is a little worrisome; at a medium-small d=0.37, power calculations indicate that we’re close to the limit of what the data can reliably tell us9. (Effect size is as important as significance values and in many contexts, more important, as it measures how strong the phenomenon is. A mnemonic: p-values are about whether the effect exists, and d-values are whether we care. For a visualization of effect sizes, see Windowpane as a Jar of Marbles.) Part of the problem is that too many days wound up being useless, and each day costs us information and reduces our true sample size.

Deep sleep and time in wake were both apparently unaffected (p=0.88 & p=0.53); time in wake apparently had too small a sample to draw much conclusion:

Surprisingly, total REM sleep fell 20% (145 to 116 minutes), for a large & statistically-significant reduction (p=0.011):

Given that the ZQ metric seems to be primarily determined by total sleep and REM sleep, it’s not surprising - given the REM graph - that ZQ also fell (p=0.087):

REM’s average fell by 29 minutes, deep sleep fell by 1 minute, but total sleep fell by 54 minutes; this implies that light sleep fell by 24 minutes. (The averages were 254.2 & 233.3; p=0.22) I am not sure what to make of this. While my original heuristic of a one hour reduction turns out to be surprisingly accurate, I had expected light and deep sleep to take most of the time hit. Do I get enough REM sleep? I don’t know how I would answer that.

I did feel fine on the days after melatonin use, but I didn’t track it very systematically. The best I have is the morning feel parameter, which the Zeo asks you on waking up; in practice I entered the values as: a 2 means I woke feeling poor or unrested, 3 was fine or mediocre, and 4 was feeling good. When we graph the average of morning feel against melatonin use or non-use, we find that melatonin was noticeably better (2.95 vs 3.17, p=0.18):

None of the metrics are strong enough to survive multiple correction10, sadly.

And also unfortunately, this dataseries doesn’t distinguish between addition to melatonin or benefits from melatonin - perhaps the 3.2 is my normal sleep quality and the 2.9 comes from a withdrawal of sorts. The research on melatonin doesn’t indicate any addiction effect, but who knows? So, onwards to other measures of mental performance. Also unfortunately, during this period, I didn’t regularly do my n-backing either, so there’d be little point trying to graph that. What I spent a lot of my free time doing was editing gwern.net, so it might be worth looking at whether nights on melatonin correspond to increased edits the next day. In this graph of edits, the red dots are days without melatonin and the green are days with melatonin; I don’t see any clear trend, although it’s worth noting almost all of the very busy days were melatonin days:

If I were to run further experiments, I would definitely run it double-blind, and maybe even test <1.5mg doses as well to see if I’ve been taking too much; 3mg turned out to be excessive, and there are one or two studies indicating that <1mg doses are best for normal people. I wound up using 1.5mg doses. (There could be 3 conditions: placebo, 0.75mg, and 1.5mg. For looking at melatonin effect in general, the data on 2 dosages could be combined. Melatonin has a short half-life, so probably there would be no point in random blocks of more than 2-3 days11: we can randomize each day separately and assume that days are independent of each other.)

Worth comparing are Jayson Virissimo’s preliminary results:

According to the preliminary [Zeo] data, while on melatonin, I seemed to get more total sleep, more REM sleep, less deep sleep, and wake up about the same number of times each night. Because this isn’t enough data to be very confident in the results, I plan on continuing this experiment for at least another 4 months (2 on and 2 off of melatonin) and will analyze the results for the [statistical] significance and magnitude of the effects (if there really are any) while throwing out the outliers (since my sleep schedule is so erratic).

## Value of Information (VoI)

See also the discussion as applied to ordering modafinil and testing nootropics

We all know it’s possible to spend more time figuring out how to save time on a task than we would actually save time like rearranging books on a shelf or cleaning up in the name of efficiency (xkcd even has a cute chart listing the break-even points for various possibilities,Is It Worth The Time?), and similarly, it’s possible to spend more money trying to save money than one would actually save; less appreciated is that the same thing is also possible to do with gaining information.

The value of an experiment is the information it produces. What is the value of information? Well, we can take the economic tack and say value of information is the value of the decisions it changes. (Would you pay for a weather forecast about somewhere you are not going to? No. Or a weather forecast about your trip where you have to make that trip, come hell or high water? Only to the extent you can make preparations like bringing an umbrella.)

Wikipedia says that for a risk-neutral person, value of perfect information is value of decision situation with perfect information - value of current decision situation. (Imperfect information is just weakened perfect information: if your information was not 100% reliable but 99% reliable, well, that’s worth 99% as much.)

The decision is the binary take or not take. Melatonin costs ~$10 a year (if you buy in bulk during sales, as I did). Suppose I had perfect information it worked; I would not change anything, so the value is$0. Suppose I had perfect information it did not work; then I would stop using it, saving me $10 a year in perpetuity, which has a net present value12 (at 5% discounting) of$205. So the best-case value of perfect information - the case in which it changes my actions - is $205, because it would save me from blowing$10 every year for the rest of my life. My melatonin experiment is not perfect since I didn’t randomize or double-blind it, but I had a lot of data and it was well powered, with something like a >90% chance of detecting the decent effect size I expected, so the imperfection is just a loss of 10%, down to $184. From my previous research and personal use over years, I am highly confident it works - say, 80%13. If the experiment says melatonin works, the information is useless to me since I continue using melatonin, and if the experiment says it doesn’t, then let’s assume I decide to quit melatonin14 and then save$10 a year or $184 total. What’s the expected value of obtaining the information, giving these two outcomes? $\left(80$. Or another way, redoing the net present value: $\frac{10-0}{\mathrm{ln}1.05}×0.9×0.2$ At minimum wage opportunity cost of$7 an hour, $36.8 is worth 5.25 hours of my time. I spent much time on screenshots, summarizing, and analysis, and I’d guess I spent closer to 10-15 hours all told. This worked out example demonstrates that when a substance is cheap and you are highly confident it works, a long costly experiment may not be worth it. (Of course, I would have done it anyway due to factors not included in the calculation: to try out my Zeo, learn a bit about sleep experimentation, do something cool, and have something neat to show everyone.) ## Melatonin data The data looked much better than the first night, except for a big 2-hour gap where I vaguely recall the sensor headband having slipped off. (I don’t think it was because it was uncomfortable but due to shifting positions or something.) Judging from the cycle of sleep phases, I think I lost data on a REM peak. The REM peaks interest me because it’s a standard theory of polyphasic sleeping that thriving on 2 or 3 hours of sleep a day is possible because REM (and deep sleep) is the only phase that truly matters, and REM can dominate sleep time through REM rebound and training. Besides that, I noticed that time to sleep was 19 minutes that night. I also had forgotten to take my melatonin. Hmm… Since I’ve begun this inadvertent experiment, I’ll try continuing it, alternating days of melatonin usage. I claim in my melatonin article that usage seems to save about 1 hour of sleep/time, but there’s several possible avenues. One could be quicker to fall asleep; one could awake fewer times; and one could have greater percentage of REM or deep sleep, reducing light sleep. (Light sleep doesn’t seem very useful; I sometimes feel worse after light sleep.) During the afternoon, I took a quick nap. I’m not a very good napper, it seems - only the first 5 minutes registered as even light sleep. A dose of melatonin (1.5mg) and off to bed a bit early. I’m a little more impressed with the smart alarm; since I’m hard-of-hearing and audio alarms rarely if ever work, I usually use a Sonic Alert vibrating alarm clock. But in the morning I woke up within a minute of the alarm, despite the lack of vibration or flashing lights. (The chart doesn’t reflect this, but as a previous link says, distinguishing waking from sleeping can be difficult and the transitions are the least trustworthy parts of the data.) The data was especially good today, with no big gaps: You can see an impressively regular sleep cycle, cycling between REM and light sleep. What’s disturbing is the relative lack of deep sleep - down 4-5% (and there wasn’t a lot to begin with). I suspect that the lack of deep sleep indicates I wasn’t sleeping very well, but not badly enough to wake up, and this is probably due either to light from the Zeo itself - I only figured out how to turn it off a few days later - or my lack of regular blankets and use of a sleeping bag. But the awakenings around 4-6 AM and on other days has made me suspicious that one of the cats is bothering me around here and I’m just forgetting it as I fall asleep. The next night is another no-melatonin night. This time it took 79 minutes to fall asleep. Very bad, but far from unprecedented; this sort of thing is why I was interested in melatonin in the first place. Deep sleep is again limited in dispersion, with a block at the beginning and end, but mostly a regular cycle between light and REM: Melatonin night, and 32 minutes to sleep. (I’m starting to notice a trend here.) Another fairly regular cycle of phases, with some deep sleep at the beginning and end; 32 minutes to fall asleep isn’t great but much better than 79 minutes. Perhaps I should try a biphasic schedule where I sleep for an hour at the beginning and end? That’d seem to pick up most of my deep sleep, and REM would hopefully take care of itself with REM rebound. Need to sum my average REM & deep sleep times (that sum seems to differ quite a bit, eg one fellow needs 4+ hours. My own need seems to be similar) so I don’t try to pick a schedule doomed to fail. Another night, no melatonin. Time to sleep, just 18 minutes and the ZQ sets a new record even though my cat Stormy woke me up in the morning15: I personally blame this on being exhausted from 10 hours working on my transcription of The Notenki Memoirs. But a data point is a data point. I spend New Year’s Eve pretty much finishing The Notenki Memoirs (transcribing the last of the biographies, the round-table discussion, and editing the images for inclusion), which exhausts me a fair bit as well; the champagne doesn’t help, but between that and the melatonin, I fall asleep in a record-setting 7 minutes. Unfortunately, the headband came off somewhere around 5 AM: A cat? Waking up? Dunno. Another relatively quick falling asleep night at 20 minutes. Which then gets screwed up as I simply can’t stay asleep and then the cat begins bothering the heck out of me in the early morning: Melatonin night, which subjectively didn’t go too badly; 20 minutes to sleep. But lots of wake time (long enough wakes that I remembered them) and 2 or 3 hours not recorded (probably from adjusting my scarf and the headband): Accidentally did another melatonin night (thought Monday was a no-melatonin night). Very good sleep - set records for REM especially towards the late morning which is curious. (The dreams were also very curious. I was an Evangelion character (Kaworu) tasked with riding that kind of carnival-like ride that goes up and drops straight down.) Also another quick falling asleep: Rather than 3 melatonin nights in a row, I skipped melatonin this night (and thus will have it the next one). Perhaps because I went to sleep so very late, and despite some awakenings, this was a record-setting night for ZQ and TODO deep sleep or REM sleep? : I also switched the alarm sounds 2 or 3 days ago to forest sounds; they seem somewhat more pleasant than the beeping musical tones. The next night, data is all screwed up. What happened there? It didn’t even record the start of the night, though it seemed to be active and working when I checked right before going to sleep. Odd. Next 2 days aren’t very interesting; first is no-melatonin, second is melatonin: One of my chief Zeo complaints was the bright blue-white LCD screen. I had resorted to turning the base station over and surrounding it with socks to block the light. Then I looked closer at the labels for the buttons and learned that the up-down buttons changed the brightness and the LCD screen could be turned off. And I had read the part of the manual that explained that. D’oh! Off, but no data on the 22nd. No idea what the problem is - the headset seems to have been on all night. On with a double-dose of melatonin because I was going to bed early; as you can see, didn’t work: Off, no data on the 24th. On, no data on the 25th. I don’t know what went wrong on these two nights. The 27th (on for melatonin) yielded no data because, frustratingly, the Zeo was printing a write-protected error on its screen; I assumed it had something to do with uploading earlier that day - perhaps I had yanked it out too quickly - and put it back in the computer, unmounted and went to eject it. But the memory card splintered on me! It was stuck and the end was splintering and little needles of plastic breaking off. I couldn’t get it out and gave up. The next day (I slept reasonably well) I went back with a pair of needle-nose pliers. I had a backup memory card. After much trial and error, I figured out the card had to be FAT-formatted and have a directory structure that looked like ZEO/ZEOSLEEP.DAT. So that’s that. • 30: on • 31: off • 1: on • 2: off • 3: on Unfortunately, this night continues a long run of no data. Looking back, it doesn’t seem to have been the fault of the new memory card, since some nights did have enough data for the Zeo website to generate graphs. I suspect that the issue is the pad getting dirty after more than a month of use. I hope so, anyway. I’ll look around for rubbing alcohol to clean it. That night initially starts badly - the rubbing alcohol seemed to do nothing. After some messing around, I figure out that the headband seems to have loosened over the weeks and so while the sensor felt reasonably snug and tight and was transmitting, it wasn’t snug enough. I tighten it considerably and actually get some decent data: • 5: on • 7: on • 8: off • 9: on • 11: on? The previous night, I began paying closer attention to when it was and was not reading me (usually the latter). Pushing hard on it made it eventually read me, but tightening the headband hadn’t helped the previous several nights. Pushing and not pushing, I noticed a subtle click. Apparently the band part with the metal sensor pad connects to the wireless unit by 3 little black metal nubs; 2 were solidly in place, but the third was completely loose. Suspicious, I try pulling on the band without pushing on the wireless unit - leaving the loose connection loose. Sure enough, no connection was registered. I push on the unit while loosing the headband - and the connection worked. I felt I finally had solved it. It wasn’t a loose headband or me pulling it off at night or oils on the metal sensors or a problem with the SD card. I was too tired to fix it when I had the realization, but resolved the next morning to fix it by wrapping a rubber band around the wireless unit and band. This turned out to not interfere with recharging, and when I took a short nap, the data looked fine and gapless. So! The long data drought is hopefully over. On the 15th of February, I had a very early flight to San Francisco. That night and every night from then on, I was using melatonin, so we’ll just include all the nights for which any sensible data was gathered. Oddly enough, the data and ZQs seem bad (as one would expect from sleeping on a couch), but I wake up feeling fairly refreshed. By this point we have the idea how the sleep charts work, so I will simply link them rather than display them. Then I took a long break on updating this page; when I had a month or two of data, I uploaded to Zeo again, and buckled down and figured out how to have ImageMagick crop pages. The shell script (for screenshots of my browser, YMMV) is for file in *.png; do mogrify +repage -crop 700x350+350+285$file; done;

General observations: almost all these nights were on melatonin. Not far into this period, I realized that the little rubber band was not working, and I hauled out my red electrical tape and tightened it but good; and again, you can see the transition from crappy recordings to much cleaner recordings. The rest of February:

March:

April:

April 4th was one of the few nights that I was not on melatonin during this timespan; I occasionally take a weekend and try to drop all supplements and nootropics besides the multivitamins and fish oil, which includes my melatonin pills. This night (or more precisely, that Sunday evening) I also stayed up late working on my computer, getting in to bed at 12:25 AM. You can see how well that worked out. During the 2 AM wake period, it occurred to me that I didn’t especially want to sacrifice a day to show that computer work can make for bad sleep (which I already have plenty of citations for in the Melatonin essay), and I gave in, taking a pill. That worked out much better, with a relatively normal number of wakings after 2 AM and a reasonable amount of deep & REM sleep.

# Exercise

## One-legged standing

Seth Roberts found that for him, standing a lot helped him sleep. This seems very plausible to me - more fatigue to repair, closer to ancestral conditions of constant walking - and tallied with my own experience. (One summer I worked at Yawgoog Scout Camp, where I spent the entire day on my feet; I always slept very well though my bunk was uncomfortable.) He also found that stressing his legs by standing on one at a time for a few minutes also helped him sleep. That did not seem as plausible to me. But still worth trying: standing is free, and if it does nothing, at least I got a little more exercise.

Roberts tried a fairly complicated randomized routine. I am simply alternating days as with melatonin (note that I have resumed taking melatonin every day). My standing method is also simple; for 5 minutes, I stand on one leg, rise up onto the ball of my foot (because my calves are in good shape), and then sink down a foot or two and hold it until the burning sensation in my thigh forces me to switch to the other leg. (I seem to alternate every minute.) I walk my dog most every day, so the effect is not as simple as some moderate exercise that day; in the next experiment, I might try 5 minutes of dumbbell bicep curves instead.

### One-legged standing analysis

The initial results were promising. Of the first 5 days, 3 are on and 2 are off; all 3 on-days had higher ZQs than the 2 off-days. Unfortunately, the full time series did not seem to bear this out. Looking at the ~70 recorded days between 11 June 2011 and 27 August 2011 (raw CSV data), the averages looked like this (as before, the 3 means the intervention was used, 0 that it was not):

With the melatonin analysis, I had counted it as a success for melatonin inasmuch as sleep time had substantially fell but sleep quality and mental performance, as far as I could tell, had remained the same. R analysis16:

• total sleep time (deep/REM/total) fell (513.8 vs 502.4, p=0.43)
• REM fell (168.8 vs 152.7, p=0.018)
• hence ZQ fell, which is ambiguous - good or bad? (96.05 vs 92.5, p=0.19)
• but number of awakenings fell, which is good (4.18 vs 4, p=0.75)
• as did total time awake which is good (12.8 vs 11.7, p=0.8)
• morning feel improved, also good (2.81 vs 2.71, p=0.3)

No p-values survived multiple-correction17:.

While I did not replicate Roberts’s setup exactly in the interest of time and ease, and obviously it was not blinded, I tried to compensate with an unusually large sample: 69 nights of data. This was a mixed experiment: there seems to be an effect, but none of the changes seem to have large effect sizes or strong p-values.

The one-legged standing was not in exclusion to melatonin use, but I had used it most every night. I thought I might go on using one-legged standing, perhaps skipping it on nights when I am up particularly late or lack the willpower, but I’ve abandoned it because it is a lot of work to use and the result looked weak. In the future, I should look into whether walks before bedtime help.

# Vitamin D

## Background

Seth Roberts has speculated that vitamin D, despite its myriads of other benefits, may harm sleep when taken in the evening and help sleep when taken in the morning based on some anecdotes (with 2 null results). The anecdotes are nearly worthless as sleep is pretty variable (look above or below, and you’ll see swings of over 20 ZQ points night to night), and just a little carelessness or selection bias will persuade one that there is a major effect where there is none - especially since they are not using Zeos or accelerometers or even giving basic quantities like I felt bad in the morning 3/5 days. But I began to wonder. Vitamin D is a chemical intimately involved in circadian rhythms (a zeitgeber), with some connections to systems involved in sleep (The steroid hormone of sunlight soltriol (vitamin D) as a seasonal regulator of biological activities and photoperiodic rhythms); given its links to the early day and sunlight, one would expect it to affect sleep for the worse.

To see what, if any existing research there was, I checked the 49 hits in PubMed and the first 10 pages of Google Scholar for vitamin D sleep. For the most part, hits were completely irrelevant, and the most relevant ones like Vitamins and Sleep: An Exploratory Study did not cover any relationship between vitamin D and sleep, much less the timing of vitamin D consumption. There’s some speculation the elderly may sleep badly in part due to lack of vitamin D (Some new food for thought: The role of vitamin D in the mental health of older adults), but the only hard results I found were weak or tangential: a correlation with daytime sleepiness in Taiwanese dialysis patients18, a correlation with later sleep in American women19, and of course a correlation with earlier sleep in Japanese women20. This reads like noise.

In June 2012, after I finished my 2 experiments, a preprint appeared for Medical Hypotheses: The world epidemic of sleep disorders is linked to vitamin D deficiency, Gominak & Stumpf 2012; the lead author, unfortunately, had little to tell me when I emailed her, indicating that the use of vitamin D was not systematic or recorded:

• I don’t know about the overarching claims (I suspect most of the problem is lighting, and general demands on time), but the trial itself seems really important, especially since neither Roberts nor I had the slightest idea about it but seem to have reached similar results
• the 2 patients suggested it, in an interesting example of the value of self-experimentation
• the authors cover much more specific potential connections between vitamin D and sleep than just circadian rhythms
• the methodology section is non-existent; how were these 1500 patients picked? how long did each use vitamin D? Unfortunately, I nor Roberts has taken vitamin D blood tests (as far as I know) and so we cannot verify that the authors’ 60-80ng/ml range is what we fell into, but it’s plausible. how is sleep quality being measured? are these results consistent or inconsistent with our two cases of morning mood/restedness improvement but little else? Although even if they were inconsistent, that could be explained by neither of us being sleep disorder sufferers and the effect being weaker in us

In July 2012, preprints of Huang et al 2012 became available; it is a case series - the authors followed a group of veterans with chronic pain who received vitamin D supplements, finding improvements to pain but also reduction in sleep latency and increase in sleep duration. While I did not observe any effect on latency or duration in my following experiments, this would still be a promising datapoint but unfortunately, the sample had substantial dropout, and had no control group (hence no randomizing or blinding). This renders the study not very useful - the improvements being perhaps just regression toward the mean or a selection bias.

Blogger Chris L looked back in August 2012 on ~1 year of Zeo data and a quasi-experiment in which he started with 4000IU of vitamin D supplementation, then 5000IU, then none; he took them at night, then switched to morning; the results were that the length of his deep sleep started high, dropped, and then recovered. He interprets this as evidence that too much vitamin D hurts sleep.

## Vitamin D at night hurts?

### Setup

I decided to run a small double-blind experiment much like the Adderall and other trials. My Vitamin D is 360 5000IU softgels by Healthy Origins, bought on iHerb.com. The gel-capsules contain cholecalciferol dissolved in olive oil. This made preparing placebo pills a little more difficult. I wound up puncturing the capsules, squeezing out the olive oil contents into a new capsule (they were too wide to push in) and then pushing in the empty shell; all 20 were topped off with ordinary white baking flour. (I used up the last of my creatine preparing the placebos for the Modalert day trial.) For the 20 placebo pills, I spooned in some olive oil to each and topped them off with flour as well. Each set went into its own identical Tupperware container. The process was a little messier than I had hoped, but the pills seem like they will work.

The procedure at night will be: in the dark21 immediately before putting on the Zeo headband and going to bed, I will take my usual melatonin pill; then I will take the two containers blindly; mix them up; select a pill from one to take, and put the selected container on the shelf next to the Zeo. In the morning, I will see which one I took. (The Vitamin D olive oil was distinctly more yellow than the green placebo olive oil.) If I took placebo, I will take my usual daily dose of Vitamin D, and if active, I will skip it. This hopefully will blind me and keep constant my total Vitamin D intake. (This procedure may need to be amended with something more like the modafinil/Adderall procedure: a bag with replacement of the consumed placebos.) If I get a run of one kind of pills, I will re-balance the numbers.

Based on the first 10 days’ ZQs, I predict I’ll find in the final data set:

1. increased sleep latency; probably at least another 10 minutes to fall asleep, as my mind seems to churn away with ideas of things to do
2. increased awakenings; not that many, maybe 1 or 2 on average
3. decreased ZQ; by around 5-10 points (a large effect, on par with melatonin)

My best guess is that the ZQ hit is coming from reduced deep sleep, or maybe reduced deep & REM sleep. I don’t think the total amount of sleep has changed.

Roberts theorizes that besides vitamin D damaging sleep, it could actively improve your sleep if taken in the morning. As it happens, in this setup, on placebo days I do take vitamin D in the morning - so wouldn’t one expect to see scores improve on the nights following a placebo night (a vitamin D morning), regardless of whether that night was vitamin D or placebo? A quick analysis of the first 24 nights showed the lagged nights to average a ZQ of 94.5. My monthly averages for October and November were 96, so there is no obvious improvement here.

One thing I suspect but cannot confirm - since I do not have a heart rate monitor - is that ~10 minutes after taking the vitamin D pills, my heart rate increases. Not to any uncomfortable or worrisome degree, but when one expects one’s heart rate to go down after going to bed, even a small increase in the opposite direction is noticeable. On the 12th, I finally got around to writing down this impression; then I searched online a bit and found that low vitamin D levels are associated with arrhythmia and other issues, but so are very high levels, and increased heart rates in the studies and anecdotes are associated with higher heart rates22. I’m not worried about the heart rate, but I am concerned that this is defeating the double-blinding: if all I have to do is notice my heart rate (and lying swaddled in bed in complete silence, it would be hard for me not to), then I’ve unblinded myself before falling asleep. Other stimulants like caffeine or sulbutiamine might similarly increase my heart rate, but they’d obviously also interfere with sleep, so I can’t create any active placebo even if I wanted to start over. (One promising future gadget is the Basis wristwatch which measures, among other things, heart-rate; I look forward to the early reviews.)

### Vitamin D data

The data (trimmed CSV), covering January-February 2012:

Date Pill Quality23 ZQ Guess
31D-1J active bad 84 right 70%
1-2 placebo better 93 right 65%
2-3 active well 94 50%
3-4 active poor 86 right 60%
4-5 placebo well 98 wrong 60%
5-6 active mediocre 86 50%
6-7 placebo OK ??24 right 65%
7-8 placebo good 90 right 60%
8-9 active poor 84 right 65%
9-10 placebo good 95 right 65%
10-11 active good 100 wrong 70%
11-12 active mediocre 92 right 70%
12-13 active mediocre 88 50%
13-14 active poor 100 right 60%
14-15 placebo poor 83 wrong 60%
15-16 active poor 101 right 55%
16-17 placebo mediocre 90 50%
17-18 placebo mediocre 88 right 60%
18-19 placebo good 100 50%
19-20 active poor 86 50%
20-21 active mediocre 85 50%
21-22 placebo OK 91 right 60%
22-23 placebo OK 106 right 65%
23-24 active poor 91 right 65%
24-25 active 1 79 right 75%
25-26 placebo 3 85 right 65%
26-27 active 2 ??25 right 55%
28-29 active 3 85 50%
29-30 active 3 93 wrong 55%
30-31 placebo 3 100 right 60%
31J-1F active 3 94 50%
1F-2F active 2 89 right 60%
2-3 active 1 83 right 70%
3-4 placebo 2 81 wrong 70%
5-6 placebo 3 98 right 65%
6-7 active 2 88 50%
7-8 active 2 94 right 55%
8-9 active 3 94 wrong 75%
9-10 placebo 3 92 50%
10-11 placebo 3 95 right 60%
11-12 placebo 3 103 right 75%
12-13 placebo 3 84 right 70%

(Data input was for Other Disruptions 3; 0 = placebo, 1 = vitamin D.)

### Vitamin D analysis

From a quick look at the prediction confidences, I was usually correct but perhaps underconfident: my proper scoring log score compared to a random guesser is 5.426, which is even better than my guesses in my Adderall experiment.

Looking at the data averages in the Zeo website, it looked like ZQ & total & REM sleep fell, deep increased slightly, time awake & awakenings both increased, and morning feel decreased. The R analysis (pessimistic27, optimistic28):

1. ZQ fell, 93.4 vs 89.3 (optimistic: p=0.062; pessimistic: p=0.029)
2. Total Z fell, 533.4m vs 512.3m (p=0.07; p=0.031)
3. Time in REM fell, 175.6m vs 160.8m (p=0.022; p=0.008)
4. Time in Deep increased, 55m vs 56.74 (but p=0.54!)
5. Time in Wake increased, 26.3m vs 28.1 (but p=0.74!)
6. Awakenings increased, 7.58 vs 8.26 (but p=0.4)
7. Morning Feel decreased, 2.84 vs 2.32 (p=0.0053; p=0.0033)
8. Time to Z increased, 17.58m vs 20.74m (but p=0.51; p=0.25)29

So we see the Deep & Wakening & Time to Z changes are weak-to-irrelevant, but on the other hand, we get statistically-significant changes in REM & Morning Feel (the latter even survives a multiple comparison correction30), which look like they’d have pretty decent effect sizes too31. (I only calculated the effect size for ZQs: something like 0.5832, which is medium-sized33.)

Going back to my predictions after the first 10 days, they’re sort of right:

1. sleep latency was increased, but not statistically-significantly and only by ~4m, which is less than half the predicted 10 minutes
2. increased awakenings was less than 1 additional awakening (compared to predicted 1-2) and didn’t reach statistical significance
3. ZQ did decrease, and by roughly 4.1 points (reached statistical significance), but that’s a bit under the tentatively predicted range of 5-10 points

Finally, a re-look at lagged days; the final data set yielded 20 lagged nights when the others were deleted, 14 when obvious outliers like ZQ <70 were removed for a total of 6 placebo nights and 8 vitamin D night. The ZQs averaged 92.6 - between the placebo ZQ average of 93.4 and the vitamin D ZQ average of 89.3, but much closer to placebo than vitamin D, despite being more vitamin D than placebo. The p-value would be pretty awful, though, for such small differences - just 14 datapoints?

My conclusion?

Vitamin D hurts sleep when taken at night. I know no reason to take vitamin D at that time, so even with anecdotal data, I will avoid it at that time entirely.

### VoI

For background on value of information calculations, see the first calculation.

The first experiment I had no opinion on. I actually did sometimes take vitamin D in the evening when I hadn’t gotten around to it earlier (I take it for its anti-cancer and SAD effects). There was no research background, and the anecdotal evidence was of very poor quality. Still, it was plausible since vitamin D is involved in circadian rhythms, so I gave it 50% and decided to run an experiment. What effect would perfect information that it did negatively affect my sleep have? Well, I’d definitely switch to taking it in the morning and would never take it in the evening again, which would change maybe 20% of my future doses, and what was the negative effect? It couldn’t be that bad or I would have noticed it already (like I noticed sulbutiamine made it hard to get to sleep). I’m not willing to change my routines very much to improve my sleep, so I would be lying if I estimated that the value of eliminating any vitamin D-related disturbance was more than, say, 10 cents per night; so the total value of affected nights would be $0.10×0.20×365.25=7.3$. On the plus side, my experiment design was high quality and ran for a fair number of days, so it would surely detect any sleep disturbance from the randomized vitamin D, so say 90% quality of information. This gives $\frac{7.3-0}{\mathrm{ln}1.05}×0.90×0.50=67.3$, justifying <9.6 hours. Making the pills took perhaps an hour, recording used up some time, and the analysis took several hours to label & process all the data, play with it in R, and write it all up in a clean form for readers. Still, I don’t think it took almost 10 hours of work, so I think this experiment ran at a profit.

## Vitamin D at morn helps?

### Setup

The logical next thing to test is whether there is any benefit to sleep by taking vitamin D in the morning as compared to not taking vitamin D at all, since we have already established that evening is worse than morning. (Besides anecdotes, Seth Roberts reported - after I concluded my experiment - that his own non-blind varying of doses seemed to help his subjective restedness but didn’t influence anything else.) I would expect any benefits in the morning to be attenuated compared to the evening effect: the morning is simply many hours away from going to bed again in the evening, giving time for many events to affect the ultimate sleep. So this experiment will run for more than 40 days of 20/20, but 56 days of 28/28; per Roberts’s suggestion, I will not randomize individual days but 8 paired blocks of 7 days. (Multiple days to give any slow effects time to manifest, which seem eminently possible with a fat-soluble vitamin like vitamin D; 7 days, so we don’t cycle around the week but instead have exactly the same number of eg. active Sundays and placebo Sundays since sleep often varies systematically over the week.)

I prepare 27 placebo pills & 27 actives as before, stored in separate baggies. To randomize blocks of 7-days - I will fill 2 opaque containers with 7 placebo and 7 actives (with a label on the inside of the active container), and pick a container at random to use for the next 7 days. I will take one each morning upon awakening, closing my eyes. On the 8th morning, the first container will be empty, so I set it aside and open the second; when the second is emptied, I will look inside it to see whether it has the label, which lets me infer which one it was, and record whether the 2 weeks were active/placebo or placebo/active. The 2 containers will be refilled as before, and blocks 3-4 will begin. I will do this 4 times, at which point I will analyze the data.

Analysis will be the same Zeo parameters as before, but this time augmented by a simple mood indicator: 1-5, with 3 being an ordinary mildly productive day and 1 being my car caught on fire and was totaled day (real data-point), recorded at the end of the day just before bed. (I considered a more complex mood indicator, the BOMS, while setting up my lithium experiment, but rejected it as being too heavy-weight for long-term use, and subjectively, my mood doesn’t vary that much.)

### Morning data

1. Blocks:
• 17-25F: guess: placebo (last pill used morning 25; swapped jars and consumed pill from second jar the morning of 26); actual: placebo
• 26F-8M: skipped multiple days for modafinil (omit March 1, 2); actual: active
2. Blocks:
• 9M-15M: guess: active actual: placebo
• 16-25: active (omit March 21)
3. Blocks:
• 26M-1A: guess: placebo actual: placebo
• 2A-8: active
4. Blocks:
• 9A-19: (omit April 11, 12) guess: placebo actual: placebo
• 20-27: active (omit April 21, 22)

Placebo/active coded as 0/1 in SSCF.134 in the CSV export. Mood was coded as fractional integers as the Mood column.

### Morning analysis

As before, we fire up R and analyze the spreadsheet35 with the usual assumptions36 about independence of the daily observations:

1. ZQ increased, 94.3 vs 90.6 (p=0.34/0.1637)
2. Total Z increased, 526.3m vs 510.6m (p=0.44/0.21)
3. Time in REM increased, 163.1m vs 157.2m (p=0.48/0.23)
4. Time in Deep increased, 66.8m vs 64m (p=0.45/0.23)
5. Time in Wake decreased, 23m vs 27 (p=0.36/0.18)
6. Awakenings decreased, 7.53 vs 7.77 (p=0.78/0.39)
7. Morning Feel increased, 3.16 vs 2.62 (p=0.005/0.003!)
8. Time to Z did not change, 25.32m vs 25.33 (p=0.99/0.5)
9. Mood did not change (or decreased), 3.02 vs 3.1 (p=0.65)

As a quick look at the p-values indicates, of the 9 parameters, only 1 reaches any kind of significance: Morning Feel. It survives multiple correction38, and has a healthy effect size of 0.7.39

All the other changes are junk, including ones I was fairly sure would change, like Time to Z or Mood. (Mood arguably was affected by an exogenous event - my car burning ruined that week - but still!) Morning Feel particularly stands out because it was the most statistically-significant, but by a lot - none of the others even approached p=0.10, and Morning Feel was 2 orders smaller, p=0.005. (This also lines up with Roberts’s own observation that the only metric clearly affected was restedness.) I have no idea how vitamin D could improve only my morning mood without affecting any of the other parameters. If it was improving my sleep in general with less awakenings or something, it should have shown up strongly on those parameters; or if vitamin D deficiency was causing depressive symptoms, it should have shown up in the overall Mood metric etc.

If one looks at 2 R-generated graphs40 for Mood and Morning Feel, it looks like the effect on Morning Feel is being driven by a greater quantity of 4 mornings, and not so much by a reduction in number of bad 2 mornings. (I also notices that when my car burned, my Mood takes a clearly visible fall for a week, while my sleep looks like it was affected less - it seems that during that period, waking up was literally the best part of the day…)

So, vitamin D seems to improve my mood when I wake up in the morning, which is good, but that’s it. Given how cheap it is and its apparent anti-cancer properties (among others), I will keep taking it but definitely in the morning and not at night. It’s still useful.

(This experiment also afforded me a chance to test Seth Roberts’s reaction to faked data which contradicted his vitamin D theory; he did not take it gracefully, which is useful to know in weighing his future opinions.)

### Control quality control

Like with melatonin, we might wonder: is taking vitamin D causing effects on the control days as well? With melatonin, the concern I often hear voiced is whether melatonin might in some way be addictive or suppress normal melatonin secretion, in which case the observed difference between control and experimental days - which we interpreted as improvement - may actually be the opposite, a negative effect caused by a sort of withdrawal (lowered melatonin secretion levels, since the body has not yet adapted to the absence of melatonin supplements and will not when supplementation resumes the next day).

In the case of vitamin D, I find the results (no effect on anything except Morning Feel) sufficiently surprising that I wonder if this fat-soluble vitamin was causing effects over periods even longer than a week; and that the true results were that both control and experimental weeks were better than unsupplemented weeks, but that Morning Feel was the only variable which reacted to placebo fast enough to show up as a difference. The previously-mentioned August 2012 report of Chris L that an increase of 1k IU in his vitamin D supplementation reduced his deep sleep with month-long lags reinforces my suspicion: with such a long lag, any reduction in my deep sleep would go unnoticed. A completely dry multi-month long control group is necessary.

The solution most obvious to me, although I don’t know if it’s statistically correct, is to drop the vitamin D or melatonin for a long enough period that any long-term effects should have disappeared, and then compare this abstention period to the supposed control weeks. If the abstention weeks are worse than the control weeks, then this supports the long-term interpretation; if the abstention weeks are similar to the control weeks, then we can eliminate the long-term interpretation; and if the abstention weeks are better than the control weeks, then we ought to be puzzled and start thinking about other possibilities. (Not enough data/power? Misinterpreted results? Or, the original morning experiment was in spring, while the abstention periods were summer/autumn - does sleep get worse in summer, perhaps due to heat?)

I won’t bother with blinding this one since it’s just a double-check of an unlikely possibility. (If one wanted to blind it, the procedure would be the same as before, but with big blocks: say, 2 blocks of 62 days, first pick randomized, or blocks of 31 days, with 4 blocks randomized in 2 pairs.) This experiment is easy enough to run: simply stop taking vitamin D. To avoid the temptation to cheat on days I am feeling down, it’s easiest to just wait until I run out of vitamin D and procrastinate on ordering a fresh supply until a bunch of days have passed.

The vitamin D experiment terminated in April; the last day of vitamin D was 2 July 2012; and I resumed 6 September 2012 with the end of the dataset being 31 October 2012.

#### Analysis

The question is simple: does the Morning Feel differ between the control days in the original Vitamin D morning experiment and between vitamin-less days as part of a long later sustained period? Was there something funky about the original control days, was there some sort of vitamin D bleed-over or maybe some sort of long-term effect which we could describe as contamination or dependency?

The short answer is: no. When we compare the two groups of days, the Morning Feel ratings have identical means, as we expected.

A Bayesian MCMC analysis41 (using the BEST library) produces the following graphical summary, which shows the two groups almost completely overlapping on means, with the key graph in the lower-right corner: there is no visible effect size at all (centered on 0), much less an effect size of d>=0.1 which we might take seriously as indicating a real difference:

More precisely, the summary statistics indicate that the difference in means & medians is usually -0.03 (negligibly small), the full range of effect size estimates is -0.4678744 to 0.4142259, and 44.4% of the possibilities were simply zero effect size.

(I did the old frequentist t-test as well: p=0.997542.)

### VoI

For background on value of information calculations, see the first calculation.

With the vitamin D theory partially vindicated by the previous experiment, I became fairly sure that vitamin D in the morning would benefit my sleep somehow: 70%. Benefit how? I had no idea, it might be large or small. I didn’t expect it to be a second melatonin, improving my sleep and trimming it by 50 minutes, but I hoped maybe it would help me get to sleep faster or wake up less. The actual experiment turned out to show, with very high confidence, absolutely no change except in my mood upon awakening in the morning.

What is the value of information for this experiment? Essentially - nothing! Zero!

1. If the experiment had shown any benefit, I obviously would have continued taking it in the morning
2. if the experiment had shown no effect, I would have continued taking it in the morning to avoid incurring the evening penalty discovered in the previous experiment
3. if the experiment had shown the unthinkable, a negative effect, it would have to be substantial to convince me to stop taking vitamin D altogether and forfeit its other health benefits, and it’s not worth bothering to analyze an outcome I would have given <=5% chance to.

Of course, I did it anyway because it was cool and interesting! (Estimated time cost: perhaps half the evening experiment, since I manually recorded less data and had the analysis worked out from before.)

# Potassium

## Potassium day use

In October 2012, I bought some potassium citrate on a lark after noting that the daily RDA and my diet suggested that I was massively deficient. The first night I slept terribly, taking what felt like hours to fall asleep and then waking up frequently - due to either the potassium or a fan left on; the second night with potassium, I turned off the fan but slept poorly again. My suspicions were aroused. I began recording sleep data.

### Background

Partway through the process, I searched Google Scholar and Pubmed (human trials) for potassium sleep; I checked the first 70 results of both. A general Google search turned up mostly speculation on the relationship of potassium deficiency and sleep. The only useful citation was Potassium affects actigraph-identified sleep, Drennan et al 1991; actigraphs likely aren’t as good as a Zeo, and n=6, but the study is directly relevant. Only 2 actigraph results reached statistical significance: a small improvement in sleep efficiency (the percentage of time spent laying in bed and actually sleeping) and a bigger benefit in WASO (time awake during sleep time; this probably drove the sleep efficiency).

### Data

The first night (10/12) involved falling asleep in 30 minutes rather than my usual 19.6±11.9, waking up 12 times (5.9±3.4), and spending ~90 minutes awake (18.1±16.2) The next day (10/13) I took a similar dose and double-checked the fan before bed: 25 minutes to fall asleep, 10 awakenings, 35 minutes awake, but I woke fairly rested. So it seems like the fan was only partly to blame. The third day (10/14) I omitted any potassium: 21/8/29. Fourth (10/15) on again with an evening dose: 54/7/24. Fifth (10/16), off: 16/2/6. Sixth (10/17), on with a halved dose: 33/3/6. Seventh (10/18), off: 17/6/7. Eighth (10/20), half: 33/6/15. (At this point I began randomizing consumption between on and off; since this is preliminary, I didn’t bother with blinding potassium consumption.) Ninth (10/21), on: 25/7/9. Tenth (10/22), on: 18/8/10. 11th (10/23), off: 26/4/10. 12th (10/24), off: 33/7/16. 13th (10/25), on: 32/7/13. 14th (10/26), on: 21/5/8. 15th, on: 34/2/1. 16th, off: 16/7/15. 17th, on: 29/8/20. 18th, on: 17/10/17. 19th, off: 36/9/24. 20th (11/1), on: 21/4/19. 21st (11/2), off: 29/7/16. 22nd (11/3), on: 26/7/10. 23rd (11/4), on: 16/4/11. 24th (11/5), off: 21/4/17. 25th (11/6), on: 19/9/24.

11 Nov, on: 15/3/08. 13 Nov, off: 11/8/21. 14 Nov, off: 18/8/22. 15 Nov, on: 30/8/16. 16 Nov, off: 20/7/12. 17 Nov, on: 34/8/20. 18 Nov, on: 12/8/22. 19 Nov, off: 24/8/14. 20 Nov, on: 26/4/39. 21 Nov, off: 15/6/14. 22 Nov, on: 26/8/29. 23 Nov, on: 23/4/8. 24 Nov, off: 24/3/5. 25 Nov, on: 27/7/15. 26 Nov, on: 30/10/17. 27 Nov, off: 42/12/13. 28 Nov, off: 40/11/42. 29 Nov, off: 19/14/50. 30 Nov, off: 32/8/39. (Here I counted the sample-sizes and realized the off days were drastically under-represented, reducing statistical power; so I have eliminated randomization and gone off potassium.) 1 Dec, off: 28/10/15. 2 Dec, off: 37/8/20. 3 Dec, off: 36/6/18. 4 Dec, off: 19/9/33. 5 Dec, off: 25/8/27. 6 Dec, off: 30/13/45. (Now balanced, resuming randomization.) 7 Dec, on: 31/9/60. 8 Dec, off: 22/9/23. 9 Dec, off: 11/5/21. 10 Dec, on: 30/4/10. 11 Dec, on: 22/9/50. 13 Dec, off: 20/5/6. 14 Dec, off: 33/13/25. 15 Dec, on: 26/11/22. 16 Dec, off: 33/12/28. 17 Dec, off: 42/9/31. 18 Dec, off: 31/9/61. 19 Dec, on: 23/8/18.

### Analysis

#### Sleep disturbances

If potassium was disturbing my sleep, I didn’t necessarily want to wait for any one metric of wakefulness to reach significance; rather, I wanted to combine them into a single metric of sleep problems: time to fall asleep (latency), number of awakenings, and time spent awake. (With all 3, higher is worse.) Number of awakenings tends to vary over a smaller range than time to fall asleep or time spent awake - a normal value for the former might be 5, rather than 30 for the latter; to compensate for that, we convert each metric into a standard deviation indicating how unusual eg. 10 awakenings is and whether it is more unusual than it taking 15 minutes to fall asleep. Then we can do a standard test. To graph the data at each step, starting with graphing all the data on an overlapping chart43 (this is not per day):

Nights off potassium are colored blue and nights on potassium are red; it looks like red dots are higher than blues, overall, but the trend is not clear. So we convert each individual datapoint to its respective standard deviation44:

The trend has become much clearer, but the final step is to add each day’s scores to get an overall measure45:

Now the different has become dramatic: one can almost draw a line separating both groups without any errors. As one would expect given this graphical evidence, a Bayesian two-group test reports that there is ~0 chance that the true effect size is 0, and the most likely effect size is a dismaying d=-1.146:

A t-test agrees:47 p=0.0000037. (There is no need for multiple correction in this instance.) This confirms my subjective impression.

#### Mood/productivity

A secondary question is whether potassium delivered any waking benefits. I write down at the end of each day my rating 2-4 how happy and/or productive I felt that day. Does this self-rating show any effect? Here’s a plot of each day colored by whether it was a potassium day:

There is little visible effect, and the formal Bayesian48 analysis is as weak as the sleep disturbances are strong:

So there is no apparent benefit from the potassium.

### Conclusion

This experiment was hastily done and has several weaknesses, some I mentioned before; in ascending order of importance:

1. dosage was not uniform

Number of dosages varied from day to day as was convenient and doses were measured approximately with a spoon (since 4 grams is a pretty substantial amount, after all). Here is another objection I don’t think matters: lower than average doses may contribute to an underestimate of the effect size… but that implies that the effect size is even more extreme than -1.1! We are interested in problems that would shrink the effect size back to 0, not imply that it’s even worse than -1.1.
2. the randomization was incomplete

As covered in the data section, there was a severe imbalance in sample size for each condition, so I stopped randomization for about a week. Intuitively, I don’t think there was anything special about that week in regard to getting very good sleep (as would be necessary to contribute to an overestimated effect size), but if anyone disagreed, it would not be hard to exclude those days and use the rest.
3. no blinding was done

I am not sure how much this matters. I had no expectation that potassium would affect my sleep at all, one user specifically denied any effect, the only study suggested I’d find improvements, I did not want to find a negative effect much less such a severe effect, and the sheer strength of the effect over a multi-month period is a bit more than I would expect from any expectancy or placebo effect.
4. timing was not uniform

Of the issues, this is the most important. If potassium has some stimulating effects as anecdotes claim, then timing may be causing all the sleep disturbances and not potassium per se. It might be exactly like vitamin D in this respect: taken in the evening, it badly damages sleep but taken in the morning, it does nothing or it improves sleep.

If I were to do a followup experiment, it would be blinded & randomized as usual, with consistent doses (eliminating objections 1-3), but more importantly, the dose would be consumed upon awakening.

I am not sure I will bother with a followup experiment. Potassium is not of particular interest to me, my existing supply is low after months of consumption, I observed no subjective improvements on consumption, and so I am not inclined to run the risk of damaging more months of sleep. Other people can do that.

## Potassium morning use

As it happened, I managed to retrieve my pill-making machine and spare gel capsules, and I do hate to waste perfectly good potassium citrate powder, so I decided to do a morning experiment. I made 3x24 potassium pills and 3x24 brown rice pills (out of flour); I take one set of 3 pills each morning, randomly picking. This procedure addresses all 4 issues, and will answer the question about whether potassium’s sleep disturbance is due to a timing issue like that of caffeine and vitamin D. Analysis will be the same as before: 3 metrics of sleep disturbance, and then daily self-rating. (I didn’t devise a paired-blocks setup since my marked containers were in use elsewhere; as often happens I ran out of one set of pills first, the rice placebo pills, on 10 February 2013, and made another batch of 24 rice placebo pills. The last potassium pill was 21 February 2013.)

### Analysis

Subjectively, I noticed nothing on what turned out to be the potassium days, unlike in the first experiment.

#### Sleep disturbances

Running the analysis the same way as before, we get a small increase in sleep disturbances (d=0.15, higher is worse) but the effect could easily be nothing49:

I suspect there really is an underlying causal effect: the first experiment indicated a large increase in sleep disturbances, and a much smaller one is in line with my expectations of the effect of a smaller standardized dose first thing upon waking.

But practically speaking, this small disturbance would be acceptable if it came with some benefit.

#### Mood/productivity

The results look almost identical to before50:

### Conclusion

A much higher-quality experiment with more favorable conditions for potassium showed a result consistent with some harm to my sleep, and no benefit. I will not continue using potassium.

# X

In the middle of the five-fold experiment, I paused part of it to run a more interesting self-experiment using X; I included sleep metrics to check for disturbances. X did not seem to affect latency, total sleep, or awakenings, but did improve (d=0.42) the morning feel non-statistically-significantly (due to the multiple correction). Unfortunately, given that it seemed to negatively affect more important metrics like the self-rating of mood/productivity & creativity, this is not nearly enough to begin to justify use of X for me.

# In progress

Someone suggested that instead of running experiments serially, with limited sample sizes (because I am impatient to try the next interesting suggestion), I could instead take a step up in statistical sophistication and use a factorial experiment design: use multiple experimental interventions simultaneously for a much larger sample size, and then run ANOVA analyses rather than simpler two-sample t-tests. No less than R.A. Fisher praises multifactorial experiments as being more efficient: squeezing more data out of a given sample. Hence, I thought a crazy thought: my lithium experiment was going to run for ~360 days, and so I kept putting it off. But what if I ran multiple experiments for 360 days? If I had 4 or 5, then by the end of the year, I would have 5 results to show, and I would have the statistical equivalent of more than n=72 ($\frac{360}{5}$) for each experiment. Win-win.

Classic multifactorial designs arrange to have every possible combination of the n experiments happen on some day or other (such an arrangement is called a Latin square). However, with 5 experiments, each of which has 2 states (on and off), that means I only have 25=32 possible arrangements, all of which ought to be covered over 360 days, terminating in March 2013. (It actually will take much longer, as I paused the lithium sub-experiment for several months to run the X self-experiment.)

So I will be lazy and will independently randomize each experiment. What are my 5 chosen interventions?

## Lithium

Rationale & procedure in the Nootropics page. Randomized in 7-day paired blocks. Blinded.

## Redshift/f.lux

My earlier melatonin experiment found it helped me sleep. Melatonin secretion is also influenced by the color of light (some references can be found in my melatonin article), specifically blue light tends to suppress melatonin secretion while redder light does not affect it. (This makes sense: blue/white light is associated with the brightest part of the day, while reddish light is the color of sunsets.) Electronics and computer monitors frequently emit white or blue light. (The recent trend of bright blue LEDs is particularly deplorable in this regard.) Besides the plausible suggestion about melatonin, reddish light impairs night vision less and is easier to see under dim conditions: you may want a blazing white screen at noon so you can see something, but in a night setting, that is like staring for hours straight into a fluorescent light.

Hence, you would like to both dim your monitor and also shift the color temperature towards the cooler redder end of the spectrum with a utility like Redshift.

But does it actually work? An experiment is called for!

The suggested mechanism is through melatonin secretion. So we’d look at all the usual sleep metrics plus mood plus an additional one: what time I go to bed.. One of the reasons I became interested in melatonin was as a way of getting myself to go to bed rather than stay up until 3 AM - a chemically enforced bedtime - and it seems plausible that if Redshift is reducing the interference of the computer monitor, it will make me stay up later (but I don’t feel sleepy yet).

### Power calculation

The earlier melatonin experiment found somewhat weak effects with >100 days of data, and one would expect that actually consuming 1.5mg of melatonin would be a stronger intervention than simply shifting my laptop screen color. (What if I don’t use my laptop that night? What if I’m surrounded by white lights?) 30 days is probably too small, judging from the other experiments; 60 is more reasonable, but 90 feels more plausible.

It may be time to learn some more statistics, specifically how to do statistical power calculations for sample size determination. As I understand it, a power calculation is an equation balancing your sample size, the effect size, and the significance level (eg the old p<0.05); if you have 2, you can deduce the third. So if you already knew your sample size and your effect size, you could predict what significance your results would have. In this specific case, we can specify our significance at the usual level, and we can guess at the effect size, but we want to know what sample size we should have.

Let’s pin down the effect size: we expect any Redshift effect to be weaker than melatonin supplementation, and the most striking change in melatonin (the reduction in total sleep time by ~50 minutes) had an effect size of 0.37. As usual, R has a bunch of functions we can use. Stealing shamelessly from an R guide, and reusing the means and standard deviations from the melatonin experiment, we can begin asking questions like: suppose I wanted a 90% chance of my experiment producing a solid result of p>0.01 (not 0.05, so I can do multiple correction) if the Redshift data looks like the melatonin data and acts the same way?

install.packages("pwr", depend = TRUE)
library(pwr)
pwr.t.test(d=(456.4783-407.5312)/131.4656,power=0.9,sig.level=0.01,type="paired",alternative="greater")

Paired t test power calculation

n = 96.63232
d = 0.3723187
sig.level = 0.01
power = 0.9
alternative = greater

NOTE: n is number of *pairs*

n is pairs of days, so each n is one day on, one day off; so it requires 194 days! Ouch, but OK, that was making some assumptions. What if we say the effect size was halved?

pwr.t.test(d=((456.4783-407.5312)/131.4656)/2,power=0.9,sig.level=0.01,type="paired",alternative="greater")

Paired t test power calculation

n = 378.3237

That’s much worse (as one should expect - the smaller an effect or desired p-value or chance you don’t have the power to observe it, the more data you need to see it). What if we weaken the power and significance level to 0.5 and 0.05 respectively?

pwr.t.test(d=((456.4783-407.5312)/131.4656)/2,power=0.5,sig.level=0.05,type="paired",alternative="greater")

Paired t test power calculation

n = 79.43655
d = 0.1861593

This is more reasonable, since n=80 or 160 days will fit within the experiment but look at what it cost us: it’s now a coin-flip that the results will show anything, and they may not pass multiple correction either. But it’s also very expensive to gain more certainty - if we halve that 50% chance of finding nothing, it basically doubles the number of pairs of days we need from 79 to 157:

pwr.t.test(d=((456.4783-407.5312)/131.4656)/2,power=0.75,sig.level=0.05,type="paired",alternative="greater")

Paired t test power calculation

n = 156.5859
d = 0.1861593

Statistics is a harsh master. What if we solve the equation for a different variable, power or significance? Maybe I can handle 200 days, what would 100 pairs buy me in terms of power?

pwr.t.test(d=((456.4783-407.5312)/131.4656)/2,n=100,sig.level=0.05,type="paired",alternative="greater")

Paired t test power calculation

n = 100
d = 0.1861593
sig.level = 0.05
power = 0.5808219

Just 58%. (But at p=0.01, n=100 only buys me 31% power, so it could be worse!) At 120 pairs/240 days, I get 65% power, so it may all be doable. I guess it’ll depend on circumstances: ideally, a Redshift trial will involve no work on my part, so the real question becomes what quicker sleep experiments does it stop me from running and how long can I afford to run it? Would it painfully overlap with things like the lithium trial?

Speaking of the lithium trial, the plan is to run it for a year. What would 2 years of Redshift data buy me even at p=0.01?

pwr.t.test(d=((456.4783-407.5312)/131.4656)/2,n=365,sig.level=0.01,type="paired",alternative="greater")

Paired t test power calculation

n = 365
d = 0.1861593
sig.level = 0.01
power = 0.8881948

Nice! Of course, we have to expect to lose a good deal of statistical power due to interference/uncertainty from the other simultaneous experiments that will be fed into the ANOVA, but I don’t know how to calculate that.

### Experiment

OK, power calculations aside, how exactly to run it? I don’t expect any bleed-over from day to day, so we randomize on a per-day basis. Each day must either have Redshift running or not. Redshift is run from cron every 15 minutes: */15 * * * * redshift -o. (This is to deal with logouts, shutdowns, freezes, etc., that might kill Redshift as a persistent daemon.) We’ll change the code to at the beginning of each day run:

@daily redshift -x; if ((RANDOM \% 2 < 1));
then touch ~/.redshift; echo date +"\%d \%b \%Y": on >> ~/redshift.log;
else rm ~/.redshift; echo date +"\%d \%b \%Y": off >> ~/redshift.log; fi

Then the Redshift call simply includes a check for the file’s existence:

*/15  * * * * if [ -f ~/.redshift ]; then redshift -o; fi

Now we have completely automatic randomization and logging of the experiment. As long as I don’t screw things up by deleting either file or uninstalling Redshift, and I keep using my Zeo, all the data is gathered and labeled nicely until I finish the experiment and do the analysis. Non-blinded, or perhaps I should say quasi-blinded - I initially don’t know, but I can check the logs or file to see what that day was, and obviously I will at some point in the night notice whether the monitor is reddened or not.

As it turned out, I received a proof that I was not noticing the randomization. On 11 January 2013, due to Internet connectivity problems, I was idling on my computer and thought to myself that I hadn’t noticed Redshift turn my screen salmon-colored in a while, and I happened to idly try redshift -x (reset the screen to normal) and then redshift -o (immediately turn the screen red) - but neither did anything at all. Busy with other things, I set the anomaly aside until a few days later, I traced the problem to a package I had uninstalled back in 25 September 2012 because my system didn’t use it - which it did not, but this had the effect of removing another package which turned out to set the default video driver to the proper driver, and so removing it forced my system to a more primitive driver which apparently did not support Redshift functionality51! And I had not noticed for 3 solid months. This was a frustrating incident, but since it took me so long to notice, I am going to keep the 3 months’ data and keep them in the off category - this is not nearly as good as if those 3 months had varied (since now the on category will be underpopulated), but it seems better than just deleting them all.

So to recap: the experiment is 100+ days with Redshift randomized on or off by a shell script, affecting the usual sleep metrics plus time of bed. The expectation is that lack of Redshift will produce a weak negative effect: increasing awakenings & time awake & light sleep, increasing overall sleep time, and also pushing back bedtime.

### VoI

For background on value of information calculations, see the first calculation.

Like the modafinil day trial, this is another value-less experiment justified by its intrinsic interest. I expect the results will confirm what I believe: that red-tinting my laptop screen will result in less damage to my sleep by not forcing lower melatonin levels with blue light. The only outcome that might change my decisions is if the use of Redshift actually worsens my sleep, but I regard this as highly unlikely. It is cheap to run as it is piggybacking on other experiments, and all the randomizing & data recording is being handled by 2 simple shell scripts.

## Push-ups

Rather than dumbbells (might be hard to find in the dark), I decided to try out push-ups since I routinely do 25 push-ups after showering and it ought to be mentally easy to shift those push-ups to before/after bedtime. As before, alternate-day, but with a twist: on-days, I do the push-ups immediately before going to bed, but off-days entail immediately upon awakening. (I don’t exercise enough in general.) I began 21 September 2011.

I interrupted the experiment for a long period to run the vitamin D experiments; when I resumed on 8 May 2012, I decided to avoid the alternate-day procedure and instead randomize morning vs evening push ups with a coin. Non-blinded.

On 13 November 2012, I decided I was sufficiently convinced that exercise immediately before bed was damaging my sleep latency that I didn’t want to continue to pay the price of worse sleep, and I discontinued this variable. Hopefully the previous data will be sufficient to confirm or disconfirm any effect.

## Meditation

The practice of meditation can be time-intensive; a claimed anecdotal benefit is that one sleeps less and so the time requirement isn’t as bad as it may seem.

Meditation has been linked with sleep changes multiple times; see Meditation and Its Regulatory Role on Sleep. In particular, Meditation acutely improves psychomotor vigilance, and may decrease sleep need found a correlation between long meditation and reduced sleep need. The general link seems plausible - that deliberate relaxation may reduce the need for another kind of relaxation (although I doubt meditation is going as far as reducing synaptic weights as the synaptic homeostasis hypothesis predicts which I discuss in Drug heuristics) - but I can think of at least 2 plausible ways the correlation would not be causation (1. those with less sleep need can afford to spend time on meditation; 2. meditation is partially sleep so there’s no correlation or causation to explain).

Randomized on a daily basis: either 20-3052 minutes of meditation or none. (I am not sure what a good placebo would be so I will omit it.) Non-blinded. My meditation is nothing fancy: simple breath-following (based on early chapters of Mindfulness in Plain English).

Plausibly, any decrease in sleep need could be due to long-term changes in the brain itself, as meditation is known to affect areas like the prefrontal cortex. Kaul et al 2010 above did not randomize the long-term meditators’ use of meditation or apparently investigate whether sleep time averages correlated with meditation. If the changes are long-term, then there will be relatively little variation during the 360 days and instead a gradual trend of less sleep. If no clear effect shows up in the analysis, I’ll try a before-after comparison: compare n days before the experiment started to n days after the experiment and see if there is a difference in the averages.

### Power calculation

Kaul et al 2010 describes the long-term meditators as spending 2-3 hrs/day in meditation. (Their experiment used novices who meditated for 1 hour.) If meditation indeed reduces sleep time, but I am meditating for only $\frac{1}{3}$ an hour, can I detect any effect?

The difference between the long-term meditators and their normal Indian counterparts was 5.2 hours of sleep per day versus 7.8. Assume the worst case of 3 hours, this implies that meditation is indeed a net cost in time (8.2 > 7.8), but also that each hour of meditation is equivalent to almost an hour of sleep ($\frac{7.8-5.2}{3}=0.866...$). So at that conversion rate, 20 minutes of meditation translates to 17.32 minutes less sleep. We will steal code and data from the previous Redshift power calculation: assume the same control sleep, same standard deviation, and subtract 17.32 from the control to get the true mean of the intervention

install.packages("pwr", depend = TRUE)
library(pwr)
pwr.t.test(d=(456.4783 - (456.4783 - 17.32))/131.4656,power=0.5,type="paired",alternative="greater")

Paired t test power calculation
n = 157.237

# we're getting 360 days or 180 pairs; let's ask for more than 50-50 power;
# what does n = 180 buy us? Not much!
pwr.t.test(d=(456.4783 - (456.4783 - 17.32))/131.4656,power=0.55,type="paired",alternative="greater")

Paired t test power calculation

n = 181.9631

# how many pairs *do* we need for good results?
pwr.t.test(d=(456.4783 - (456.4783 - 17.32))/131.4656,power=0.75,
sig.level=0.01,type="paired",alternative="greater")

Paired t test power calculation
n = 521.5252

pwr.t.test(d=(456.4783 - (456.4783 - 17.32))/131.4656,power=0.56
sig.level=0.01,type="paired",alternative="greater")

Paired t test power calculation
n = 356.2923

This is discouraging. With 180 pairs, we only have a 55% chance of seeing anything at p=0.05? That’s awful! But there’s no point in looking further into this power calculation: I’m not going to be doing a paired t-test, after all, but some sort of ANOVA, and I’m not sure how much power the interfering experiments cost me. The first calculation is the most important: to satisfy somewhat reasonable criteria, I need less than half the data I will get, which ought to be an adequate margin of safety.

### VoI

For background on value of information calculations, see the first calculation.

I find meditation useful when I am screwing around and can’t focus on anything, but I don’t meditate as much as I might because I lose half an hour. Hence, I am interested in the suggestion that meditation may not be as expensive as it seems because it reduces sleep need to some degree: if for every two minutes I meditate, I need one less minute of sleep, that halves the time cost - I spend 30 minutes meditating, gain back 15 minutes from sleep, for a net time loss of 15 minutes. So if I meditate regularly but there is no substitution, I lose out on 15 minutes a day. Figure I skip every 2 days, that’s a total lost time of $\frac{15×\frac{2}{3}×365.25}{60}=61$ hours a year or $427 at minimum wage. I find the theory somewhat plausible (60%), and my year-long experiment has roughly a 55% chance of detecting the effect size (estimated based on the sleep reduction in a Indian sample of meditators). So $\frac{427-0}{\mathrm{ln}1.05}×0.60×0.55=2888$. The experiment itself is unusually time-intensive, since it involve ~180 sessions of meditation, which if I am overpaying translates to 45 hours ($\frac{180×15}{60}$) of wasted time or$315. But even including the design and analysis, that’s less than the calculated value of information.

This example demonstrates that drugs aren’t the only expensive things for which you should do extensive testing.

## Masturbation

Orgasm has been linked occasionally with changes in sleep latency, although one 1985 experimental study found no changes. Schenck et al 2007 covers some inconclusive followup studies on related matters like whether arousal or brief viewing of porn interferes with sleep (no).

Randomized on a daily basis before going to bed; no placebo, but abstinence. Non-blinded. Since the theory has always been about a very short-term effect, there’s no need to worry about daytime activities. (This would only matter if I were testing something like the folk wisdom that masturbation reduces testosterone levels, where the timing is not as important as the quantity.)

My expectations are that the treadmill will increase how much I sleep, decrease sleep latency, and possibly have a small negative effect on productivity (which may be offset by an improvement in mood and less need to get a daily walk). If it were intense aerobic fitness, I might expect an increase in cognitive abilities or various sorts, but it’s not, so I don’t expect any effect on Mnemosyne scores.

### Power

Starting it part way, I lose potential power: there are only ~330 days left. The effect of most interest is productivity, where I expect a negative effect, but we also need a more stringent p-value since we’re looking at so many variables; so 330 samples gives a floor on detectable effect size of

pwr.t.test(n=(330/2),power=0.75,sig.level=0.01,type="paired",alternative="less")

Paired t test power calculation

n = 165
d = -0.2355713

Not that great. We may wind up being able to conclude nothing about the effect on productivity; similarly for sleep - the effect would have to be comparable to vitamin D or melatonin to be detectable.

### VoI

The VoI calculation for this investigation is very difficult: it may improve sleep and it may improve or worsen productivity but regardless is good for very valuable exercise, scrapping the practice has immediate cash value, but none of this is certain and there are few guides from experimental studies.

If it turns out the treadmill is not helpful, I can probably sell it for ~$100 based on prices listed in Craigslist. If it’s helpful, I gain considerable exercise (1-2MPH implies an 8-hour day could be 8-16 miles of exercise a day!) with the related benefits. I strongly suspect that this much exercise would influence my sleep for the better, but I’m not sure the treadmill desk really does allow for productivity like regular sitting does. If it does reduce productivity somewhat but I otherwise can adapt, it’s probably still a net gain because of the extra exercise. However, a small-to-medium decrease - let’s say an effect size of d<=-0.4 - would be enough to cause me to scrap the treadmill. This is highly unlikely. The large sample gives a very good shot at detecting it. Running the experiment is relatively easy since the treadmill desk can be set up and put away in ~5 minutes. Without running numbers on this one, my best guess is that the VoI is negative; so this is another experiment I am doing because it is interesting and other people may find it interesting, rather than because running the experiment makes economic sense. # Appendix ### Inverse correlation of sleep quality with productivity? Curiously, playing around with the full potassium data after the 2013 morning experiment, poor sleep quality seemed to correlate with higher mood/productivity ratings: cor.test(pot$Disturbance, pot$MP) Pearsons product-moment correlation data: pot$Disturbance and pot$MP t = 1.224, df = 49, p-value = 0.2269 alternative hypothesis: true correlation is not equal to 0 95% confidence interval: -0.1085 0.4275 sample estimates: cor 0.1722 #### Hypotheses While not statistically-significant, this inverse correlation comes as a surprise and I thought worth thinking about more. I have a couple theories on what could be going on: 1. it could be an artifact and actually better sleep means better performance: I’ve always been concerned about the possibility of off-by-one errors in my data or analyses. If better sleep meant better performance (as one would naively suspect), and either sleep data or performance data was shifted by one day, then you would observe the exact opposite. One would have to carefully check the data and make sure every field is referring to the time it should. If a entry records 10hrs sleep for 3 February 2012, does that refer to sleep that morning which is necessary because you were awake during 2 February 2012, or does it refer to the sleep you engage in that evening (you go to bed at 11pm 3 February 2012 and that is the sleep data being used). This seems unlikely, since such an error should screw up all sorts of other analyses (for example such a flip ought to have claimed that potassium would help sleep, if days were being reversed). 2. it could be that on productive days, you leap out of bed; but if you are depressed, unmotivated, apathetic, you might hang around in bed for a while after the alarm rings. Depressed people sometimes sleep more than regular people; for pretty much this reason, I’d guess. This could be checked by looking at sleep quality indicators in the beginning or middle of the night. For example time to fall asleep (higher on more productive days in this sample), or percentage in deep sleep (mostly done towards the beginning and middle of a sleep; seemed to be lower for productive days). One could try to test the sluggard hypothesis: how much past an alarm one snoozed. 3. it’s a temporary correlation of this time period, perhaps related to the potassium, perhaps not. This is testable: with more data, does the correlation shrink or go away? 4. I have sometimes wondered if I am depressed. One of the curious facts about depression is that sleep deprivation can temporarily relieve the symptoms of depression in people who prefer evenings (owls), and I am indeed an owl. What does this imply? We can do some back-of-the-envelope estimates. Wikipedia reports a very high depression incidence; we’ll call it a 25% lifetime risk. But presumably the treatment only works if one is actually in a depressive episode, and while it’s unclear what the distribution or length of depression period (as opposed to individual episodes) might be, it seems to be closer to years than months or decades, so we’ll put it at ~3 years out of an adult lifespan of ~60 years or a per-year risk of $\frac{1}{20}=0.05$. On closer examination of Selvi et al 2006, the morning/evening split only appears with the total sleep deprivation procedure (morning types see their mood worsen, evening sees it improve) while with partial sleep deprivation both groups seem to see an improvement in their mood; since I rarely skip sleep entirely and such nights are dropped from the Zeo data, the total sleep deprivation results are irrelevant, but then my chronotype being evening doesn’t matter. Finally, the sleep deprivation papers estimate <60% effectiveness in the depressed, so that knocks the possibility that both I am depressed and partial sleep deprivation helps me to <0.025. 2.5% is not a large possibility; and my vague speculation and a small inverse correlation do not seem like they would increase that possibility a lot. (If it’s not these, I don’t have any suggestion on why it might be. Why would poor sleep either cause productivity or be caused by something that later also causes productivity?) #### Analysis But before rashly assuming I am depressive or engaging in personally costly self-experiments like sleep deprivation, I decided on 26 April 2013 to check the correlation on a larger dataset. Typing up my full self-rating dataset of 416 days and cleaning up all the data54, I rechecked the correlation: r=0.06655 This is noticeably smaller (hence, less practically relevant) than the previous correlation, is also not statistically-significant, and shrinking is what one would expect from a spurious relationship. To be more sure, I reused some of the techniques from my analysis of the effect of weather on my mood/productivity (specifically, ordinal logistic regression) and looked for a relationship; the result was similar, an odds which was inverse but close to no effect (1.05756). More importantly, when all the other variables are taken into account in the logistic regression, things change57: with other data to condition on, the inverse relationship of sleep quality with mood/productivity reverses and becomes the expected relationship (an increase in sleep disturbances predicts lower mood/productivity); many of the other variables turn out to be far stronger predictors (bigger odds); and some of the signs look odd (how can total sleep time predict increased mood/productivity, yet increasing all forms of sleep - REM/light/deep - predicts decreased mood/productivity‽). I attempted to construct a simpler model, which wound up ignoring any metric of sleep disturbance and ignoring all but 3 variables, and concluding that Morning Feel was the most important predictor58 - which makes a lot of sense to me, and confirms my previous experiments’ focusing on the Morning Feel variable. Given this weakening and in the absence of any corroborating information, I consider it highly unlikely that the original correlation is reflecting an anti-depressant effect due to sleep deprivation. A followup in a few years may be warranted to see if a larger still dataset will shrink the correlation closer to zero. ## SDr lucid dreaming: exploratory data analysis In October 2012, an acquaintance offered me an extract from his free-form data on lucid dreaming which he had been compiling since 2004, to see what insights I could extract. In May 2013, I augmented it with another 60 entries ### Data cleaning The original text was a serious mess, and I put several hours into cleaning it up and organizing it into something more sensible. This wasn’t enough, so I wrote an ugly Haskell program to parse it into a quasi-CSV file: import Data.List (isInfixOf, isPrefixOf, intercalate) import Data.List.Split (splitOn) -- http://hackage.haskell.org/package/split main :: IO () main = do txt <- readFile "2012-sdr-dream.txt" let txt' = filter (not . isPrefixOf "#")$ lines txt
let header = drop 2 $head$ filter (isPrefixOf "# Sleep Date,") $lines txt let fields = map (splitOn ",") txt' let csvs = map convert fields putStrLn$ unlines (header : map show csvs)

data CSVEntry = CSVEntry { sleepDate :: String, totalZ :: Int,
wakeTime :: String, intensity :: String, recall :: String,
emotion :: String, interrupted :: Bool, melatonin :: Bool, lucid :: String }
instance Show CSVEntry where
show a = intercalate "," [sleepDate a, if totalZ a == 0 then "" else show (totalZ a),
wakeTime a, intensity a, recall a, emotion a,
if interrupted a then "1" else "0", if melatonin a then "1" else "0", lucid a]

convert :: [String] -> CSVEntry
convert xs = CSVEntry { sleepDate = safeHead $filter (\x -> isInfixOf "." x || isInfixOf "20" x) xs, totalZ = timeToMinutes$ drop 12 $safeHead$ filter (isInfixOf "dreamtime: ") xs,
wakeTime = drop 7 $safeHead$ filter (isInfixOf "wake: ") xs,
intensity = drop 6 $safeHead$ filter (isInfixOf "int: ") xs,
recall = drop 9 $safeHead$ filter (isInfixOf "recall: ") xs,
emotion = drop 6 $safeHead$ filter (isInfixOf "emo: ") xs,
lucid =  drop 8 $safeHead$ filter (isInfixOf "lucid: ") xs,
interrupted = any (isInfixOf "interrupted") xs,
melatonin = any (isInfixOf "melatonin") xs }
where

-- clock hour:minute to total minutes: timeToMinutes "4:30" ~> 270
timeToMinutes :: String -> Int
timeToMinutes a = if null a then 0 else let (x,y) = break (==':') a
in read x * 60 + read (tail y)

### Analysis

This was usable. My next question was: since none of his routines were randomized and correlations were all that one could extract, what correlations were in his data?

table <- read.csv("http://www.gwern.net/docs/2013-sdr-dream.csv")
summary(table)
Sleep.Date     Total.Z        Wake.Time     Intensity        Recall         Emotion
2011.10.02:  2   Min.   : 120           :217   Min.   :0.10   Min.   :0.000   Min.   :-0.50
2011.11.26:  2   1st Qu.: 480   16:00   :  3   1st Qu.:0.30   1st Qu.:0.200   1st Qu.: 0.00
2012.02.28:  2   Median : 600   11:00   :  2   Median :0.40   Median :0.300   Median : 0.20
2012.04.15:  2   Mean   : 613   13:23:00:  2   Mean   :0.44   Mean   :0.367   Mean   : 0.18
2012.06.21:  2   3rd Qu.: 720   19:17:00:  2   3rd Qu.:0.50   3rd Qu.:0.500   3rd Qu.: 0.40
2013.01.23:  2   Max.   :1320   4:55:00 :  2   Max.   :7.00   Max.   :1.000   Max.   : 0.70
(Other)   :316   NA's   :8      (Other) :100   NA's   :94     NA's   :26      NA's   :296
Interrupted     Melatonin          Lucid      Day.quality
Min.   :0.00   Min.   :0.0000   Min.   :0.0   Min.   :0.10
1st Qu.:0.00   1st Qu.:0.0000   1st Qu.:0.1   1st Qu.:0.30
Median :0.00   Median :0.0000   Median :0.2   Median :0.40
Mean   :0.07   Mean   :0.0762   Mean   :0.2   Mean   :0.42
3rd Qu.:0.00   3rd Qu.:0.0000   3rd Qu.:0.2   3rd Qu.:0.52
Max.   :1.00   Max.   :1.0000   Max.   :0.6   Max.   :0.70
NA's   :76                      NA's   :319   NA's   :312

# These 2 date fields haven't been turned into anything useful, so we'll just delete them:
table$Wake.Time <- NULL; table$Sleep.Date <- NULL

# Warning: 'Lucid' has just 9 datapoints, and 'Melatonin' just 6!
# Table cleaned up heavily by hand from default R output:
# deleted duplicates, censored any correlation -0.1<x<0.1 etc.
cor(table,use="pairwise.complete.obs")
Recall  Emotion Interrupted Melatonin  Lucid  Day.quality
Total.Z                                    -0.12    -0.43  0.56
Intensity    0.35     0.37                           0.79
Recall                0.16      -0.16       0.14    -0.15
Emotion                          0.28      -0.14
Interrupted                                          0.91
Melatonin                                                  0.25

Much of the data is too impoverished to draw any suggestions from. The remaining correlations are:

• Intensity/Recall: r=0.35

The causality is likely Intensity->Recall; either one is probably impossible to experimentally manipulate.
• Intensity/Emotion: r=0.37

Causality could go either way or to a third factor; Emotion might be manipulable by intending to dream of disturbing topics, but might not.
• Interrupted/Recall: r=-0.16
• Interrupted/Emotion: r=0.28

Interruption is experimentally manipulable by eg. an alarm clock or roommate. Recall might be improved by some change in journaling, for example doing at your bed instead of waiting until you’re on your computer. The positive correlation with Emotion suggests that, per the WILD methodology of lucid dreaming (see LaBerge & Rheingold, Exploring the World of Lucid Dreaming), a temporary awakening does increase the chance of a lucid dream (laden with emotion).
• Melatonin interestingly correlates with both day quality and with reduced sleep; this is interesting because Total.Z increasing also increased Day.quality so it’s not clear how melatonin could do both at the same time if more sleep is otherwise better. The correlations may be statistically-significant but the data is too wretched and the melatonin/day-quality variables too few to say anything further.

(One observation that came to mind working on cleaning the data was that collection was very sparse, sporadic, and accidental-looking.)

So these general points suggest 3 future overlapping approaches:

1. deliberate use of interruptions (maybe randomized), to investigate effect on lucid dreaming
2. more systematic usage (perhaps randomized or blinded) of melatonin, to allow correlations or causal inferences to other variables
3. attacking the unsystematic data collection (perhaps it’s too much trouble to do all those variables each day?) by getting a Zeo to handle part of the data collection for you.

1. The obvious and cheaper alternative to the Zeo would be the Fitbit, one of the accelerometers. There aren’t many comparisons; Diana Sherman compared one night, and Joe Betts-LaCroix compared ~38 nights of data. In both cases, the Fitbit seemed to be pretty similar to the Zeo at estimating total sleep time (the only thing it can measure). Betts-LaCroix explicitly recommends the Zeo, but I’m not clear on whether that is due to the better data quality or because Fitbit made it hard to impossible for him to extract the detailed Fitbit data while Zeo offers easy exporting. In any case, I already have the Zeo and I’ve come to like the detailed information.

2. I had previously tried huperzine-A and subjectively noticed no effect from it, but I had no way of really noticing any effect on sleep, and Timothy Ferriss in his The Four-hour Body claims:

Taking 200 milligrams of huperzine-A 30 minutes before bed can increase total REM by 20-30%. Huperzine-A, an extract of Huperzia serrata, slows the breakdown of the neurotransmitter acetylcholine. It is a popular nootropic (smart drug), and I have used it in the past to accelerate learning and increase the incidence of lucid dreaming. I now only use huperzine-A for the first few weeks of language acquisition, and no more than three days per week to avoid side effects. Ironically, one documented side effect of overuse is insomnia. The brain is a sensitive instrument, and while generally well tolerated, this drug is contraindicated with some classes of medications. Speak with your doctor before using.

3. My own suspicion is that given the existence of neuron-level sleep in mice, poor self-monitoring in humans, and anecdotal reports about polyphasic sleep, is that polyphasic sleep is a real & workable phenomenon but that it comes at the price of a large chunk of mental performance.

4. Kruschke 2012 argues that there is no need for people to use the old framework of p-values and null hypotheses etc, with their many well-known philosophical difficulties and misleading interpretations - interpretations I, alas, perpetuate in my analyses with my use of statistical significance:

Nevertheless, some people have the impression that conclusions from NHST and Bayesian methods tend to agree in simple situations such as comparison of two groups: Thus, if your primary question of interest can be simply expressed in a form amenable to a t-test, say, there really is no need to try and apply the full Bayesian machinery to so simple a problem. (Brooks, 2003, p. 2694) This article shows, to the contrary, that Bayesian parameter estimation provides much richer information than the NHST t-test, and that its conclusions can differ from those of the NHST t-test. Decisions based on Bayesian parameter estimation are better founded than NHST, whether the decisions of the two methods agree or not. The conclusion is bold but simple: Bayesian parameter estimation supersedes the NHST t-test.

Unfortunately, while I have no love for NHST, I did find it much easier to use the NHST concepts & code when learning how to do these analyses. In the future, hopefully I can switch to Bayesian techniques.

5. The analysis session in the R interpreter:

# Read in data w/ variable names in header; uninteresting columns deleted in OpenOffice.org

# "SSCF 10" is the variable ('sleep stealer custom factor')
# I edited the CSV to convert all '3' to '1' (& so a binary)
t.test(ZQ~SSCF.10, data=mydata)

Welch Two Sample t-test

data:  ZQ by SSCF.10
t = 1.7467, df = 53.249, p-value = 0.08645
alternative hypothesis: true difference in means is not equal to 0
95% confidence interval:
-1.282982 18.602275
sample estimates:
mean in group 0 mean in group 1
84.95652        76.29688

t.test(Total.Z~SSCF.10, data=mydata)

Welch Two Sample t-test

data:  Total.Z by SSCF.10
t = 1.7696, df = 53.227, p-value = 0.08252
alternative hypothesis: true difference in means is not equal to 0
95% confidence interval:
-6.526613 104.420635
sample estimates:
mean in group 0 mean in group 1
456.4783        407.5312

t.test(Time.to.Z~SSCF.10, data=mydata)

Welch Two Sample t-test

data:  Time.to.Z by SSCF.10
t = 1.1337, df = 27.207, p-value = 0.2668
alternative hypothesis: true difference in means is not equal to 0
95% confidence interval:
-3.337966 11.587966
sample estimates:
mean in group 0 mean in group 1
19.000          14.875

t.test(Time.in.Wake~SSCF.10, data=mydata)

Welch Two Sample t-test

data:  Time.in.Wake by SSCF.10
t = 0.6333, df = 46.823, p-value = 0.5296
alternative hypothesis: true difference in means is not equal to 0
95% confidence interval:
-2.012675  3.861860
sample estimates:
mean in group 0 mean in group 1
6.565217        5.640625

t.test(Time.in.REM~SSCF.10, data=mydata)

Welch Two Sample t-test

data:  Time.in.REM by SSCF.10
t = 2.6237, df = 62.064, p-value = 0.01093
alternative hypothesis: true difference in means is not equal to 0
95% confidence interval:
6.509461 48.165811
sample estimates:
mean in group 0 mean in group 1
145.4783        118.1406

t.test(Time.in.Deep~SSCF.10, data=mydata)

Welch Two Sample t-test

data:  Time.in.Deep by SSCF.10
t = 0.1547, df = 35.496, p-value = 0.8779
alternative hypothesis: true difference in means is not equal to 0
95% confidence interval:
-7.743305  9.021837
sample estimates:
mean in group 0 mean in group 1
57.21739        56.57812

t.test(Awakenings~SSCF.10, data=mydata)

Welch Two Sample t-test

data:  Awakenings by SSCF.10
t = 0.7956, df = 40.835, p-value = 0.4309
alternative hypothesis: true difference in means is not equal to 0
95% confidence interval:
-0.6648722  1.5290027
sample estimates:
mean in group 0 mean in group 1
2.869565        2.437500

t.test(Morning.Feel~SSCF.10, data=mydata)

Welch Two Sample t-test

data:  Morning.Feel by SSCF.10
t = -1.3606, df = 44.188, p-value = 0.1805
alternative hypothesis: true difference in means is not equal to 0
95% confidence interval:
-0.5496275  0.1065633
sample estimates:
mean in group 0 mean in group 1
2.952381        3.173913

t.test(Time.in.Light~SSCF.10, data=mydata)

Welch Two Sample t-test

data:  Time.in.Light by SSCF.10
t = 1.2354, df = 46.426, p-value = 0.2229
alternative hypothesis: true difference in means is not equal to 0
95% confidence interval:
-13.15648  54.99751
sample estimates:
mean in group 0 mean in group 1
254.2174        233.2969
6. The following R re-runs the above analysis but with options that increase the power of the results - but only if some assumptions about melatonin’s effects are correct; see later Vitamin D discussion of optimistic and pessimistic assumptions.

t.test(ZQ~SSCF.10, data=mydata, var.equal=TRUE);
t.test(Total.Z~SSCF.10, data=mydata, var.equal=TRUE)

Two Sample t-test

data:  ZQ by SSCF.10
t = 1.5091, df = 85, p-value = 0.135
alternative hypothesis: true difference in means is not equal to 0
95% confidence interval:
-2.749923 20.069217
sample estimates:
mean in group 0 mean in group 1
84.95652        76.29688

Two Sample t-test

data:  Total.Z by SSCF.10
t = 1.5291, df = 85, p-value = 0.13
alternative hypothesis: true difference in means is not equal to 0
95% confidence interval:
-14.69983 112.59385
sample estimates:
mean in group 0 mean in group 1
456.4783        407.5312

t.test(Time.to.Z~SSCF.10, data=mydata, alternative="less", var.equal=TRUE);
t.test(Time.in.Wake~SSCF.10, data=mydata, alternative="less", var.equal=TRUE);

Two Sample t-test

data:  Time.to.Z by SSCF.10
t = 1.4588, df = 85, p-value = 0.9258
alternative hypothesis: true difference in means is less than 0
95% confidence interval:
-Inf 8.827328
sample estimates:
mean in group 0 mean in group 1
19.000          14.875

Two Sample t-test

data:  Time.in.Wake by SSCF.10
t = 0.5783, df = 85, p-value = 0.7177
alternative hypothesis: true difference in means is less than 0
95% confidence interval:
-Inf 3.58344
sample estimates:
mean in group 0 mean in group 1
6.565217        5.640625

t.test(Time.in.REM~SSCF.10, data=mydata, var.equal=TRUE);
t.test(Time.in.Deep~SSCF.10, data=mydata, var.equal=TRUE)

Two Sample t-test

data:  Time.in.REM by SSCF.10
t = 2.1307, df = 85, p-value = 0.036
alternative hypothesis: true difference in means is not equal to 0
95% confidence interval:
1.827653 52.847619
sample estimates:
mean in group 0 mean in group 1
145.4783        118.1406

Two Sample t-test

data:  Time.in.Deep by SSCF.10
t = 0.163, df = 85, p-value = 0.8709
alternative hypothesis: true difference in means is not equal to 0
95% confidence interval:
-7.156889  8.435422
sample estimates:
mean in group 0 mean in group 1
57.21739        56.57812

t.test(Awakenings~SSCF.10, data=mydata, alternative="less", var.equal=TRUE);
t.test(Morning.Feel~SSCF.10, data=mydata, alternative="great", var.equal=TRUE)

Two Sample t-test

data:  Awakenings by SSCF.10
t = 0.7756, df = 85, p-value = 0.7799
alternative hypothesis: true difference in means is less than 0
95% confidence interval:
-Inf 1.358521
sample estimates:
mean in group 0 mean in group 1
2.869565        2.437500

Two Sample t-test

data:  Morning.Feel by SSCF.10
t = -1.2918, df = 65, p-value = 0.8995
alternative hypothesis: true difference in means is greater than 0
95% confidence interval:
-0.5076874        Inf
sample estimates:
mean in group 0 mean in group 1
2.952381        3.173913

t.test(Time.in.Light~SSCF.10, data=mydata, alternative="less", var.equal=TRUE)

Two Sample t-test

data:  Time.in.Light by SSCF.10
t = 1.1324, df = 85, p-value = 0.8697
alternative hypothesis: true difference in means is less than 0
95% confidence interval:
-Inf 51.64192
sample estimates:
mean in group 0 mean in group 1
254.2174        233.2969
7. The usual way to correct for the issue of multiple comparisons inflating results (a big problem in epidemiology and why their results are so often false) is to use a Bonferroni correction - if I look at the p-values for 7 Zeo metrics, I wouldn’t consider any to be statistically-significant at p=0.05 unless they were actually statistically-significant at $\frac{0.05}{7}=0.00714=0.007$, which is even more stringent than the rarer p=0.01 criterion. With the even stronger criterion p=0.007, it’s a safe bet than none of my tests give statistically-significant results. Which may be the right thing to conclude, since all my data is just n=1 and unreliable in many ways, but still, the Bonferroni correction is not being very helpful here.

The caveat is that the Bonferroni correction is intended for use on independent data, while the Zeo metrics are all very dependent, some by definition (eg. ZQ is defined partly as what the REM sleep length was, AFAIK). So while the Bonferroni correction will still do the job of only letting through really statistically-significant data, it’ll do so by throwing out way more potentially good results than one has to. (It’ll avoid some false positives by making many false negatives.) So what should we do?

Andy McKenzie suggested limiting our false discovery rate by using the method of Benjamin & Hochberg 1995:

…let’s say that you test 6 hypotheses, corresponding to different features of your Zeo data. You could use a t-test for each, as above. Then aggregate and sort all the p-values in ascending order. Let’s say that they are 0.001, 0.013, 0.021, 0.030, 0.067, and 0.134.

Assume, arbitrarily, that you want the overall false discovery rate to be 0.05, which is in this context called the q-value. You would then sequentially test, from the last value to the first, whether the current p-value is less than $\frac{\mathrm{\text{the current index}}×\mathrm{\text{the false discovery rate}}}{\mathrm{\text{the overall number of hypotheses}}}$. You stop when you get to the first true inequality and call the p-values of the rest of the hypotheses [statistically-]significant.

So in this example, you would stop when you correctly call $0.030<\frac{4×0.05}{6}$, and only the hypotheses corresponding to the first four [smallest] p-values would be called [statistically-]significant.

8. (456.4783 - 407.5312) / sd(mydata$Total.Z, TRUE) = 0.3723188 9. One weakness is that I didn’t randomize selection of melatonin or no-melatonin, instead doing an alternating days design at my convenience. But if we were to wave that away, the power seems… OK. I have ~141 days of data, of which 90-100 are usable, giving me maybe <50 pairs? If I fire up R and load in the two means and the standard deviation (which we know from calculating the effect size), and then play with the numbers, then to get an 85% chance I could find an effect at p=0.01: pwr.t.test(d=(456.4783 - 407.5312) / 131.4656,power=0.85,sig.level=0.01,type="paired",alternative="greater") Paired t test power calculation n = 84.3067 d = 0.3723187 sig.level = 0.01 power = 0.85 alternative = greater NOTE: n is number of *pairs* 84 pairs is more than 50 pairs. If I weaken the p=0.01 for 0.05, it looks like I should have had an 85% shot at detecting the effect: pwr.t.test(d=(456.4783 - 407.5312) / 131.4656,power=0.85,sig.level=0.05,type="paired",alternative="greater") Paired t test power calculation n = 53.24355 So, it’s not great, but it’s at least not terribly underpowered. 10. If we correct for multiple comparisons (see previous footnote) at q-value=0.05, of our 10 p-values (0.01093, 0.08252, 0.08645, 0.1805, 0.2229, 0.2668, 0.4309, 0.5296, 0.8779), none of them survive: R> p.adjust(c(0.01093, 0.08252, 0.08645, 0.1805, 0.2229, 0.2668, 0.4309, 0.5296, 0.8779), method="BH") < 0.05 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE Oh well. 11. Blocking is a style of variation on a simple randomized design where instead of considering each day separate and randomizing a single day, we instead randomize pairs of days, or more; so instead of flipping our coin to decide whether this week is placebo, we flip our coin to decide whether this week will be placebo & next active or this week active & next placebo. This has 2 big advantages which justify the complexity: 1. Often, I’m worried about simple randomization leading to an imbalance in sample vs experimental; if I’m only getting 20 total datapoints on something, then randomization could easily lead to something like 14 control and 6 experimental datapoints - throwing out a lot of statistical power compared to 10 control and 10 experimental! Why am I losing power? Because data is subject to diminishing returns: each new point reduces the standard error of your estimates less than the previous one did (since the total error shrinks as, roughly, inverse of the square root of the total sample size; the difference between √1 and √2 is bigger and shrinks error more than √2 vs √3, etc) . So the extra 4 control datapoints reduce the error less than the lost 4 experimental datapoints would have, and this leaves me with a final answer less precise than if it had been exactly 10:10. (If diminishing returns isn’t intuitive, imagine taking it to an extreme: is 10:10 just as good as 5:15? As good as 2:18? How about 0:20?) But if I pair days like this, then I know I will get exactly 10:10. 2. Blocking is the natural way to handle multiple-day effects or trends: if I think lithium operates slowly, I will pair entire weeks or months, rather than days and hoping enough experimental and control days form runs which will reveal any trend rather than wash it out in averaging. 12. The net present value formula is the annual savings divided by the natural log of the discount rate, out to eternity. Exponential discounting means that a bond that expires in 50 years is worth a surprisingly similar amount to one that continues paying out forever. For example, a 50 year bond paying$10 a year at a discount rate of 5% is worth sum (map (\t -> 10 / (1 + 0.05)^t) [1..50]) ~> 182.5 but if that same bond never expires, it’s worth 10 / log 1.05 = 204.9 or just $22.4 more! My own expected longevity is ~50 more years, but I prefer to use the simple natural log formula rather than the more accurate summation. Either way is interesting; Vaniver: …possibly a way to drive it home is to talk about dividing by log 1.05, which is essentially multiplying by 20.5. If you can make a one-time investment that pays off annually until you die, that’s worth 20.5 times the annual return, and multiplying the value of something by 20 can often move it from not worth thinking about to worth thinking about. 13. Vaniver notes that one reason I might be less confident than you would expect is that many substances or supplements lose effect over time as one’s body regains homeostasis and compensates for the substance, building tolerance. Which is quite true, and a major reason I tested melatonin - I was sure it worked for me in the past, but did it still work? 14. For simplicity, in all my VoI calculations I assume that I’ll stop buying the supplement (or doing the activity) if I hit a negative result. The proper way a real analyst would do this value of information question would be to say that the negative result gives us additional information which changes the expected-value of melatonin use. In my melatonin article article, I calculated that since melatonin saved me close to an hour while each dose cost literally a penny or two, the value was astronomical -$2350.60 a year! By Bayes’ formula, if I started with 80% confidence and had a 95% accurate test, a negative result drops my 80% all the way down to 17%. We get this by using a derivation of Bayes’s theorem:

$P\left(a\mid b\right)=\frac{P\left(b\mid a\right)×P\left(a\right)}{\left(P\left(b\mid a\right)×P\left(a\right)\right)+\left(P\left(b\mid ¬a\right)×P\left(¬a\right)\right)}=\frac{0.05×0.8}{\left(0.05×0.8\right)+\left(0.95×0.2\right)}=0.174$

But ironically if I now believed that melatonin only had a 17% chance of doing something helpful rather than nothing at all (as compared to my original 80% belief), well, 17% of $2350 ($117) is still way more money than the melatonin cost ($10), so I’d use it anyway! Would it make sense to iterate again and test melatonin a second time? Well, what does the calculation say? We have a new prior of 17; what happens if we get a negative result again? $\frac{0.05×0.17}{\left(0.05×0.17\right)+\left(0.95×0.82\right)}=0.01$ and then the expected value is $0.0107...×2350=25.7$, which is not much more than the cost of$10, and given the difficult-to-quantify possibility of negative long-term health effects, is not enough of a profit to really entice me.

15. Technology Review editor Emily Singer noticed the same problem when using her Zeo.

16. The R interpreter session, loading a CSV as before:

t.test(Total.Z~SSCF.11, data=mydata)

Welch Two Sample t-test

data:  Total.Z by SSCF.11
t = 0.8035, df = 48.455, p-value = 0.4256
alternative hypothesis: true difference in means is not equal to 0
95% confidence interval:
-17.09298  39.85727
sample estimates:
mean in group 0 mean in group 3
513.7750        502.3929

t.test(Time.in.REM~SSCF.11, data=mydata)

Welch Two Sample t-test

data:  Time.in.REM by SSCF.11
t = 2.4322, df = 54.225, p-value = 0.01834
alternative hypothesis: true difference in means is not equal to 0
95% confidence interval:
2.827521 29.343907
sample estimates:
mean in group 0 mean in group 3
168.8000        152.7143

t.test(ZQ~SSCF.11, data=mydata)

Welch Two Sample t-test

data:  ZQ by SSCF.11
t = 1.3406, df = 56.26, p-value = 0.1854
alternative hypothesis: true difference in means is not equal to 0
95% confidence interval:
-1.754009  8.854009
sample estimates:
mean in group 0 mean in group 3
96.05           92.50

t.test(Awakenings~SSCF.11, data=mydata)

Welch Two Sample t-test

data:  Awakenings by SSCF.11
t = 0.3177, df = 62.634, p-value = 0.7518
alternative hypothesis: true difference in means is not equal to 0
95% confidence interval:
-0.9259517  1.2759517
sample estimates:
mean in group 0 mean in group 3
4.175           4.000

t.test(Time.in.Wake~SSCF.11, data=mydata)

Welch Two Sample t-test

data:  Time.in.Wake by SSCF.11
t = 0.2617, df = 65.521, p-value = 0.7944
alternative hypothesis: true difference in means is not equal to 0
95% confidence interval:
-7.033471  9.154900
sample estimates:
mean in group 0 mean in group 3
12.77500        11.71429

t.test(Morning.Feel~SSCF.11, data=mydata)

Welch Two Sample t-test

data:  Morning.Feel by SSCF.11
t = 1.0479, df = 64.65, p-value = 0.2986
alternative hypothesis: true difference in means is not equal to 0
95% confidence interval:
-0.1427046  0.4577229
sample estimates:
mean in group 0 mean in group 3
2.871795        2.714286
17. If we correct for multiple comparisons (see previous footnote on the Bonferroni correction) at q-value=0.05, of our 6 p-values (0.01834, 0.1854, 0.2986, 0.4256, 0.7518, 0.7944), none of them survive:

R> p.adjust(c(0.01834, 0.1854, 0.2986, 0.4256, 0.7518, 0.7944), method="BH") < 0.05
[1] FALSE FALSE FALSE FALSE FALSE FALSE

Oh well! Statistics is a harsh mistress indeed.

18. …The increased odds of high PSQI score for greater hemoglobin level and for high ESS score for use of vitamin D analogues were unexpected results for which we cannot speculate about the cause or association and that may simply be spurious findings arising from statistical analysis.

19. This study found a [statistically-]significant relationship between circadian phase of sleep and dietary Vitamin D intake. Later sleep acrophase, an indicator of sleep timing, was associated with more dietary Vitamin D. For most people, most Vitamin D is obtained through sunlight(44), though dietary Vitamin D is usually obtained through supplementation, usually in pills or in dairy products(44). It is currently unknown why those who consumed more Vitamin D would demonstrate a sleep phase delay, especially since in this same subject group, those exposed to more light had earlier circadian acrophases(45).

20. Late midpoint of sleep was [statistically-]significantly negatively associated with the percentage of energy from protein and carbohydrates, and the energy-adjusted intake of cholesterol, potassium, calcium, magnesium, iron, zinc, vitamin A, vitamin D, thiamin, riboflavin, vitamin B(6), folate, rice, vegetables, pulses, eggs, and milk and milk products.

21. The problem was the original vitamin D3 capsule: I couldn’t squeeze out all the oil, so I settled for squeezing out most, and then pushing the original capsule into the new capsule. So they contain everything they should, but they have a visible bubble inside them (the original capsule). Hence, the need for literal blinding. Otherwise, they’re pretty good: identical shape and weight.

22. See the general remarks in LiveStrong, Vitamin D warning: Too much can harm your heart, and the 2009 study Relation of serum 25-hydroxyvitamin D to heart rate and cardiac work (from the National Health and Nutrition Examination Surveys).

23. For Quality & ZQ: higher = better

24. Headband came loose at some point, data useless

25. Headband came loose at some point, data useless

26. The preponderance of True is because while recording the scores, I normalized them; in retrospect, I shouldn’t’ve bothered:

logBinaryScore = sum . map (\(result,p) -> if result then 1 + logBase 2 p else 1 + logBase 2 (1-p))
logBinaryScore [(True,0.50),(True,0.50),(True,0.50),(True,0.50),(True,0.50),(True,0.50),(True,0.50),
(True,0.50),(True,0.50),(True,0.50),(True,0.50),(True,0.55),(True,0.55),(True,0.55),
(True,0.60),(True,0.60),(True,0.60),(True,0.60),(True,0.60),(True,0.60),(True,0.60),
(True,0.60),(True,0.65),(True,0.65),(True,0.65),(True,0.65),(True,0.65),(True,0.65),
(True,0.65),(True,0.65),(True,0.70),(True,0.70),(True,0.70),(True,0.70),(True,0.75),
(True,0.75),(False,0.55),(False,0.6),(False,0.6),(False,0.7),(False,0.7),(False,0.75)]
5.4
27. The usual session:

mydata <- read.csv("http://www.gwern.net/docs/2012-zeo-vitamind.csv")
t.test(ZQ~SSCF.12, data=mydata)

Welch Two Sample t-test

data:  ZQ by SSCF.12
t = 1.9284, df = 36.417, p-value = 0.06163
alternative hypothesis: true difference in means is not equal to 0
95% confidence interval:
-0.2107102  8.4258132
sample estimates:
mean in group 0 mean in group 1
93.36842        89.26087

t.test(Total.Z~SSCF.12, data=mydata)

Welch Two Sample t-test

data:  Total.Z by SSCF.12
t = 1.8776, df = 33.985, p-value = 0.06904
alternative hypothesis: true difference in means is not equal to 0
95% confidence interval:
-1.738464 43.953567
sample estimates:
mean in group 0 mean in group 1
533.3684        512.2609

t.test(Time.in.REM~SSCF.12, data=mydata)

Welch Two Sample t-test

data:  Time.in.REM by SSCF.12
t = 2.4099, df = 31.155, p-value = 0.02205
alternative hypothesis: true difference in means is not equal to 0
95% confidence interval:
2.278176 27.332808
sample estimates:
mean in group 0 mean in group 1
175.6316        160.8261

t.test(Time.in.Deep~SSCF.12, data=mydata)

Welch Two Sample t-test

data:  Time.in.Deep by SSCF.12
t = -0.6214, df = 38.235, p-value = 0.538
alternative hypothesis: true difference in means is not equal to 0
95% confidence interval:
-7.404045  3.925784
sample estimates:
mean in group 0 mean in group 1
55.00000        56.73913

t.test(Time.in.Wake~SSCF.12, data=mydata)

Welch Two Sample t-test

data:  Time.in.Wake by SSCF.12
t = -0.3332, df = 32.679, p-value = 0.7411
alternative hypothesis: true difference in means is not equal to 0
95% confidence interval:
-12.588395   9.046061
sample estimates:
mean in group 0 mean in group 1
26.31579        28.08696

t.test(Awakenings~SSCF.12, data=mydata)

Welch Two Sample t-test

data:  Awakenings by SSCF.12
t = -0.8485, df = 37.998, p-value = 0.4015
alternative hypothesis: true difference in means is not equal to 0
95% confidence interval:
-2.3089027  0.9450583
sample estimates:
mean in group 0 mean in group 1
7.578947        8.260870

t.test(Morning.Feel~SSCF.12, data=mydata)

Welch Two Sample t-test

data:  Morning.Feel by SSCF.12
t = 2.99, df = 32.596, p-value = 0.005276
alternative hypothesis: true difference in means is not equal to 0
95% confidence interval:
0.1672538 0.8805931
sample estimates:
mean in group 0 mean in group 1
2.842105        2.318182

t.test(Time.to.Z~SSCF.12, data=mydata)

Welch Two Sample t-test

data:  Time.to.Z by SSCF.12
t = -0.6687, df = 32.607, p-value = 0.5084
alternative hypothesis: true difference in means is not equal to 0
95% confidence interval:
-12.779474   6.459107
sample estimates:
mean in group 0 mean in group 1
17.57895        20.73913
28. Note that our previous options were a little pessimistic:

1. the R t-test function does not assume variance is equal, even though I believe they would be
2. nor does it assume that the effect of Vitamin D is bad on ZQ - decreases ZQ, like I expected.

If we were willing to make those assumptions, then the tests have fewer possibilities to consider, and can make stronger claims. Let’s take ZQ as an example. The straight Welch Two Sample t-test gives us p=0.06, as we already saw. But if we were willing to assume equal variance, that buys us -0.01 downwards to p=0.0581; a test for either null or negative effect gives us p=0.0308; and finally, assuming equal variance and testing for null/negative (rather than positive/null/negative) gives us p=0.029 - substantially better than the original p=0.06:

t.test(ZQ~SSCF.12, data=mydata)

Welch Two Sample t-test

data:  ZQ by SSCF.12
t = 1.9284, df = 36.417, p-value = 0.06163
alternative hypothesis: true difference in means is not equal to 0
95% confidence interval:
-0.2107102  8.4258132
sample estimates:
mean in group 0 mean in group 1
93.36842        89.26087

t.test(ZQ~SSCF.12, data=mydata, alternative="great")

Welch Two Sample t-test

data:  ZQ by SSCF.12
t = 1.9284, df = 36.417, p-value = 0.03082
alternative hypothesis: true difference in means is greater than 0
95% confidence interval:
0.5124493       Inf
sample estimates:
mean in group 0 mean in group 1
93.36842        89.26087

t.test(ZQ~SSCF.12, data=mydata, var.equal=TRUE)

Two Sample t-test

data:  ZQ by SSCF.12
t = 1.951, df = 40, p-value = 0.05809
alternative hypothesis: true difference in means is not equal to 0
95% confidence interval:
-0.1476258  8.3627288
sample estimates:
mean in group 0 mean in group 1
93.36842        89.26087

t.test(ZQ~SSCF.12, data=mydata, alternative="great", var.equal=TRUE)

Two Sample t-test

data:  ZQ by SSCF.12
t = 1.951, df = 40, p-value = 0.02905
alternative hypothesis: true difference in means is greater than 0
95% confidence interval:
0.5623672       Inf
sample estimates:
mean in group 0 mean in group 1
93.36842        89.26087

We’ll re-run the analysis with appropriate options:

t.test(ZQ~SSCF.12, data=mydata, alternative="great", var.equal=TRUE);
t.test(Total.Z~SSCF.12, data=mydata, alternative="great", var.equal=TRUE)

Two Sample t-test

data:  ZQ by SSCF.12
t = 1.951, df = 40, p-value = 0.02905
alternative hypothesis: true difference in means is greater than 0
95% confidence interval:
0.5623672       Inf
sample estimates:
mean in group 0 mean in group 1
93.36842        89.26087

Two Sample t-test

data:  Total.Z by SSCF.12
t = 1.9204, df = 40, p-value = 0.03098
alternative hypothesis: true difference in means is greater than 0
95% confidence interval:
2.599576      Inf
sample estimates:
mean in group 0 mean in group 1
533.3684        512.2609

t.test(Time.in.REM~SSCF.12, data=mydata, alternative="great", var.equal=TRUE);
t.test(Time.in.Deep~SSCF.12, data=mydata, var.equal=TRUE)

Two Sample t-test

data:  Time.in.REM by SSCF.12
t = 2.4936, df = 40, p-value = 0.00844
alternative hypothesis: true difference in means is greater than 0
95% confidence interval:
4.807836      Inf
sample estimates:
mean in group 0 mean in group 1
175.6316        160.8261

Two Sample t-test

data:  Time.in.Deep by SSCF.12
t = -0.6225, df = 40, p-value = 0.5371
alternative hypothesis: true difference in means is not equal to 0
95% confidence interval:
-7.385555  3.907294
sample estimates:
mean in group 0 mean in group 1
55.00000        56.73913

t.test(Time.in.Wake~SSCF.12, data=mydata, alternative="less", var.equal=TRUE);
t.test(Awakenings~SSCF.12, data=mydata, alternative="great", var.equal=TRUE)

Two Sample t-test

data:  Time.in.Wake by SSCF.12
t = -0.3427, df = 40, p-value = 0.3668
alternative hypothesis: true difference in means is less than 0
95% confidence interval:
-Inf 6.931845
sample estimates:
mean in group 0 mean in group 1
26.31579        28.08696

Two Sample t-test

data:  Awakenings by SSCF.12
t = -0.8513, df = 40, p-value = 0.8002
alternative hypothesis: true difference in means is greater than 0
95% confidence interval:
-2.030783       Inf
sample estimates:
mean in group 0 mean in group 1
7.578947        8.260870

t.test(Morning.Feel~SSCF.12, data=mydata, alternative="great", var.equal=TRUE);
t.test(Time.to.Z~SSCF.12, data=mydata, alternative="less", var.equal=TRUE)

Two Sample t-test

data:  Morning.Feel by SSCF.12
t = 2.8647, df = 39, p-value = 0.003344
alternative hypothesis: true difference in means is greater than 0
95% confidence interval:
0.2157825       Inf
sample estimates:
mean in group 0 mean in group 1
2.842105        2.318182

Two Sample t-test

data:  Time.to.Z by SSCF.12
t = -0.6878, df = 40, p-value = 0.2478
alternative hypothesis: true difference in means is less than 0
95% confidence interval:
-Inf 4.57606
sample estimates:
mean in group 0 mean in group 1
17.57895        20.73913
29. This one really surprised me. I was convinced it was taking longer to get to sleep, which it was, but that the effect was consistent enough it’d reach at least p=0.1 or something.

30. Correcting for multiple comparisons at q-value=0.05, of our 8 pessimistic p-values (0.005276, 0.02205, 0.06163, 0.06904, 0.4015, 0.5084, 0.538, 0.7411), 1 survives:

R> p.adjust(c(0.005276, 0.02205, 0.06163, 0.06904,0.4015, 0.5084, 0.538, 0.7411), method="BH") < 0.05
[1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Remarkable - the first time a p-value survived. (That was the Morning Feel one.)

31. A quick and dirty way is to look at the difference in the 2 means and compare it to what is the normal range (intuitively something like a standard deviation or two), which in this case, my ZQ ranges normally from the mid-80s to the very low 100s (a range of ~20 points), and the difference in ZQ means is 4 points. So that’s 1/5 the normal variability - that’s nothing to sneeze at! (By a strange coincidence, this is almost the same sort of numbers as for grades in class, so we could analogize: not taking vitamin D at night would improve my sleep grades by one letter. When I was in school, I would not have turned my nose up at such an intervention.)

My previous (unblinded non-randomized) experiment with melatonin over half a year or so, which found a shift in mean ZQs of ~8 points. I was really impressed with melatonin when I first started (as have many people been), so to find vitamin D is half as hurtful as melatonin was helpful… I didn’t have any plans to take vitamin D in the evening rather than morning, but I sure as heck don’t now!

32. Cohen’s d is the one I see most often; that’s $\frac{\mathrm{\text{bigger mean}}-\mathrm{\text{smaller mean}}}{\mathrm{\text{standard deviation}}}$; placebo had a bigger mean of 93.3684 and Vitamin D a smaller ZQ mean of 89.2609, and an online tool told me the standard-deviation of all the ZQs was 7.0198, so: $\frac{93.3684-89.2609}{7.0198}=0.585$

33. Wikipedia: For Cohen’s d an effect size of 0.2 to 0.3 might be a small effect, around 0.5 a medium effect and 0.8 to infinity, a large effect.

34. I originally input the data as Other Disruptions 4 through the Zeo web interface, since I assumed that if Other Disruptions 3 was SSCF.12, that would put the data into SSCF.13 - but it turns out that does not get exported in the CSV! Apparently the CSV is limited to 1-3. So I edited the exported CSV and just reused SSCF.1. Hopefully Zeo Inc. will fix the export functionality, since it’s very frustrating to be able to see the data used in the Cause & Effect tool, for example, but not export it.

35. The interpreter session (pessimistic assumptions; optimistic left as exercise for the reader):

mydata <- read.csv("http://www.gwern.net/docs/2012-zeo-vitamind-morning.csv")
t.test(ZQ~SSCF.1, data=mydata)

Welch Two Sample t-test

data:  ZQ by SSCF.1
t = -0.9582, df = 49.202, p-value = 0.3427
alternative hypothesis: true difference in means is not equal to 0
95% confidence interval:
-11.394801   4.036468
sample estimates:
mean in group 0 mean in group 1
90.63333        94.31250

t.test(Total.Z~SSCF.1, data=mydata)

Welch Two Sample t-test

data:  Total.Z by SSCF.1
t = -0.7836, df = 44.939, p-value = 0.4374
alternative hypothesis: true difference in means is not equal to 0
95% confidence interval:
-55.87065  24.57482
sample estimates:
mean in group 0 mean in group 1
510.6333        526.2812

t.test(Time.in.REM~SSCF.1, data=mydata)

Welch Two Sample t-test

data:  Time.in.REM by SSCF.1
t = -0.7204, df = 42.881, p-value = 0.4752
alternative hypothesis: true difference in means is not equal to 0
95% confidence interval:
-22.26792  10.54709
sample estimates:
mean in group 0 mean in group 1
157.2333        163.0938

t.test(Time.in.Deep~SSCF.1, data=mydata)

Welch Two Sample t-test

data:  Time.in.Deep by SSCF.1
t = -0.7593, df = 59.53, p-value = 0.4507
alternative hypothesis: true difference in means is not equal to 0
95% confidence interval:
-10.10198   4.54365
sample estimates:
mean in group 0 mean in group 1
64.03333        66.81250

t.test(Time.in.Wake~SSCF.1, data=mydata)

Welch Two Sample t-test

data:  Time.in.Wake by SSCF.1
t = 0.9282, df = 56.632, p-value = 0.3572
alternative hypothesis: true difference in means is not equal to 0
95% confidence interval:
-4.627848 12.623682
sample estimates:
mean in group 0 mean in group 1
26.96667        22.96875

t.test(Awakenings~SSCF.1, data=mydata)

Welch Two Sample t-test

data:  Awakenings by SSCF.1
t = 0.2798, df = 54.725, p-value = 0.7807
alternative hypothesis: true difference in means is not equal to 0
95% confidence interval:
-1.451225  1.922059
sample estimates:
mean in group 0 mean in group 1
7.766667        7.531250

t.test(Morning.Feel~SSCF.1, data=mydata)

Welch Two Sample t-test

data:  Morning.Feel by SSCF.1
t = -2.8975, df = 58.965, p-value = 0.005272
alternative hypothesis: true difference in means is not equal to 0
95% confidence interval:
-0.9054147 -0.1657060
sample estimates:
mean in group 0 mean in group 1
2.62069         3.15625

t.test(Time.to.Z~SSCF.1, data=mydata)

Welch Two Sample t-test

data:  Time.to.Z by SSCF.1
t = 0.0055, df = 59.368, p-value = 0.9957
alternative hypothesis: true difference in means is not equal to 0
95% confidence interval:
-7.617116  7.658783
sample estimates:
mean in group 0 mean in group 1
25.33333        25.31250

t.test(Mood~SSCF.1, data=mydata)

Welch Two Sample t-test

data:  Mood by SSCF.1
t = 0.4591, df = 48.049, p-value = 0.6482
alternative hypothesis: true difference in means is not equal to 0
95% confidence interval:
-0.2407499  0.3832499
sample estimates:
mean in group 0 mean in group 1
3.09000         3.01875
36. Gustavo Lacerda wondered if the two-sample t-test was really a justifiable test to use - could days be correlated, in which case the p-values would be overstated and my results actually much weaker than they look? He suggested testing my full Zeo dataset to see whether Morning Feel can be predicted from day to day by a (relatively) simple linear autocorrelation regression looking at all previous recorded days:

data <- read.csv("http://www.gwern.net/docs/gwern-zeodata.csv")
# Master Zeo export file is periodically updated; your results may not be identical
n <- length(data$Morning.Feel); n [1] 686 reg <- lm(Morning.Feel[2:n] ~ Morning.Feel[1:(n-1)], data=data) summary(reg) Call: lm(formula = Morning.Feel[2:n] ~ Morning.Feel[1:(n - 1)], data = data) Residuals: Min 1Q Median 3Q Max -1.863 -0.742 0.197 0.258 1.318 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.6217 0.1219 21.51 <2e-16 Morning.Feel[1:(n - 1)] 0.0603 0.0424 1.42 0.16 Residual standard error: 0.723 on 554 degrees of freedom (129 observations deleted due to missingness) Multiple R-squared: 0.00364, Adjusted R-squared: 0.00185 F-statistic: 2.03 on 1 and 554 DF, p-value: 0.155 # Given that pretty much all the ratings are 2, 3, or 4, and the r^2 is <0.1 # with a residual error of 0.7, that doesn't seem very correlated? # although the _p_ does indicate there's a real (but very small) correlation from # day to day, so I guess the p-values may be a *little* overstated cor(data$Morning.Feel[2:n], data$Morning.Feel[1:(n-1)], use = "complete.obs") [1] 0.06037 # 129 observations missing? What's going on? Well, what does the data look like? data$Morning.Feel
[1] NA  2  3  3  4  3  3  2 NA NA  4  4 NA  3 NA  2  4  4 NA  4  3  3  3  4  2  3  2  3 NA  3 NA
[32] NA  4 NA  4 NA NA NA NA NA NA NA NA NA NA NA NA NA  3  4 NA NA  4  4  3  4 NA NA NA NA NA NA
[63] NA  4 NA  2  3  3 NA NA  3 NA  3  3 NA  2 NA NA NA NA  3 NA NA NA NA NA NA NA  3  4 NA  4  3
[94]  3  3  4  4  3  3  3  2  3  3  2  3  3  3  2 NA  3  3  4  3 NA  3 NA  3 NA  3  3  3 NA  3  3
[125] NA NA NA NA NA  2 NA NA  3  2  3 NA NA NA NA NA NA  3  2  3  2  2  2  2  2  3  3  3  3 NA  3
[156]  3  2  2  3  3  2  3  2  3 NA  2 NA NA  4  3  3  3  2  3 NA  4  3  2  3  3  3  3  3  3  4  3
[187]  4  3  3  3  3  3  2  3  2  3  3  3 NA  3  1  4 NA  3  2  4  4  2  2  3  3  3  3  3  3  3  3
[218]  3  3  4  3  3  2  2  3  3  2  3  3  3  2  2  3  3  3  3  3  4  3  3  2  2  2  1  2  3  3 NA
[249]  3  3  3  3  3  3  3  3  2  3  2  3  2  3  3  3  2  3  3  2  3  3  3  3  4  3  3  4  3  4  2
[280]  3 NA  3  3  2  2  2  3  3  3  3  2  3  3  2  2  2  3  3  2  2  3  2  3  3  3  3  3  3  2  3
[311]  3  2  1  3  4  3  2  3  3  2  2  3  3  3  1  2 NA  2  3  2  2  3  3  2  3  3 NA  3 NA  3  3
[342]  2  3  2  2  3  3  3  3  1  3  3  3  2  1  3 NA  2  3  3  3  3  2  1  2  2  3  2  2  3  3  3
[373]  3  3  4  3  2  3  3  3  2  2  3 NA  3  2  3  4  4  3  3  2  4  3  2  3  3  4  3  4  3  3 NA
[404]  2  2  3  3  3  4  4  3  1  3  3  2  4  3  3  3  2  3  2  4  2  4  3  3  3  4 NA  2  3  3  3
[435]  3  2  1  2  2  3  2  3  1  4  3  3  4  3  3  2  2  2  2  3  1  3  3  3  4  3  3  2  3  3  4
[466]  4  2  2  3  3  2  2  4  3  3  3  2  3  2  2  3  2  3  2  3  2  3  2  3  2  3  3  3  2  3  3
[497]  2  3  1  2  3  3  3  3  2  2  3  3  1  3  2  3  3  4  1  3  4  1  4  3  4  3  3  2  3  2 NA
[528]  3  4  2  4  3  3  3  4  4  1  3  2  3  3  3  2  3  4  3  3  2  3  3  3  4  2  2  2  3  3  3
[559]  4  4  1  3  3  3  4  3  4  3  3  1  1  2  3  2  3  3  4  3  3  3  2  2  3  4  4  1  4  4  3
[590]  4  3  3  3  3  3  2  3  3  2  3  3  2  3  4  2  2  3  1  3  3  2  3  3  2  2  3  4  3  2  1
[621]  3  3  3  3  2  4  2  3  3  3  3  4  3  3  3 NA  3 NA  4  3  2  2  2  2  3  3  3  4  3  2  3
[652]  2  3  3  1  3  4  3  3  4  4  4  2  3  2  1  4  2  4  3  2  3  3  3  3  2  3  4  2  2  2  2
[683]  3  4  3  4

# ah, I just wasn't good about recording "Morning Feel" early on, and since then
# there have been occasional slips (literally, with the headband)

And by the way, instead of regressing Morning.Feel[n] on Drug[n] (a discrete variable taking values in {0,1}), it would make more sense to regress on an Exponentially-Weighted Moving Average of Drug, such as $Drug\left[n-1\right]+\left(\frac{1}{2}×Drug\left[n-2\right]\right)+\left(\frac{1}{4}×Drug\left[n-3\right]\right)+...$ which is modeling how much drug is present on the body. In the above example, I’m assuming a half-life of 1 day, so lambda=$\frac{1}{2}$. You could arguably select the lambda that gives you the best fit; just be wary of multiple testing.

37. See previous footnotes about optimistic versus pessimistic assumptions about equal variance & correctly-predicted direction of effect for why these two p-values differ.

38. Correcting for multiple comparisons at q-value=0.05, of our 9 pessimistic p-values, 1 survives:

R> p.adjust(c(0.34, 0.44, 0.48, 0.45, 0.36, 0.78, 0.005, 0.99, 0.65), method="BH") < 0.05
[1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
39. We extract the standard deviation for ’Morning Feel with sd(mydata$Morning.Feel, TRUE) = 0.7682213; so Cohen’s d = $\frac{3.15625-2.62069}{0.7682213}=0.697=0.7$ 40. Code written by Ben Wieland: library("ggplot2") sleep <- read.csv("http://www.gwern.net/docs/2012-zeo-vitamind-morning.csv") qplot(as.Date(Sleep.Date,format="%m/%d/%Y"), weight=Mood, data=sleep, geom="bar", binwidth=1, fill=factor(sleep$SSCF.1, labels=c("placebo","active")),
ylab="Mood", xlab="Date")+scale_fill_discrete(name="treatment")

# to save:
ggsave("mood.png")

qplot(as.Date(Sleep.Date,format="%m/%d/%Y"), weight=Morning.Feel, data=sleep,
geom="bar", binwidth=1, fill=factor(sleep$SSCF.1, labels=c("placebo","active")), ylab="Morning Feel", xlab="Date")+scale_fill_discrete(name="treatment") ggsave("morning.feel.png") 41. The BEST analysis is powerful and provides much more information than a simple t-test would, but the various parameters in the table or the image are not self-explanatory; the curious should read Bayesian estimation supersedes the t test (Kruschke 2012). In the CSV, an SSCF.1 of 0 indicates membership in the original experiment, 1 indicates the dry period July-September, 2 indicates the vitamin D resumption post-original-experiment, and 3 indicates the vitamin D resumption post-September. So: # set up data mydata <- read.csv("http://www.gwern.net/docs/2012-zeo-vitamind-morning-control.csv") originalcontrol <- subset(mydata, SSCF.1==0) newcontrol <- subset(mydata, SSCF.1==1) # clean missing data originalcontrol <- originalcontrol$Morning.Feel[!is.na(originalcontrol$Morning.Feel)] newcontrol <- newcontrol$Morning.Feel[!is.na(newcontrol$Morning.Feel)] # run BEST MCMC group estimations source("BEST.R") mcmc = BESTmcmc(originalcontrol, newcontrol) BESTplot(originalcontrol, newcontrol, mcmc, TRUE, ROPEeff=c(-0.1,0.1)) SUMMARY.INFO PARAMETER mean median mode HDIlow HDIhigh pcgtZero mu1 2.82199912 2.82184675 2.82109419 2.5425634 3.1008251 NA mu2 2.84712376 2.84744246 2.84233569 2.6205415 3.0777439 NA muDiff -0.02512464 -0.02542602 -0.03361140 -0.3874754 0.3339228 44.43593 sigma1 0.72900731 0.71760315 0.69447083 0.5330477 0.9474278 NA sigma2 0.88825472 0.88350888 0.87346099 0.7192899 1.0690516 NA sigmaDiff -0.15924742 -0.16410108 -0.17383105 -0.4269052 0.1171290 12.08159 nu 41.98417254 33.62743916 17.74077514 3.2649758 104.0648983 NA nuLog10 1.51048794 1.52669380 1.57284008 0.8699835 2.1138309 NA effSz -0.03198943 -0.03143175 -0.04438195 -0.4678744 0.4142259 44.43593 42. As usual: mydata <- read.csv("http://www.gwern.net/docs/2012-zeo-vitamind-morning-control.csv") originalcontrol <- subset(mydata, SSCF.1==0) newcontrol <- subset(mydata, SSCF.1==1) t.test(originalcontrol$Morning.Feel, newcontrol$Morning.Feel) Welch Two Sample t-test data: originalcontrol$Morning.Feel and newcontrol$Morning.Feel t = -0.0031, df = 67.67, p-value = 0.9975 alternative hypothesis: true difference in means is not equal to 0 95% confidence interval: -0.3466966 0.3456190 sample estimates: mean of x mean of y 2.827586 2.828125 43. The generating R code (see later analysis footnote for definitions of data variables like offtimeawake etc): plot(c(1:32), offtimeawake, col="blue", xlab="nth", ylab="latency/awakenings/awake (raw)") points(c(1:32), offlatency, col="blue") points(c(1:32), offawakenings, col="blue") points(c(1:30), ontimeawake, col="red") points(c(1:30), onlatency, col="red") points(c(1:30), onawakenings, col="red") 44. After running zscore on each data variable, we repeat the previous code but with ylab="latency/awakenings/awake (standardized)" in the call to plot. 45. Assuming the zscore conversion has been done: plot(c(1:32), offtimeawake+offlatency+offawakenings, col="blue", xlab="nth", ylab="standardized sleep disturbance score") points(c(1:30), ontimeawake+onlatency+onawakenings, col="red") 46. The previously described composite measure and BEST test: # all the non-potassium days offlatency <- c(11,15,16,16,17,18,20,21,21,24,24,26,29,33,36,42,40,19,32,28,37,36,19,25, 30,22,11,20,33,33,42,31) offawakenings <- c(8,6,2,7,6,8,7,4,8,3,8,4,7,7,9,12,11,14,8,10,8,6,9,8,13,9,5,5,13,12,9,9) offtimeawake <- c(21,14,6,15,7,22,12,17,29,5,14,10,16,16,24,13,42,50,39,15,20,18,33,27,45, 23,21,6,25,28,31,61) # all the potassium days onlatency <- c(12,15,16,17,18,19,21,21,23,25,25,26,26,26,27,29,30,30,32,33,33,34,34, 54,30,31,30,22,26,23) onawakenings <- c(8,3,4,10,8,9,4,5,4,10,7,4,7,8,7,8,12,8,7,3,6,2,8,7,10,9,4,9,11,8) ontimeawake <- c(22,08,11,17,10,24,19,8,8,35,9,39,10,29,15,20,90,16,13,6,15,1,20,24, 17,60,10,50,22,18) # normalize zscore <- function(x,y) mapply(function(a) (a - mean(y))/sd(y), x) offlatency <- zscore(offlatency, c(offlatency, onlatency)) onlatency <- zscore(onlatency, c(offlatency, onlatency)) offawakenings <- zscore(offawakenings, c(offawakenings, onawakenings)) onawakenings <- zscore(onawakenings, c(offawakenings, onawakenings)) offtimeawake <- zscore(offtimeawake, c(offtimeawake, ontimeawake)) ontimeawake <- zscore(ontimeawake, c(offtimeawake, ontimeawake)) # zip together with sum to get a single measure of how deviate a night was off <- offlatency + offawakenings + offtimeawake on <- onlatency + onawakenings + ontimeawake # usual Bayesian two-group test source("BEST.R") mcmcChain = BESTmcmc(off, on) postInfo = BESTplot(off, on, mcmcChain) # graph postInfo SUMMARY.INFO PARAMETER mean median mode HDIlow HDIhigh pcgtZero mu1 0.1664 0.1655 0.1421 -0.71894 1.0555 NA mu2 2.4256 2.4210 2.4035 1.81175 3.0478 NA muDiff -2.2592 -2.2592 -2.2318 -3.34666 -1.1853 0.006 sigma1 2.3939 2.3607 2.2695 1.78291 3.0915 NA sigma2 1.6189 1.5988 1.5786 1.11009 2.1614 NA sigmaDiff 0.7750 0.7606 0.7341 -0.03236 1.6317 97.205 nu 32.0045 23.2730 9.6599 2.33645 88.0997 NA nuLog10 1.3607 1.3669 1.4214 0.67234 2.0337 NA effSz -1.1141 -1.1107 -1.0959 -1.69481 -0.5433 0.006 47. Reusing the standardized data from before: t.test(off, on) Welch Two Sample t-test data: off and on t = -4.479, df = 56.49, p-value = 3.702e-05 alternative hypothesis: true difference in means is not equal to 0 95% confidence interval: -3.368 -1.287 sample estimates: mean of x mean of y 0.1785 2.5059 48. As before, we use BEST (the self-rating is mostly normal): Potassium <- c(1,1,0,1,0,1,0,0,1,1,1,0,0,1,1,1,0,1,1,0,1,0,1,1,0,1,0,0,0,0,1,0,0,0,1,0,1,1, 0,1,0,1,1,0,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,1,0,0,0,1,0,0,0,1) MP <- c(4,4,3,4,4,3,3,2,3,3,3,3,4,4,3,4,2,2,2,3,4,3,4,3,4,3,4,4,3,3,2,3,2,4,4,3,4,2,3,4,2, 3,3,2,2,2,3,2,3,3,4,2,3,4,3,4,3,3,2,2,3,4,4,3,4,2,2,3,2) pot <- data.frame(Potassium, MP) # first graph: library(ggplot2) qplot(data=pot, y=MP, color=Potassium) # analysis: source("BEST.R") off <- pot$MP[pot$Potassium == 0] on <- pot$MP[pot$Potassium == 1] mcmcChain = BESTmcmc(off, on) postInfo = BESTplot(off, on, mcmcChain) # graph postInfo SUMMARY.INFO PARAMETER mean median mode HDIlow HDIhigh pcgtZero mu1 3.02651 3.02686 3.03576 2.7780 3.2677 NA mu2 3.10432 3.10390 3.07921 2.7939 3.4127 NA muDiff -0.07782 -0.07736 -0.07786 -0.4728 0.3119 34.96 sigma1 0.75685 0.74855 0.73261 0.5834 0.9427 NA sigma2 0.83168 0.81845 0.79169 0.6133 1.0677 NA sigmaDiff -0.07483 -0.07033 -0.05617 -0.3755 0.2195 31.15 nu 47.52944 39.43237 23.78338 4.6350 111.4156 NA nuLog10 1.58217 1.59585 1.63348 0.9931 2.1316 NA effSz -0.09844 -0.09761 -0.10476 -0.5879 0.3897 34.96 t.test(off, on) Welch Two Sample t-test data: off and on t = -0.3938, df = 59.93, p-value = 0.6951 alternative hypothesis: true difference in means is not equal to 0 95% confidence interval: -0.4520 0.3033 sample estimates: mean of x mean of y 3.026 3.100 49. See previously for explanation: pot <- read.csv("http://www.gwern.net/docs/2013-gwern-potassium-morning.csv") # standardize & combine into a single equally-weighted synthetic index z-score pot$Disturbance <- scale(pot$Time.to.Z) + scale(pot$Awakenings) + scale(pot$Time.in.Wake) on <- pot[pot$Potassium==1,]$Disturbance off <- pot[pot$Potassium==0,]$Disturbance source("BEST.R") mcmcChain = BESTmcmc(off, on) postInfo = BESTplot(off, on, mcmcChain) # graph postInfo SUMMARY.INFO PARAMETER mean median mode HDIlow HDIhigh pcgtZero mu1 0.1329 0.13224 0.11468 -0.6505 0.9203 NA mu2 -0.2626 -0.26479 -0.22430 -1.1154 0.5966 NA muDiff 0.3956 0.39838 0.37996 -0.7724 1.5327 75.39 sigma1 1.9961 1.96663 1.89699 1.3978 2.6302 NA sigma2 1.9403 1.90682 1.86314 1.2797 2.6697 NA sigmaDiff 0.0558 0.06166 0.04212 -0.8615 0.9499 55.85 nu 33.0593 24.28680 9.49415 1.7036 90.8230 NA nuLog10 1.3674 1.38537 1.47058 0.6392 2.0655 NA effSz 0.2054 0.20334 0.18368 -0.3619 0.8119 75.39 50. on/off defined and BEST loaded in previous analysis: mcmcChain = BESTmcmc(off$MP, on$MP) postInfo = BESTplot(off$MP, on$MP, mcmcChain) # graph postInfo SUMMARY.INFO PARAMETER mean median mode HDIlow HDIhigh pcgtZero mu1 2.999866 2.99993 2.99749 2.7134 3.2884 NA mu2 2.955535 2.95571 2.95990 2.6391 3.2689 NA muDiff 0.044331 0.04465 0.05384 -0.3831 0.4669 58.29 sigma1 0.739736 0.72787 0.71017 0.5371 0.9685 NA sigma2 0.731523 0.71670 0.68979 0.5081 0.9827 NA sigmaDiff 0.008212 0.01087 0.01340 -0.3210 0.3419 52.76 nu 41.545632 33.20153 18.29201 2.5717 103.6089 NA nuLog10 1.502165 1.52116 1.55933 0.8486 2.1209 NA effSz 0.060755 0.06100 0.07764 -0.5064 0.6339 58.29 51. The geeky details: I found a error line in the X logs which appeared only when I invoked Redshift; the driver was fbdev and not the correct radeon, which mystified me further, until I read various bug reports and forum problems and wondered why radeon was not loading but the only non-fbdev error message indicated that some driver called ati was failing to load instead. Then I read that ati was the default wrapper over radeon, but then I saw that the package was not installed, installed it, noticed it was pulling in as a dependency useless Mach64 drivers, and had a flash: perhaps I had uninstalled the useless Mach64 drivers, forcing the package providing ati to be uninstalled too, permitted its uninstallation because I knew it was not the package providing radeon, which then caused the ati load to fail and to not then load radeon but X succeeding in loading fbdev which does not support Redshift, leading to a permanent failure of all uses of Redshift. Phew! I was right. 52. I don’t use a timer, but instead count 400 full breaths. Depending on how fast and shallowly I breathe, this runs from 20-35 minutes (eg. 16 May 2012’s meditation ran 33 minutes long). To be conservative, I will assume the meditation is only 20 minutes. In mid-October, I bought and began using instead a timer which could be set to 15 minutes. 53. Fortunately, I had used Amphetype for typing practice for 3 years prior to finding the treadmill, so I could compare my daily treadmill typing sessions to a very long dataseries. The graph looks like WPM (but not Accuracy) may have been damaged, but it’s not clear at all: we should do statistics. Amphetype stores the graphed data in a SQLite database, which after a little tinkering I figured out how to extract the WPM & Accuracy scores: sqlite3 -batch gwern.db 'SELECT w real, wpm real, accuracy real FROM result;' > ~/stats.txt Which gives a file like 1233502576.01172|70.2471151325281|0.981412639405205 1233502634.48339|80.9762013034008|0.989159891598916 1233502677.26434|74.0623733171948|0.988326848249027 ... The pipes are delimiters, which I replaced with commas. The first field is a date-stamp as expressed in seconds since the Unix epoch; they can be converted to more readable dates like so: $ date --date '@1308320681.44771'
Fri Jun 17 10:24:41 EDT 2011

I went through the 2870 lines until I found the first treadmill session I did on June 16. After splitting, deleting the date-stamps, and adding a CSV header like WPM,Accuracy, I had had 2285 entries for 2012-gwern-amphetype-before.csv and 585 for 2012-gwern-amphetype-after.csv. Then it is easy to load the CSVs into R and test:

before <- read.csv("http://www.gwern.net/docs/2012-gwern-amphetype-before.csv")
t.test(before$WPM, after$WPM); t.test(before$Accuracy, after$Accuracy)

Welch Two Sample t-test

data:  before$WPM and after$WPM
t = -12.0053, df = 899.744, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95% confidence interval:
-6.068614 -4.363238
sample estimates:
mean of x mean of y
82.34342  87.55935

Welch Two Sample t-test

data:  before$Accuracy and after$Accuracy
t = -4.4027, df = 940.588, p-value = 1.192e-05
alternative hypothesis: true difference in means is not equal to 0
95% confidence interval:
-0.0023275753 -0.0008923092
sample estimates:
mean of x mean of y
0.9875169 0.9891269

What? Using a treadmill made my average WPM go up 5 WPM? And my average accuracy increased 0.002%? And both are highly statistically-significant (not a surprise, given how many entries there were)? What’s going on - this is the exact opposite of expected! The key is the low mean of the before data: I type much faster than 82 WPM now. What happened was that I spent 3 years practicing. Given that I was improving, it is wrong to compare the treadmill typing data against a long-run average. What would be better would be to lop off the first half of the before data to get a fairer comparison with after, since I began to plateau around then. Redoing the tests:

t.test(before$WPM[1144:2285], after$WPM); t.test(before$Accuracy[1144:2285], after$Accuracy)
Welch Two Sample t-test

data:  before$WPM[1144:2285] and after$WPM
t = -5.3227, df = 1132.508, p-value = 1.232e-07
alternative hypothesis: true difference in means is not equal to 0
95% confidence interval:
-3.429192 -1.581982
sample estimates:
mean of x mean of y
85.05376  87.55935

Welch Two Sample t-test

data:  before$Accuracy[1144:2285] and after$Accuracy
t = -1.3091, df = 1156.249, p-value = 0.1908
alternative hypothesis: true difference in means is not equal to 0
95% confidence interval:
-0.0012902667  0.0002575397
sample estimates:
mean of x mean of y
0.9886105 0.9891269

This is more reasonable: only a 2 WPM gain from the treadmill. 2 WPM could be explicable as just a placebo effect: me wanting to justify the time I’ve sunk into the treadmill and typing practice every day. It’s still a little surprising, but the result seems solid. (If we drop every score before 2000 instead of 1144, the difference continues to shrink but still favors the treadmill. We have to go to scores 2100-2285 before the treadmill starts to lose, but with 2200-2285 the treadmill wins!) Accuracy seems largely unaffected.

54. The exact processing steps, for those curious:

zeo <- read.csv("~/wiki/docs/gwern-zeodata.csv")
zeo$Sleep.Date <- as.Date(zeo$Sleep.Date, format="%m/%d/%Y")
zeo$MP <- ordered(mp[mp$Date %in% zeo$Sleep.Date,]$MP)
zeo$Disturbance <- scale(zeo$Time.to.Z) + scale(zeo$Awakenings) + scale(zeo$Time.in.Wake)
zeo <- zeo[!is.na(zeo$Disturbance) & !is.na(zeo$Morning.Feel),]

zeo <- read.csv("http://www.gwern.net/docs/2013-gwern-sleepdisturbances-productivity.csv")
cor.test(zeo$Disturbance, as.integer(zeo$MP))

Pearsons product-moment correlation

data:  zeo$Disturbance and as.integer(zeo$MP)
t = 1.344, df = 414, p-value = 0.1798
alternative hypothesis: true correlation is not equal to 0
95% confidence interval:
-0.03045  0.16102
sample estimates:
cor
0.06589
56. We regress a continuous predictor onto a categorical outcome:

# turn into an ordinal variable
zeo$MP <- ordered(zeo$MP)

library(MASS)
lmodel <- polr(MP ~ Disturbance, data = zeo); summary(lmodel)
...
Coefficients:
Value Std. Error t value
Disturbance 0.0553     0.0429    1.29

Intercepts:
Value  Std. Error t value
1|2 -4.413  0.450     -9.808
2|3 -0.990  0.110     -8.965
3|4  1.101  0.113      9.711

Residual Deviance: 915.66
AIC: 923.66

exp(lmodel\$coefficients)
Disturbance
1.057
57. Try out more variables:

almodel <- polr(MP ~ Disturbance + ZQ + Total.Z + Time.to.Z + Time.in.Wake + Time.in.REM +
Time.in.Light + Time.in.Deep + Awakenings + Morning.Feel, data = zeo); almodel

Coefficients:
Disturbance            ZQ       Total.Z     Time.to.Z  Time.in.Wake   Time.in.REM Time.in.Light
-0.431623     -0.276236      0.307941      0.045819      0.003266     -0.246901     -0.272593
Time.in.Deep  Morning.Feel
-0.227003      0.205541

Intercepts:
1|2     2|3     3|4
-2.9105  0.5465  2.6902

Residual Deviance: 903.01
AIC: 927.01
58. Reduced by cutting out extraneous variables using stepwise regression:

salmodel <- step(almodel); summary(salmodel)
...
Coefficients:
Value Std. Error t value
Time.to.Z     0.0163    0.00713    2.29
Time.in.Deep -0.0152    0.00823   -1.85
Morning.Feel  0.1906    0.12683    1.50

Intercepts:
Value  Std. Error t value
1|2 -4.457  0.785     -5.675
2|3 -1.011  0.649     -1.557
3|4  1.113  0.649      1.713

Residual Deviance: 907.60
AIC: 919.60