> Weather is often said to affect our mood, and that people in sunnier places are happier *because* of that.
> Curious about the possible effect (it could be worth controlling for in my future QS analyses or attempting to imitate benefits inside my house eg brighter lighting), I combine my long-term daily self-ratings with logs from the nearest major official weather stations, which offer detailed weather information about temperature, humidity, precipitation, cloud cover, wind speed, brightness etc, and try to correlate them.
>
> In general, despite considerable data, there are essentially no bivariate correlations, nothing in several versions of a linear model, and nothing found by a random forest. It would appear that weather does not correlate with my self-ratings to any detectable degree, much less cause it.

[Science has found]{.smallcaps} that [lighting & temperature](http://blog.bufferapp.com/the-science-of-how-room-temperature-and-lighting-affects-our-productivity) & [the seasons](!Wikipedia "Seasonal affective disorder") matter to one's mood, focus, sleep and other things (but not, apparently, one's [life](/docs/nootropics/2013-lucas.pdf "'Does Life Seem Better on a Sunny Day? Examining the Association Between Daily Weather Conditions and Life Satisfaction Judgments', Lucas & Lawless 2013") [satisfaction](/docs/psychology/2015-feddersen.pdf "'Subjective wellbeing: why weather matters', Feddersen et al 2015") or [psychiatric disorders](http://www.badscience.net/2009/01/part-432-in-which-i-get-a-bit-overinterested-and-look-up-waaay-too-many-references/ "...is there good evidence of season having an impact on our collective mood? Seasonal affective disorder is its own separate thing. If you look at the evidence on the population’s mood, depression, and suicide changing over the seasons, you do, in fact, find a glorious mess."), which may be related to complex features of light ([Vyssoki et al 2014](/docs/psychology/2014-vyssoki.pdf "Direct Effect of Sunshine on Suicide"))). Certainly, these things having large effects makes a lot of sense---who doesn't feel gloomy on a rainy day or happier on a sunny day?
But things everyone knows often turn out to be wrong; and in my case, it seems like my morning use of [vitamin D](/Nootropics#vitamin-d) and evening [melatonin](/Melatonin) supplementation may be screening off the environmental effects. So one rainy and then sunny day in March 2013, I wondered if I could find an influence of rain or darkness on my own mood or productivity.
As it happens, since 16 February 2012, I have been daily writing down my impression of whether my mood & productivity (MP) that day was average, below-average, or above; this gave me 399 ratings to work with. I didn't record any weather data, but the nice thing about weather data is that---unlike quotidian but potentially important data like the proverbial what you had for breakfast, which is lost if not immediately recorded---the weather is concrete, objective, easily quantifiable, economically important, and of universal interest. Hence, we would expect weather data to be available online so I can find it, and see what aspects of the weather influence my self-rating (if any).
# Data
A convenient source of data seems to be [Weather Underground](!Wikipedia "Weather Underground (weather service)"). Going via [API](http://allthingsr.blogspot.com/2012/04/getting-historical-weather-data-in-r.html) is overkill, since we just need data from [Islip](http://www.wunderground.com/history/airport/KISP/2013/3/19/MonthlyHistory.html?MR=1) for 16 February 2012--08 July 2012 ([CSV](http://www.wunderground.com/history/airport/KISP/2012/2/16/CustomHistory.html?dayend=8&monthend=7&yearend=2012&req_city=NA&req_state=NA&req_statename=NA&format=1)) and from [Pax River](http://www.wunderground.com/history/airport/KNHK/2012/7/11/CustomHistory.html?dayend=9&monthend=3&yearend=2013&req_city=NA&req_state=NA&req_statename=NA&MR=1) 11 July 2012--22 March 2013 ([CSV](http://www.wunderground.com/history/airport/KNHK/2012/7/11/CustomHistory.html?dayend=22&monthend=3&yearend=2013&req_city=NA&req_state=NA&req_statename=NA&format=1)). I'm using the nearest airport, rather than the nearest volunteer weather station, to get data on cloud cover; cloud cover should be a good proxy for brightness or sunniness. The [`insol`](http://cran.r-project.org/web/packages/insol/index.html) library calculates the length of day, another potentially relevant variable.
~~~{.R}
# gather and clean data
weather1 <- read.csv("http://www.wunderground.com/history/airport/KISP/2012/2/16/CustomHistory.html?dayend=8&monthend=7&yearend=2012&req_city=NA&req_state=NA&req_statename=NA&format=1")
weather2 <- read.csv("http://www.wunderground.com/history/airport/KNHK/2012/7/11/CustomHistory.html?dayend=29&monthend=7&yearend=2013&req_city=NA&req_state=NA&req_statename=NA&format=1")
weather3 <- read.csv("http://www.wunderground.com/history/airport/KISP/2013/7/30/CustomHistory.html?dayend=7&monthend=8&yearend=2013&req_city=NA&req_state=NA&req_statename=NA&format=1")
weather4 <- read.csv("http://www.wunderground.com/history/airport/KNHK/2013/8/8/CustomHistory.html?dayend=17&monthend=9&yearend=2013&req_city=NA&req_state=NA&req_statename=NA&format=1")
weather1$PrecipitationIn <- as.numeric(as.character(weather1$PrecipitationIn)) # delete weird "T" factor; ???
# mood/productivity (MP) sourced from personal log
weather1$MP <- c(1,3,4,4,4,4,3,4,2,3,4,4,2,3,2,3,3,2,3,2,3,4,3,2,3,2,3,1,2,4,3,3,3,3,3,
3,3,2,3,4,3,3,4,4,4,4,2,3,4,3,3,3,4,3,3,1,3,2,3,3,4,3,2,3,2,4,4,4,3,3,
4,4,3,4,3,3,2,3,3,2,2,4,3,2,4,4,4,3,3,2,4,4,4,3,3,3,3,2,3,4,3,2,2,4,3,
3,2,2,2,2,3,4,2,4,4,3,3,3,2,4,2,2,2,2,2,3,3,4,2,1,3,3,2,3,3,4,2,3,2,3,
3,2,4,4)
weather2$MP <- c(2,4,4,4,3,4,3,3,4,3,2,3,4,4,4,3,2,3,3,2,3,3,3,2,2,2,2,2,3,4,3,4,2,4,3,
3,2,2,2,3,3,3,3,4,3,3,3,4,3,3,3,2,4,2,3,3,4,4,3,3,3,4,3,3,4,3,4,2,3,3,
4,4,3,3,4,4,3,4,3,2,3,3,3,4,3,2,3,2,2,2,3,3,3,4,4,3,4,4,3,3,2,3,3,3,3,
4,4,3,4,2,2,2,3,4,3,4,3,4,3,4,4,3,3,2,3,2,4,4,3,4,2,3,4,2,3,3,2,2,2,3,
2,3,3,4,2,3,4,3,4,3,3,2,2,3,4,4,3,4,2,2,3,2,3,2,2,2,4,3,3,4,2,2,3,3,3,
4,4,3,2,3,2,2,2,3,3,3,4,3,4,3,3,3,2,2,3,3,3,4,4,3,2,2,2,3,3,4,3,4,3,4,
3,2,4,4,3,3,2,4,3,3,3,2,3,3,2,3,2,3,3,3,3,2,3,3,3,3,3,3,4,2,3,2,4,3,3,
2,2,3,3,3,2,4,3,3,3,3,4,4,3,3,3,3,2,3,2,2,3,3,2,3,2,3,2,3,3,2,2,3,3,3,
4,4,3,2,2,3,2,3,2,4,4,4,3,3,3,4,3,4,3,4,4,2,2,2,4,4,3,2,2,3,4,3,2,4,2,
3,4,2,4,4,2,3,2,3,3,3,2,2,2,3,3,2,2,3,4,4,2,3,2,2,3,4,4,3,2,2,3,4,4,3,
3,3,3,2,4,3,3,4,3,3,3,4,3,4,4,4,4,3,3,3,2,2,2,4,3,4,4,2,2,3,4,4,3,2)
weather3$MP <- c(3,4,2,3,4,3,3,4,3)
weather4$MP <- c(3,3,2,2,3,2,4,4,4,3,2,2,5,4,3,4,4,3,3,4,4,3,2,2,3,4,3,3,4,4,3,4,3,3,
4,3,4,4,4,2,3)
# compute length of day for each location
library(insol)
weather1$DayLength <- daylength(40,73, julian(as.Date(as.character(weather1$EDT))), 1)[,3]
weather3$DayLength <- daylength(40,73, julian(as.Date(as.character(weather3$EDT))), 1)[,3]
weather2$DayLength <- daylength(38,76, julian(as.Date(as.character(weather2$EDT))), 1)[,3]
weather4$DayLength <- daylength(38,76, julian(as.Date(as.character(weather4$EDT))), 1)[,3]
# combine & start cleaning
weather <- rbind(weather1,weather2)
weather$WindDirDegrees.br... <- sub("", "", weather$WindDirDegrees.br...) weather$WindDirDegrees.br... <- as.integer(weather$WindDirDegrees.br...) weather$Events <- as.integer(weather$Events) weather$Events[weather$Events>1] <- 2 # something of a hack but we'll impute any missing rain precipitation values as 0 weather[is.na(weather)] <- 0 ~~~ # Analysis ## Exploratory After cleanup, the data looks reasonable and as expected, with each variable spanning the ranges one would expect of temperatures and precipitation: ~~~{.R} summary(weather) # EDT Max.TemperatureF Mean.TemperatureF Min.TemperatureF Max.Dew.PointF MeanDew.PointF # 2012-2-16: 1 Min. : 26.0 Min. :21.0 Min. :14.0 Min. :12.0 Min. : 1 # 2012-2-17: 1 1st Qu.: 53.0 1st Qu.:45.0 1st Qu.:37.0 1st Qu.:39.8 1st Qu.:33 # 2012-2-18: 1 Median : 68.0 Median :58.0 Median :50.0 Median :54.0 Median :48 # 2012-2-19: 1 Mean : 66.9 Mean :58.7 Mean :50.6 Mean :52.5 Mean :47 # 2012-2-20: 1 3rd Qu.: 82.0 3rd Qu.:73.0 3rd Qu.:66.0 3rd Qu.:67.0 3rd Qu.:63 # 2012-2-21: 1 Max. :100.0 Max. :87.0 Max. :80.0 Max. :79.0 Max. :74 # (Other) :522 # Min.DewpointF Max.Humidity Mean.Humidity Min.Humidity Max.Sea.Level.PressureIn # Min. :-5.0 Min. : 42.0 Min. :28.0 Min. :12.0 Min. :29.4 # 1st Qu.:25.0 1st Qu.: 82.0 1st Qu.:59.0 1st Qu.:37.0 1st Qu.:30.0 # Median :42.0 Median : 89.0 Median :69.0 Median :48.0 Median :30.1 # Mean :41.2 Mean : 85.7 Mean :67.8 Mean :48.6 Mean :30.1 # 3rd Qu.:58.0 3rd Qu.: 93.0 3rd Qu.:77.2 3rd Qu.:60.0 3rd Qu.:30.2 # Max. :73.0 Max. :100.0 Max. :97.0 Max. :93.0 Max. :30.6 # # Mean.Sea.Level.PressureIn Min.Sea.Level.PressureIn Max.VisibilityMiles Mean.VisibilityMiles # Min. :29.1 Min. :28.7 Min. : 6.00 Min. : 2.0 # 1st Qu.:29.9 1st Qu.:29.8 1st Qu.:10.00 1st Qu.: 9.0 # Median :30.0 Median :29.9 Median :10.00 Median :10.0 # Mean :30.0 Mean :29.9 Mean : 9.97 Mean : 9.1 # 3rd Qu.:30.2 3rd Qu.:30.1 3rd Qu.:10.00 3rd Qu.:10.0 # Max. :30.5 Max. :30.5 Max. :10.00 Max. :10.0 # # Min.VisibilityMiles Max.Wind.SpeedMPH Mean.Wind.SpeedMPH Max.Gust.SpeedMPH PrecipitationIn # Min. : 0.00 Min. : 6.0 Min. : 1.00 Min. : 0.0 Min. :0.000 # 1st Qu.: 3.00 1st Qu.:13.0 1st Qu.: 6.00 1st Qu.: 0.0 1st Qu.:0.000 # Median : 9.00 Median :15.0 Median : 8.00 Median :22.0 Median :0.000 # Mean : 6.88 Mean :16.4 Mean : 8.04 Mean :18.6 Mean :0.124 # 3rd Qu.:10.00 3rd Qu.:20.0 3rd Qu.:10.00 3rd Qu.:28.0 3rd Qu.:0.030 # Max. :10.00 Max. :40.0 Max. :27.00 Max. :56.0 Max. :6.000 # # CloudCover Events WindDirDegrees.br... MP DayLength # Min. :0.0 Min. :1.00 Min. : 1 Min. :1.00 Min. : 9.27 # 1st Qu.:3.0 1st Qu.:1.00 1st Qu.:138 1st Qu.:2.00 1st Qu.:10.31 # Median :4.5 Median :1.00 Median :208 Median :3.00 Median :12.10 # Mean :4.4 Mean :1.47 Mean :202 Mean :2.98 Mean :12.07 # 3rd Qu.:6.0 3rd Qu.:2.00 3rd Qu.:283 3rd Qu.:4.00 3rd Qu.:13.87 # Max. :8.0 Max. :2.00 Max. :360 Max. :4.00 Max. :14.73 ~~~ `CloudCover` being 0--8 looks a bit odd, since one might've expected a decimal or percentage or some quantification of thickness but turns out to be a standard measurement, the "[okta](!Wikipedia)". We might expect some seasonal effects, but graphing a sensitive [LOESS](!Wikipedia) [moving average](!Wikipedia) and then a per-week spineplot suggests just 1 anomaly, that there were some days I rated "1" but never subsequently (this has an explanation: as I began keeping the series, my car caught on fire, was totaled; and I was then put through the wringer with the insurance & junkyard, and felt truly miserable at multiple points): ~~~{.R} par(mfrow=c(2,1)) scatter.smooth(x=weather$EDT, y=weather$MP, span=0.3, col="#CCCCCC", xlab="Days", ylab="MP rating") weeks <- c(seq(from=1,to=length(weather$MP),by=7), length(weather$MP)) spineplot(c(1:(length(weather$MP))), as.factor(weather$MP), breaks=weeks, xlab="Weeks", ylab="MP rating ratios") # we don't use the date past the spineplot & # it causes problems with all the models, so delete it rm(weather$EDT) ~~~ ![MP from February 2012 to March 2013](/images/weather/2013-seasonal.png){.full-width} A formal check for [autocorrelation](!Wikipedia) in MP ratings (most methods assume independence) turns up [little enough](#autocorrelation) that I feel free to ignore them. So the data looks clean and tame; now we can begin real interpretation. The first and most obvious thing to do is to see what the overall correlation matrix with MP looks like: ~~~{.R} # all weather correlations with mood/productivity round(cor(weather)[23,], digits=3) # Max.TemperatureF Mean.TemperatureF Min.TemperatureF # 0.013 0.016 0.022 # Max.Dew.PointF MeanDew.PointF Min.DewpointF # 0.016 0.019 0.016 # Max.Humidity Mean.Humidity Min.Humidity # 0.042 0.026 0.011 # Max.Sea.Level.PressureIn Mean.Sea.Level.PressureIn Min.Sea.Level.PressureIn # 0.006 0.004 0.015 # Max.VisibilityMiles Mean.VisibilityMiles Min.VisibilityMiles # 0.066 0.004 -0.016 # Max.Wind.SpeedMPH Mean.Wind.SpeedMPH Max.Gust.SpeedMPH # 0.008 0.002 0.018 # PrecipitationIn CloudCover Events # -0.037 0.041 -0.001 # WindDirDegrees.br... MP DayLength # -0.006 1.000 0.036 # tail(sort(abs(round(cor(weather)[23,], digits=3))), 6) # DayLength PrecipitationIn CloudCover Max.Humidity Max.VisibilityMiles # 0.036 0.037 0.041 0.042 0.066 # MP # 1.000 ~~~ All the _r_ values seem very small: >0.1. But some of the more plausible correlations may be statistically-significant, so we'll look at the 5 largest correlations ~~~{.R} sapply(c("DayLength","PrecipitationIn","CloudCover","Max.Humidity","Max.VisibilityMiles"), function(x) cor.test(weather$MP, weather[[x]])) # ... # DayLength PrecipitationIn # p.value 0.4099 0.394 # estimate 0.03594 -0.03717 # # CloudCover Max.Humidity # p.value 0.3445 0.3319 # estimate 0.04122 0.04231 # # Max.VisibilityMiles # p.value 0.1292 # estimate 0.06611 ~~~ These specific variables turn out to be failures too, with large _p_-values. One useful technique is to convert metrics into standardized units (of standard deviations), sum them all into a single composite variable, and then test that; this can reveal influences not obvious if one looked at the metrics individually. (For example, in my [potassium sleep experiments](/zeo/Potassium), where I was interested in an overall measure of reduced sleep quality rather than single metrics like sleep latency.) Perhaps this would work here? ~~~{.R} # construct a z-score for all of them to see if it does any better with(weather, cor.test(MP, (scale(Max.Humidity) + scale(Max.VisibilityMiles) + scale(PrecipitationIn) + scale(CloudCover) + scale(DayLength)))) # ... # t = 1.296, df = 526, p-value = 0.1957 # alternative hypothesis: true correlation is not equal to 0 # 95% confidence interval: # -0.02908 0.14105 # sample estimates: # cor # 0.0564 ~~~ ## Modeling ### Continuous MP #### Linear model

> `Touji:` "Oh, yes. the view of the world that one can have is quite small." \
> `Hikari:` "Yes, you measure things only by your own small measure." \
> `Asuka:` "One sees things with the truth, given by others." \
> `Misato:` "Happy on a sunny day." \
> `Rei:` "Gloomy on a rainy day." \
> `Asuka:` "If you're taught that, you always think so." \
> `Ritsuko:` "But, you can enjoy rainy days."
>
> _Neon Genesis Evangelion_, episode 26 ["The Beast that Shouted 'I' at the Heart of the World"](http://www.oocities.org/gorene/text/Episode26.txt) (Literal Translation Project)

The correlations show individually little value, so we'll move on to building modeling and assessing their fit. Can we accurately predict MP if we use all the parameters? We'll start with a linear model/regression, where we treat the categorical MP variable as a continuous variable for simplicity:
~~~{.R}
model1 <- lm(MP ~ ., data=weather); summary(model1)
# ...Residuals:
# Min 1Q Median 3Q Max
# -2.0255 -0.7815 0.0165 0.8126 1.3730
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -4.29e+00 6.38e+00 -0.67 0.50
# Max.TemperatureF 3.76e-02 3.20e-02 1.17 0.24
# Mean.TemperatureF -5.87e-02 6.06e-02 -0.97 0.33
# Min.TemperatureF 4.41e-02 3.24e-02 1.36 0.17
# Max.Dew.PointF -9.52e-03 1.46e-02 -0.65 0.52
# MeanDew.PointF -1.45e-02 2.79e-02 -0.52 0.60
# Min.DewpointF -6.52e-03 1.43e-02 -0.46 0.65
# Max.Humidity 1.10e-02 8.36e-03 1.31 0.19
# Mean.Humidity 1.73e-03 1.33e-02 0.13 0.90
# Min.Humidity 5.75e-03 8.16e-03 0.71 0.48
# Max.Sea.Level.PressureIn 9.92e-01 9.79e-01 1.01 0.31
# Mean.Sea.Level.PressureIn -1.61e+00 1.63e+00 -0.99 0.32
# Min.Sea.Level.PressureIn 7.29e-01 8.39e-01 0.87 0.39
# Max.VisibilityMiles 2.00e-01 1.47e-01 1.36 0.17
# Mean.VisibilityMiles 1.04e-02 3.96e-02 0.26 0.79
# Min.VisibilityMiles -3.94e-03 1.89e-02 -0.21 0.83
# Max.Wind.SpeedMPH 2.31e-03 1.32e-02 0.18 0.86
# Mean.Wind.SpeedMPH -2.28e-03 1.70e-02 -0.13 0.89
# Max.Gust.SpeedMPH 2.71e-03 4.54e-03 0.60 0.55
# PrecipitationIn -1.25e-01 9.29e-02 -1.35 0.18
# CloudCover 2.65e-02 2.53e-02 1.05 0.29
# Events -9.18e-02 9.83e-02 -0.93 0.35
# WindDirDegrees.br... 3.08e-05 4.11e-04 0.07 0.94
# DayLength 5.72e-02 4.78e-02 1.20 0.23
#
# Residual standard error: 0.755 on 504 degrees of freedom
# Multiple R-squared: 0.0282, Adjusted R-squared: -0.0162
# F-statistic: 0.635 on 23 and 504 DF, p-value: 0.905
~~~
The variance explained is extremely minimal, even with 23 different variables in the linear model, and suggests overfitting (as one would expect). By Occam's razor, most of the variables should be scrapped for not carrying their weight; the `step` function uses a [complexity penalty](!Wikipedia "Akaike information criterion") to choose which variable to eliminate while still fitting the data reasonably well. It choose to keep only 4 variables:
~~~{.R}
smodel1 <- step(model1); summary(smodel1)
# ...Residuals:
# Min 1Q Median 3Q Max
# -2.1301 -0.8907 0.0121 0.8990 1.1779
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.02134 1.38025 0.02 0.988
# Min.TemperatureF 0.01429 0.01008 1.42 0.157
# MeanDew.PointF -0.01489 0.01034 -1.44 0.150
# Max.Humidity 0.00958 0.00531 1.81 0.072
# Max.VisibilityMiles 0.21250 0.12682 1.68 0.094
#
# Residual standard error: 0.748 on 523 degrees of freedom
# Multiple R-squared: 0.0109, Adjusted R-squared: 0.00336
# F-statistic: 1.44 on 4 and 523 DF, p-value: 0.218
~~~
We can compare the models' prediction accuracy on the dataset (lower is better in mean-squared-error):
~~~{.R}
mean((weather$MP - weather$MP)^2) # perfect
# [1] 0
mean((weather$MP - model1$fitted.values)^2) # original
# [1] 0.5446
mean((weather$MP - smodel1$fitted.values)^2) # much simpler model
# [1] 0.5543
model2 <- lm(MP ~ 1, data=weather); mean((weather$MP - model2$fitted.values)^2)
# [1] 0.5604
~~~
The model with 4 data variables is much simpler, and appears to fit almost as well, but slightly better than the null model of no parameters
### Random forests regression
Linear modeling having failed to reveal any interesting relationships, we'll take one last crack at it: if there's any interesting predictive power in the weather data at all, a high-powered machine learning technique like [random forests](!Wikipedia) ought to build a model which outperforms the linear model at least a little on mean-squared-error
~~~{.R}
library(randomForest)
rmodel <- randomForest(MP ~ ., data=weather, proximity=TRUE); rmodel
# ...
# Type of random forest: regression
# Number of trees: 500
# No. of variables tried at each split: 7
#
# Mean of squared residuals: 0.5692
# % Var explained: -1.57
~~~
The error with all variables is a little higher than the simpler linear model. Troublingly, a random forests with the 4 simpler variables increases the error:
~~~{.R}
srmodel <- randomForest(MP ~ Min.TemperatureF + MeanDew.PointF + Max.Humidity + Max.VisibilityMiles, data=weather, proximity=TRUE); srmodel
# ...
# Type of random forest: regression
# Number of trees: 500
# No. of variables tried at each split: 1
#
# Mean of squared residuals: 0.5814
# % Var explained: -3.76
~~~
Perhaps treating MP as a continuous variable was a bad idea. Let's start over.
### Categorical MP
We turn the MP data into an ordered factor:
~~~{.R}
weather$MP <- ordered(weather$MP)
~~~
#### Logistic model
A straight [logistic regression](!Wikipedia) (`glm(MP ~ ., data=weather, family="binomial")`) is not appropriate because MP is not a binary outcome; we use the [MASS](http://cran.r-project.org/web/packages/MASS/index.html) library to do an ordinal logistic regression; as with the linear model, the coefficients don't seem to differ very much
~~~{.R}
library(MASS)
lmodel <- polr(as.ordered(MP) ~ ., data = weather); summary(lmodel)
# ...
# Coefficients:
# Value Std. Error t value
# Max.TemperatureF 0.086798 0.07916 1.096
# Mean.TemperatureF -0.135828 0.14974 -0.907
# Min.TemperatureF 0.105068 0.08024 1.309
# Max.Dew.PointF -0.023466 0.03624 -0.648
# MeanDew.PointF -0.033449 0.06789 -0.493
# Min.DewpointF -0.018700 0.03424 -0.546
# Max.Humidity 0.029043 0.02036 1.427
# Mean.Humidity 0.004063 0.03263 0.124
# Min.Humidity 0.013462 0.02024 0.665
# Max.Sea.Level.PressureIn 2.554493 0.85951 2.972
# Mean.Sea.Level.PressureIn -4.352709 0.06383 -68.191
# Min.Sea.Level.PressureIn 2.047343 0.91077 2.248
# Max.VisibilityMiles 0.549889 0.37650 1.461
# Mean.VisibilityMiles 0.029195 0.09772 0.299
# Min.VisibilityMiles -0.007733 0.04638 -0.167
# Max.Wind.SpeedMPH 0.003846 0.03219 0.119
# Mean.Wind.SpeedMPH -0.007351 0.04250 -0.173
# Max.Gust.SpeedMPH 0.008375 0.01130 0.741
# PrecipitationIn -0.325188 0.22738 -1.430
# CloudCover 0.067914 0.06000 1.132
# Events -0.219315 0.24050 -0.912
# WindDirDegrees.br... 0.000192 0.00101 0.190
# DayLength 0.144132 0.11804 1.221
#
# Intercepts:
# Value Std. Error t value
# 1|2 13.143 0.021 638.830
# 2|3 17.080 0.500 34.185
# 3|4 19.126 0.509 37.550
#
# Residual Deviance: 1143.49
# AIC: 1195.49
## relative risk or odds; how much difference does each variable make, per its units?
exp(coef(lmodel))
# Max.TemperatureF Mean.TemperatureF Min.TemperatureF
# 1.09068 0.87299 1.11079
# Max.Dew.PointF MeanDew.PointF Min.DewpointF
# 0.97681 0.96710 0.98147
# Max.Humidity Mean.Humidity Min.Humidity
# 1.02947 1.00407 1.01355
# Max.Sea.Level.PressureIn Mean.Sea.Level.PressureIn Min.Sea.Level.PressureIn
# 12.86477 0.01287 7.74729
# Max.VisibilityMiles Mean.VisibilityMiles Min.VisibilityMiles
# 1.73306 1.02963 0.99230
# Max.Wind.SpeedMPH Mean.Wind.SpeedMPH Max.Gust.SpeedMPH
# 1.00385 0.99268 1.00841
# PrecipitationIn CloudCover Events
# 0.72239 1.07027 0.80307
# WindDirDegrees.br... DayLength
# 1.00019 1.15504
~~~
Another use of `step`; this time, it builds something even narrower than before:
~~~{.R}
slmodel <- step(lmodel); summary(slmodel)
# ...Coefficients:
# Value Std. Error t value
# Max.VisibilityMiles 0.527 0.329 1.6
#
# Intercepts:
# Value Std. Error t value
# 1|2 0.364 3.293 0.110
# 2|3 4.277 3.282 1.303
# 3|4 6.283 3.287 1.912
#
# Residual Deviance: 1156.46
# AIC: 1164.46
exp(0.527)
# [1] 1.694
~~~
##### Seasonal effects
The previously mentioned life-satisfaction/weather paper, Lucas et al 2013, examined a single variable very similar to my MP:
> Life satisfaction was assessed using a single item that read, "In general, how satisfied are you with your life." Participants responded using a 4-point scale with the options "Very Satisfied," "Satisfied," "Dissatisfied," or "Very Dissatisfied" (responses were scored such that higher scores reflect higher satisfaction)
Rather than simply regress local weather on the score, similar to what I've been doing so far, they attempted a more complex model where the effects could change based on region & time of year:
> In addition, the effect of absolute levels of any weather variable may vary depending on when in the year the weather occurred. A 50 °F day may contribute to positive mood (and hence higher life satisfaction) if it occurred in the middle of winter in a cold climate, whereas this same absolute temperature might contribute to a negative mood (and hence lower life satisfaction) if it occurred in the middle of summer in a warm climate. Thus, seasonal differences must be considered when examining the effects of weather.
While I live in only one region and replicating the rest of the analysis would be a lot of work, the seasonal changes is possible and not too hard to investigate: reset the data to restore the date, create a `Season` variable, populate (since I am lazy, I convert the dates to fiscal quarters, which are similar enough), and then rerun an ordinal logistic regression with all the potential predictors nested under `Season`.
Looking at the coefficients split by season, and looking for coefficients that suddenly become larger or which switch signs from season to season, I see no real candidates:
~~~{.R}
weather$MP <- as.ordered(weather$MP)
library(lubridate)
weather$Season <- quarter(weather$EDT)
library(lme4)
lmr <- lmer(MP ~ (. |Season), data=weather, control = lmerControl(maxfun=20000)); ranef(lmr)
# ...$Season
# (Intercept) MP.L MP.Q MP.C EDT Max.TemperatureF Mean.TemperatureF Min.TemperatureF
# 1 0.05723 1.8727 0.2184 -0.1058 4.704e-05 0.002089 -0.002253 0.0009085
# 2 0.06215 1.8918 0.2303 -0.1042 6.786e-06 0.004284 -0.004817 0.0034466
# 3 0.31806 0.9037 0.9451 -0.4151 3.206e-05 0.003028 -0.002113 0.0019708
# 4 0.31525 0.8986 0.9376 -0.4169 7.582e-07 -0.001217 -0.001045 0.0003388
# Max.Dew.PointF MeanDew.PointF Min.DewpointF Max.Humidity Mean.Humidity Min.Humidity
# 1 -3.416e-04 -0.002731 0.0008853 -1.584e-04 0.0015697 0.0004101
# 2 1.922e-03 -0.004427 0.0001578 6.227e-04 -0.0002743 0.0010211
# 3 2.523e-05 -0.002938 -0.0007302 5.611e-05 0.0009317 0.0004772
# 4 -1.235e-04 0.003389 -0.0012974 6.986e-04 -0.0009484 -0.0004437
# Max.Sea.Level.PressureIn Mean.Sea.Level.PressureIn Min.Sea.Level.PressureIn Max.VisibilityMiles
# 1 0.006218 -0.018023 -0.008160 0.013916
# 2 0.011742 -0.001504 0.001079 -0.002815
# 3 0.002675 0.002821 -0.004279 0.005178
# 4 0.006987 0.003645 0.007091 0.013030
# Mean.VisibilityMiles Min.VisibilityMiles Max.Wind.SpeedMPH Mean.Wind.SpeedMPH Max.Gust.SpeedMPH
# 1 -0.002323 0.0028401 0.0028779 0.0009989 -6.320e-04
# 2 0.001611 -0.0023890 0.0014840 -0.0012118 -7.029e-05
# 3 0.002107 0.0007576 -0.0014126 -0.0016919 9.457e-04
# 4 -0.006157 0.0020409 0.0007483 0.0008237 -4.648e-04
# PrecipitationIn CloudCover Events WindDirDegrees.br... DayLength Season.L Season.Q
# 1 -0.025717 0.002471 0.005752 1.576e-05 0.009140 -0.0006296 0.005954
# 2 -0.005913 -0.002430 -0.002981 -1.195e-04 -0.003607 -0.0015844 0.006389
# 3 0.002764 0.003450 -0.006076 6.861e-06 -0.004650 -0.0011383 0.010232
# 4 -0.008745 0.002082 -0.012127 -4.119e-05 0.003976 -0.0010785 0.010584
# Season.C
# 1 -0.0041696
# 2 -0.0047106
# 3 -0.0011607
# 4 -0.0004978
~~~
Variables which are positive in one season tend to be positive in another. There are some cases of flipping sign, like `PrecipitationIn` but in these cases, the coefficients are tiny (sometimes so tiny R lapses into scientific notation). In no case is the difference between season estimates as large as 0.1. Hence, seasonal effects do not seem to be important
##### Correlation size & implications
The `Max.VisibilityMiles` result seems like a reasonable one; the variable may be tapping into some sort of anti-rain measure. We can visualize the coefficients thusly:
~~~{.R}
library(arm)
maxvis <- attr(profile(slmodel), "summary")$coefficients[1]
coefplot(maxvis, varnames=rownames(x), vertical=FALSE, var.las=1, main="95% CIs for the log odds of predictors")
~~~
![Variables with confidence intervals show log odds size and whether the CI excludes zero](/images/weather/2013-coefficients-ci.png)
It doesn't *look* particularly impressive and doesn't exclude 0; but we want to know whether this surviving variable is potentially useful in practice, what its "cash value" might be. We again exponentiate the log odds to get more understandable odds:
~~~{.R}
exp(coef(slmodel))
# Max.VisibilityMiles
# 1.694
~~~
What do these estimates mean in practice? Let's take `Max VisibilityMiles` as a variable that may be manipulable by buying bright lights, with an odds of 1.5; if we could somehow maximize each day's `DayLength`, the model would predict 4 higher days:
~~~{.R}
weatherOdds <- weather # copy data & max out DayLength
weatherOdds$Max.VisibilityMiles <- max(weather$Max.VisibilityMiles)
sum(as.integer(predict(slmodel, newdata=weatherOdds))) - sum(as.integer(predict(slmodel, newdata=weather)))
# [1] 4
~~~
If we wanted to test this potential effect via an experiment (eg. buying a powerful lighting system, and turning it on or off on random days), how many days _n_ should the experiment run? With a bootstrap: we can calculate the effect size of 4 incremented days and then ask for the _n_ required for a self-experiment with good power of 80%; this turns out to be a ~1600 day or 4.4 year long experiment:
~~~{.R}
library(boot)
library(rms)
n <- 1600
weatherPower <- function(dt, indices) {
d <- dt[sample(nrow(dt), n, replace=TRUE), ] # new dataset, possibly larger than the original
lmodel <- lrm(MP ~ Max.VisibilityMiles, data = d)
return(anova(lmodel)[5])
}
bs <- boot(data=weather, statistic=weatherPower, R=100000, parallel="multicore", ncpus=4)
alpha <- 0.05
print(sum(bs$t<=alpha)/length(bs$t))
# [1] 0.8054
~~~
To detect such a subtle effect is not easy, and such a time-consuming experiment is surely not worthwhile.
#### Random forests classification
Random forests can be used to classify (predict categorical outcomes) as well as regressions; having turned the response variable into a factor, the type switches automatically:
~~~{.R}
rmodel <- randomForest(MP ~ ., data=weather, proximity=TRUE); rmodel
# ...
# Number of trees: 500
# No. of variables tried at each split: 4
#
# OOB estimate of error rate: 57.39%
# Confusion matrix:
# 1 2 3 4 class.error
# 1 0 1 3 0 1.0000
# 2 0 20 104 16 0.8571
# 3 0 31 179 34 0.2664
# 4 0 20 94 26 0.8143
~~~
We can't `step` through random forests since it doesn't have a complexity measure like AIC to use, so we'll reuse the variable from the simplified ordinal:
~~~{.R}
srmodel <- randomForest(MP ~ Max.VisibilityMiles, data=weather, proximity=TRUE); srmodel
# ...
# Type of random forest: classification
# Number of trees: 500
# No. of variables tried at each split: 1
#
# OOB estimate of error rate: 53.79%
# Confusion matrix:
# 1 2 3 4 class.error
# 1 0 0 4 0 1.000000
# 2 0 2 138 0 0.985714
# 3 0 1 242 1 0.008197
# 4 0 0 140 0 1.000000
~~~
Much better, and now slightly better than the original linear models.
#### Ordinal vs Random forests
We can now compare the fraction of days that are incorrectly predicted by the constant predictor, the ordinal logistic regression (full & simplified), and the random forests (full & simplified):
~~~{.R}
1 - (sum(weather$MP==3) / length(weather$MP))
# [1] 0.5379
1 - (sum(weather$MP == as.integer(predict(lmodel))) / length(weather$MP))
# [1] 0.5284
1 - (sum(as.integer(weather$MP) == predict(slmodel)) / length(weather$MP))
# [1] 0.5341
1 - (sum(as.integer(weather$MP) == as.integer(predict(rmodel))) / length(weather$MP))
# [1] 0.5739
1 - (sum(as.integer(weather$MP) == as.integer(predict(srmodel))) / length(weather$MP))
# [1] 0.5379
~~~
It would seem that there is no big difference between the models, but the ordinal may be a little bit better than the constant predictor.
### Model checking
#### Error rate
Before concluding that the ordinal logistic regression is better than the constant predictor, it might be a good idea to check how robust this result holds up. The difference in correctly classified days is very small, and it might represent minimal advantage. We'll [bootstrap](!Wikipedia "Bootstrapping (statistics)") a large number of logistic regressions on samples of the full dataset, and see what fraction of them incur a higher classification error rate than the constant predictor:
~~~{.R}
library(boot)
errorRate <- function(dt, indices) {
d <- dt[indices,] # allows boot to select subsample
lmodel <- polr(as.ordered(MP) ~ ., data = d) # train new regression model on subsample
return(1 - (sum(d$MP == as.integer(predict(lmodel))) / length(d$MP)))
}
bs <- boot(data=weather, statistic=errorRate, R=100000, parallel="multicore", ncpus=4); bs
# ...Bootstrap Statistics :
# original bias std. error
# t1* 0.5284 -0.01137 0.02431
boot.ci(bs)
# ...
# Intervals :
# Level Normal Basic
# 95% ( 0.4921, 0.5874 ) ( 0.4905, 0.5871 )
#
# Level Percentile BCa
# 95% ( 0.4697, 0.5663 ) ( 0.4905, 0.5890 )
hist(bs$t, xlab="Error rate", ylab="Number of samples",
main="Bootstrap check of ordinal logistic regression accuracy")
sum(bs$t > (1 - (sum(weather$MP==3) / length(weather$MP)))) / length(bs$t)
# [1] 0.1832
~~~
![Distribution of logistic regression classification rates](/images/weather/2013-bootstrap.png)
There's substantial uncertainty in the classification rate, as evidenced by the potentially wide confidence intervals, but in ~18% of the new logistic regressions, the classification rate was worse than the constant predictor. This is not too bad but the logistic regression is not outperforming random by very much, so the reality of the result is open to question.
# Conclusion
An attack on the data turned up nothing in several ways; the only model that seemed to improve on random guessing does not do so by very much, indicating weak weather effects. I will attempt to follow up in 2 years to see if the results replicate with a longer data series.
# Appendix
## Autocorrelation {.collapse}
A test using `acf` shows no autocorrelation worth mentioning:
~~~{.R}
acf(weather$MP, main="Do days predict subsequent days at various distances?")
~~~
![MP series shows almost no autocorrelation at any timelag](/images/weather/2013-autocorrelation-mp.png)
For comparison, here is `acf` for a data series where one would expect a great deal of autocorrelation---my daily weight 2012--2013---and one does indeed observe it:
~~~{.R}
weight <- c(205,214,216,213,213,218,218,214,215,216,218,216,210,219,217,219,217,215,215,217,219,
216,215,218,215,219,219,219,218,218,220,220,219,219,219,221,220,219,216,220,220,218,
218,220,219,220,215,215,218,218,215,216,216,218,218,220,219,216,217,220,215,215,218,
216,214,213,216,215,214,213,214,216,216,212,209,212,214,213,210,211,210,213,211,215,
211,212,212,216,212,215,216,215,215,212,216,213,212,211,215,214,215,216,214,212,212,
213,212,213,211,214,215,210,211,211,211,212,210,210,212,211,214,213,214,212,215,214,
213,215,211,214,214,216,215,213,215,213,213,212,215,211,212,212,211,212,211,212,210,
211,213,218,217,212,214,216,213,212,211,211,214,212,211,216,NA,NA,NA,NA,NA,218,
216,214,215,216,216,213,216,214,215,219,218,216,215,218,217,219,219,219,219,219,218,
217,216,216,215,218,219,217,216,219,217,216,216,219,216,218,215,216,215,215,213,214,
215,217,216,215,215,216,214,215,215,214,216,211,214,213,214,211,212,211,210,212,211,
212,214,212,211,214,212,211,212,211,212,213,210,212,210,210,211,210,212,210,210,210,
210,210,209,206,206,208,209,207,207,205,205,205,206,206,209,204,206,206,204,204,204,
204,205,203,205,203,205,204,204,203,204,204,205,205,204,205,205,204,204,205,205,205,
205,204,203,205,205,206,205,204,203,204,205,206,207,206,205,205,207,206,210,210,212,
211,208,210,210,209,211,212,204,205,208,208,209,209,208,208,208,209,208,210,209,208,
207,207,209,208,209,207,207,206,207,208,207,209,210,210,208,208,206,208,210,210,209,
209,209,210,209,210,212,212,210,212,213,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,210,
211,210,210,210,211,212,211,210,210,210,212,210,211,210,209,211,209,209,210,210,210,
211,210,210,209,213,210,210,210,214,212,211,210,211,215,215,214,210,214,212)
acf(weight, na.action = na.pass)
~~~
![Weight series showing autocorrelation at every timelag](/images/weather/2013-autocorrelation-weight.png)