# Weather and My Productivity

Rain or shine affect my mood? Not much. (statistics, psychology, R)
created: 19 Mar 2013; modified: 13 Dec 2018; status: in progress; confidence: likely;

Touji: Oh, yes. the view of the world that one can have is quite small.
Hikari: Yes, you measure things only by your own small measure.
Asuka: One sees things with the truth, given by others.
Misato: Happy on a sunny day.
Rei: Gloomy on a rainy day.
Asuka: If you’re taught that, you always think so.
Ritsuko: But, you can enjoy rainy days.1

The scientific literature has found that lighting & temperature & the seasons matter to one’s mood, focus, sleep and other things (but not, apparently, one’s life satisfaction or psychiatric disorders, which may be related to complex features of light (Vyssoki et al 2014)). Certainly, these things having large effects makes a lot of sense - who doesn’t feel gloomy on a rainy day or happier on a sunny day?

But things everyone knows often turn out to be wrong; and in my case, it seems like my morning use of vitamin D and evening melatonin supplementation may be screening off the environmental effects. So one rainy and then sunny day in March 2013, I wondered if I could find an influence of rain or darkness on my own mood or productivity.

As it happens, since 16 February 2012, I have been daily writing down my impression of whether my mood & productivity (MP) that day was average, below-average, or above; this gave me 399 ratings to work with. I didn’t record any weather data, but the nice thing about weather data is that - unlike quotidian but potentially important data like the proverbial what you had for breakfast, which is lost if not immediately recorded - the weather is concrete, objective, easily quantifiable, economically important, and of universal interest. Hence, we would expect weather data to be available online so I can find it, and see what aspects of the weather influence my self-rating (if any).

# Data

A convenient source of data seems to be Weather Underground. Going via API is overkill, since we just need data from Islip for 16 February 2012 - 08 July 2012 (CSV) and from Pax River 11 July 2012 - 22 March 2013 (CSV). I’m using the nearest airport, rather than the nearest volunteer weather station, to get data on cloud cover; cloud cover should be a good proxy for brightness or sunniness. The insol library calculates the length of day, another potentially relevant variable.

# gather and clean data

weather1$PrecipitationIn <- as.numeric(as.character(weather1$PrecipitationIn)) # delete weird "T" factor; ???
# mood/productivity (MP) sourced from personal log
weather1$MP <- c(1,3,4,4,4,4,3,4,2,3,4,4,2,3,2,3,3,2,3,2,3,4,3,2,3,2,3,1,2,4,3,3,3,3,3, 3,3,2,3,4,3,3,4,4,4,4,2,3,4,3,3,3,4,3,3,1,3,2,3,3,4,3,2,3,2,4,4,4,3,3, 4,4,3,4,3,3,2,3,3,2,2,4,3,2,4,4,4,3,3,2,4,4,4,3,3,3,3,2,3,4,3,2,2,4,3, 3,2,2,2,2,3,4,2,4,4,3,3,3,2,4,2,2,2,2,2,3,3,4,2,1,3,3,2,3,3,4,2,3,2,3, 3,2,4,4) weather2$MP <- c(2,4,4,4,3,4,3,3,4,3,2,3,4,4,4,3,2,3,3,2,3,3,3,2,2,2,2,2,3,4,3,4,2,4,3,
3,2,2,2,3,3,3,3,4,3,3,3,4,3,3,3,2,4,2,3,3,4,4,3,3,3,4,3,3,4,3,4,2,3,3,
4,4,3,3,4,4,3,4,3,2,3,3,3,4,3,2,3,2,2,2,3,3,3,4,4,3,4,4,3,3,2,3,3,3,3,
4,4,3,4,2,2,2,3,4,3,4,3,4,3,4,4,3,3,2,3,2,4,4,3,4,2,3,4,2,3,3,2,2,2,3,
2,3,3,4,2,3,4,3,4,3,3,2,2,3,4,4,3,4,2,2,3,2,3,2,2,2,4,3,3,4,2,2,3,3,3,
4,4,3,2,3,2,2,2,3,3,3,4,3,4,3,3,3,2,2,3,3,3,4,4,3,2,2,2,3,3,4,3,4,3,4,
3,2,4,4,3,3,2,4,3,3,3,2,3,3,2,3,2,3,3,3,3,2,3,3,3,3,3,3,4,2,3,2,4,3,3,
2,2,3,3,3,2,4,3,3,3,3,4,4,3,3,3,3,2,3,2,2,3,3,2,3,2,3,2,3,3,2,2,3,3,3,
4,4,3,2,2,3,2,3,2,4,4,4,3,3,3,4,3,4,3,4,4,2,2,2,4,4,3,2,2,3,4,3,2,4,2,
3,4,2,4,4,2,3,2,3,3,3,2,2,2,3,3,2,2,3,4,4,2,3,2,2,3,4,4,3,2,2,3,4,4,3,
3,3,3,2,4,3,3,4,3,3,3,4,3,4,4,4,4,3,3,3,2,2,2,4,3,4,4,2,2,3,4,4,3,2)
weather3$MP <- c(3,4,2,3,4,3,3,4,3) weather4$MP <- c(3,3,2,2,3,2,4,4,4,3,2,2,5,4,3,4,4,3,3,4,4,3,2,2,3,4,3,3,4,4,3,4,3,3,
4,3,4,4,4,2,3)

# compute length of day for each location
library(insol)
weather1$DayLength <- daylength(40,73, julian(as.Date(as.character(weather1$EDT))), 1)[,3]
weather3$DayLength <- daylength(40,73, julian(as.Date(as.character(weather3$EDT))), 1)[,3]
weather2$DayLength <- daylength(38,76, julian(as.Date(as.character(weather2$EDT))), 1)[,3]
weather4$DayLength <- daylength(38,76, julian(as.Date(as.character(weather4$EDT))), 1)[,3]
# combine & start cleaning
weather <- rbind(weather1,weather2)
weather$WindDirDegrees.br... <- sub("<br />", "", weather$WindDirDegrees.br...)
weather$WindDirDegrees.br... <- as.integer(weather$WindDirDegrees.br...)
weather$Events <- as.integer(weather$Events)
weather$Events[weather$Events>1] <- 2
# something of a hack but we'll impute any missing rain precipitation values as 0
weather[is.na(weather)] <- 0

# Analysis

## Exploratory

After cleanup, the data looks reasonable and as expected, with each variable spanning the ranges one would expect of temperatures and precipitation:

summary(weather)
EDT      Max.TemperatureF Mean.TemperatureF Min.TemperatureF Max.Dew.PointF MeanDew.PointF
2012-2-16:  1   Min.   : 26.0    Min.   :21.0      Min.   :14.0     Min.   :12.0   Min.   : 1
2012-2-17:  1   1st Qu.: 53.0    1st Qu.:45.0      1st Qu.:37.0     1st Qu.:39.8   1st Qu.:33
2012-2-18:  1   Median : 68.0    Median :58.0      Median :50.0     Median :54.0   Median :48
2012-2-19:  1   Mean   : 66.9    Mean   :58.7      Mean   :50.6     Mean   :52.5   Mean   :47
2012-2-20:  1   3rd Qu.: 82.0    3rd Qu.:73.0      3rd Qu.:66.0     3rd Qu.:67.0   3rd Qu.:63
2012-2-21:  1   Max.   :100.0    Max.   :87.0      Max.   :80.0     Max.   :79.0   Max.   :74
(Other)  :522
Min.DewpointF   Max.Humidity   Mean.Humidity   Min.Humidity  Max.Sea.Level.PressureIn
Min.   :-5.0   Min.   : 42.0   Min.   :28.0   Min.   :12.0   Min.   :29.4
1st Qu.:25.0   1st Qu.: 82.0   1st Qu.:59.0   1st Qu.:37.0   1st Qu.:30.0
Median :42.0   Median : 89.0   Median :69.0   Median :48.0   Median :30.1
Mean   :41.2   Mean   : 85.7   Mean   :67.8   Mean   :48.6   Mean   :30.1
3rd Qu.:58.0   3rd Qu.: 93.0   3rd Qu.:77.2   3rd Qu.:60.0   3rd Qu.:30.2
Max.   :73.0   Max.   :100.0   Max.   :97.0   Max.   :93.0   Max.   :30.6

Mean.Sea.Level.PressureIn Min.Sea.Level.PressureIn Max.VisibilityMiles Mean.VisibilityMiles
Min.   :29.1              Min.   :28.7             Min.   : 6.00       Min.   : 2.0
1st Qu.:29.9              1st Qu.:29.8             1st Qu.:10.00       1st Qu.: 9.0
Median :30.0              Median :29.9             Median :10.00       Median :10.0
Mean   :30.0              Mean   :29.9             Mean   : 9.97       Mean   : 9.1
3rd Qu.:30.2              3rd Qu.:30.1             3rd Qu.:10.00       3rd Qu.:10.0
Max.   :30.5              Max.   :30.5             Max.   :10.00       Max.   :10.0

Min.VisibilityMiles Max.Wind.SpeedMPH Mean.Wind.SpeedMPH Max.Gust.SpeedMPH PrecipitationIn
Min.   : 0.00       Min.   : 6.0      Min.   : 1.00      Min.   : 0.0      Min.   :0.000
1st Qu.: 3.00       1st Qu.:13.0      1st Qu.: 6.00      1st Qu.: 0.0      1st Qu.:0.000
Median : 9.00       Median :15.0      Median : 8.00      Median :22.0      Median :0.000
Mean   : 6.88       Mean   :16.4      Mean   : 8.04      Mean   :18.6      Mean   :0.124
3rd Qu.:10.00       3rd Qu.:20.0      3rd Qu.:10.00      3rd Qu.:28.0      3rd Qu.:0.030
Max.   :10.00       Max.   :40.0      Max.   :27.00      Max.   :56.0      Max.   :6.000

CloudCover      Events     WindDirDegrees.br...       MP         DayLength
Min.   :0.0   Min.   :1.00   Min.   :  1          Min.   :1.00   Min.   : 9.27
1st Qu.:3.0   1st Qu.:1.00   1st Qu.:138          1st Qu.:2.00   1st Qu.:10.31
Median :4.5   Median :1.00   Median :208          Median :3.00   Median :12.10
Mean   :4.4   Mean   :1.47   Mean   :202          Mean   :2.98   Mean   :12.07
3rd Qu.:6.0   3rd Qu.:2.00   3rd Qu.:283          3rd Qu.:4.00   3rd Qu.:13.87
Max.   :8.0   Max.   :2.00   Max.   :360          Max.   :4.00   Max.   :14.73

CloudCover being 0-8 looks a bit odd, since one might’ve expected a decimal or percentage or some quantification of thickness but turns out to be a standard measurement, the okta. We might expect some seasonal effects, but graphing a sensitive LOESS moving average and then a per-week spineplot suggests just 1 anomaly, that there were some days I rated 1 but never subsequently (this has an explanation: as I began keeping the series, my car caught on fire, was totaled; and I was then put through the wringer with the insurance & junkyard, and felt truly miserable at multiple points):

par(mfrow=c(2,1))
scatter.smooth(x=weather$EDT, y=weather$MP, span=0.3, col="#CCCCCC", xlab="Days", ylab="MP rating")
weeks <- c(seq(from=1,to=length(weather$MP),by=7), length(weather$MP))
spineplot(c(1:(length(weather$MP))), as.factor(weather$MP), breaks=weeks, xlab="Weeks", ylab="MP rating ratios")
# we don't use the date past the spineplot &
# it causes problems with all the models, so delete it
rm(weather$EDT) A formal check for autocorrelation in MP ratings (most methods assume independence) turns up little enough that I feel free to ignore them. So the data looks clean and tame; now we can begin real interpretation. The first and most obvious thing to do is to see what the overall correlation matrix with MP looks like: # all weather correlations with mood/productivity round(cor(weather)[23,], digits=3) Max.TemperatureF Mean.TemperatureF Min.TemperatureF 0.013 0.016 0.022 Max.Dew.PointF MeanDew.PointF Min.DewpointF 0.016 0.019 0.016 Max.Humidity Mean.Humidity Min.Humidity 0.042 0.026 0.011 Max.Sea.Level.PressureIn Mean.Sea.Level.PressureIn Min.Sea.Level.PressureIn 0.006 0.004 0.015 Max.VisibilityMiles Mean.VisibilityMiles Min.VisibilityMiles 0.066 0.004 -0.016 Max.Wind.SpeedMPH Mean.Wind.SpeedMPH Max.Gust.SpeedMPH 0.008 0.002 0.018 PrecipitationIn CloudCover Events -0.037 0.041 -0.001 WindDirDegrees.br... MP DayLength -0.006 1.000 0.036 tail(sort(abs(round(cor(weather)[23,], digits=3))), 6) DayLength PrecipitationIn CloudCover Max.Humidity Max.VisibilityMiles 0.036 0.037 0.041 0.042 0.066 MP 1.000 All the r values seem very small: >0.1. But some of the more plausible correlations may be statistically-significant, so we’ll look at the 5 largest correlations sapply(c("DayLength","PrecipitationIn","CloudCover","Max.Humidity","Max.VisibilityMiles"), function(x) cor.test(weather$MP, weather[[x]]))
...
DayLength                              PrecipitationIn
p.value     0.4099                                 0.394
estimate    0.03594                                -0.03717

CloudCover                             Max.Humidity
p.value     0.3445                                 0.3319
estimate    0.04122                                0.04231

Max.VisibilityMiles
p.value     0.1292
estimate    0.06611

These specific variables turn out to be failures too, with large p-values.

One useful technique is to convert metrics into standardized units (of standard deviations), sum them all into a single composite variable, and then test that; this can reveal influences not obvious if one looked at the metrics individually. (For example, in my potassium sleep experiments, where I was interested in an overall measure of reduced sleep quality rather than single metrics like sleep latency.) Perhaps this would work here?

# construct a z-score for all of them to see if it does any better
with(weather, cor.test(MP, (scale(Max.Humidity) + scale(Max.VisibilityMiles)
+ scale(PrecipitationIn) + scale(CloudCover) + scale(DayLength))))
...
t = 1.296, df = 526, p-value = 0.1957
alternative hypothesis: true correlation is not equal to 0
95% confidence interval:
-0.02908  0.14105
sample estimates:
cor
0.0564

## Modeling

### Continuous MP

#### Linear model

The correlations show individually little value, so we’ll move on to building modeling and assessing their fit. Can we accurately predict MP if we use all the parameters? We’ll start with a linear model/regression, where we treat the categorical MP variable as a continuous variable for simplicity:

model1 <- lm(MP ~ ., data=weather); summary(model1)
...
Residuals:
Min      1Q  Median      3Q     Max
-2.0255 -0.7815  0.0165  0.8126  1.3730

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)               -4.29e+00   6.38e+00   -0.67     0.50
Max.TemperatureF           3.76e-02   3.20e-02    1.17     0.24
Mean.TemperatureF         -5.87e-02   6.06e-02   -0.97     0.33
Min.TemperatureF           4.41e-02   3.24e-02    1.36     0.17
Max.Dew.PointF            -9.52e-03   1.46e-02   -0.65     0.52
MeanDew.PointF            -1.45e-02   2.79e-02   -0.52     0.60
Min.DewpointF             -6.52e-03   1.43e-02   -0.46     0.65
Max.Humidity               1.10e-02   8.36e-03    1.31     0.19
Mean.Humidity              1.73e-03   1.33e-02    0.13     0.90
Min.Humidity               5.75e-03   8.16e-03    0.71     0.48
Max.Sea.Level.PressureIn   9.92e-01   9.79e-01    1.01     0.31
Mean.Sea.Level.PressureIn -1.61e+00   1.63e+00   -0.99     0.32
Min.Sea.Level.PressureIn   7.29e-01   8.39e-01    0.87     0.39
Max.VisibilityMiles        2.00e-01   1.47e-01    1.36     0.17
Mean.VisibilityMiles       1.04e-02   3.96e-02    0.26     0.79
Min.VisibilityMiles       -3.94e-03   1.89e-02   -0.21     0.83
Max.Wind.SpeedMPH          2.31e-03   1.32e-02    0.18     0.86
Mean.Wind.SpeedMPH        -2.28e-03   1.70e-02   -0.13     0.89
Max.Gust.SpeedMPH          2.71e-03   4.54e-03    0.60     0.55
PrecipitationIn           -1.25e-01   9.29e-02   -1.35     0.18
CloudCover                 2.65e-02   2.53e-02    1.05     0.29
Events                    -9.18e-02   9.83e-02   -0.93     0.35
WindDirDegrees.br...       3.08e-05   4.11e-04    0.07     0.94
DayLength                  5.72e-02   4.78e-02    1.20     0.23

Residual standard error: 0.755 on 504 degrees of freedom
Multiple R-squared:  0.0282,    Adjusted R-squared:  -0.0162
F-statistic: 0.635 on 23 and 504 DF,  p-value: 0.905

The variance explained is extremely minimal, even with 23 different variables in the linear model, and suggests overfitting (as one would expect). By Occam’s razor, most of the variables should be scrapped for not carrying their weight; the step function uses a complexity penalty to choose which variable to eliminate while still fitting the data reasonably well. It choose to keep only 4 variables:

smodel1 <- step(model1); summary(smodel1)
...
Residuals:
Min      1Q  Median      3Q     Max
-2.1301 -0.8907  0.0121  0.8990  1.1779

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)          0.02134    1.38025    0.02    0.988
Min.TemperatureF     0.01429    0.01008    1.42    0.157
MeanDew.PointF      -0.01489    0.01034   -1.44    0.150
Max.Humidity         0.00958    0.00531    1.81    0.072
Max.VisibilityMiles  0.21250    0.12682    1.68    0.094

Residual standard error: 0.748 on 523 degrees of freedom
Multiple R-squared:  0.0109,    Adjusted R-squared:  0.00336
F-statistic: 1.44 on 4 and 523 DF,  p-value: 0.218

We can compare the models’ prediction accuracy on the dataset (lower is better in mean-squared-error):

mean((weather$MP - weather$MP)^2) # perfect
[1] 0
mean((weather$MP - model1$fitted.values)^2) # original
[1] 0.5446
mean((weather$MP - smodel1$fitted.values)^2) # much simpler model
[1] 0.5543
model2 <- lm(MP ~ 1, data=weather); mean((weather$MP - model2$fitted.values)^2)
[1] 0.5604

The model with 4 data variables is much simpler, and appears to fit almost as well, but slightly better than the null model of no parameters

### Random forests regression

Linear modeling having failed to reveal any interesting relationships, we’ll take one last crack at it: if there’s any interesting predictive power in the weather data at all, a high-powered machine learning technique like random forests ought to build a model which outperforms the linear model at least a little on mean-squared-error

library(randomForest)
rmodel <- randomForest(MP ~ ., data=weather, proximity=TRUE); rmodel
...
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 7

Mean of squared residuals: 0.5692
% Var explained: -1.57

The error with all variables is a little higher than the simpler linear model. Troublingly, a random forests with the 4 simpler variables increases the error:

srmodel <- randomForest(MP ~ Min.TemperatureF + MeanDew.PointF + Max.Humidity + Max.VisibilityMiles, data=weather, proximity=TRUE); srmodel
...
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 1

Mean of squared residuals: 0.5814
% Var explained: -3.76

Perhaps treating MP as a continuous variable was a bad idea. Let’s start over.

### Categorical MP

We turn the MP data into an ordered factor:

weather$MP <- ordered(weather$MP)

#### Logistic model

A straight logistic regression (glm(MP ~ ., data=weather, family="binomial")) is not appropriate because MP is not a binary outcome; we use the MASS library to do an ordinal logistic regression; as with the linear model, the coefficients don’t seem to differ very much

library(MASS)
lmodel <- polr(as.ordered(MP) ~ ., data = weather); summary(lmodel)
...
Coefficients:
Value Std. Error t value
Max.TemperatureF           0.086798    0.07916   1.096
Mean.TemperatureF         -0.135828    0.14974  -0.907
Min.TemperatureF           0.105068    0.08024   1.309
Max.Dew.PointF            -0.023466    0.03624  -0.648
MeanDew.PointF            -0.033449    0.06789  -0.493
Min.DewpointF             -0.018700    0.03424  -0.546
Max.Humidity               0.029043    0.02036   1.427
Mean.Humidity              0.004063    0.03263   0.124
Min.Humidity               0.013462    0.02024   0.665
Max.Sea.Level.PressureIn   2.554493    0.85951   2.972
Mean.Sea.Level.PressureIn -4.352709    0.06383 -68.191
Min.Sea.Level.PressureIn   2.047343    0.91077   2.248
Max.VisibilityMiles        0.549889    0.37650   1.461
Mean.VisibilityMiles       0.029195    0.09772   0.299
Min.VisibilityMiles       -0.007733    0.04638  -0.167
Max.Wind.SpeedMPH          0.003846    0.03219   0.119
Mean.Wind.SpeedMPH        -0.007351    0.04250  -0.173
Max.Gust.SpeedMPH          0.008375    0.01130   0.741
PrecipitationIn           -0.325188    0.22738  -1.430
CloudCover                 0.067914    0.06000   1.132
Events                    -0.219315    0.24050  -0.912
WindDirDegrees.br...       0.000192    0.00101   0.190
DayLength                  0.144132    0.11804   1.221

Intercepts:
Value   Std. Error t value
1|2  13.143   0.021    638.830
2|3  17.080   0.500     34.185
3|4  19.126   0.509     37.550

Residual Deviance: 1143.49
AIC: 1195.49

# relative risk or odds; how much difference does each variable make, per its units?
exp(coef(lmodel))
Max.TemperatureF         Mean.TemperatureF          Min.TemperatureF
1.09068                   0.87299                   1.11079
Max.Dew.PointF            MeanDew.PointF             Min.DewpointF
0.97681                   0.96710                   0.98147
Max.Humidity             Mean.Humidity              Min.Humidity
1.02947                   1.00407                   1.01355
Max.Sea.Level.PressureIn Mean.Sea.Level.PressureIn  Min.Sea.Level.PressureIn
12.86477                   0.01287                   7.74729
Max.VisibilityMiles      Mean.VisibilityMiles       Min.VisibilityMiles
1.73306                   1.02963                   0.99230
Max.Wind.SpeedMPH        Mean.Wind.SpeedMPH         Max.Gust.SpeedMPH
1.00385                   0.99268                   1.00841
PrecipitationIn                CloudCover                    Events
0.72239                   1.07027                   0.80307
WindDirDegrees.br...                 DayLength
1.00019                   1.15504

Another use of step; this time, it builds something even narrower than before:

slmodel <- step(lmodel); summary(slmodel)
...
Coefficients:
Value Std. Error t value
Max.VisibilityMiles 0.527      0.329     1.6

Intercepts:
Value Std. Error t value
1|2 0.364 3.293      0.110
2|3 4.277 3.282      1.303
3|4 6.283 3.287      1.912

Residual Deviance: 1156.46
AIC: 1164.46

exp(0.527)
[1] 1.694
##### Seasonal effects

The previously mentioned life-satisfaction/weather paper, Lucas et al 2013, examined a single variable very similar to my MP:

Life satisfaction was assessed using a single item that read, In general, how satisfied are you with your life. Participants responded using a 4-point scale with the options Very Satisfied, Satisfied, Dissatisfied, or Very Dissatisfied (responses were scored such that higher scores reflect higher satisfaction)

Rather than simply regress local weather on the score, similar to what I’ve been doing so far, they attempted a more complex model where the effects could change based on region & time of year:

In addition, the effect of absolute levels of any weather variable may vary depending on when in the year the weather occurred. A 50 °F day may contribute to positive mood (and hence higher life satisfaction) if it occurred in the middle of winter in a cold climate, whereas this same absolute temperature might contribute to a negative mood (and hence lower life satisfaction) if it occurred in the middle of summer in a warm climate. Thus, seasonal differences must be considered when examining the effects of weather.

While I live in only one region and replicating the rest of the analysis would be a lot of work, the seasonal changes is possible and not too hard to investigate: reset the data to restore the date, create a Season variable, populate (since I am lazy, I convert the dates to fiscal quarters, which are similar enough), and then rerun an ordinal logistic regression with all the potential predictors nested under Season.

Looking at the coefficients split by season, and looking for coefficients that suddenly become larger or which switch signs from season to season, I see no real candidates:

weather$MP <- as.ordered(weather$MP)
library(lubridate)
weather$Season <- quarter(weather$EDT)
library(lme4)
lmr <- lmer(MP ~ (. |Season), data=weather, control = lmerControl(maxfun=20000)); ranef(lmr)
...
$Season (Intercept) MP.L MP.Q MP.C EDT Max.TemperatureF Mean.TemperatureF Min.TemperatureF 1 0.05723 1.8727 0.2184 -0.1058 4.704e-05 0.002089 -0.002253 0.0009085 2 0.06215 1.8918 0.2303 -0.1042 6.786e-06 0.004284 -0.004817 0.0034466 3 0.31806 0.9037 0.9451 -0.4151 3.206e-05 0.003028 -0.002113 0.0019708 4 0.31525 0.8986 0.9376 -0.4169 7.582e-07 -0.001217 -0.001045 0.0003388 Max.Dew.PointF MeanDew.PointF Min.DewpointF Max.Humidity Mean.Humidity Min.Humidity 1 -3.416e-04 -0.002731 0.0008853 -1.584e-04 0.0015697 0.0004101 2 1.922e-03 -0.004427 0.0001578 6.227e-04 -0.0002743 0.0010211 3 2.523e-05 -0.002938 -0.0007302 5.611e-05 0.0009317 0.0004772 4 -1.235e-04 0.003389 -0.0012974 6.986e-04 -0.0009484 -0.0004437 Max.Sea.Level.PressureIn Mean.Sea.Level.PressureIn Min.Sea.Level.PressureIn Max.VisibilityMiles 1 0.006218 -0.018023 -0.008160 0.013916 2 0.011742 -0.001504 0.001079 -0.002815 3 0.002675 0.002821 -0.004279 0.005178 4 0.006987 0.003645 0.007091 0.013030 Mean.VisibilityMiles Min.VisibilityMiles Max.Wind.SpeedMPH Mean.Wind.SpeedMPH Max.Gust.SpeedMPH 1 -0.002323 0.0028401 0.0028779 0.0009989 -6.320e-04 2 0.001611 -0.0023890 0.0014840 -0.0012118 -7.029e-05 3 0.002107 0.0007576 -0.0014126 -0.0016919 9.457e-04 4 -0.006157 0.0020409 0.0007483 0.0008237 -4.648e-04 PrecipitationIn CloudCover Events WindDirDegrees.br... DayLength Season.L Season.Q 1 -0.025717 0.002471 0.005752 1.576e-05 0.009140 -0.0006296 0.005954 2 -0.005913 -0.002430 -0.002981 -1.195e-04 -0.003607 -0.0015844 0.006389 3 0.002764 0.003450 -0.006076 6.861e-06 -0.004650 -0.0011383 0.010232 4 -0.008745 0.002082 -0.012127 -4.119e-05 0.003976 -0.0010785 0.010584 Season.C 1 -0.0041696 2 -0.0047106 3 -0.0011607 4 -0.0004978 Variables which are positive in one season tend to be positive in another. There are some cases of flipping sign, like PrecipitationIn but in these cases, the coefficients are tiny (sometimes so tiny R lapses into scientific notation). In no case is the difference between season estimates as large as 0.1. Hence, seasonal effects do not seem to be important ##### Correlation size & implications The Max.VisibilityMiles result seems like a reasonable one; the variable may be tapping into some sort of anti-rain measure. We can visualize the coefficients thusly: library(arm) maxvis <- attr(profile(slmodel), "summary")$coefficients[1]
coefplot(maxvis, varnames=rownames(x), vertical=FALSE, var.las=1, main="95% CIs for the log odds of predictors")

It doesn’t look particularly impressive and doesn’t exclude 0; but we want to know whether this surviving variable is potentially useful in practice, what its cash value might be. We again exponentiate the log odds to get more understandable odds:

exp(coef(slmodel))
Max.VisibilityMiles
1.694

What do these estimates mean in practice? Let’s take Max VisibilityMiles as a variable that may be manipulable by buying bright lights, with an odds of 1.5; if we could somehow maximize each day’s DayLength, the model would predict 4 higher days:

weatherOdds <- weather # copy data & max out DayLength
weatherOdds$Max.VisibilityMiles <- max(weather$Max.VisibilityMiles)
sum(as.integer(predict(slmodel, newdata=weatherOdds))) - sum(as.integer(predict(slmodel, newdata=weather)))
[1] 4

If we wanted to test this potential effect via an experiment (eg. buying a powerful lighting system, and turning it on or off on random days), how many days n should the experiment run? With a bootstrap: we can calculate the effect size of 4 incremented days and then ask for the n required for a self-experiment with good power of 80%; this turns out to be a ~1600 day or 4.4 year long experiment:

library(boot)
library(rms)
n <- 1600
weatherPower <- function(dt, indices) {
d <- dt[sample(nrow(dt), n, replace=TRUE), ] # new dataset, possibly larger than the original
lmodel <- lrm(MP ~ Max.VisibilityMiles, data = d)
return(anova(lmodel)[5])
}
bs <- boot(data=weather, statistic=weatherPower, R=100000, parallel="multicore", ncpus=4)
alpha <- 0.05
print(sum(bs$t<=alpha)/length(bs$t))
# [1] 0.8054

To detect such a subtle effect is not easy, and such a time-consuming experiment is surely not worthwhile.

#### Random forests classification

Random forests can be used to classify (predict categorical outcomes) as well as regressions; having turned the response variable into a factor, the type switches automatically:

rmodel <- randomForest(MP ~ ., data=weather, proximity=TRUE); rmodel
...
Number of trees: 500
No. of variables tried at each split: 4

OOB estimate of  error rate: 57.39%
Confusion matrix:
1  2   3  4 class.error
1 0  1   3  0      1.0000
2 0 20 104 16      0.8571
3 0 31 179 34      0.2664
4 0 20  94 26      0.8143

We can’t step through random forests since it doesn’t have a complexity measure like AIC to use, so we’ll reuse the variable from the simplified ordinal:

srmodel <- randomForest(MP ~ Max.VisibilityMiles, data=weather, proximity=TRUE)
srmodel
...
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 1

OOB estimate of  error rate: 53.79%
Confusion matrix:
1 2   3 4 class.error
1 0 0   4 0    1.000000
2 0 2 138 0    0.985714
3 0 1 242 1    0.008197
4 0 0 140 0    1.000000

Much better, and now slightly better than the original linear models.

#### Ordinal vs Random forests

We can now compare the fraction of days that are incorrectly predicted by the constant predictor, the ordinal logistic regression (full & simplified), and the random forests (full & simplified):

1 - (sum(weather$MP==3) / length(weather$MP))
[1] 0.5379

1 - (sum(weather$MP == as.integer(predict(lmodel))) / length(weather$MP))
[1] 0.5284
1 - (sum(as.integer(weather$MP) == predict(slmodel)) / length(weather$MP))
[1] 0.5341

1 - (sum(as.integer(weather$MP) == as.integer(predict(rmodel))) / length(weather$MP))
[1] 0.5739
1 - (sum(as.integer(weather$MP) == as.integer(predict(srmodel))) / length(weather$MP))
[1] 0.5379

It would seem that there is no big difference between the models, but the ordinal may be a little bit better than the constant predictor.

### Model checking

#### Error rate

Before concluding that the ordinal logistic regression is better than the constant predictor, it might be a good idea to check how robust this result holds up. The difference in correctly classified days is very small, and it might represent minimal advantage. We’ll bootstrap a large number of logistic regressions on samples of the full dataset, and see what fraction of them incur a higher classification error rate than the constant predictor:

library(boot)
errorRate <- function(dt, indices) {
d <- dt[indices,] # allows boot to select subsample
lmodel <- polr(as.ordered(MP) ~ ., data = d) # train new regression model on subsample
return(1 - (sum(d$MP == as.integer(predict(lmodel))) / length(d$MP)))
}
bs <- boot(data=weather, statistic=errorRate, R=100000, parallel="multicore", ncpus=4); bs
...
Bootstrap Statistics :
original   bias    std. error
t1*   0.5284 -0.01137     0.02431

boot.ci(bs)
...
Intervals :
Level      Normal              Basic
95%   ( 0.4921,  0.5874 )   ( 0.4905,  0.5871 )

Level     Percentile            BCa
95%   ( 0.4697,  0.5663 )   ( 0.4905,  0.5890 )

hist(bs$t, xlab="Error rate", ylab="Number of samples", main="Bootstrap check of ordinal logistic regression accuracy") sum(bs$t > (1 - (sum(weather$MP==3) / length(weather$MP)))) / length(bs$t) [1] 0.1832 There’s substantial uncertainty in the classification rate, as evidenced by the potentially wide confidence intervals, but in ~18% of the new logistic regressions, the classification rate was worse than the constant predictor. This is not too bad but the logistic regression is not outperforming random by very much, so the reality of the result is open to question. # Conclusion An attack on the data turned up nothing in several ways; the only model that seemed to improve on random guessing does not do so by very much, indicating weak weather effects. I will attempt to follow up in 2 years to see if the results replicate with a longer data series. # Appendix ## Autocorrelation A test using acf shows no autocorrelation worth mentioning: acf(weather$MP, main="Do days predict subsequent days at various distances?")

For comparison, here is acf for a data series where one would expect a great deal of autocorrelation - my daily weight 2012-2013 - and one does indeed observe it:

weight <- c(205,214,216,213,213,218,218,214,215,216,218,216,210,219,217,219,217,215,215,217,219,
216,215,218,215,219,219,219,218,218,220,220,219,219,219,221,220,219,216,220,220,218,
218,220,219,220,215,215,218,218,215,216,216,218,218,220,219,216,217,220,215,215,218,
216,214,213,216,215,214,213,214,216,216,212,209,212,214,213,210,211,210,213,211,215,
211,212,212,216,212,215,216,215,215,212,216,213,212,211,215,214,215,216,214,212,212,
213,212,213,211,214,215,210,211,211,211,212,210,210,212,211,214,213,214,212,215,214,
213,215,211,214,214,216,215,213,215,213,213,212,215,211,212,212,211,212,211,212,210,
211,213,218,217,212,214,216,213,212,211,211,214,212,211,216,NA,NA,NA,NA,NA,218,
216,214,215,216,216,213,216,214,215,219,218,216,215,218,217,219,219,219,219,219,218,
217,216,216,215,218,219,217,216,219,217,216,216,219,216,218,215,216,215,215,213,214,
215,217,216,215,215,216,214,215,215,214,216,211,214,213,214,211,212,211,210,212,211,
212,214,212,211,214,212,211,212,211,212,213,210,212,210,210,211,210,212,210,210,210,
210,210,209,206,206,208,209,207,207,205,205,205,206,206,209,204,206,206,204,204,204,
204,205,203,205,203,205,204,204,203,204,204,205,205,204,205,205,204,204,205,205,205,
205,204,203,205,205,206,205,204,203,204,205,206,207,206,205,205,207,206,210,210,212,
211,208,210,210,209,211,212,204,205,208,208,209,209,208,208,208,209,208,210,209,208,
207,207,209,208,209,207,207,206,207,208,207,209,210,210,208,208,206,208,210,210,209,
209,209,210,209,210,212,212,210,212,213,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,210,
211,210,210,210,211,212,211,210,210,210,212,210,211,210,209,211,209,209,210,210,210,
211,210,210,209,213,210,210,210,214,212,211,210,211,215,215,214,210,214,212)
acf(weight, na.action = na.pass)

1. Neon Genesis Evangelion, episode 26 The Beast that Shouted I at the Heart of the World (Literal Translation Project)