Weather and My Productivity

Rain or shine affect my mood? Not much.
statistics, psychology, R
2013-03-192015-12-18 in progress certainty: likely importance: 8


Weather is often said to affect our mood, and that peo­ple in sun­nier places are hap­pier because of that. Curi­ous about the pos­si­ble effect (it could be worth con­trol­ling for in my future QS analy­ses or attempt­ing to imi­tate ben­e­fits inside my house eg brighter light­ing), I com­bine my long-term daily self­-rat­ings with logs from the near­est major offi­cial weather sta­tions, which offer detailed weather infor­ma­tion about tem­per­a­ture, humid­i­ty, pre­cip­i­ta­tion, cloud cov­er, wind speed, bright­ness etc, and try to cor­re­late them.

In gen­er­al, despite con­sid­er­able data, there are essen­tially no bivari­ate cor­re­la­tions, noth­ing in sev­eral ver­sions of a lin­ear mod­el, and noth­ing found by a ran­dom for­est. It would appear that weather does not cor­re­late with my self­-rat­ings to any detectable degree, much less cause it.

Sci­ence has found that light­ing & tem­per­a­ture & mat­ter to one’s mood, focus, sleep and other things (but not, appar­ent­ly, one’s life sat­is­fac­tion or psy­chi­atric dis­or­ders, which may be related to com­plex fea­tures of light ()). Cer­tain­ly, these things hav­ing large effects makes a lot of sense—who does­n’t feel gloomy on a rainy day or hap­pier on a sunny day?

But things every­one knows often turn out to be wrong; and in my case, it seems like my morn­ing use of vit­a­min D and evening sup­ple­men­ta­tion may be screen­ing off the envi­ron­men­tal effects. So one rainy and then sunny day in March 2013, I won­dered if I could find an influ­ence of rain or dark­ness on my own mood or pro­duc­tiv­i­ty.

As it hap­pens, since 2012-02-16, I have been daily writ­ing down my impres­sion of whether my mood & pro­duc­tiv­ity (MP) that day was aver­age, below-av­er­age, or above; this gave me 399 rat­ings to work with. I did­n’t record any weather data, but the nice thing about weather data is that—un­like quo­tid­ian but poten­tially impor­tant data like the prover­bial what you had for break­fast, which is lost if not imme­di­ately record­ed—the weather is con­crete, objec­tive, eas­ily quan­tifi­able, eco­nom­i­cally impor­tant, and of uni­ver­sal inter­est. Hence, we would expect weather data to be avail­able online so I can find it, and see what aspects of the weather influ­ence my self­-rat­ing (if any).

Data

A con­ve­nient source of data seems to be . Going via API is overkill, since we just need data from Islip for 2012-02-16–2012-07-08 (CSV) and from Pax River 2012-07-11–2013-03-22 (CSV). I’m using the near­est air­port, rather than the near­est vol­un­teer weather sta­tion, to get data on cloud cov­er; cloud cover should be a good proxy for bright­ness or sun­ni­ness. The insol library cal­cu­lates the length of day, another poten­tially rel­e­vant vari­able.

# gather and clean data
weather1 <- read.csv("http://www.wunderground.com/history/airport/KISP/2012/2/16/CustomHistory.html?dayend=8&monthend=7&yearend=2012&req_city=NA&req_state=NA&req_statename=NA&format=1")
weather2 <- read.csv("http://www.wunderground.com/history/airport/KNHK/2012/7/11/CustomHistory.html?dayend=29&monthend=7&yearend=2013&req_city=NA&req_state=NA&req_statename=NA&format=1")
weather3 <- read.csv("http://www.wunderground.com/history/airport/KISP/2013/7/30/CustomHistory.html?dayend=7&monthend=8&yearend=2013&req_city=NA&req_state=NA&req_statename=NA&format=1")
weather4 <- read.csv("http://www.wunderground.com/history/airport/KNHK/2013/8/8/CustomHistory.html?dayend=17&monthend=9&yearend=2013&req_city=NA&req_state=NA&req_statename=NA&format=1")

weather1$PrecipitationIn <- as.numeric(as.character(weather1$PrecipitationIn)) # delete weird "T" factor; ???
# mood/productivity (MP) sourced from personal log
weather1$MP <- c(1,3,4,4,4,4,3,4,2,3,4,4,2,3,2,3,3,2,3,2,3,4,3,2,3,2,3,1,2,4,3,3,3,3,3,
                 3,3,2,3,4,3,3,4,4,4,4,2,3,4,3,3,3,4,3,3,1,3,2,3,3,4,3,2,3,2,4,4,4,3,3,
                 4,4,3,4,3,3,2,3,3,2,2,4,3,2,4,4,4,3,3,2,4,4,4,3,3,3,3,2,3,4,3,2,2,4,3,
                 3,2,2,2,2,3,4,2,4,4,3,3,3,2,4,2,2,2,2,2,3,3,4,2,1,3,3,2,3,3,4,2,3,2,3,
                 3,2,4,4)
weather2$MP <- c(2,4,4,4,3,4,3,3,4,3,2,3,4,4,4,3,2,3,3,2,3,3,3,2,2,2,2,2,3,4,3,4,2,4,3,
                 3,2,2,2,3,3,3,3,4,3,3,3,4,3,3,3,2,4,2,3,3,4,4,3,3,3,4,3,3,4,3,4,2,3,3,
                 4,4,3,3,4,4,3,4,3,2,3,3,3,4,3,2,3,2,2,2,3,3,3,4,4,3,4,4,3,3,2,3,3,3,3,
                 4,4,3,4,2,2,2,3,4,3,4,3,4,3,4,4,3,3,2,3,2,4,4,3,4,2,3,4,2,3,3,2,2,2,3,
                 2,3,3,4,2,3,4,3,4,3,3,2,2,3,4,4,3,4,2,2,3,2,3,2,2,2,4,3,3,4,2,2,3,3,3,
                 4,4,3,2,3,2,2,2,3,3,3,4,3,4,3,3,3,2,2,3,3,3,4,4,3,2,2,2,3,3,4,3,4,3,4,
                 3,2,4,4,3,3,2,4,3,3,3,2,3,3,2,3,2,3,3,3,3,2,3,3,3,3,3,3,4,2,3,2,4,3,3,
                 2,2,3,3,3,2,4,3,3,3,3,4,4,3,3,3,3,2,3,2,2,3,3,2,3,2,3,2,3,3,2,2,3,3,3,
                 4,4,3,2,2,3,2,3,2,4,4,4,3,3,3,4,3,4,3,4,4,2,2,2,4,4,3,2,2,3,4,3,2,4,2,
                 3,4,2,4,4,2,3,2,3,3,3,2,2,2,3,3,2,2,3,4,4,2,3,2,2,3,4,4,3,2,2,3,4,4,3,
                 3,3,3,2,4,3,3,4,3,3,3,4,3,4,4,4,4,3,3,3,2,2,2,4,3,4,4,2,2,3,4,4,3,2)
weather3$MP <- c(3,4,2,3,4,3,3,4,3)
weather4$MP <- c(3,3,2,2,3,2,4,4,4,3,2,2,5,4,3,4,4,3,3,4,4,3,2,2,3,4,3,3,4,4,3,4,3,3,
                 4,3,4,4,4,2,3)

# compute length of day for each location
library(insol)
weather1$DayLength <- daylength(40,73, julian(as.Date(as.character(weather1$EDT))), 1)[,3]
weather3$DayLength <- daylength(40,73, julian(as.Date(as.character(weather3$EDT))), 1)[,3]
weather2$DayLength <- daylength(38,76, julian(as.Date(as.character(weather2$EDT))), 1)[,3]
weather4$DayLength <- daylength(38,76, julian(as.Date(as.character(weather4$EDT))), 1)[,3]
# combine & start cleaning
weather <- rbind(weather1,weather2)
weather$WindDirDegrees.br... <- sub("<br />", "", weather$WindDirDegrees.br...)
weather$WindDirDegrees.br... <- as.integer(weather$WindDirDegrees.br...)
weather$Events <- as.integer(weather$Events)
weather$Events[weather$Events>1] <- 2
# something of a hack but we'll impute any missing rain precipitation values as 0
weather[is.na(weather)] <- 0

Analysis

Exploratory

After cleanup, the data looks rea­son­able and as expect­ed, with each vari­able span­ning the ranges one would expect of tem­per­a­tures and pre­cip­i­ta­tion:

summary(weather)
#         EDT      Max.TemperatureF Mean.TemperatureF Min.TemperatureF Max.Dew.PointF MeanDew.PointF
#  2012-2-16:  1   Min.   : 26.0    Min.   :21.0      Min.   :14.0     Min.   :12.0   Min.   : 1
#  2012-2-17:  1   1st Qu.: 53.0    1st Qu.:45.0      1st Qu.:37.0     1st Qu.:39.8   1st Qu.:33
#  2012-2-18:  1   Median : 68.0    Median :58.0      Median :50.0     Median :54.0   Median :48
#  2012-2-19:  1   Mean   : 66.9    Mean   :58.7      Mean   :50.6     Mean   :52.5   Mean   :47
#  2012-2-20:  1   3rd Qu.: 82.0    3rd Qu.:73.0      3rd Qu.:66.0     3rd Qu.:67.0   3rd Qu.:63
#  2012-2-21:  1   Max.   :100.0    Max.   :87.0      Max.   :80.0     Max.   :79.0   Max.   :74
#  (Other)  :522
#  Min.DewpointF   Max.Humidity   Mean.Humidity   Min.Humidity  Max.Sea.Level.PressureIn
#  Min.   :-5.0   Min.   : 42.0   Min.   :28.0   Min.   :12.0   Min.   :29.4
#  1st Qu.:25.0   1st Qu.: 82.0   1st Qu.:59.0   1st Qu.:37.0   1st Qu.:30.0
#  Median :42.0   Median : 89.0   Median :69.0   Median :48.0   Median :30.1
#  Mean   :41.2   Mean   : 85.7   Mean   :67.8   Mean   :48.6   Mean   :30.1
#  3rd Qu.:58.0   3rd Qu.: 93.0   3rd Qu.:77.2   3rd Qu.:60.0   3rd Qu.:30.2
#  Max.   :73.0   Max.   :100.0   Max.   :97.0   Max.   :93.0   Max.   :30.6
#
#  Mean.Sea.Level.PressureIn Min.Sea.Level.PressureIn Max.VisibilityMiles Mean.VisibilityMiles
#  Min.   :29.1              Min.   :28.7             Min.   : 6.00       Min.   : 2.0
#  1st Qu.:29.9              1st Qu.:29.8             1st Qu.:10.00       1st Qu.: 9.0
#  Median :30.0              Median :29.9             Median :10.00       Median :10.0
#  Mean   :30.0              Mean   :29.9             Mean   : 9.97       Mean   : 9.1
#  3rd Qu.:30.2              3rd Qu.:30.1             3rd Qu.:10.00       3rd Qu.:10.0
#  Max.   :30.5              Max.   :30.5             Max.   :10.00       Max.   :10.0
#
#  Min.VisibilityMiles Max.Wind.SpeedMPH Mean.Wind.SpeedMPH Max.Gust.SpeedMPH PrecipitationIn
#  Min.   : 0.00       Min.   : 6.0      Min.   : 1.00      Min.   : 0.0      Min.   :0.000
#  1st Qu.: 3.00       1st Qu.:13.0      1st Qu.: 6.00      1st Qu.: 0.0      1st Qu.:0.000
#  Median : 9.00       Median :15.0      Median : 8.00      Median :22.0      Median :0.000
#  Mean   : 6.88       Mean   :16.4      Mean   : 8.04      Mean   :18.6      Mean   :0.124
#  3rd Qu.:10.00       3rd Qu.:20.0      3rd Qu.:10.00      3rd Qu.:28.0      3rd Qu.:0.030
#  Max.   :10.00       Max.   :40.0      Max.   :27.00      Max.   :56.0      Max.   :6.000
#
#    CloudCover      Events     WindDirDegrees.br...       MP         DayLength
#  Min.   :0.0   Min.   :1.00   Min.   :  1          Min.   :1.00   Min.   : 9.27
#  1st Qu.:3.0   1st Qu.:1.00   1st Qu.:138          1st Qu.:2.00   1st Qu.:10.31
#  Median :4.5   Median :1.00   Median :208          Median :3.00   Median :12.10
#  Mean   :4.4   Mean   :1.47   Mean   :202          Mean   :2.98   Mean   :12.07
#  3rd Qu.:6.0   3rd Qu.:2.00   3rd Qu.:283          3rd Qu.:4.00   3rd Qu.:13.87
#  Max.   :8.0   Max.   :2.00   Max.   :360          Max.   :4.00   Max.   :14.73

CloudCover being 0–8 looks a bit odd, since one might’ve expected a dec­i­mal or per­cent­age or some quan­tifi­ca­tion of thick­ness but turns out to be a stan­dard mea­sure­ment, the “”. We might expect some sea­sonal effects, but graph­ing a sen­si­tive and then a per-week spine­plot sug­gests just 1 anom­aly, that there were some days I rated “1” but never sub­se­quently (this has an expla­na­tion: as I began keep­ing the series, my car caught on fire, was totaled; and I was then put through the wringer with the insur­ance & junk­yard, and felt truly mis­er­able at mul­ti­ple points):

par(mfrow=c(2,1))
scatter.smooth(x=weather$EDT, y=weather$MP, span=0.3, col="#CCCCCC", xlab="Days", ylab="MP rating")
weeks <- c(seq(from=1,to=length(weather$MP),by=7), length(weather$MP))
spineplot(c(1:(length(weather$MP))), as.factor(weather$MP), breaks=weeks, xlab="Weeks", ylab="MP rating ratios")
# we don't use the date past the spineplot &
# it causes problems with all the models, so delete it
rm(weather$EDT)
MP from Feb­ru­ary 2012 to March 2013

A for­mal check for in MP rat­ings (most meth­ods assume inde­pen­dence) turns up lit­tle enough that I feel free to ignore them. So the data looks clean and tame; now we can begin real inter­pre­ta­tion. The first and most obvi­ous thing to do is to see what the over­all cor­re­la­tion matrix with MP looks like:

# all weather correlations with mood/productivity
round(cor(weather)[23,], digits=3)
#          Max.TemperatureF         Mean.TemperatureF          Min.TemperatureF
#                     0.013                     0.016                     0.022
#            Max.Dew.PointF            MeanDew.PointF             Min.DewpointF
#                     0.016                     0.019                     0.016
#              Max.Humidity             Mean.Humidity              Min.Humidity
#                     0.042                     0.026                     0.011
#  Max.Sea.Level.PressureIn Mean.Sea.Level.PressureIn  Min.Sea.Level.PressureIn
#                     0.006                     0.004                     0.015
#       Max.VisibilityMiles      Mean.VisibilityMiles       Min.VisibilityMiles
#                     0.066                     0.004                    -0.016
#         Max.Wind.SpeedMPH        Mean.Wind.SpeedMPH         Max.Gust.SpeedMPH
#                     0.008                     0.002                     0.018
#           PrecipitationIn                CloudCover                    Events
#                    -0.037                     0.041                    -0.001
#      WindDirDegrees.br...                        MP                 DayLength
#                    -0.006                     1.000                     0.036
# tail(sort(abs(round(cor(weather)[23,], digits=3))), 6)
#           DayLength     PrecipitationIn          CloudCover        Max.Humidity Max.VisibilityMiles
#               0.036               0.037               0.041               0.042               0.066
#                  MP
#               1.000

All the r val­ues seem very small: >0.1. But some of the more plau­si­ble cor­re­la­tions may be sta­tis­ti­cal­ly-sig­nifi­cant, so we’ll look at the 5 largest cor­re­la­tions

sapply(c("DayLength","PrecipitationIn","CloudCover","Max.Humidity","Max.VisibilityMiles"),
       function(x) cor.test(weather$MP, weather[[x]]))
# ...
#             DayLength                              PrecipitationIn
# p.value     0.4099                                 0.394
# estimate    0.03594                                -0.03717
#
#             CloudCover                             Max.Humidity
# p.value     0.3445                                 0.3319
# estimate    0.04122                                0.04231
#
#             Max.VisibilityMiles
# p.value     0.1292
# estimate    0.06611

These spe­cific vari­ables turn out to be fail­ures too, with large p-val­ues.

One use­ful tech­nique is to con­vert met­rics into stan­dard­ized units (of stan­dard devi­a­tion­s), sum them all into a sin­gle com­pos­ite vari­able, and then test that; this can reveal influ­ences not obvi­ous if one looked at the met­rics indi­vid­u­al­ly. (For exam­ple, in my , where I was inter­ested in an over­all mea­sure of reduced sleep qual­ity rather than sin­gle met­rics like sleep laten­cy.) Per­haps this would work here?

# construct a z-score for all of them to see if it does any better
with(weather, cor.test(MP, (scale(Max.Humidity) + scale(Max.VisibilityMiles)
                            + scale(PrecipitationIn) + scale(CloudCover) + scale(DayLength))))
# ...
# t = 1.296, df = 526, p-value = 0.1957
# alternative hypothesis: true correlation is not equal to 0
# 95% confidence interval:
#  -0.02908  0.14105
# sample estimates:
#    cor
# 0.0564

Modeling

Continuous MP

Linear model

Tou­ji: “Oh, yes. the view of the world that one can have is quite small.”
Hikari: “Yes, you mea­sure things only by your own small mea­sure.”
Asuka: “One sees things with the truth, given by oth­ers.”
Mis­ato: “Happy on a sunny day.”
Rei: “Gloomy on a rainy day.”
Asuka: “If you’re taught that, you always think so.”
Rit­suko: “But, you can enjoy rainy days.”

Neon Gen­e­sis Evan­ge­lion, episode 26 “The Beast that Shouted ‘I’ at the Heart of the World” (Lit­eral Trans­la­tion Pro­ject)

The cor­re­la­tions show indi­vid­u­ally lit­tle val­ue, so we’ll move on to build­ing mod­el­ing and assess­ing their fit. Can we accu­rately pre­dict MP if we use all the para­me­ters? We’ll start with a lin­ear model/regression, where we treat the cat­e­gor­i­cal MP vari­able as a con­tin­u­ous vari­able for sim­plic­i­ty:

model1 <- lm(MP ~ ., data=weather); summary(model1)
# ...Residuals:
#     Min      1Q  Median      3Q     Max
# -2.0255 -0.7815  0.0165  0.8126  1.3730
#
# Coefficients:
#                            Estimate Std. Error t value Pr(>|t|)
# (Intercept)               -4.29e+00   6.38e+00   -0.67     0.50
# Max.TemperatureF           3.76e-02   3.20e-02    1.17     0.24
# Mean.TemperatureF         -5.87e-02   6.06e-02   -0.97     0.33
# Min.TemperatureF           4.41e-02   3.24e-02    1.36     0.17
# Max.Dew.PointF            -9.52e-03   1.46e-02   -0.65     0.52
# MeanDew.PointF            -1.45e-02   2.79e-02   -0.52     0.60
# Min.DewpointF             -6.52e-03   1.43e-02   -0.46     0.65
# Max.Humidity               1.10e-02   8.36e-03    1.31     0.19
# Mean.Humidity              1.73e-03   1.33e-02    0.13     0.90
# Min.Humidity               5.75e-03   8.16e-03    0.71     0.48
# Max.Sea.Level.PressureIn   9.92e-01   9.79e-01    1.01     0.31
# Mean.Sea.Level.PressureIn -1.61e+00   1.63e+00   -0.99     0.32
# Min.Sea.Level.PressureIn   7.29e-01   8.39e-01    0.87     0.39
# Max.VisibilityMiles        2.00e-01   1.47e-01    1.36     0.17
# Mean.VisibilityMiles       1.04e-02   3.96e-02    0.26     0.79
# Min.VisibilityMiles       -3.94e-03   1.89e-02   -0.21     0.83
# Max.Wind.SpeedMPH          2.31e-03   1.32e-02    0.18     0.86
# Mean.Wind.SpeedMPH        -2.28e-03   1.70e-02   -0.13     0.89
# Max.Gust.SpeedMPH          2.71e-03   4.54e-03    0.60     0.55
# PrecipitationIn           -1.25e-01   9.29e-02   -1.35     0.18
# CloudCover                 2.65e-02   2.53e-02    1.05     0.29
# Events                    -9.18e-02   9.83e-02   -0.93     0.35
# WindDirDegrees.br...       3.08e-05   4.11e-04    0.07     0.94
# DayLength                  5.72e-02   4.78e-02    1.20     0.23
#
# Residual standard error: 0.755 on 504 degrees of freedom
# Multiple R-squared:  0.0282,    Adjusted R-squared:  -0.0162
# F-statistic: 0.635 on 23 and 504 DF,  p-value: 0.905

The vari­ance explained is extremely min­i­mal, even with 23 differ­ent vari­ables in the lin­ear mod­el, and sug­gests over­fit­ting (as one would expec­t). By Occam’s razor, most of the vari­ables should be scrapped for not car­ry­ing their weight; the step func­tion uses a to choose which vari­able to elim­i­nate while still fit­ting the data rea­son­ably well. It choose to keep only 4 vari­ables:

smodel1 <- step(model1); summary(smodel1)
# ...Residuals:
#     Min      1Q  Median      3Q     Max
# -2.1301 -0.8907  0.0121  0.8990  1.1779
#
# Coefficients:
#                     Estimate Std. Error t value Pr(>|t|)
# (Intercept)          0.02134    1.38025    0.02    0.988
# Min.TemperatureF     0.01429    0.01008    1.42    0.157
# MeanDew.PointF      -0.01489    0.01034   -1.44    0.150
# Max.Humidity         0.00958    0.00531    1.81    0.072
# Max.VisibilityMiles  0.21250    0.12682    1.68    0.094
#
# Residual standard error: 0.748 on 523 degrees of freedom
# Multiple R-squared:  0.0109,    Adjusted R-squared:  0.00336
# F-statistic: 1.44 on 4 and 523 DF,  p-value: 0.218

We can com­pare the mod­els’ pre­dic­tion accu­racy on the dataset (lower is bet­ter in mean-squared-er­ror):

mean((weather$MP - weather$MP)^2) # perfect
# [1] 0
mean((weather$MP - model1$fitted.values)^2) # original
# [1] 0.5446
mean((weather$MP - smodel1$fitted.values)^2) # much simpler model
# [1] 0.5543
model2 <- lm(MP ~ 1, data=weather); mean((weather$MP - model2$fitted.values)^2)
# [1] 0.5604

The model with 4 data vari­ables is much sim­pler, and appears to fit almost as well, but slightly bet­ter than the null model of no para­me­ters

Random forests regression

Lin­ear mod­el­ing hav­ing failed to reveal any inter­est­ing rela­tion­ships, we’ll take one last crack at it: if there’s any inter­est­ing pre­dic­tive power in the weather data at all, a high­-pow­ered machine learn­ing tech­nique like ought to build a model which out­per­forms the lin­ear model at least a lit­tle on mean-squared-er­ror

library(randomForest)
rmodel <- randomForest(MP ~ ., data=weather, proximity=TRUE); rmodel
# ...
#                Type of random forest: regression
#                      Number of trees: 500
# No. of variables tried at each split: 7
#
#           Mean of squared residuals: 0.5692
#                     % Var explained: -1.57

The error with all vari­ables is a lit­tle higher than the sim­pler lin­ear mod­el. Trou­bling­ly, a ran­dom forests with the 4 sim­pler vari­ables increases the error:

srmodel <- randomForest(MP ~ Min.TemperatureF + MeanDew.PointF + Max.Humidity + Max.VisibilityMiles, data=weather, proximity=TRUE); srmodel
# ...
#                Type of random forest: regression
#                      Number of trees: 500
# No. of variables tried at each split: 1
#
#           Mean of squared residuals: 0.5814
#                     % Var explained: -3.76

Per­haps treat­ing MP as a con­tin­u­ous vari­able was a bad idea. Let’s start over.

Categorical MP

We turn the MP data into an ordered fac­tor:

weather$MP <- ordered(weather$MP)

Logistic model

A straight (glm(MP ~ ., data=weather, family="binomial")) is not appro­pri­ate because MP is not a binary out­come; we use the MASS library to do an ordi­nal logis­tic regres­sion; as with the lin­ear mod­el, the coeffi­cients don’t seem to differ very much

library(MASS)
lmodel <- polr(as.ordered(MP) ~ ., data = weather); summary(lmodel)
# ...
# Coefficients:
#                               Value Std. Error t value
# Max.TemperatureF           0.086798    0.07916   1.096
# Mean.TemperatureF         -0.135828    0.14974  -0.907
# Min.TemperatureF           0.105068    0.08024   1.309
# Max.Dew.PointF            -0.023466    0.03624  -0.648
# MeanDew.PointF            -0.033449    0.06789  -0.493
# Min.DewpointF             -0.018700    0.03424  -0.546
# Max.Humidity               0.029043    0.02036   1.427
# Mean.Humidity              0.004063    0.03263   0.124
# Min.Humidity               0.013462    0.02024   0.665
# Max.Sea.Level.PressureIn   2.554493    0.85951   2.972
# Mean.Sea.Level.PressureIn -4.352709    0.06383 -68.191
# Min.Sea.Level.PressureIn   2.047343    0.91077   2.248
# Max.VisibilityMiles        0.549889    0.37650   1.461
# Mean.VisibilityMiles       0.029195    0.09772   0.299
# Min.VisibilityMiles       -0.007733    0.04638  -0.167
# Max.Wind.SpeedMPH          0.003846    0.03219   0.119
# Mean.Wind.SpeedMPH        -0.007351    0.04250  -0.173
# Max.Gust.SpeedMPH          0.008375    0.01130   0.741
# PrecipitationIn           -0.325188    0.22738  -1.430
# CloudCover                 0.067914    0.06000   1.132
# Events                    -0.219315    0.24050  -0.912
# WindDirDegrees.br...       0.000192    0.00101   0.190
# DayLength                  0.144132    0.11804   1.221
#
# Intercepts:
#     Value   Std. Error t value
# 1|2  13.143   0.021    638.830
# 2|3  17.080   0.500     34.185
# 3|4  19.126   0.509     37.550
#
# Residual Deviance: 1143.49
# AIC: 1195.49
## relative risk or odds; how much difference does each variable make, per its units?
exp(coef(lmodel))
#          Max.TemperatureF         Mean.TemperatureF          Min.TemperatureF
#                   1.09068                   0.87299                   1.11079
#            Max.Dew.PointF            MeanDew.PointF             Min.DewpointF
#                   0.97681                   0.96710                   0.98147
#              Max.Humidity             Mean.Humidity              Min.Humidity
#                   1.02947                   1.00407                   1.01355
#  Max.Sea.Level.PressureIn Mean.Sea.Level.PressureIn  Min.Sea.Level.PressureIn
#                  12.86477                   0.01287                   7.74729
#       Max.VisibilityMiles      Mean.VisibilityMiles       Min.VisibilityMiles
#                   1.73306                   1.02963                   0.99230
#         Max.Wind.SpeedMPH        Mean.Wind.SpeedMPH         Max.Gust.SpeedMPH
#                   1.00385                   0.99268                   1.00841
#           PrecipitationIn                CloudCover                    Events
#                   0.72239                   1.07027                   0.80307
#      WindDirDegrees.br...                 DayLength
#                   1.00019                   1.15504

Another use of step; this time, it builds some­thing even nar­rower than before:

slmodel <- step(lmodel); summary(slmodel)
# ...Coefficients:
#                     Value Std. Error t value
# Max.VisibilityMiles 0.527      0.329     1.6
#
# Intercepts:
#     Value Std. Error t value
# 1|2 0.364 3.293      0.110
# 2|3 4.277 3.282      1.303
# 3|4 6.283 3.287      1.912
#
# Residual Deviance: 1156.46
# AIC: 1164.46
exp(0.527)
# [1] 1.694
Seasonal effects

The pre­vi­ously men­tioned life-satisfaction/weather paper, Lucas et al 2013, exam­ined a sin­gle vari­able very sim­i­lar to my MP:

Life sat­is­fac­tion was assessed using a sin­gle item that read, “In gen­er­al, how sat­is­fied are you with your life.” Par­tic­i­pants responded using a 4-point scale with the options “Very Sat­is­fied,” “Sat­is­fied,” “Dis­sat­is­fied,” or “Very Dis­sat­is­fied” (re­sponses were scored such that higher scores reflect higher sat­is­fac­tion)

Rather than sim­ply regress local weather on the score, sim­i­lar to what I’ve been doing so far, they attempted a more com­plex model where the effects could change based on region & time of year:

In addi­tion, the effect of absolute lev­els of any weather vari­able may vary depend­ing on when in the year the weather occurred. A 50 °F day may con­tribute to pos­i­tive mood (and hence higher life sat­is­fac­tion) if it occurred in the mid­dle of win­ter in a cold cli­mate, whereas this same absolute tem­per­a­ture might con­tribute to a neg­a­tive mood (and hence lower life sat­is­fac­tion) if it occurred in the mid­dle of sum­mer in a warm cli­mate. Thus, sea­sonal differ­ences must be con­sid­ered when exam­in­ing the effects of weath­er.

While I live in only one region and repli­cat­ing the rest of the analy­sis would be a lot of work, the sea­sonal changes is pos­si­ble and not too hard to inves­ti­gate: reset the data to restore the date, cre­ate a Season vari­able, pop­u­late (since I am lazy, I con­vert the dates to fis­cal quar­ters, which are sim­i­lar enough), and then rerun an ordi­nal logis­tic regres­sion with all the poten­tial pre­dic­tors nested under Season.

Look­ing at the coeffi­cients split by sea­son, and look­ing for coeffi­cients that sud­denly become larger or which switch signs from sea­son to sea­son, I see no real can­di­dates:

weather$MP <- as.ordered(weather$MP)
library(lubridate)
weather$Season <- quarter(weather$EDT)
library(lme4)
lmr <- lmer(MP ~ (. |Season), data=weather, control = lmerControl(maxfun=20000)); ranef(lmr)
# ...$Season
#   (Intercept)   MP.L   MP.Q    MP.C       EDT Max.TemperatureF Mean.TemperatureF Min.TemperatureF
# 1     0.05723 1.8727 0.2184 -0.1058 4.704e-05         0.002089         -0.002253        0.0009085
# 2     0.06215 1.8918 0.2303 -0.1042 6.786e-06         0.004284         -0.004817        0.0034466
# 3     0.31806 0.9037 0.9451 -0.4151 3.206e-05         0.003028         -0.002113        0.0019708
# 4     0.31525 0.8986 0.9376 -0.4169 7.582e-07        -0.001217         -0.001045        0.0003388
#   Max.Dew.PointF MeanDew.PointF Min.DewpointF Max.Humidity Mean.Humidity Min.Humidity
# 1     -3.416e-04      -0.002731     0.0008853   -1.584e-04     0.0015697    0.0004101
# 2      1.922e-03      -0.004427     0.0001578    6.227e-04    -0.0002743    0.0010211
# 3      2.523e-05      -0.002938    -0.0007302    5.611e-05     0.0009317    0.0004772
# 4     -1.235e-04       0.003389    -0.0012974    6.986e-04    -0.0009484   -0.0004437
#   Max.Sea.Level.PressureIn Mean.Sea.Level.PressureIn Min.Sea.Level.PressureIn Max.VisibilityMiles
# 1                 0.006218                 -0.018023                -0.008160            0.013916
# 2                 0.011742                 -0.001504                 0.001079           -0.002815
# 3                 0.002675                  0.002821                -0.004279            0.005178
# 4                 0.006987                  0.003645                 0.007091            0.013030
#   Mean.VisibilityMiles Min.VisibilityMiles Max.Wind.SpeedMPH Mean.Wind.SpeedMPH Max.Gust.SpeedMPH
# 1            -0.002323           0.0028401         0.0028779          0.0009989        -6.320e-04
# 2             0.001611          -0.0023890         0.0014840         -0.0012118        -7.029e-05
# 3             0.002107           0.0007576        -0.0014126         -0.0016919         9.457e-04
# 4            -0.006157           0.0020409         0.0007483          0.0008237        -4.648e-04
#   PrecipitationIn CloudCover    Events WindDirDegrees.br... DayLength   Season.L Season.Q
# 1       -0.025717   0.002471  0.005752            1.576e-05  0.009140 -0.0006296 0.005954
# 2       -0.005913  -0.002430 -0.002981           -1.195e-04 -0.003607 -0.0015844 0.006389
# 3        0.002764   0.003450 -0.006076            6.861e-06 -0.004650 -0.0011383 0.010232
# 4       -0.008745   0.002082 -0.012127           -4.119e-05  0.003976 -0.0010785 0.010584
#     Season.C
# 1 -0.0041696
# 2 -0.0047106
# 3 -0.0011607
# 4 -0.0004978

Vari­ables which are pos­i­tive in one sea­son tend to be pos­i­tive in anoth­er. There are some cases of flip­ping sign, like PrecipitationIn but in these cas­es, the coeffi­cients are tiny (some­times so tiny R lapses into sci­en­tific nota­tion). In no case is the differ­ence between sea­son esti­mates as large as 0.1. Hence, sea­sonal effects do not seem to be impor­tant

Correlation size & implications

The Max.VisibilityMiles result seems like a rea­son­able one; the vari­able may be tap­ping into some sort of anti-rain mea­sure. We can visu­al­ize the coeffi­cients thus­ly:

library(arm)
maxvis <- attr(profile(slmodel), "summary")$coefficients[1]
coefplot(maxvis, varnames=rownames(x), vertical=FALSE, var.las=1, main="95% CIs for the log odds of predictors")
Vari­ables with con­fi­dence inter­vals show log odds size and whether the CI excludes zero

It does­n’t look par­tic­u­larly impres­sive and does­n’t exclude 0; but we want to know whether this sur­viv­ing vari­able is poten­tially use­ful in prac­tice, what its “cash value” might be. We again expo­nen­ti­ate the log odds to get more under­stand­able odds:

exp(coef(slmodel))
# Max.VisibilityMiles
#              1.694

What do these esti­mates mean in prac­tice? Let’s take Max VisibilityMiles as a vari­able that may be manip­u­la­ble by buy­ing bright lights, with an odds of 1.5; if we could some­how max­i­mize each day’s DayLength, the model would pre­dict 4 higher days:

weatherOdds <- weather # copy data & max out DayLength
weatherOdds$Max.VisibilityMiles <- max(weather$Max.VisibilityMiles)
sum(as.integer(predict(slmodel, newdata=weatherOdds))) - sum(as.integer(predict(slmodel, newdata=weather)))
# [1] 4

If we wanted to test this poten­tial effect via an exper­i­ment (eg. buy­ing a pow­er­ful light­ing sys­tem, and turn­ing it on or off on ran­dom days), how many days n should the exper­i­ment run? With a boot­strap: we can cal­cu­late the effect size of 4 incre­mented days and then ask for the n required for a self­-ex­per­i­ment with good power of 80%; this turns out to be a ~1600 day or 4.4 year long exper­i­ment:

library(boot)
library(rms)
n <- 1600
weatherPower <- function(dt, indices) {
    d <- dt[sample(nrow(dt), n, replace=TRUE), ] # new dataset, possibly larger than the original
    lmodel <- lrm(MP ~ Max.VisibilityMiles, data = d)
    return(anova(lmodel)[5])
}
bs <- boot(data=weather, statistic=weatherPower, R=100000, parallel="multicore", ncpus=4)
alpha <- 0.05
print(sum(bs$t<=alpha)/length(bs$t))
# [1] 0.8054

To detect such a sub­tle effect is not easy, and such a time-con­sum­ing exper­i­ment is surely not worth­while.

Random forests classification

Ran­dom forests can be used to clas­sify (pre­dict cat­e­gor­i­cal out­comes) as well as regres­sions; hav­ing turned the response vari­able into a fac­tor, the type switches auto­mat­i­cal­ly:

rmodel <- randomForest(MP ~ ., data=weather, proximity=TRUE); rmodel
# ...
#                      Number of trees: 500
# No. of variables tried at each split: 4
#
#         OOB estimate of  error rate: 57.39%
# Confusion matrix:
#   1  2   3  4 class.error
# 1 0  1   3  0      1.0000
# 2 0 20 104 16      0.8571
# 3 0 31 179 34      0.2664
# 4 0 20  94 26      0.8143

We can’t step through ran­dom forests since it does­n’t have a com­plex­ity mea­sure like AIC to use, so we’ll reuse the vari­able from the sim­pli­fied ordi­nal:

srmodel <- randomForest(MP ~ Max.VisibilityMiles, data=weather, proximity=TRUE); srmodel
# ...
#                Type of random forest: classification
#                      Number of trees: 500
# No. of variables tried at each split: 1
#
#         OOB estimate of  error rate: 53.79%
# Confusion matrix:
#   1 2   3 4 class.error
# 1 0 0   4 0    1.000000
# 2 0 2 138 0    0.985714
# 3 0 1 242 1    0.008197
# 4 0 0 140 0    1.000000

Much bet­ter, and now slightly bet­ter than the orig­i­nal lin­ear mod­els.

Ordinal vs Random forests

We can now com­pare the frac­tion of days that are incor­rectly pre­dicted by the con­stant pre­dic­tor, the ordi­nal logis­tic regres­sion (full & sim­pli­fied), and the ran­dom forests (full & sim­pli­fied):

1 - (sum(weather$MP==3) / length(weather$MP))
# [1] 0.5379
1 - (sum(weather$MP == as.integer(predict(lmodel))) / length(weather$MP))
# [1] 0.5284
1 - (sum(as.integer(weather$MP) == predict(slmodel)) / length(weather$MP))
# [1] 0.5341

1 - (sum(as.integer(weather$MP) == as.integer(predict(rmodel))) / length(weather$MP))
# [1] 0.5739
1 - (sum(as.integer(weather$MP) == as.integer(predict(srmodel))) / length(weather$MP))
# [1] 0.5379

It would seem that there is no big differ­ence between the mod­els, but the ordi­nal may be a lit­tle bit bet­ter than the con­stant pre­dic­tor.

Model checking

Error rate

Before con­clud­ing that the ordi­nal logis­tic regres­sion is bet­ter than the con­stant pre­dic­tor, it might be a good idea to check how robust this result holds up. The differ­ence in cor­rectly clas­si­fied days is very small, and it might rep­re­sent min­i­mal advan­tage. We’ll a large num­ber of logis­tic regres­sions on sam­ples of the full dataset, and see what frac­tion of them incur a higher clas­si­fi­ca­tion error rate than the con­stant pre­dic­tor:

library(boot)
errorRate <- function(dt, indices) {
  d <- dt[indices,] # allows boot to select subsample
  lmodel <- polr(as.ordered(MP) ~ ., data = d) # train new regression model on subsample
  return(1 - (sum(d$MP == as.integer(predict(lmodel))) / length(d$MP)))
}
bs <- boot(data=weather, statistic=errorRate, R=100000, parallel="multicore", ncpus=4); bs
# ...Bootstrap Statistics :
#     original   bias    std. error
# t1*   0.5284 -0.01137     0.02431

boot.ci(bs)
# ...
# Intervals :
# Level      Normal              Basic
# 95%   ( 0.4921,  0.5874 )   ( 0.4905,  0.5871 )
#
# Level     Percentile            BCa
# 95%   ( 0.4697,  0.5663 )   ( 0.4905,  0.5890 )


hist(bs$t, xlab="Error rate", ylab="Number of samples",
           main="Bootstrap check of ordinal logistic regression accuracy")
sum(bs$t > (1 - (sum(weather$MP==3) / length(weather$MP)))) / length(bs$t)
# [1] 0.1832
Dis­tri­b­u­tion of logis­tic regres­sion clas­si­fi­ca­tion rates

There’s sub­stan­tial uncer­tainty in the clas­si­fi­ca­tion rate, as evi­denced by the poten­tially wide con­fi­dence inter­vals, but in ~18% of the new logis­tic regres­sions, the clas­si­fi­ca­tion rate was worse than the con­stant pre­dic­tor. This is not too bad but the logis­tic regres­sion is not out­per­form­ing ran­dom by very much, so the real­ity of the result is open to ques­tion.

Conclusion

An attack on the data turned up noth­ing in sev­eral ways; the only model that seemed to improve on ran­dom guess­ing does not do so by very much, indi­cat­ing weak weather effects. I will attempt to fol­low up in 2 years to see if the results repli­cate with a longer data series.

Appendix

Autocorrelation

A test using acf shows no auto­cor­re­la­tion worth men­tion­ing:

acf(weather$MP, main="Do days predict subsequent days at various distances?")
MP series shows almost no auto­cor­re­la­tion at any time­lag

For com­par­ison, here is acf for a data series where one would expect a great deal of auto­cor­re­la­tion—my daily weight 2012–2013—and one does indeed observe it:

weight <- c(205,214,216,213,213,218,218,214,215,216,218,216,210,219,217,219,217,215,215,217,219,
    216,215,218,215,219,219,219,218,218,220,220,219,219,219,221,220,219,216,220,220,218,
    218,220,219,220,215,215,218,218,215,216,216,218,218,220,219,216,217,220,215,215,218,
    216,214,213,216,215,214,213,214,216,216,212,209,212,214,213,210,211,210,213,211,215,
    211,212,212,216,212,215,216,215,215,212,216,213,212,211,215,214,215,216,214,212,212,
    213,212,213,211,214,215,210,211,211,211,212,210,210,212,211,214,213,214,212,215,214,
    213,215,211,214,214,216,215,213,215,213,213,212,215,211,212,212,211,212,211,212,210,
    211,213,218,217,212,214,216,213,212,211,211,214,212,211,216,NA,NA,NA,NA,NA,218,
    216,214,215,216,216,213,216,214,215,219,218,216,215,218,217,219,219,219,219,219,218,
    217,216,216,215,218,219,217,216,219,217,216,216,219,216,218,215,216,215,215,213,214,
    215,217,216,215,215,216,214,215,215,214,216,211,214,213,214,211,212,211,210,212,211,
    212,214,212,211,214,212,211,212,211,212,213,210,212,210,210,211,210,212,210,210,210,
    210,210,209,206,206,208,209,207,207,205,205,205,206,206,209,204,206,206,204,204,204,
    204,205,203,205,203,205,204,204,203,204,204,205,205,204,205,205,204,204,205,205,205,
    205,204,203,205,205,206,205,204,203,204,205,206,207,206,205,205,207,206,210,210,212,
    211,208,210,210,209,211,212,204,205,208,208,209,209,208,208,208,209,208,210,209,208,
    207,207,209,208,209,207,207,206,207,208,207,209,210,210,208,208,206,208,210,210,209,
    209,209,210,209,210,212,212,210,212,213,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,210,
    211,210,210,210,211,212,211,210,210,210,212,210,211,210,209,211,209,209,210,210,210,
    211,210,210,209,213,210,210,210,214,212,211,210,211,215,215,214,210,214,212)
acf(weight, na.action = na.pass)
Weight series show­ing auto­cor­re­la­tion at every time­lag