For self-experimentation purposes, I'm interested in trying to quantify 'personal productivity' in some sense. I thought it would be a latent variable, but a large personal dataset failed to yield any trace, so I'm backtracking. What if 'productivity' is more like 'MP' or 'mana' in a game, where it can be larger or smaller but must be spent on various things in a zero-sum way? This would not yield a simple general factor where 'a rising tide lifts all boats' or necessarily any kind of hierarchical factor model either. I've been unable to find a model which corresponds to this. Does anyone recognize this setup?
In your standard latent variable like a measurement error model, an unobserved hidden variable is postulated which increases or decreases a set of observed variables (or another latent variable) to variable-specific degrees. So the variables tend to move together, in what you might call a positive-sum way. Intelligence is the classic example: scores on all tests of cognitive ability tend to increase or decrease together to some degree.
One alternative to a latent variable model is an index variable, where variables are simply added up and it is the sum which affects other things. Index variables aren't discussed much, so it's hard to find how to work with them, but the summing up appears to be done by fiat based on theoretical expectations about the relevant scales & weights, and then used in the rest of the model as simply another variable. It's unclear to me how you would infer an index variable if you didn't already know the weights.
Further, the variables might affect each other by 'using up' in some stochastic fashion.
Is there any way to infer the distribution of a latent index variable, or whether it exists, or how it's allocated among multiple outputs?
In Quantified Self, when we run experiments or analyses, the question of measurements is often a big one. You can use something like Cogmind or dual n-back to measure effects, but this is rarely what you want and delivers a precise answer to the wrong question. What you want is something like 'personal productivity'. My feeling has always been that it seems like productivity (let's call it P) waxes and wanes on a roughly daily basis. Some days P is high and you get a lot done, and some days it feels like it's all you can do to watch cat videos on YouTube. So, P looks a lot like g: it's a latent variable which additively influences a variety of measurements, perhaps to different degrees (different P-loadings), but you should be able to measure it by simply collecting a bunch of variables for a year or so, extract the biggest factor, and now you have a reliable global measure of your functioning which will be much more efficient than ad hoc experimenting on loosely-connected narrow variables. The lack of any P-factor might even explain the often unimpressive effects of things like stimulants, where the objective measures on things like reaction-time are so much less impressive than the subjective impressions.
So, I collected a bunch of data: number of emails sent/received, window-tracking logs split into 10 categories like IRC vs Emacs vs videos, number of patches in Git repos (a proxy for both coding & writing for me), commandline activity, daily self-ratings, number of Wikipedia edits, entries in a daily checklist of things to do, etc. I cleaned missing data as much as possible, imputed where relevant, transformed everything into normality for stabler fitting, deleted variables of low quality, checked for oddness, and plugged it into lavaan/blavaan and... Nothing. There was no general factor. There were a few factors which picked up clusters of specific related variables (often measuring the same thing, eg time spent in IRC client/# of IRC lines), but I couldn't find anything like the P-factor I expected to account for half or more of variance.
After thinking more about this and introspecting, it occurred to me. Often on the best days, activity is skewed. On a good day, I might get a bunch of everything done, but on, say, a good writing day, I might well simply skip going to the gym entirely because I'm in flow; I probably won't be answering too many emails; and I may or may not spend time reading a book. This didn't look so much like a P-factor at all. There might even be negative intercorrelations, simply because there's only so much time/energy in a day and time spent doing one thing is time spent not doing another thing. The metaphor which occurs to me is that of 'mana' or 'MP' in video games like RPGs, where you get a total which you can then spend on each ability: there is a single variable which constrains all of the abilities, sure, but you might not allocate them evenly, you might deliberately dump almost all of it into a specific ability, and another player might focus on a different ability or subset of abilities, and so on, and if you were to naively factor analyze a bunch of such characters, one would conclude that there is an negative correlation between strength and magical ability and that there is no general factor of mana and so on, all of which would be wrong. So in this model, each day you get a certain random amount of mana, and this mana gets spent on several possible outcomes in an uneven way.
Unfortunately, I can't find anything which really corresponds to this. It's somewhat like index variables in SEMs, but such indexes seem to be assumed by fiat, not inferred or modeled. One can simply standardize & sum up the variables and proceed regardless, but that provides no apparent way to check that the index is meaningful or corresponds to any latent mana in the first place. From a decision-theory perspective, since the end-purpose is optimization, constructing an index variable is optimal since by definition if your experiments don't increase the things you value, it is of no value that it might be increasing some hypothetical P-factor, but it's hard to assign utilities to indirect measurements like these, the results may be of little value to third-parties, and the question of 'what is productivity, why does it vary so much day-to-day, and how can we increase it?' is of interest in its own right. The uneven allocation reminds me of nonparametrics, but of course latent Dirichlet analysis doesn't work here; the Indian & Chinese buffet processes sound more similar, but still not what I need.
I can probably write something like the below toy model in Stan (all of it, including the stickbreaking or Dirichlet simplex, should be doable in a differentiable form), and then do Bayesian model comparison to a simple linear model with the observed intercorrelations, which doesn't really prove the existence of a mana variable but would at least suggest it's not worse than a baseline model with no latent variables, but that feels very ad hoc and I'd rather not. So, any ideas?
Below I present a simple generative toy model of how such a zero-sum latent variable model could work. For each day (row of k observations, k defaulting to 10 for easy viewing), defaulting to 3 years in my example:
- we draw a latent variable, 'mana', which determines how much total we have to 'spend' (arbitrarily set to
N(100,15)
)
- we do a stickbreaking-like process to decide what fraction of mana each of the k categories of observations get that day: the first one gets ~1/5th of the mana on average, the next gets ~1/5th of whatever is left. (1/5th simply gives a roughly plausible feel to how many categories are active each day and a nice imbalance; this could also be done differently, with uniform distributions or on a simplex or whatever, without affecting anything important, I think)
- the weights are permuted
- each k observation gets its fraction of the mana
- for some realism, we add some
N(0,5)
noise to each observation (a fraction of the original mana variance)
- for some more realism, most productivity measurements are inherently positive (you can't send -1 emails or go to the gym -1 times), so finally we constrain everything to be non-negative
Code below; as expected, it shows minimal intercorrelations between the 10 variables, and the factor analysis yields nonsensical results which essentially pick up on all the individual variables or one variable at random (depending on which of the disagreeing criterion for number of factors you pick):
set.seed(2018-08-23)
zsSim <- function(k=10, manaMean=100, manaSD=15, verbose=FALSE) {
mana <- rnorm(1, manaMean, manaSD)
manaWeights <- rbeta(k, 1, 2) # peak at ~20%
## stickbreaking-like process: each Kth variable gets a fraction of whatever is left over from the previous K-1th variables
manaFractions <- numeric(k)
for (i in 1:k) { if (i==1) { manaFractions[i] <- 1 * manaWeights[i] } else { manaFractions[i] <- (1-sum(manaFractions[1:i])) * manaWeights[i] } }
## and shuffle so every variable index has a chance of being the winner:
manaFractions <- sample(manaFractions)
manas <- mana * manaFractions
noise <- rnorm(k, mean=0, sd=5)
observedK <- pmax(0, noise + manas)
if (verbose) { print(cat("mana fraction sum = ", sum(manaFractions))); print(data.frame(mana, manaFractions, manas, noise, observedK)) }
return(c(mana, observedK)) }
zsSim(verbose=TRUE)
# mana fraction sum = 0.9822376563NULL
# mana manaFractions manas noise observedK
# 1 80.67062075 0.244110724694 19.6925636922 -1.8044969274 17.8880667648
# 2 80.67062075 0.337299009511 27.2101204748 -2.8833943254 24.3267261494
# 3 80.67062075 0.106969587558 8.6293030294 2.4519340066 11.0812370360
# 4 80.67062075 0.038686279033 3.1208461440 -1.8644193958 1.2564267482
# 5 80.67062075 0.009046399910 0.7297786962 -1.2547776674 0.0000000000
# 6 80.67062075 0.008968339185 0.7234814891 -3.5647215933 0.0000000000
# 7 80.67062075 0.011288962516 0.9106876138 5.2570430302 6.1677306440
# 8 80.67062075 0.076767488915 6.1928809840 1.4026349717 7.5955159557
# 9 80.67062075 0.043309055577 3.4937683974 -0.7777781983 2.7159901991
# 10 80.67062075 0.105791809402 8.5342909345 -7.8818836301 0.6524073044
# [1] 80.6706207475 17.8880667648 24.3267261494 11.0812370360 1.2564267482 0.0000000000 0.0000000000 6.1677306440 7.5955159557 2.7159901991 0.6524073044
zsSamples <- function(n=3*365) {
df <- as.data.frame(t(as.matrix(replicate(n, zsSim()))))
df$Total <- rowSums(df[,2:11])
colnames(df)[1] <- "Mana"
return(df) }
df <- zsSamples(); head(df); library(skimr); skim(df)
# Mana V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 Total
# 1 85.95998192 10.217891405 21.302955523 5.7183615417 24.482058824 0.000000000 4.934269306 0.000000000 9.056144993 5.475358537 8.895966577 90.08300671
# 2 97.31998793 4.696202516 28.118715139 0.0000000000 0.000000000 0.000000000 2.831360574 4.646360925 0.000000000 46.966954472 8.419743164 95.67933679
# 3 99.09208823 55.838817955 5.716622158 15.7976175665 1.981887659 7.218705793 0.000000000 0.000000000 13.161102725 0.000000000 19.131188352 118.84594221
# 4 115.84031989 2.692868221 0.000000000 15.4021610995 12.252356943 0.000000000 1.335282060 6.082977984 70.962209257 9.029501884 1.251178728 119.00853618
# 5 102.82656990 5.654245118 0.000000000 0.4575808017 17.677875061 0.000000000 13.739800959 0.000000000 29.584056032 0.000000000 7.379633772 74.49319174
# 6 111.87671714 0.000000000 0.000000000 46.0825241332 2.072741551 3.010552616 0.000000000 51.449334396 8.368490798 3.275699184 6.760038518 121.01938119
# Skim summary statistics
# n obs: 1095
# n variables: 12
#
# Variable type: numeric
# variable missing complete n mean sd p0 p25 p50 p75 p100 hist
# Mana 0 1095 1095 99.72 15.23 52.56 89.98 99.68 109.96 148.42 ▁▂▅▇▇▅▁▁
# Total 0 1095 1095 106.87 19.29 43.27 93.67 107.11 120.38 159.87 ▁▁▃▆▇▆▂▁
# V10 0 1095 1095 11.56 16.11 0 0.082 5.54 14.36 94.81 ▇▂▁▁▁▁▁▁
# V11 0 1095 1095 10.53 15.87 0 0 4.67 12.71 100.88 ▇▂▁▁▁▁▁▁
# V2 0 1095 1095 10.77 15.95 0 0.1 5.37 12.97 107.02 ▇▂▁▁▁▁▁▁
# V3 0 1095 1095 10.78 15.44 0 0.27 5.12 13.15 113.54 ▇▁▁▁▁▁▁▁
# V4 0 1095 1095 11.42 16.28 0 0.33 5.15 15 107.75 ▇▂▁▁▁▁▁▁
# V5 0 1095 1095 9.97 14.84 0 0 4.82 12.12 98.24 ▇▁▁▁▁▁▁▁
# V6 0 1095 1095 9.98 15.64 0 0 4.63 11.77 116.35 ▇▁▁▁▁▁▁▁
# V7 0 1095 1095 9.77 14.21 0 0 4.55 12.26 92.47 ▇▂▁▁▁▁▁▁
# V8 0 1095 1095 10.9 15.88 0 0 4.84 13.3 94.37 ▇▂▁▁▁▁▁▁
# V9 0 1095 1095 11.2 15.98 0 0.06 5.08 13.53 96.24 ▇▁▁▁▁▁▁▁
round(digits=2, cor(df))
# Mana V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 Total
# Mana 1.00 0.09 0.10 0.10 0.07 0.07 0.12 0.09 0.08 0.05 0.13 0.75
# V2 0.09 1.00 -0.08 -0.13 -0.07 -0.11 -0.11 -0.08 -0.12 -0.07 -0.12 0.10
# V3 0.10 -0.08 1.00 -0.12 -0.11 -0.10 -0.05 -0.09 -0.09 -0.13 -0.08 0.11
# V4 0.10 -0.13 -0.12 1.00 -0.07 -0.10 -0.06 -0.10 -0.12 -0.10 -0.05 0.16
# V5 0.07 -0.07 -0.11 -0.07 1.00 -0.11 -0.07 -0.12 -0.07 -0.09 -0.09 0.12
# V6 0.07 -0.11 -0.10 -0.10 -0.11 1.00 -0.10 -0.08 -0.10 -0.06 -0.09 0.13
# V7 0.12 -0.11 -0.05 -0.06 -0.07 -0.10 1.00 -0.07 -0.12 -0.10 -0.05 0.15
# V8 0.09 -0.08 -0.09 -0.10 -0.12 -0.08 -0.07 1.00 -0.10 -0.12 -0.12 0.10
# V9 0.08 -0.12 -0.09 -0.12 -0.07 -0.10 -0.12 -0.10 1.00 -0.09 -0.09 0.10
# V10 0.05 -0.07 -0.13 -0.10 -0.09 -0.06 -0.10 -0.12 -0.09 1.00 -0.12 0.11
# V11 0.13 -0.12 -0.08 -0.05 -0.09 -0.09 -0.05 -0.12 -0.09 -0.12 1.00 0.16
# Total 0.75 0.10 0.11 0.16 0.12 0.13 0.15 0.10 0.10 0.11 0.16 1.00
library(psych)
nfactors(df[,2:11])
# VSS complexity 1 achieves a maximimum of 0.81 with 10 factors
# VSS complexity 2 achieves a maximimum of 0.84 with 10 factors
# The Velicer MAP achieves a minimum of 0.02 with 1 factors
# Empirical BIC achieves a minimum of 294.33 with 5 factors
# Sample Size adjusted BIC achieves a minimum of 661.74 with 5 factors
# ...
fa(nfactors=1, df[,2:11])
# Warning: A Heywood case was detected.
# Standardized loadings (pattern matrix) based upon correlation matrix
# MR1 h2 u2 com
# V2 -0.08 7.1e-03 0.993 1
# V3 -0.04 1.3e-03 0.999 1
# V4 0.00 9.6e-06 1.000 1
# V5 -0.05 2.4e-03 0.998 1
# V6 -0.05 3.0e-03 0.997 1
# V7 -0.01 2.2e-04 1.000 1
# V8 -0.08 6.9e-03 0.993 1
# V9 -0.05 2.3e-03 0.998 1
# V10 -0.08 7.1e-03 0.993 1
# V11 1.01 1.0e+00 -0.012 1
#
# MR1
# SS loadings 1.04
# Proportion Var 0.10
Want to add to the discussion?
Post a comment!