LW anchoring experiment

Do mindless positive/negative comments skew article quality ratings up and down?
statistics, experiments, psychology, R
2012-02-272014-02-17 finished certainty: unlikely importance: 1


I do an infor­mal exper­i­ment test­ing whether Less­Wrong karma scores are sus­cep­ti­ble to a form of anchor­ing based on the first com­ment post­ed; a medi­um-large effect size is found. Although the data does not fit the assumed nor­mal dis­tri­b­u­tion so there may or may not be any actual anchor­ing effect.

Do social media web­sites suf­fer from Matthew effects or anchor­ing, where an early inter­ven­tion can have sub­stan­tial effects? Most stud­ies or analy­ses sug­gest the effects are small.

Problem

2012-02-27 IRC:

Grognor> I've been reading the highest-scoring articles, and I have noticed a pattern
         a MUCH HIGHER PROPORTION of top-scoring articles have "upvoted" in the first two words in the
         first comment
         (standard disclaimer: correlation is not causation blah blah)
         http://lesswrong.com/r/lesswrong/lw/6r6/tendencies_in_reflective_equilibrium/ then I see this,
         one of the follow-ups to one of the top-scoring articles like this. the first comment
         says "not upvoted"
         and it has a much lower score
         while reading it, I was wondering "why is this at only 23? this is one of the best articles
                                           I've ever oh look at that comment"
         I'm definitely hitting on a real phenomenon here, probably some social thing that says
         "hey if he upvoted, I should too" and it seems to be incredibly
         strongly related to the first comment
         the proportion is really quite astounding
         http://lesswrong.com/lw/5n6/i_know_im_biased_but/ compare this to the
         Tendencies in Reflective Equilibrium post. Compared to that, it's awful, but it has
         nearly the same score. Note the distinct lack of a first comment saying "not upvoted"
         (side note: I thought my article, "on saying the obvious", would have a much lower score than
         it did. note the first comment: "good points, all of them.")
         it seems like it's having more of an effect than I would naively predict
...
gwern> Grognor: hm. maybe I should register a sockpuppet and on every future article I write flip a coin
                and write either upvoted or downvoted
quanticle> gwern: Aren't you afraid you'll incur Goodhart's wrath?
gwern> quanticle: no, that would be if I only put in 'upvoted' comments
gwern> Grognor: do these comments tend to include any reasons?
Grognor> gwern: yes

Boxo> you suggesting that the comments cause the upvotes? I'd rather say that the post is just the kind
      of post that makes lots people think as their first reaction
      "hell yeah imma upvote this", makes upvoting salient to them, and then some of that bubbles up to
      the comments
Grognor> Boxo: I'm not suggesting it's entirely that simple, no, but I do think it's obvious now that a
              first comment that says "upvoted for reasons x, y, and z"
              will cause more people to upvote than otherwise would have, and vice versa
Boxo> (ie. you saw X and Y and though X caused Y, but I think there's a Z that causes both X and Y)
ksotala> Every now and then, I catch myself wanting to upvote something because others have upvoted it
         already. It sounds reasonable that having an explicit comment
         declaring "I upvoted" might have an even stronger effect.
ksotala> On the other hand, I usually decide to up/downvote before reading the comments.
gwern> ksotala: you should turn on anti-kibitzing then
rmmh> gwern: maybe karma blinding like HN would help
Boxo> I guess any comment about voting could remind people to vote, in whatever direction. Could test this
      if you had the total number of votes per post.
Grognor> that too. the effect here is multifarious and complicated and the intricate details could not
         possibly be worked out, which is exactly why this proportion of first comments with an 'upvoted'
         note surprises me

Such an or effect result­ing in a seems plau­si­ble to me. At least one exper­i­ment found some­thing sim­i­lar online: “Social Influ­ence Bias: A Ran­dom­ized Exper­i­ment”:

Our soci­ety is increas­ingly rely­ing on the dig­i­tized, aggre­gated opin­ions of oth­ers to make deci­sions. We there­fore designed and ana­lyzed a large-s­cale ran­dom­ized exper­i­ment on a social news aggre­ga­tion Web site to inves­ti­gate whether knowl­edge of such aggre­gates dis­torts deci­sion-­mak­ing. Prior rat­ings cre­ated sig­nif­i­cant bias in indi­vid­ual rat­ing behav­ior, and pos­i­tive and neg­a­tive social influ­ences cre­ated asym­met­ric herd­ing effects. Whereas neg­a­tive social influ­ence inspired users to cor­rect manip­u­lated rat­ings, pos­i­tive social influ­ence increased the like­li­hood of pos­i­tive rat­ings by 32% and cre­ated accu­mu­lat­ing pos­i­tive herd­ing that increased final rat­ings by 25% on aver­age. This pos­i­tive herd­ing was top­ic-de­pen­dent and affected by whether indi­vid­u­als were view­ing the opin­ions of friends or ene­mies. A mix­ture of chang­ing opin­ion and greater turnout under both manip­u­la­tions together with a nat­ural ten­dency to up-vote on the site com­bined to cre­ate the herd­ing effects. Such find­ings will help inter­pret col­lec­tive judg­ment accu­rately and avoid social influ­ence bias in col­lec­tive intel­li­gence in the future.

Design

So on the 27th, I reg­is­tered the account “Rhwawn”1. I made some qual­ity com­ments and upvotes to seed the account as a legit­i­mate active account.

There­after, when­ever I wrote an Arti­cle or Dis­cus­sion, after mak­ing it pub­lic, I flipped a coin and if Heads, I posted a com­ment as Rhwawn say­ing “Upvoted” or if Tails, a com­ment say­ing “Down­voted” with some addi­tional text (see next sec­tion). Need­less to say, no actual vote was made. I then made a num­ber of qual­ity com­ments and votes on other Articles/Discussions to cam­ou­flage the exper­i­men­tal inter­ven­tion. (In no case did I upvote or down­vote some­one I had already replied to or voted on with my Gwern accoun­t.) Final­ly, I sched­uled a reminder on my cal­en­dar for 30 days later to record the karma on that Article/Discussion. I don’t post that often, so I decided to stop after 1 year, on 2013-02-27. I wound up break­ing this deci­sion since by Sep­tem­ber 2012 I had ceased to find it an inter­est­ing ques­tion, it was an unfin­ished task that was bur­den­ing my mind, and the neces­sity of mak­ing some gen­uine con­tri­bu­tions as Rhwawn to cloak a anchor­ing com­ment was a not-­so-triv­ial incon­ve­nience that was stop­ping me from post­ing.

To enlarge the sam­ple, I passed Recent Posts through xclip -o | grep '^by ' | cut -d ' ' -f 2 | sort | uniq -c | sort -g and pick­ing every­one with >=6 posts (8 peo­ple exclud­ing me), and I mes­saged them with a short mes­sage explain­ing my desire for a large sam­ple and the bur­den of par­tic­i­pa­tion (“It would require per­haps half a minute to a minute of your time every time you post an Arti­cle or Dis­cus­sion for the next year, which is for most of you no more than once a week or month.”)

For those who replied, I sent a copy of this writeup and explained their pro­ce­dure would be as fol­lows: every time they posted they would flip a coin and post like­wise (the Rhwawn account pass­word hav­ing been shared with them); how­ev­er, as a con­ve­nience to them, I would take care of record­ing the karma a month lat­er. (I sub­scribed to par­tic­i­pants’ post RSS feeds; this would not guar­an­tee that I would learn of their posts in time to add a ran­dom­ized sock com­ment - hence the need for their active par­tic­i­pa­tion - but I could at least han­dle the sched­ul­ing & kar­ma-check­ing for them.)

Comment variation

Grognor pointed that the orig­i­nal com­ments came with rea­sons, but unfor­tu­nately if I came up with rea­sons for either com­ment, some crit­i­cisms or praise would be bet­ter than oth­ers and this would be another source of vari­abil­i­ty; so I added generic rea­sons.

I know from watch­ing them plum­met into obliv­ion that com­ments which are just “Upvoted” or “Down­voted” are not a good idea for any anchor­ing ques­tion - they’ll quickly be hid­den, so any effect size will be a lot smaller than usu­al, and it’s pos­si­ble that hid­den com­ments them­selves anchor (my guess: neg­a­tive­ly, by mak­ing peo­ple think “why is this attract­ing stu­pid com­ments?”).

While if you go with more care­fully ratio­nal­ized com­ments, that’s sort of like the XKCD car­toon: qual­ity com­ments start to draw on the exper­i­menter’s own strengths & weak­nesses (I’m sure I could make both qual­ity crit­i­cisms and praises of psy­chol­o­gy-re­lated arti­cles, but not so much tech­ni­cal deci­sion the­ory arti­cles).

It’s a “damned if you do, damned if you don’t” sort of dilem­ma.

I hoped my strat­egy would be a golden mean of not too triv­ial to be down­voted into obliv­ion, but not so high­-qual­ity and indi­vid­u­al­ized that com­pa­ra­bil­ity was lost. I think I came close, since as we see in the analy­sis sec­tion, the pos­i­tive anchor­ing com­ments saw only a small neg­a­tive net down­vote, indi­cat­ing LWers may not have regarded it as good enough to upvote but also not so obvi­ously bad as to merit a down­vote.

(Of course, I did­n’t expect the pos­i­tive and neg­a­tive com­ments to be treated dif­fer­ently - they’re pretty much the same thing, with a nega­tion. I’m not sure how I would have designed it dif­fer­ently if I had known about the dou­ble-­s­tan­dard in advance.)

To avoid issues with some crit­i­cisms being accu­rate and oth­ers poor, a fixed list of rea­sons was used with minor vari­a­tion to make them fit in:

  1. Neg­a­tive; “Down­vot­ed; …”

    • “too much weight on n stud­ies”
    • “too many stud­ies cited”
    • “too reliant on per­sonal anec­dote”
    • “too heavy on math”
    • “not enough math”
    • “rather obvi­ous”
    • “not very inter­est­ing”
  2. Pos­i­tive; “Upvot­ed; …”

    • “good use of n stud­ies”
    • “thor­ough cita­tion of claims”
    • “enjoy­able anec­dotes”
    • “rig­or­ous use of math”
    • “just enough math”
    • “not at all obvi­ous”
    • “very inter­est­ing”

Questions

I will have to do some con­tem­pla­tion of val­ues before I accept or reject. I like get­ting hon­est feed­back on my posts, I like accu­mu­lat­ing kar­ma, and I also like per­form­ing exper­i­ments.

Ran­dom­iza­tion sug­gests that your expect­ed-kar­ma-­value would be 0, unless you expect asym­me­try between pos­i­tive and neg­a­tive.

What do you antic­i­pate doing with the data accu­mu­lated over the course of the exper­i­ment?

Oh, it’d be sim­ple enough. Sort arti­cles into one group of karma scores for the pos­i­tive anchors, the other group for the neg­a­tive anchors; feed into a two-sam­ple t-test to see if the means dif­fer and if the dif­fer­ence is sig­nif­i­cant. I can prob­a­bly copy the R code straight from my var­i­ous Zeo-re­lated R ses­sions.

If I can hit p < 0.10 or p < 0.05 or so, post an Arti­cle tri­umphantly announc­ing the find­ing of bias and an object les­son of why one should­n’t take karma too seri­ous­ly; if I don’t, post a Dis­cus­sion arti­cle dis­cussing it and why I thought the results did­n’t reach sig­nif­i­cance. (Not enough arti­cles? Too weak assump­tions in my t-test?)

And the ethics?

The post authors are vol­un­teers, and as already pointed out, the expected karma ben­e­fit is 0. So no one is harmed, and as for the decep­tion, it does not seem to me to be a big deal. We are already nudged by count­less primes and stim­uli and bias­es, so another one, designed to be neu­tral in total effect, seems harm­less to me.

“What comes before deter­mines what comes after…The thoughts of all men arise from the dark­ness. If you are the move­ment of your soul, and the cause of that move­ment pre­cedes you, then how could you ever call your thoughts your own? How could you be any­thing other than a slave to the dark­ness that comes before?…His­to­ry. Lan­guage. Pas­sion. Cus­tom. All these things deter­mine what men say, think, and do. These are the hid­den pup­pet-strings from which all men hang…all men are deceived….So long as what comes before remains shroud­ed, so long as men are already deceived, what does [de­ceiv­ing men] mat­ter?”

Kel­hus,

Data

The results:

Analysis

For the analy­sis, I have 2 ques­tions:

  1. Is there a dif­fer­ence in karma between posts that received a neg­a­tive ini­tial com­ment and those that received a pos­i­tive ini­tial com­ment? (Any dif­fer­ence sug­gests that one or both are hav­ing an effec­t.)
  2. Is there a dif­fer­ence in karma between the two kinds of ini­tial com­ments (as I began to sus­pect dur­ing the exper­i­men­t)?

Article effect

Some Bayesian infer­ence using BEST:

lw <- data.frame(Anchor     = c( 0, 0, 1, 0, 1, 0, 1,  1, 1, 1, 0, 0, 1, 0, 1, 0, 0),
                 Post.karma = c(11,19,50, 9,19,11,62,120,49,20,16,20,10,45,22,23,33))

source("BEST.R")
neg <- lw[lw$Anchor==0,]$Post.karma
pos <- lw[lw$Anchor==1,]$Post.karma
mcmc = BESTmcmc(neg, pos)
BESTplot(neg, pos, mcmcChain=mcmc)
#
#            SUMMARY.INFO
# PARAMETER       mean  median     mode   HDIlow HDIhigh pcgtZero
#   mu1        20.1792  20.104  20.0392  10.7631 29.9835       NA
#   mu2        41.9474  41.640  40.4661  11.0307 75.1056       NA
#   muDiff    -21.7682 -21.519 -22.6345 -55.3222 11.2283    8.143
#   sigma1     13.1212  12.264  10.9018   5.8229 22.4381       NA
#   sigma2     40.9768  37.835  33.8565  16.5560 72.6948       NA
#   sigmaDiff -27.8556 -24.995 -21.7802 -60.9420 -0.9855    0.838
#   nu         30.0681  21.230   5.6449   1.0001 86.5698       NA
#   nuLog10     1.2896   1.327   1.4332   0.4332  2.0671       NA
#   effSz      -0.7718  -0.765  -0.7632  -1.8555  0.3322    8.143
Graph­i­cal sum­mary of BEST results for full dataset

The results are heav­ily skewed by Yvain’s very pop­u­lar post; we can’t trust any results based on such a high scor­ing post. Let’s try omit­ting Yvain’s dat­a­point. BEST actu­ally crashes dis­play­ing the result, per­haps due to mak­ing an assump­tion about there being at least 8 dat­a­points or some­thing; and it’s ques­tion whether we should be using a nor­mal-based test like BEST in the first place: just from graph­ing we can see it’s def­i­nitely not a nor­mal dis­tri­b­u­tion. So we’ll fall back to a dis­tri­b­u­tion-free two-sam­ple test (rather than a t-test):

# wilcox.test(Post.karma ~ Anchor, data=lw)
#
#     Wilcoxon rank sum test with continuity correction
#
# data:  Post.karma by Anchor
# W = 19, p-value = 0.1117
# ...
wilcox.test(Post.karma ~ Anchor, data=(lw[lw$Post.karma<100,]))
#
#     Wilcoxon rank sum test with continuity correction
#
# data:  Post.karma by Anchor
# W = 19, p-value = 0.203

Rea­son­able. To work around the bug, let’s replace Yvain by the mean for that group with­out him, 33; the new results:

#            SUMMARY.INFO
# PARAMETER       mean   median     mode  HDIlow HDIhigh pcgtZero
#   mu1        20.2877  20.2002  20.1374  10.863 29.9532       NA
#   mu2        32.7912  32.7664  32.8370  15.609 50.4410       NA
#   muDiff    -12.5035 -12.4802 -12.2682 -32.098  7.3301    9.396
#   sigma1     13.2561  12.3968  10.9385   6.044 22.3085       NA
#   sigma2     22.4574  20.6784  18.3106  10.449 38.5859       NA
#   sigmaDiff  -9.2013  -8.1031  -7.1115 -28.725  7.6973   11.685
#   nu         33.2258  24.5819   8.6726   1.143 91.3693       NA
#   nuLog10     1.3575   1.3906   1.4516   0.555  2.0837       NA
#   effSz      -0.7139  -0.7066  -0.7053  -1.779  0.3607    9.396
Graph­i­cal sum­mary of BEST results for dataset with Yvain replaced by a mean

The dif­fer­ence in means has shrunk but not gone away; it’s large enough that 10% of the pos­si­ble effect sizes (of “a neg­a­tive ini­tial com­ment rather than pos­i­tive”) may be zero or actu­ally be pos­i­tive (in­crease kar­ma) instead. This is a lit­tle con­cern­ing, but I don’t take this too seri­ous­ly:

  1. this is not a lot of data
  2. as we’ve seen there are extreme out­liers sug­gest­ing that the assump­tion that karma scores are Gaussian/normal may be badly wrong
  3. even at face val­ue, 10 karma points does­n’t seem like it’s large enough to have any impor­tant real-­world con­se­quences (like make peo­ple leave LW who should’ve stayed)

Pur­su­ing point num­ber two, there’s two main options for deal­ing with the nor­mal dis­tri­b­u­tion assump­tion of BEST/t-tests being vio­lat­ed: either switch to a test which assumes a dis­tri­b­u­tion more like what we actu­ally see & hope that this new dis­tri­b­u­tion is close enough to true, or use a test which does­n’t rely on dis­tri­b­u­tions at all.

For exam­ple, we could try to model it as a and see if the anchor­ing vari­able is an impor­tant pre­dic­tor of the mean of the process sum, which it seems to be:

summary(glm(Post.karma ~ Anchor, data=lw, family=poisson))
# ...
# Deviance Residuals:
#    Min      1Q  Median      3Q     Max
# -6.194  -2.915  -0.396   0.885   9.423
#
# Coefficients:
#             Estimate Std. Error z value Pr(>|z|)
# (Intercept)   3.0339     0.0731   41.49   <2e-16
# Anchor        0.7503     0.0905    8.29   <2e-16
# ...
summary(glm(Post.karma ~ Anchor, data=(lw[lw$Post.karma<100,]), family=poisson))
# ...
# Deviance Residuals:
#    Min      1Q  Median      3Q     Max
# -4.724  -2.386  -0.744   2.493   4.594
#
# Coefficients:
#             Estimate Std. Error z value Pr(>|z|)
# (Intercept)   3.0339     0.0731   41.49   <2e-16
# Anchor        0.4669     0.0983    4.75    2e-06

(On a side note, I regard these p-val­ues as evi­dence for an effect even thought they don’t fall under 0.05 or another alpha I defined in advance: with this small sam­ple size and hence low , to reach p < 0.05, each anchor­ing com­ment would have to have a grotesquely large effect on arti­cle karma - but anchor­ing com­ments hav­ing such an effect is highly unlike­ly! Anchor­ing, in psy­chol­o­gy, isn’t that omnipo­tent: it’s rel­a­tively sub­tle. So we have a sim­i­lar prob­lem as with the - they’re the sort of sit­u­a­tions where it’d be nicer to be talk­ing in more rel­a­tive terms like s.)

Comment treatment

How did these mind­less unsub­stan­ti­ated com­ments either prais­ing or crit­i­ciz­ing an arti­cle get treated by the com­mu­ni­ty? Let’s look at the anchor­ing com­men­t’s kar­ma:

neg <- lw[lw$Anchor==0,]$Comment.karma
pos <- lw[lw$Anchor==1,]$Comment.karma
mcmc = BESTmcmc(neg, pos)
BESTplot(neg, pos, mcmcChain=mcmc)
#            SUMMARY.INFO
# PARAMETER      mean  median     mode   HDIlow HDIhigh pcgtZero
#   mu1       -6.4278 -6.4535 -6.55032 -10.5214 -2.2350       NA
#   mu2       -0.2755 -0.2455 -0.01863  -1.3180  0.7239       NA
#   muDiff    -6.1523 -6.1809 -6.25451 -10.3706 -1.8571    0.569
#   sigma1     5.6508  5.2895  4.70143   2.3262  9.7424       NA
#   sigma2     1.2614  1.1822  1.07138   0.2241  2.4755       NA
#   sigmaDiff  4.3893  4.0347  3.53457   1.1012  8.5941   99.836
#   nu        27.4160 18.1596  4.04827   1.0001 83.9648       NA
#   nuLog10    1.2060  1.2591  1.41437   0.2017  2.0491       NA
#   effSz     -1.6750 -1.5931 -1.48805  -3.2757 -0.1889    0.569
Graph­i­cal sum­mary of BEST results for full dataset of how the positive/negative com­ments were treated

As one would hope, nei­ther group of com­ments ends up with net pos­i­tive mean score, but they’re clearly being treated very dif­fer­ent­ly: the neg­a­tive com­ments get down­voted far more than the pos­i­tive com­ments. I take this as per­haps imply­ing that LW’s rep­u­ta­tion for being neg­a­tive & hos­tile is a bit overblown: we’re neg­a­tive and hos­tile to poorly thought out crit­i­cisms and argu­ments, not fluffy praise.

See Also


  1. If you were won­der­ing about the account name: both ‘Rhwawn’ and ‘Gwern’ are char­ac­ter names from the Welsh col­lec­tion . They share the dis­tinc­tions of being short, nearly unique, and obvi­ously pseu­do­ny­mous to any­one who Googles them, which is why I also used that name as an alter­nate account on Wikipedia.↩︎