LW anchoring experiment

Do mindless positive/negative comments skew article quality ratings up and down?
statistics, experiments, psychology, R
2012-02-272014-02-17 finished certainty: unlikely importance: 1

I do an in­for­mal ex­per­i­ment test­ing whether Less­Wrong karma scores are sus­cep­ti­ble to a form of an­chor­ing based on the first com­ment post­ed; a medi­um-large effect size is found. Al­though the data does not fit the as­sumed nor­mal dis­tri­b­u­tion so there may or may not be any ac­tual an­chor­ing effect.

Do so­cial me­dia web­sites suffer from Matthew effects or an­chor­ing, where an early in­ter­ven­tion can have sub­stan­tial effects? Most stud­ies or analy­ses sug­gest the effects are small.


2012-02-27 IRC:

Grognor> I've been reading the highest-scoring articles, and I have noticed a pattern
         a MUCH HIGHER PROPORTION of top-scoring articles have "upvoted" in the first two words in the
         first comment
         (standard disclaimer: correlation is not causation blah blah)
         http://lesswrong.com/r/lesswrong/lw/6r6/tendencies_in_reflective_equilibrium/ then I see this,
         one of the follow-ups to one of the top-scoring articles like this. the first comment
         says "not upvoted"
         and it has a much lower score
         while reading it, I was wondering "why is this at only 23? this is one of the best articles
                                           I've ever oh look at that comment"
         I'm definitely hitting on a real phenomenon here, probably some social thing that says
         "hey if he upvoted, I should too" and it seems to be incredibly
         strongly related to the first comment
         the proportion is really quite astounding
         http://lesswrong.com/lw/5n6/i_know_im_biased_but/ compare this to the
         Tendencies in Reflective Equilibrium post. Compared to that, it's awful, but it has
         nearly the same score. Note the distinct lack of a first comment saying "not upvoted"
         (side note: I thought my article, "on saying the obvious", would have a much lower score than
         it did. note the first comment: "good points, all of them.")
         it seems like it's having more of an effect than I would naively predict
gwern> Grognor: hm. maybe I should register a sockpuppet and on every future article I write flip a coin
                and write either upvoted or downvoted
quanticle> gwern: Aren't you afraid you'll incur Goodhart's wrath?
gwern> quanticle: no, that would be if I only put in 'upvoted' comments
gwern> Grognor: do these comments tend to include any reasons?
Grognor> gwern: yes

Boxo> you suggesting that the comments cause the upvotes? I'd rather say that the post is just the kind
      of post that makes lots people think as their first reaction
      "hell yeah imma upvote this", makes upvoting salient to them, and then some of that bubbles up to
      the comments
Grognor> Boxo: I'm not suggesting it's entirely that simple, no, but I do think it's obvious now that a
              first comment that says "upvoted for reasons x, y, and z"
              will cause more people to upvote than otherwise would have, and vice versa
Boxo> (ie. you saw X and Y and though X caused Y, but I think there's a Z that causes both X and Y)
ksotala> Every now and then, I catch myself wanting to upvote something because others have upvoted it
         already. It sounds reasonable that having an explicit comment
         declaring "I upvoted" might have an even stronger effect.
ksotala> On the other hand, I usually decide to up/downvote before reading the comments.
gwern> ksotala: you should turn on anti-kibitzing then
rmmh> gwern: maybe karma blinding like HN would help
Boxo> I guess any comment about voting could remind people to vote, in whatever direction. Could test this
      if you had the total number of votes per post.
Grognor> that too. the effect here is multifarious and complicated and the intricate details could not
         possibly be worked out, which is exactly why this proportion of first comments with an 'upvoted'
         note surprises me

Such an or effect re­sult­ing in a seems plau­si­ble to me. At least one ex­per­i­ment found some­thing sim­i­lar on­line: “So­cial In­flu­ence Bi­as: A Ran­dom­ized Ex­per­i­ment”:

Our so­ci­ety is in­creas­ingly re­ly­ing on the dig­i­tized, ag­gre­gated opin­ions of oth­ers to make de­ci­sions. We there­fore de­signed and an­a­lyzed a large-s­cale ran­dom­ized ex­per­i­ment on a so­cial news ag­gre­ga­tion Web site to in­ves­ti­gate whether knowl­edge of such ag­gre­gates dis­torts de­ci­sion-mak­ing. Prior rat­ings cre­ated sig­nifi­cant bias in in­di­vid­ual rat­ing be­hav­ior, and pos­i­tive and neg­a­tive so­cial in­flu­ences cre­ated asym­met­ric herd­ing effects. Whereas neg­a­tive so­cial in­flu­ence in­spired users to cor­rect ma­nip­u­lated rat­ings, pos­i­tive so­cial in­flu­ence in­creased the like­li­hood of pos­i­tive rat­ings by 32% and cre­ated ac­cu­mu­lat­ing pos­i­tive herd­ing that in­creased fi­nal rat­ings by 25% on av­er­age. This pos­i­tive herd­ing was top­ic-de­pen­dent and affected by whether in­di­vid­u­als were view­ing the opin­ions of friends or en­e­mies. A mix­ture of chang­ing opin­ion and greater turnout un­der both ma­nip­u­la­tions to­gether with a nat­ural ten­dency to up­-vote on the site com­bined to cre­ate the herd­ing effects. Such find­ings will help in­ter­pret col­lec­tive judg­ment ac­cu­rately and avoid so­cial in­flu­ence bias in col­lec­tive in­tel­li­gence in the fu­ture.


So on the 27th, I reg­is­tered the ac­count “Rhwawn”1. I made some qual­ity com­ments and up­votes to seed the ac­count as a le­git­i­mate ac­tive ac­count.

There­after, when­ever I wrote an Ar­ti­cle or Dis­cus­sion, after mak­ing it pub­lic, I flipped a coin and if Heads, I posted a com­ment as Rhwawn say­ing “Up­voted” or if Tails, a com­ment say­ing “Down­voted” with some ad­di­tional text (see next sec­tion). Need­less to say, no ac­tual vote was made. I then made a num­ber of qual­ity com­ments and votes on other Ar­ti­cles/Dis­cus­sions to cam­ou­flage the ex­per­i­men­tal in­ter­ven­tion. (In no case did I up­vote or down­vote some­one I had al­ready replied to or voted on with my Gw­ern ac­coun­t.) Fi­nal­ly, I sched­uled a re­minder on my cal­en­dar for 30 days later to record the karma on that Ar­ti­cle/Dis­cus­sion. I don’t post that often, so I de­cided to stop after 1 year, on 2013-02-27. I wound up break­ing this de­ci­sion since by Sep­tem­ber 2012 I had ceased to find it an in­ter­est­ing ques­tion, it was an un­fin­ished task that was bur­den­ing my mind, and the ne­ces­sity of mak­ing some gen­uine con­tri­bu­tions as Rhwawn to cloak a an­chor­ing com­ment was a not-so-triv­ial in­con­ve­nience that was stop­ping me from post­ing.

To en­large the sam­ple, I passed Re­cent Posts through xclip -o | grep '^by ' | cut -d ' ' -f 2 | sort | uniq -c | sort -g and pick­ing every­one with >=6 posts (8 peo­ple ex­clud­ing me), and I mes­saged them with a short mes­sage ex­plain­ing my de­sire for a large sam­ple and the bur­den of par­tic­i­pa­tion (“It would re­quire per­haps half a minute to a minute of your time every time you post an Ar­ti­cle or Dis­cus­sion for the next year, which is for most of you no more than once a week or month.”)

For those who replied, I sent a copy of this writeup and ex­plained their pro­ce­dure would be as fol­lows: every time they posted they would flip a coin and post like­wise (the Rhwawn ac­count pass­word hav­ing been shared with them); how­ev­er, as a con­ve­nience to them, I would take care of record­ing the karma a month lat­er. (I sub­scribed to par­tic­i­pants’ post RSS feeds; this would not guar­an­tee that I would learn of their posts in time to add a ran­dom­ized sock com­ment - hence the need for their ac­tive par­tic­i­pa­tion - but I could at least han­dle the sched­ul­ing & kar­ma-check­ing for them.)

Comment variation

Grognor pointed that the orig­i­nal com­ments came with rea­sons, but un­for­tu­nately if I came up with rea­sons for ei­ther com­ment, some crit­i­cisms or praise would be bet­ter than oth­ers and this would be an­other source of vari­abil­i­ty; so I added generic rea­sons.

I know from watch­ing them plum­met into obliv­ion that com­ments which are just “Up­voted” or “Down­voted” are not a good idea for any an­chor­ing ques­tion - they’ll quickly be hid­den, so any effect size will be a lot smaller than usu­al, and it’s pos­si­ble that hid­den com­ments them­selves an­chor (my guess: neg­a­tive­ly, by mak­ing peo­ple think “why is this at­tract­ing stu­pid com­ments?”).

While if you go with more care­fully ra­tio­nal­ized com­ments, that’s sort of like the XKCD car­toon: qual­ity com­ments start to draw on the ex­per­i­menter’s own strengths & weak­nesses (I’m sure I could make both qual­ity crit­i­cisms and praises of psy­chol­o­gy-re­lated ar­ti­cles, but not so much tech­ni­cal de­ci­sion the­ory ar­ti­cles).

It’s a “damned if you do, damned if you don’t” sort of dilem­ma.

I hoped my strat­egy would be a golden mean of not too triv­ial to be down­voted into obliv­ion, but not so high­-qual­ity and in­di­vid­u­al­ized that com­pa­ra­bil­ity was lost. I think I came close, since as we see in the analy­sis sec­tion, the pos­i­tive an­chor­ing com­ments saw only a small neg­a­tive net down­vote, in­di­cat­ing LW­ers may not have re­garded it as good enough to up­vote but also not so ob­vi­ously bad as to merit a down­vote.

(Of course, I did­n’t ex­pect the pos­i­tive and neg­a­tive com­ments to be treated differ­ently - they’re pretty much the same thing, with a nega­tion. I’m not sure how I would have de­signed it differ­ently if I had known about the dou­ble-s­tan­dard in ad­vance.)

To avoid is­sues with some crit­i­cisms be­ing ac­cu­rate and oth­ers poor, a fixed list of rea­sons was used with mi­nor vari­a­tion to make them fit in:

  1. Neg­a­tive; “Down­vot­ed; …”

    • “too much weight on n stud­ies”
    • “too many stud­ies cited”
    • “too re­liant on per­sonal anec­dote”
    • “too heavy on math”
    • “not enough math”
    • “rather ob­vi­ous”
    • “not very in­ter­est­ing”
  2. Pos­i­tive; “Up­vot­ed; …”

    • “good use of n stud­ies”
    • “thor­ough ci­ta­tion of claims”
    • “en­joy­able anec­dotes”
    • “rig­or­ous use of math”
    • “just enough math”
    • “not at all ob­vi­ous”
    • “very in­ter­est­ing”


I will have to do some con­tem­pla­tion of val­ues be­fore I ac­cept or re­ject. I like get­ting hon­est feed­back on my posts, I like ac­cu­mu­lat­ing kar­ma, and I also like per­form­ing ex­per­i­ments.

Ran­dom­iza­tion sug­gests that your ex­pect­ed-kar­ma-value would be 0, un­less you ex­pect asym­me­try be­tween pos­i­tive and neg­a­tive.

What do you an­tic­i­pate do­ing with the data ac­cu­mu­lated over the course of the ex­per­i­ment?

Oh, it’d be sim­ple enough. Sort ar­ti­cles into one group of karma scores for the pos­i­tive an­chors, the other group for the neg­a­tive an­chors; feed into a two-sam­ple t-test to see if the means differ and if the differ­ence is sig­nifi­cant. I can prob­a­bly copy the R code straight from my var­i­ous Zeo-re­lated R ses­sions.

If I can hit p < 0.10 or p < 0.05 or so, post an Ar­ti­cle tri­umphantly an­nounc­ing the find­ing of bias and an ob­ject les­son of why one should­n’t take karma too se­ri­ous­ly; if I don’t, post a Dis­cus­sion ar­ti­cle dis­cussing it and why I thought the re­sults did­n’t reach sig­nifi­cance. (Not enough ar­ti­cles? Too weak as­sump­tions in my t-test?)

And the ethics?

The post au­thors are vol­un­teers, and as al­ready pointed out, the ex­pected karma ben­e­fit is 0. So no one is harmed, and as for the de­cep­tion, it does not seem to me to be a big deal. We are al­ready nudged by count­less primes and stim­uli and bi­as­es, so an­other one, de­signed to be neu­tral in to­tal effect, seems harm­less to me.

“What comes be­fore de­ter­mines what comes after…The thoughts of all men arise from the dark­ness. If you are the move­ment of your soul, and the cause of that move­ment pre­cedes you, then how could you ever call your thoughts your own? How could you be any­thing other than a slave to the dark­ness that comes be­fore?…His­to­ry. Lan­guage. Pas­sion. Cus­tom. All these things de­ter­mine what men say, think, and do. These are the hid­den pup­pet-strings from which all men hang…all men are de­ceived….So long as what comes be­fore re­mains shroud­ed, so long as men are al­ready de­ceived, what does [de­ceiv­ing men] mat­ter?”



The re­sults:


For the analy­sis, I have 2 ques­tions:

  1. Is there a differ­ence in karma be­tween posts that re­ceived a neg­a­tive ini­tial com­ment and those that re­ceived a pos­i­tive ini­tial com­ment? (Any differ­ence sug­gests that one or both are hav­ing an effec­t.)
  2. Is there a differ­ence in karma be­tween the two kinds of ini­tial com­ments (as I be­gan to sus­pect dur­ing the ex­per­i­men­t)?

Article effect

Some Bayesian in­fer­ence us­ing BEST:

lw <- data.frame(Anchor     = c( 0, 0, 1, 0, 1, 0, 1,  1, 1, 1, 0, 0, 1, 0, 1, 0, 0),
                 Post.karma = c(11,19,50, 9,19,11,62,120,49,20,16,20,10,45,22,23,33))

neg <- lw[lw$Anchor==0,]$Post.karma
pos <- lw[lw$Anchor==1,]$Post.karma
mcmc = BESTmcmc(neg, pos)
BESTplot(neg, pos, mcmcChain=mcmc)
#            SUMMARY.INFO
# PARAMETER       mean  median     mode   HDIlow HDIhigh pcgtZero
#   mu1        20.1792  20.104  20.0392  10.7631 29.9835       NA
#   mu2        41.9474  41.640  40.4661  11.0307 75.1056       NA
#   muDiff    -21.7682 -21.519 -22.6345 -55.3222 11.2283    8.143
#   sigma1     13.1212  12.264  10.9018   5.8229 22.4381       NA
#   sigma2     40.9768  37.835  33.8565  16.5560 72.6948       NA
#   sigmaDiff -27.8556 -24.995 -21.7802 -60.9420 -0.9855    0.838
#   nu         30.0681  21.230   5.6449   1.0001 86.5698       NA
#   nuLog10     1.2896   1.327   1.4332   0.4332  2.0671       NA
#   effSz      -0.7718  -0.765  -0.7632  -1.8555  0.3322    8.143
Graph­i­cal sum­mary of BEST re­sults for full dataset

The re­sults are heav­ily skewed by Yvain’s very pop­u­lar post; we can’t trust any re­sults based on such a high scor­ing post. Let’s try omit­ting Yvain’s dat­a­point. BEST ac­tu­ally crashes dis­play­ing the re­sult, per­haps due to mak­ing an as­sump­tion about there be­ing at least 8 dat­a­points or some­thing; and it’s ques­tion whether we should be us­ing a nor­mal-based test like BEST in the first place: just from graph­ing we can see it’s defi­nitely not a nor­mal dis­tri­b­u­tion. So we’ll fall back to a dis­tri­b­u­tion-free two-sam­ple test (rather than a t-test):

# wilcox.test(Post.karma ~ Anchor, data=lw)
#     Wilcoxon rank sum test with continuity correction
# data:  Post.karma by Anchor
# W = 19, p-value = 0.1117
# ...
wilcox.test(Post.karma ~ Anchor, data=(lw[lw$Post.karma<100,]))
#     Wilcoxon rank sum test with continuity correction
# data:  Post.karma by Anchor
# W = 19, p-value = 0.203

Rea­son­able. To work around the bug, let’s re­place Yvain by the mean for that group with­out him, 33; the new re­sults:

#            SUMMARY.INFO
# PARAMETER       mean   median     mode  HDIlow HDIhigh pcgtZero
#   mu1        20.2877  20.2002  20.1374  10.863 29.9532       NA
#   mu2        32.7912  32.7664  32.8370  15.609 50.4410       NA
#   muDiff    -12.5035 -12.4802 -12.2682 -32.098  7.3301    9.396
#   sigma1     13.2561  12.3968  10.9385   6.044 22.3085       NA
#   sigma2     22.4574  20.6784  18.3106  10.449 38.5859       NA
#   sigmaDiff  -9.2013  -8.1031  -7.1115 -28.725  7.6973   11.685
#   nu         33.2258  24.5819   8.6726   1.143 91.3693       NA
#   nuLog10     1.3575   1.3906   1.4516   0.555  2.0837       NA
#   effSz      -0.7139  -0.7066  -0.7053  -1.779  0.3607    9.396
Graph­i­cal sum­mary of BEST re­sults for dataset with Yvain re­placed by a mean

The differ­ence in means has shrunk but not gone away; it’s large enough that 10% of the pos­si­ble effect sizes (of “a neg­a­tive ini­tial com­ment rather than pos­i­tive”) may be zero or ac­tu­ally be pos­i­tive (in­crease kar­ma) in­stead. This is a lit­tle con­cern­ing, but I don’t take this too se­ri­ous­ly:

  1. this is not a lot of data
  2. as we’ve seen there are ex­treme out­liers sug­gest­ing that the as­sump­tion that karma scores are Gaus­sian/nor­mal may be badly wrong
  3. even at face val­ue, 10 karma points does­n’t seem like it’s large enough to have any im­por­tant re­al-world con­se­quences (like make peo­ple leave LW who should’ve stayed)

Pur­su­ing point num­ber two, there’s two main op­tions for deal­ing with the nor­mal dis­tri­b­u­tion as­sump­tion of BEST/t-tests be­ing vi­o­lat­ed: ei­ther switch to a test which as­sumes a dis­tri­b­u­tion more like what we ac­tu­ally see & hope that this new dis­tri­b­u­tion is close enough to true, or use a test which does­n’t rely on dis­tri­b­u­tions at all.

For ex­am­ple, we could try to model it as a and see if the an­chor­ing vari­able is an im­por­tant pre­dic­tor of the mean of the process sum, which it seems to be:

summary(glm(Post.karma ~ Anchor, data=lw, family=poisson))
# ...
# Deviance Residuals:
#    Min      1Q  Median      3Q     Max
# -6.194  -2.915  -0.396   0.885   9.423
# Coefficients:
#             Estimate Std. Error z value Pr(>|z|)
# (Intercept)   3.0339     0.0731   41.49   <2e-16
# Anchor        0.7503     0.0905    8.29   <2e-16
# ...
summary(glm(Post.karma ~ Anchor, data=(lw[lw$Post.karma<100,]), family=poisson))
# ...
# Deviance Residuals:
#    Min      1Q  Median      3Q     Max
# -4.724  -2.386  -0.744   2.493   4.594
# Coefficients:
#             Estimate Std. Error z value Pr(>|z|)
# (Intercept)   3.0339     0.0731   41.49   <2e-16
# Anchor        0.4669     0.0983    4.75    2e-06

(On a side note, I re­gard these p-val­ues as ev­i­dence for an effect even thought they don’t fall un­der 0.05 or an­other al­pha I de­fined in ad­vance: with this small sam­ple size and hence low , to reach p < 0.05, each an­chor­ing com­ment would have to have a grotesquely large effect on ar­ti­cle karma - but an­chor­ing com­ments hav­ing such an effect is highly un­like­ly! An­chor­ing, in psy­chol­o­gy, is­n’t that om­nipo­tent: it’s rel­a­tively sub­tle. So we have a sim­i­lar prob­lem as with the - they’re the sort of sit­u­a­tions where it’d be nicer to be talk­ing in more rel­a­tive terms like s.)

Comment treatment

How did these mind­less un­sub­stan­ti­ated com­ments ei­ther prais­ing or crit­i­ciz­ing an ar­ti­cle get treated by the com­mu­ni­ty? Let’s look at the an­chor­ing com­men­t’s kar­ma:

neg <- lw[lw$Anchor==0,]$Comment.karma
pos <- lw[lw$Anchor==1,]$Comment.karma
mcmc = BESTmcmc(neg, pos)
BESTplot(neg, pos, mcmcChain=mcmc)
#            SUMMARY.INFO
# PARAMETER      mean  median     mode   HDIlow HDIhigh pcgtZero
#   mu1       -6.4278 -6.4535 -6.55032 -10.5214 -2.2350       NA
#   mu2       -0.2755 -0.2455 -0.01863  -1.3180  0.7239       NA
#   muDiff    -6.1523 -6.1809 -6.25451 -10.3706 -1.8571    0.569
#   sigma1     5.6508  5.2895  4.70143   2.3262  9.7424       NA
#   sigma2     1.2614  1.1822  1.07138   0.2241  2.4755       NA
#   sigmaDiff  4.3893  4.0347  3.53457   1.1012  8.5941   99.836
#   nu        27.4160 18.1596  4.04827   1.0001 83.9648       NA
#   nuLog10    1.2060  1.2591  1.41437   0.2017  2.0491       NA
#   effSz     -1.6750 -1.5931 -1.48805  -3.2757 -0.1889    0.569
Graph­i­cal sum­mary of BEST re­sults for full dataset of how the pos­i­tive/neg­a­tive com­ments were treated

As one would hope, nei­ther group of com­ments ends up with net pos­i­tive mean score, but they’re clearly be­ing treated very differ­ent­ly: the neg­a­tive com­ments get down­voted far more than the pos­i­tive com­ments. I take this as per­haps im­ply­ing that LW’s rep­u­ta­tion for be­ing neg­a­tive & hos­tile is a bit overblown: we’re neg­a­tive and hos­tile to poorly thought out crit­i­cisms and ar­gu­ments, not fluffy praise.

See Also

  1. If you were won­der­ing about the ac­count name: both ‘Rhwawn’ and ‘Gw­ern’ are char­ac­ter names from the Welsh col­lec­tion . They share the dis­tinc­tions of be­ing short, nearly unique, and ob­vi­ously pseu­do­ny­mous to any­one who Googles them, which is why I also used that name as an al­ter­nate ac­count on Wikipedia.↩︎