created: 10 Jan 2009; modified: 12 Jul 2016; status: finished; belief: highly likely
- Prediction markets
- Personal bets
- See also
- External links
Everything old is new again. Wikipedia is the collaboration of amateur gentlemen, writ in countless epistolary IRC or email or talk page messages. And the American public’s untrammeled betting on elections and victories has been reborn as prediction markets.
Wikipedia admirably summarizes the basic idea:
Prediction markets…are speculative markets created for the purpose of making predictions. Assets are created whose final cash value is tied to a particular event (e.g., will the next US president be a Republican) or parameter (e.g., total sales next quarter). The current market prices can be interpreted as predictions of the probability of the event or the expected value of the parameter1. Prediction markets are thus structured as betting exchanges, without any risk for the bookmaker.
Emphasis is added on the most important characteristic of a prediction market, the way in which it differs from regular stock markets. The idea is that by tracking accuracy - punishing ignorance & rewarding knowledge in equal measure - a prediction market can elicit one’s true beliefs, and avoid the failure mode of predictions as pundit’s bloviation or wishful thinking or signaling alignment:
“The usual touchstone of whether what someone asserts is mere persuasion or at least a subjective conviction, i.e., firm belief, is betting. Often someone pronounces his propositions with such confident and inflexible defiance that he seems to have entirely laid aside all concern for error. A bet disconcerts him. Sometimes he reveals that he is persuaded enough for one ducat but not for ten. For he would happily bet one, but at 10 he suddenly becomes aware of what he had not previously noticed, namely that it is quite possible that he has erred.”2
Events, not dividends or sales
Imagine a prediction market in which every day the administrator sells off pairs of shares (he doesn’t want to risk paying out more than he received) for $1 a share, and all the shares say either heads or tails. Then he flips a coin and gives everyone with a ‘right’ share $2. Obviously if people bid up heads to $5, this is crazy and irrational - even if heads wins today, one would still lose. Similarly for any amount greater than $2. But $2 is also crazy: the only way this share price doesn’t lose money is if heads is 100% guarantee. Of course, it isn’t. It is quite precisely guaranteed to not be the case - 50% not the case. Anything above 50% is going to lose in the long run.
A smart investor could come into this market, and blindly buy any share whatsoever that was less than $1; they would make money. If their shares were even 99¢, then about half would turn into $2 and half into 0…
This is all elementary and obvious, and its how we can convince ourselves that market prices can indeed be interpreted as predictions of expected value. But that’s only because the odds are known in advance! We specified it was a fair coin. If the odds of the event were not known, then things would be much more interesting. No one bets on a coin flip: we bet on whether John is bluffing.
Real prediction markets famously prefer to make the subject of a share a topic like the party of the victor of the 2008 American presidential elections; a topic with a relatively clear outcome (barring the occasional George W. Bush or coin landing on its edge) and of considerable interest to many.
Interest, I mean, not merely for speculating on, but possibly of real world importance. Advocates for prediction markets as tools, such as Robin Hanson, tirelessly remind us of the possible benefits in ‘aggregating information’. A prediction market rewards clear thinking and insider information, but they focus on topics it’d be difficult to clearly bet for or against on regular financial markets.
Yes, if I thought the financial markets were undervaluing green power stocks because they were weighing Senator John McCain’s presidential candidacy too heavily, then I could do something like short those stocks. But suppose that’s all I know about the green power stocks and the financial markets? It’d be madness to go and trade on that belief alone. I’d be exposing myself to countless risks, countless ways for the price of green stocks to be unconnected to McCain’s odds, countless intermediaries, countless other relations of green stocks which may cancel out my correct appraisal of one factor. Certainly in the long run, weakly related factors will have exactly the effect they deserve to have. But this is a long run in which the investor is quite dead.
Prediction markets offer a way to cut through all the confounding effects of proxies, and bet directly and precisely on that bit of information. If I believe Senator Barack Obama has been unduly discounted, then I can directly buy shares in him instead of casting about for some class of stocks that might be correlated with him - which is a formidable task in and of itself; perhaps oil stocks will rise because Obama’s platform includes withdrawal from Iraq which render the Middle East less stable, or perhaps green stocks will rise for similar reasons, or perhaps they’ll all fall because people think he’ll be incompetent, or perhaps optimism over a historic election of a half-black man and enthusiasm over his plans will lift all boats…
One will never get a faithful summation of all the information about Obama scattered among hundreds or thousands of traders if one places multiple difficult barriers in front of a trader who wishes to turn his superior knowledge or analysis into money.
Or here’s another example: many of the early uses of prediction markets have been inside corporations, betting on metrics like quarterly sales. Now, all of those metrics are important and will in the long run affect stock prices or dividends. But what employee working down in the R&D department is going to say ‘People are too optimistic about next year’s sales, the prototypes just aren’t working as well as they would need to’ and go short the company’s stock? No one, of course. A small difference in their assessment from everyone else’s is unlikely to make a noticeable price difference, even if the transaction costs of shorting didn’t bar it. And yet, the company wants to know what this employee knows.
How much to bet
There’s something of an efficient market issue with prediction markets, specifically a no-trade theorem. Given that unlike the regular stock market, trades in prediction markets are usually zero-sum3, and so lots of traders are going to be net losers. If you don’t have any particular reason to think you are one of the wolves canny enough to make money off the sheep, then you’re one of the sheep, and why trade at all? (I understand poker players have a saying - if you can’t spot the fish at the table, you’re the fish.)
So, the bad-and-self-aware won’t participate. If you are trading in a prediction market, you are either good-and-aware or good-and-ignorant or bad-but-ignorant. Ironically, the latter two can’t tell whether they are the first group or not. It reminds me of the smoking lesion puzzle or “King Solomon’s problem” in decision theory: you may have many cognitive biases such as the overconfidence effect (the lesion), and they may cause you to fail or succeed on the prediction market (get cancer or not) and also want to participate therein. What do you do?
Best of course is to test for the lesion directly - to test whether our predictions are calibrated4, whether events we confidently predict at 0% do in fact never happen and so on. If we manage to overcome our biases, we can give calibrated probability assessments. We can do this sort of testing with the relevant biases - just knowing about them and introspecting about one’s predictions can improve them. Coming up with the precise reasons one is making a prediction improves one’s predictions5 and can also help with the hindsight bias6 or the temptation to falsify your memories based on social feedback, all of which is important to figuring out how well you will do in the future. We can quickly test calibration using our partial ignorance about many factual questions, eg. the links in “Test Your Calibration!”. My recent practice with thousands of real-world predictions on PredictionBook.com has surely helped my calibration.
So, how much better are you than your competing traders? What is your edge? This, believe it or not, is pretty much all you need to know to know how much to bet on any contract. The exact fraction of your portfolio to bet given a particular edge is defined by the Kelly criterion (more details) which gives the greatest possible expected utility of your growth rate. (But you need to be psychologically tough7 to use it lest you begin to play irrationally: it’s not a risk averse strategy. And strictly speaking, it doesn’t immediately apply to multiple bets you can choose from, but let’s say that whatever we’re looking at is the bet we feel is the most mispriced and we can do the best on.)
The formula is:
- o = odds
- e = your edge
- x = the fraction to invest
To quote the Wikipedia explanation:
As an example, if a gamble has a 60% chance of winning (), but the gambler receives 1-to-1 odds on a winning bet (), then the gambler should bet 20% of the bankroll at each opportunity (), in order to maximize the long-run growth rate of the bankroll.
So, suppose the President’s re-election contract was floating at 50%, but based on his performance and past incumbent re-election rates, you decide the true odds are 60%; you can buy the contract at 50% and if you hold until the election and are right, you get back double your money, so the odds are 1:1. The filled-in equation looks like
Hence, you ought to put 20% of your portfolio into buying the President’s contract. (If we assume that all bets are double-or-nothing, Wikipedia tells us it simplifies to , which in this example would be = = . But usually our contracts in prediction markets won’t be that simple, so the simplification isn’t very useful here.)
It’s not too hard to apply this to more complex situations. Suppose the president were at, say, 10% but you are convinced the unfortunate equine sex scandal will soon be forgotten and the electorate will properly appreciate el Presidente for winning World War III by making his true re-election odds 80%. You can buy in at 10% and you resolve to sell out at 80%, for a reward of 70% or 7 times your initial stake (7:1). And we’ll again say you’re right 60% of the time. So your Kelly criterion looks like:
Wow! We’re supposed to bet more than half our portfolio despite knowing we’ll lose the bet 40% of the time? Well, yes. With an upside like 7x, we can lose several bets in a row and eventually make up our loss. And if we win the first time, we just won huge.
It goes both ways, of course. If we have a market/true-odds of 80%/90% and we do the same thing, we have a return of 12.5% (9/8) rather than 100%, and for that little return ought to risk only:
As one would expect, with a smaller reward but equal risk compared to our first example, the KC recommends a smaller than 0.2 fraction invested.
If one doesn’t enjoy calculating the KC, one could always write a program to do so; Russell O’Connor has a nice Haskell blog post on “Implementing the Kelly Criterion” (who also has an interesting post on the KC and the lottery.)
So once we are interested in prediction markets and would like to try them out, we need to pick one. There are several. I generally ignore the ‘play money’ markets like the Hollywood Stock Exchange, despite their similar levels of accuracy to the real money markets; I just have a prejudice that if I make a killing, then I ought to have a real reward like a nice steak dinner and not just increment some bits on a computer. The primary markets to consider are:
- Betfair and BETDAQ are probably the 2 largest prediction markets, but unfortunately, it is difficult for Americans to make use of them - Betfair bans them outright.
- Intrade is another European prediction market, similar to Betfair and BETDAQ, but it does not go out of its way to bar Americans, and thus is likely the most popular market in the United States. (Its sister site TradeSports was sport-only, and is now defunct.)
- HedgeStreet is some sort of hybrid of derivatives and predictions. I know little about it.
- The Iowa Electronic Markets (IEM) is an old prediction markets, and one of the better covered in American press. It’s a research prediction market, so it handles only small quantities of money and trades and has only a few traders8. Accounts max out at $500, a major factor in limiting the depth & liquidity of its markets.
I didn’t want to wager too much money on what was only a lark, and the IEM has the favorable distinction of being clearly legal in the USA. So I chose them.
In 2003, I sent in a check for $20. A given market’s contracts in the IEM are supposed to sum to $1, so $20 would let me buy around 40 shares - enough to play around with.
My IEM trading
“Like all weak men he laid an exaggerated stress on not changing one’s mind.”9
Prediction markets are known to have a number of biases. Some of these biases are shared with other betting exchanges; horse-racing is plagued with a ‘long-shot favoritism’ just like prediction markets are. (An example of long-shot favoritism would be Intrade and IEM shares for libertarian Ron Paul winning the 2008 Republican nomination trading at ludicrous valuations like 10¢, or Al Gore - who wasn’t even running - for the Democratic nomination at 5¢.) The financial structure of markets also seems to make shorting of such low-value (but still over-valued) shares more difficult. They can be manipulated, consciously or unconsciously, due to not being very good markets (“They are thin, trading volumes are anemic, and the dollar amounts at risk are pitifully small”), and that’s where they aren’t reflecting the prejudices of their users (one can’t help but suspect Ron Paul shares were overpriced because he has so many fans among techies).
I began experimenting with some small trades on IEM’s Federal Reserve interest rate market; I had a theory that there was a ‘favorites bias’ (the inverse of long-shot favoritism, where traders buck the conventional wisdom despite it being more correct). I simply based my trades on what I read in the New York Times. It worked fairly well. In 2005, I also dabbled in the markets on Microsoft and Apple share prices, but I didn’t find any values I liked.
2004 was, of course, a presidential election year. I couldn’t resist, and traded heavily. I avoided Democratic nominations, reasoning that I was too ignorant that year - which was true, I did not expect John Kerry to eventually win the nomination - and focused on the party-victory market. The traders there were far too optimistic about a Democratic victory; I knew ‘Bush is a war-time president’ (in addition to the incumbency!) as people said, and that this matter a lot to the half of the electorate that voted for him in 2000. Giving him a re-election probability of under 40% was too foolish for words.
I did well on these trades, and then in October, I closed out all my trades, sold my Republican/Bush shares, and bought Kerry. I thought the debates had gone well for Kerry and was confident the Swift Boating wouldn’t do much in the end, and certainly couldn’t compensate for the albatross of Iraq.
As you know, I was quite wrong in this strategy. Bush did win, and won more than in 2000. And I lost $5-10. (Between a quarter and a half my initial capital. Ouch! I was glad I hadn’t invested some more substantial sum like $200.) I had profited early on from people who had confused what they wanted to happen with what would, but then I had succumbed to the same thing. Yes, everyone around me (I live in a liberal state) was sure Kerry would win, but that’s no excuse for starting off with a correct assessment and then choosing a false one. It was a valuable lesson for me; this experience makes me sometimes wonder whether ‘personal’ prediction markets, if you will, could be a useful tool.
In 2005 & 2006, I did minimal interesting trading. I largely continued my earlier strategies in the interest rate markets. Slowly, I made up for my failures in 2004.
In 2007, the presidential markets started back up! I surveyed the markets and the political field with great excitement. As anyone remembers, it was the most interesting election in a very long, with such memorable characters (Hillary Clinton, Ron Paul, Barack Obama, John McCain, Sarah Palin) and unexpected twists.
As in 2004, the odds of an ultimate Republican victory were far too low - hovering in the range of 30-40%. This is obviously wrong on purely historical considerations (Democrats don’t win the presidency that often), and seems particularly wrong when we consider that George W. Bush won in 2004. Anyone arguing that GWB poisoned the well for a succeeding Republican administration faces the difficult task of explaining (at least) 2 things:
- How association with GWB would be so damaging when GWB himself was re-elected in 2004 with a larger percentage of votes than 2000
- How association with GWB policies like Iraq would be so damaging when the daily security situation in Iraq has clearly improved since 2004.
- And in general: how a fresh Republican face (with the same old policies) could do any worse than GWB did, given that he will possess all the benefits of GWB’s policies and none of the personal animus against GWB.
The key to Republican betting was figuring out who was hopeless, and work from there by essentially short selling them. As time passed, one could sharpen one’s bets and begin betting for a candidate rather than against. My list ultimately looked like this:
- Ron Paul was so obviously not going to win. He appealed to only a small minority of the Republican party, had views idiosyncratic where they weren’t offensive, and wanted to destroy important Republican constituencies. If the Internets were America, perhaps he could’ve won.
- Rudy Giuliani was another easy candidate to bet against. He had multiple strikes: he was far too skeevy, questionable ethically (the investigations of Bernard Kerik were well underway at this point), had made himself a parody, had few qualifications, and a campaign strategy that was as ruinous as it was perplexing. He was unacceptable culturally, what with his divorces, loose living, humorous cross-dressing, and New York ways. He would not play well in Peoria.
- Fred Thompson was undone by being a bad version of Reagan. He didn’t campaign nearly as industriously as he needed to. The death knell, as far as I was concerned, was when national publications began mentioning the “lazy like a fox” joke as an old joke. No special appeal, no special resources, no conventional ability…
- Mitt Romney had 2 problems: he was slick and seemed inauthentic, and people focused too much on his being Mormon and Massachusetts governorship (a position that would’ve been a great aid - if it hadn’t been in that disgustingly liberal state). I was less confident about striking him off, but I decided his odds of 20% or so were too generous.
- Mike Huckabee struck me as not having the resources to make it to the nomination. I was even less sure about this one than Mitt, but I lucked out - the supporters of Huckabee began infighting with Romney supporters.
This didn’t leave very many candidates for consideration. By this process of elimination, I was in fact left with only John McCain as a serious Republican contender. If you remember the early days, this was in fact a very strange result to reach: John McCain appeared tired, a beaten man from 2004 making one last pro forma try, his campaign inept and riven by infighting, and he was just in general - old, old, old.
But hey, his shares were trading in the 5-15% range. They were the best bargain going in the market. I held them for a long time and ultimately would sell them at 94-99¢ for a roughly 900% gain. (I sold them instead of waiting for the Republican convention because I was forgoing minimal gains, and I was concerned by reports on his health.)
A similar process obtained for the Democrats. A certain dislike of Hillary Clinton led me to think that her status as the heir presumptive (reflected in share prices) would be damaged at some point. All of the other candidates struck me as flakes and hopeless causes, with the exception of John Edwards and Barack Obama.
I eventually ruled out John Edwards as having no compelling characteristics and smacking of phoniness (much like Romney). I was never tempted to change my mind on him, and the adultery and hair flaps turned out to be waiting in the wings for him. So I could get rid of Edwards as a choice.
Is it any surprise I lighted on Obama? He had impressed me (and just about everyone else) with his 2004 convention speech, his campaign seemed quite competent and well-funded, the media clearly loved him, and so on. Best of all, his shares were relatively low (30-40%) and I had money left after the Republicans. So I bought Obama and sold Clinton. I eventually sold out of Obama at the quite respectable 78¢.
By the end of the election, I had made a killing on my Obama and McCain shares. My account balance stood at $38; so over the 3 or 4 years of trading I had nearly doubled my investment. $18 is perhaps enough for a steak dinner.
Further, I had learned a valuable lesson in 2004 about my own political biases and irrationality, and had earned the right in 2008 to be smug about foreseeing a McCain and Obama match-up when the majority of pundits were trying to figure out whether Hillary would be running against Huckabee or Romney.
And finally, I’ve concluded that my few observations aside, prediction markets are pretty accurate. I often use them to sanity-check myself by asking ‘If I disagree, what special knowledge do I have?’ Often I have none.
When I got out of the IEM, I reflected on my trades: I learned some valuable lessons, I had a good experience, and I came out a believer. I resolved that one day I’d like to try out a more substantial and varied market, like Intrade.
The following is an edited IEM trading history for me, removing many limit positions and other expired or canceled trades:
|Order date||O.time||Market||Contract||Order||#||Unit price||Expiry||Resolution type||R.#||R.price|
In 2010, I signed up for Intrade since the IEM was too small and had too few contracts to maintain my interest.
Paying Intrade, as a foreign company in Ireland, was a little tricky. I first looked into paying via debit card, but Intrade demanded considerable documentation, so I abandoned that approach. I then tried a bank transfer since that would be quick; but my credit union failed me and said Intrade had not provided enough information (which seemed unlikely to me, and Intrade’s customer service agreed) - and even if they had, they would charge me $10! Finally, I decide to just snail-mail them a check. I was pleasantly surprised to see that postage to Ireland was ~$1, and it made it there without a problem. But very slowly: perhaps 15 days or so before the check finally cleared and my initial $200 was deposited.
My Intrade trading
Intrade has a considerably less usable system than IEM. In IEM, selling short is very easy: you purchase a pair of contracts (yes/no) which sum to $0, and then you sell off the opposite. If I think DEM08 is too high compared to REP08, I get 1 share of each and sell the DEM08. Intrade, on the other hand, requires you to ‘sell’ a share. I don’t entirely understand it, but it seems to be equivalent.
I wanted to sell short some of the more crazy probabilities such as on Japan going nuclear or the USA attacking North Korea or Iran, but it turned out that to make even small profits on them, I would have to hold them a long time and because their probabilities were so low already, Intrade was demanding large margins - to buy 4 or 5 shorts would lock up half my account!10
My first trade was to sell short the Intrade contract on California Proposition 19 (2010), which would legalize non-medical marijuana possession. I reasoned that California recently banned gay marriage at the polls, and medical marijuana is well-known as a joke (lessening the incentive to pass Prop 19), and that its true probability of passing was more like 30% - well below its current price. The contract would expire in just 2 months, making it even more attractive.
It was at 49 when I shorted it. I put around 20% of my portfolio (or ~$40) after consulting with the Kelly criterion. 2 days later, the price had increased to 53.3, and on 4 October, it had spiked all the way to 76%. I began to seriously consider how confident I was in my prediction, and whether I was faced with a choice between losing the full $40 I had invested or buying shares at 76% (to fulfill my shorting contracts) and eating the loss of ~$20. I meditated, and reasoned that there wasn’t that much liquidity and I had found no germane information online (like a poll registering strong public support), and decided to hold onto my shares. As of 27 October, the price had plummeted all the way to 27%, and continued to bounce around the 25-35% price range. I had at the beginning decided that the true probability was in the 30% decile, and if anything, it was now underpriced. Given that, I was running a risk holding onto my shorts. So on 30 October, I bought 10 shares at 26%, closing out my shorts, and netting me $75.83, for a return of $25.83, or 50% over the month I held it.
My second trade dipped into the highly liquid 2012 US presidential elections. The partisan contracts were trading at ~36% for the Republicans and ~73% for the Democrats. I would agree that the true odds are >50% for the Democrats since presidents are usually re-elected and the Republicans have few good-looking candidates compared to Obama, who has accomplished quite a bit in office. However, I think 73% is overstated, and further, that the markets always panic during an election and squish the ratio to around 50:50. So I sold Democrat and bought Republican. (I wound up purchasing more Republican contracts than selling Democrat contracts because of the aforementioned margin issues.)
I bought 5 Reps at 39, and shorted 1 Dem at 60.8. 2 days later, they had changed to 37.5 and 62.8 respectively. By 26 November 2010, it was 42 and 56.4. By 1 January 2011, Republicans was at 39.8 and Democrats at 56.8.
Finally, I decided that Sarah Palin has next to no chance at the Republican nomination since she blew a major hole in her credentials by her bizarre resignation as governor, and her shares at 18% were just crazy.
I shorted 10 at 18% since I thought the true odds are more like 10%. 2 days later, they had risen to 19%. By 26 November, they were still at 19%, but the odds of her announcing a candidacy had risen to 75%. I’d put the odds of her announcing a run at ~90% (a mistake, given that she ultimately decided against running in October 2011), but I don’t have any spare cash to buy contracts. I could sell out of the anti-nomination contracts and put that money into announcement, but I’m not sure this is a good idea - the announcement is very volatile, and I dislike eating the fees. She hasn’t done too well as the Tea Party eminence grise, but maybe she prefers it to the hard work of a national campaign?
By 1 January 2011, the nominee odds were still stuck at 18% but the announcement had fallen to 62%. The latter is dramatic enough that I’m wondering whether my 90% odds really are correct (it probably wasn’t). By June, I’ve begun to think that Palin knows she has little chance of winning either the nomination or presidency, and is just milking the speculation for all its worth. Checking on 8 June, I see that the odds of an announcement have fallen from 62% to 33% and a nomination from 18% to 5.9% - so I would have made out very nicely on the nomination contract had I held the short, but been mauled if I had made any shorts on the announcement. I am not sure what lesson to draw from this observation; probably that I am better at assessing outcomes based on a great many people (like a nomination) than outcomes based on a single individual person’s psychology (like whether to announce a run or not).
In January 201111, Intrade announced a new fee structure - instead of paying a few cents per trade, one has free trading but your account is charged $5 every month or $60 a year (see also the forum announcement). Fees have been a problem with Intrade in the past due to the small amounts usually wagered - see for example financial journalist Felix Salmon’s 2008 complaints.
Initially, the new changes didn’t seem so bad to me, but then I compared the annual cost of this fee to my trading stake, ~$200. I would have to earn a return of 30% just to cover the fee! (This is also pointed out by many in the forum thread above.)
I don’t trade very often since I think I’m best at spotting mispricings over the long-term (the CA Proposition 19 contract (WP) being a case in point; despite being ultimately correct, I could have been mauled by some of the spikes if I had tried only short-term trades). If this fee had been in place since I joined, I would be down by $30 or $40.
I’m confident that I can earn a good return like 10 or 20%, but I can’t do >30% without taking tremendous risks and wiping myself out.
And more generally, assuming that this isn’t raiding accounts12 as a prelude to shutting down (as a number of forumers claim), Intrade is no longer useful for LessWrongers like me as it is heavily penalizing small long-term bets like the ones we are usually concerned with - bets intended to be educational or informative. It may be time to investigate other prediction markets like Betfair, or just resign ourselves to non-monetary/play-money sites like PredictionBook.com.
Fortunately for my decision to cash out (I didn’t see anything I wanted to risk holding for more than a few weeks), prices had moved enough that I didn’t have to take any losses on any positions13, and I wound up with $223.32. The $5 for January had already been assessed, and there is a 5 euro fee for a check withdrawal, so my check will actually be for something more like $217, a net profit of $17.
I requested my account be closed on 5 January and the check arrived 16 January; the fee for withdrawal was $5.16 and my sum total $218.16 (a little higher than the $217 I had guessed).
In May-June 2011, Bitcoin, an online currency, underwent approximately 5-6 doublings of its exchange rate against the US dollar, drawing the interest of much of the tech world and myself. (I had first heard of it when it was at 50 cents to the dollar, but had written it off as not worth my time to investigate in detail.)
During the first doubling, when it hit parity with the dollar, I began reading up on it and acquired a Bitcoin of my own - a donation from Kiba on #lesswrong to try out Witcoin, which was a social news site where votes are worth fractions of bitcoins. I then gave my thoughts on LessWrong when the topic came up:
After thinking about it and looking at the current community and the surprising amount of activity being conducted in bitcoins, I estimate that bitcoin has somewhere between 0 and 0.1% chance of eventually replacing a decent size fiat currency, which would put the value of a bitcoin at anywhere upwards of $10,000 a bitcoin. (Match the existing outstanding number of whatever currency to 21m bitcoins. Many currencies have billions or trillions outstanding.) Cut that in half to $5000, and call the probability an even 0.05% (average of 0 and 0.1%), and my expected utility/value for possessing a coin is $25 a bitcoin ().
I was more than a little surprised that by June, my expected value had already been surpassed by the market value of bitcoins. Which leads to a tricky question: should I sell now? If Bitcoin is a bubble as frequently argued, then I would be foolish not to sell my 5 bitcoins for a cool $130 (excluding transaction costs). But… I had not expected Bitcoin to rise so much, and if Bitcoin did better than I expected, doesn’t it follow that I should no longer believe the probability of success is merely 0.05%? Shouldn’t it have increased a bit? Even if it increased only to 0.07%, that would make the EV more like $35 and so I would continue to hold bitcoins.
The stakes are high. It is a curious problem, but it’s also a prediction market. One is simply predicting what the ultimate price of bitcoins will be. Will they be worthless, or a global currency? The current price is the probability, against an unknown payoff. To predict the latter, one simply holds bitcoins. To predict the former, one simply sells bitcoins. Bitcoins are not commodities in any sense. Buying a cow is not a prediction market on beef because the value of beef can’t drop to literally 0: you can always eat it. You can’t eat bitcoins or do anything at all with them. They are even more purely money than fiat money (the US government having perpetual problems with the zinc or nickel or copper in its coins being worth more as metal than as coins, and dollars are a tough linen fabric).
Mencius Moldbug turns out to have a similar analysis of the situation:
If Bitcoin becomes the new global monetary system, one bitcoin purchased today (for 90 cents, last time I checked) will make you a very wealthy individual. You are essentially buying Manhattan for a quarter. There are only 21 million bitcoins (including those not yet minted). (In my design, this was a far more elegant 264, with quantities in exponential notation. Just sayin’.) Mapped to $100 trillion of global money, to pull a random number out of the air, you become a millionaire. Wow!
So even if the probability of Bitcoin succeeding is epsilon, a million to one, it’s still worthwhile for anyone to buy at least a few bitcoins now. The currency thus derives an initial value from this probability, and boots itself into existence from pure worthlessness - becoming a viable repository of savings. If a very strange, dangerous and unstable one. I think the probability of Bitcoin succeeding is very low. I would not put it at a million to one, though, so I recommend that you go out and buy a few bitcoins if you have the technical chops. My financial advice is to not buy more than ten14, which should be F-U money if Bitcoin wins.
Bitcoin cumulatively represents my largest ever wager in a prediction market; at stake was >$130 in losses (if bitcoins go to zero), or indefinite thousands. It will be very interesting to see what happens. By 5 August 2011, Bitcoin has worked its way down to around $10/฿, making my net worth $26; I did spend several bitcoins on the Silk Road, though. By 23 November 2011, it had trended down to $2.35/฿, but due to a large donation of 20 bitcoins, I spent most of my balance at the Silk Road, leaving me with 4.7 bitcoins. Overall, not a good start. By July 2012, donations brought my stock up to ฿12.5 with prices trading at $5-7. After an unexpected spike on 17 July to $9, I did some reading and learned that “pirateat40” (the operator of a possible Ponzi scheme) was boasting in
#bitcoin (Reddit discussion) of using the funds to manipulate the market in an apparent pump and dump scheme and also mocking the ignorance of most buyers and sellers for not paying attention to the Bitcoin forums or IRC channel. pirateat40’s manipulation and insinuation of future plans sourced me on holding many bitcoins, and I resolved to sell if the price on MtGox went quickly back up to >$9; it did so the next day (18 July), I sold at $9.17. Withdrawing from MtGox turns out to be a major pain, with Dwolla withdrawal requiring providing documentation like a passport and a bank transfer costing $25. I ultimately used the
#bitcoin-otc channel to arrange a swap with “nanotube” of my $115 MtGox dollars for an equivalent donation to my Paypal account. The next day, the price had fallen to $7.77; demonstrating why I don’t try to time markets, by 11 August, the price had jumped to $11.50. This was a little worrisome for my long-term views that there’s a good chance the Ponzi scheme will be used in market manipulation or collapse, but there’s still much time left. A few days later, the price had spiked as high as $15, and I felt like quite a fool; but that’s the marvelous thing about markets, one day you are a genius and the next you are fool. Unexpectedly, pirateat40 announced the dissolution of his BTCST. Was it a Ponzi or not? No one knew. Perhaps on fears of that, or perhaps because pirateat40 was fleeing with the funds, on the 18-19 August, the price began dropping, and kept dropping, all the way through $10, then $9, then $8. Watching this, I resolved to buy back in. It was very difficult to find anyone who would accept PayPal on
#bitcoin-otc, but ultimately Namegduf agreed to a MtGox voucher swap, and I got $60 which I then spent at $7.8 for ฿7.6. In late February 2013, Bitcoin was almost at its all-time high of $31, and I happened to also need cash badly; I had received additional donations, so I sold out my ฿5.79 at $31.5 even as the price reached $32 - I just wanted to be out of what might have been another bubble. I then watched slackjawed as the bubble failed to pop, failed to keep its price-level, but instead doubled to $60, doubled again to $120, hit $159 on 7 April 2013, having quintupled since I decided to sell out, and finally peaked at $266 2 days later before falling back down to a steady-state of ~$100. That sale was not a great testament to my market timing skills, and prompted me to rethink my opinions about Bitcoin. At various points through August 2013, I sold on
#bitcoin-otc ฿0.5 for $52, ฿0.28 for $50, & ฿1.15 for $120, ฿0.5 for $66 & $64, ฿0.25 for $32, ฿0.1 for $13, and ฿1.0 for $127 & $129 - leaving me uncomfortably exposed at ฿18 (having had difficulty finding trustworthy buyers). On 2 October 2013, the news burst that Silk Road had been busted & DPR arrested & charged; Bitcoin immediately began dropping by $20-$40 from ~$127 (depending on exchange), so I purchased ฿2.7 for $105 each.
(One might wonder why I don’t use the fairly active Bets of Bitcoin prediction market; that is because the payout rules are insane and I have no idea how to translate the “total weighted bets” into actual probabilities - Betting blind is never a good idea. And I have no interest in ever using BitBet as they brazenly steal from users.)
A research paper (overview) introduced zero-knowledge proofs for the destruction of coins in a hypothetical Bitcoin variant (Zerocoin); this allowed the creation of new coins out of nothing while still keeping total coins constant (simply require a proof that for every new coin, an older coin was destroyed). In other words, truly anonymous coins rather than the pseudonymity and trackability of Bitcoin. Existing coin mixes are not guaranteed to work & to not steal your coins, so this scheme could be useful to Bitcoin users and worth adding. Efficiency concerns meant that the original version was impossible to add, but the researchers/developers kept working on it and shrunk the proofs to the point where they should be feasible to use. But they also announced they were looking into launching the functionality into an altcoin.
This raises a question: would this potential “Zerocoin” altcoin be worth possessing? That is, might it be more than simply a testbed for the zero-knowledge proofs to see how they perform before merging into Bitcoin proper?
I am generally extremely cynical about altcoins as being generally pump-and-dump schemes like Litecoin; I except Namecoin because distributed domain names is an interesting application of the global ledger and the proof-of-stake altcoins as interesting experiments on alternatives to Bitcoin’s proof-of-work solution. Anonymity seems to me to be even more important than Namecoin’s DNS functionality - witness the willingness of people to pay the fees to laundries like Bitcoin Fog without even guarantee they will receive safe bitcoins back (or look at the Tor network itself). So I see basically a few possible long-term outcomes:
- Zerocoin fizzles out and the network disintegrates because no one cares
- Zerocoin core functionality is captured in Bitcoin and it disintegrates because it is now redundant
- Zerocoin survives as an anonymity layer: people buy zerocoins with tainted bitcoins, then sell the zerocoins for unlinked bitcoins
- Zerocoin replaces Bitcoin
Probability-wise, I’d rank outcome #1 as the most likely, #2 is likely but not very likely because the Bitcoin Foundation seems increasingly beholden to corporate and government overseers and even if not actively opposed, will engage in motivated reasoning looking for reasons to reject Zerocoin functionality and avoid rocking its boat; #3 seems a little less likely since people can use the laundries or alternative tumbling solutions like CoinJoin but still fairly probable; #4 very improbable, like 1%.
To elaborate a little more on the reasoning for believing #2 unlikely: my belief that the Foundation & core developers are not keen on Zerocoin is based on my personal intuition about a number of things:
- the decision by the Zerocoin developers to pursue an altcoin at all, which is a massive waste of effort if they had no reason to expect it to be hard to merge it in (or if the barriers to Zerocoin use were purely technical); the altcoin is a very recent decision, and they were clear upfront that “Zerocoin is not intended as a replacement for Bitcoin” (written 11 April 2013).
- the iron law of oligarchy, which suggests that the Foundation & core developers may be gradually shifting into an accommodationist modes of thought - attending government hearings to defend Bitcoin, repeatedly stating Bitcoin is not anonymous but pseudonymous and so is no threat to the status quo (which is misleading and even technically interpreted, would be torpedoed by Zerocoin), and discussing whitelisting addresses. To put it crudely, we may be in the early stages of them “selling out”: moderating their positions and cooperating with the Powers That Be to avoid rocking the boat and achieve things they value more like mainstream acceptance & praise. (I believe something very similar happened to Wikipedia’s WikiMedia Foundation after the Seigenthaler incident.)
- the lack of any really positive statements about Zerocoin, despite the technical implications: the holy grail achieved - truly anonymous decentralized digital cash! With Zerocoin added in, the impossible will have become possible. It says a lot about how far from the libertarian cryptopunk roots Bitcoin has drifted that Zerocoin is not a top priority.
Price-wise, #1 and #2 mean zerocoins go to zero, but on the plus side mining or buying at least signals support and may have positive effects on the Foundation or Bitcoin community. Outcome #4 (replacing Bitcoin) means obviously ludicrous profits as Zerocoin goes from pennies or a dollar each to $500+ (assuming for convenience Zerocoin also sets 21m coins). Interestingly, outcome #3 (anonymity layer) also means substantial profits: because the price of zerocoins will be more than pennies due to the float from Bitcoin users washing coins. Imagine that there are 1m zerocoins actively traded, and Bitcoin users want to launder $10m of bitcoins a year, and it on average takes a day for each Bitcoin user to finish moving in and out of zerocoins; then each day there’s $27378 locked up in zerocoins and spread over the 1m zerocoins, then solely from the float alone, each zerocoin must be worth 3¢ (which is a nice profit for anyone who, say, bought zerocoins at 1¢ after the Zerocoin genesis block).
I personally think Bitcoin should incorporate Zerocoin if the resource requirements are not too severe, and supporting Zerocoin may help this. And if it doesn’t, then it may well be profitable. In either case, I benefit. So if/when the Zerocoin genesis block is released, I will consider trying to mine it or establishing a price floor (eg publicly committing $100 to buying zerocoins at 1¢ from any and all comers).
- Zerocoin as functioning altcoin network within a year: 65%
- Zerocoin market cap >$7,700,000,000 within 5 years (conditional on launch): 1%
- Zerocoin market cap >$7,000,000 within 5 years (conditional on launch): 7%
- Zerocoin functionality incorporated into Bitcoin within 1 year: 33%
- Zerocoin functionality incorporated into Bitcoin within 5 years: 45%
Overall, I am for betting because I am against bullshit. Bullshit is polluting our discourse and drowning the facts. A bet costs the bullshitter more than the non-bullshitter so the willingness to bet signals honest belief. A bet is a tax on bullshit; and it is a just tax, tribute paid by the bullshitters to those with genuine knowledge.15
Besides prediction markets, one can make person-to-person bets. These are not common because they require a degree of trust due to the issue of who will judge a bet & counterparty risk, and I have not found many people online that I would willing to bet with or vice versa. Below is a list of attempts:
|Person||Bet||Accepted||Date offered||Expiration||Theirs||My $||My P||Bet Position||Result||Notes|
|mostlyacoustic||Entrance fee/RSVP required at NYU lecture.||No||3 March 2011||2 days||$5||$100||<5%||Against||Win||LW discussion|
|Eliezer Yudkowsky||HP MoR will win Hugo for Best Novel 2013-2017||Yes||12 April 2012||5 Sep 2017||$5||$100||5%||Against||Win||LW discussion|
|Filipe||Cosma Shalizi believes that P=NP||Yes||4 June 2012||1 week||$100||$100||1%||Against||Win||I forgave the amount due to his personal circumstances.|
|mtaran||Kim Suozzi’s donation solicitations not a scam||No||19 August 2012||1 Jan 2013||$10||$100||90%||Against||Win||LW discussion; in negotiating the details, mtaran didn’t seem to understand betting, so the bet fell through.|
|chaosmosis||Mitt Romney lose 2012 Presidential election||No||15 Oct 2012||3 Nov 2013||$30||$20||70%||For||Win|
|David Lee||>1m people using Google Glass-style HUD in 10 years.||No||8 June 2013||10 years||?||?||50%||Against||Fortune discussion; Lee’s cavalier acceptance of 100:1 odds indicated he was not serious, so I declined.|
|chaosmosis||HP MoR: the dead character Hermione to reappear as ghost||No||30 June 2013||1 year||?||$25||30%||Against||Win||Reddit discussion|
|jacoblyles||MIRI/CFAR to evolve into terrorist organizations||No||18 Oct 2012||30 years||?||<$1000||<1%||Against||LW discussion|
|Patrick Robotham||Whether could prove took economics course to third party||Yes||20 Sep 2013||immediate||$50||$10||50%||Against||Loss|
|Mparaiso||>30 Silk Road-related arrests in the year after the bust||No||8 Oct 2013||1 Oct 2014||$20||$100||20%||Against||offer, PB.com|
|qwertyoruiop||Bitcoin ≤$50/฿ between October & December 2013||Yes||19 Oct 2013||19 Dec 2013||฿0.1||฿0.1||5%||Against||Win||PB.com; signed contract; qwertyoruiop paid early as once Bitcoin reached a peak of $900, it was obviously not going to be ≤$50 again, as indeed it was not.|
|everyone||Sheep Marketplace to shut down in 6 months||No||30 Oct 2013||30 Apr 2013||฿2.3||฿1.0||40%||For||Loss||Reddit post|
|*||Sheep Marketplace to shut down in 12 months||No||30 Oct 2013||30 Oct 2014||฿0.66||฿1.0||50%||For||Win||*|
|*||BlackMarket Reloaded to shut down in 6 months||No||30 Oct 2013||30 Apr 2013||฿3.0||฿1.0||35%||For||Win?||*|
|*||BlackMarket Reloaded to shut down in 12 months||No||30 Oct 2013||30 Oct 2014||฿1.5||฿1.0||50%||For||Win?||*|
|Delerrar||Nanotube is providing escrow for the 4 BMR/Sheep bets||No||30 Oct 2013||31 Oct 2013||฿0.1||฿0.1||<5%||For||Win||Offer on Reddit|
“I recall, for example, suggesting to a regular loser at a weekly poker game that he keep a record of his winnings and losses. His response was that he used to do so but had given up because it proved to be unlucky.” –Ken Binmore, Rational Decisions
Markets teach humility to all except those who have very good or very poor memories. Writing down precise predictions is like spaced repetition: it’s brutal to do because it is almost a paradigmatic long-term activity, being wrong is physically unpleasant16, and it requires 2 skills, formulating precise predictions and then actually predicting. (For spaced repetition, writing good flashcards and then actually regularly reviewing.) There are lots of exercises to try to (calibrate yourself using trivia questions obscure historical events, geography, etc.), but they only take you so far; it’s the real world near term and long term predictions that give you the most food for thought, and those require a year or three at minimum. I’ve used PB heavily for 11 months now, and I used prediction markets for years before PB, and only now do I begin to feel like I am getting a grasp on predicting. We’ll look at these alternatives.
“The best salve for failure - to have quite a lot else going on.”17
Besides the specific mechanism of prediction markets, one can just make and keep track of predictions oneself. They are much cheaper than prediction markets or informal betting and correspondingly tend to elicit many more responses18
There are a number of relevant websites I have a little experience with; some aspire to be like David Brin’s proposed prediction registries, some do not:
PredictionBook (PB) is a general-purpose free-form prediction site. PB is a site intended for personal use and small groups registering predictions; the hope was that LessWrongers would use it whenever they made predictions about things (as they ought to in order to keep their theories grounded in reality). It hasn’t seen much uptake, though not for the lack of my trying.I personally use it heavily and have input somewhere around 1000 predictions, of which around 300 have been judged. (I apparently am rather underconfident.) A good way to get started is to go to the list of upcoming predictions and start entering in your own assessment; this will give you feedback quickly.
- Warren Buffet has. forcing people to put up money has kept real-money prediction markets pretty small in both participants and volume; and how much more so when all proceeds go to charity? No wonder that half a decade or more later, there’s only a few hundred money-bets going, even with prominent participants like Warren Buffet. Non-money markets or prediction registries can work in the higher volumes necessary for learning to predict better. Single-handedly on PB I have made 10 times the number of predictions on all of Long Bets. Where will I learn & improve more, Long Bets or PB? (It was easy for me to borrow all the decent predictions and register them on PB.)
FutureTimeline is a maintained list of projected technological milestones, events like the Olympics, and mega-construction deadlines.FutureTimeline does not assign any probabilities and doesn’t attempt to track which came true; hence, it’s more of a list of suggestions than predictions. I have copied over many of the more falsifiable ones to PB.
WrongTomorrow: a site that was devoted solely to registering and judging predictions made by pundits (such as the infamous Tom Friedman).
Unfortunately, WT was moderated and when WT didn’t see a sudden massive surge in contributions, moderation fell behind badly until eventually the server was just turned off for the author’s other projects. I still managed to copy a number of predictions off it into PB, however. WT is an example of a general failure mode for collections of predictions: no follow-through. Predictions are the paradigmatic Long Content, and WT will probably not be the first site to learn this the hard way.
And the last site demonstrates like Brin’s prediction registries have not come into existence. One of the few approximations to a prediction registry is Philip Tetlock’s justly famous 2005 book Expert Political Judgment: How Good Is It? How Can We Know?, which discusses an ongoing study which has tracked >28000 predictions by >284 experts, proves why: experts are not accurate and can be outperformed by embarrassingly simple models, and they do not learn from their experience, attempting to retroactively justify their predictions with reference to counterfactuals. (If wishes were fishes… Predictions are about the real world, and in the real world, hacks and bubbles are normal expected phenomena. A verse I saw somewhere runs: “Since the beginning / not one unusual thing has happened”. If your predictions can’t handle normal exogenous events, then they are still wrong. Tetlock identifies this as a common failure mode of hedgehog-style experts: “I was actually right! but for X Y Z…”) And looking around, I think I agree with Eliezer Yudkowsky that when the vast majority of people make a prediction, it is not an actual prediction to be judged right or wrong but an entertaining performative utterance intended to signal partisan loyalties.
Another feature worth mentioning is that prediction sites do not generally allow retrospective predictions, because that is easily abused even by the honest (who may be suffering confirmation bias). Prediction markets, needless to say, universally ban retrospective predictions. So, predicting generally doesn’t give fast feedback - intrinsically, you can’t learn very much from short-term predictions because either there’s serious randomness involved such that it takes hundreds of predictions to begin to improve, or the predictions are badly over-determined by available information that one learns little from the successes.
A short list of sites which make it easy to find newly-created predictions or (for quicker gratification & calibration) predictions which are about to reach their due dates:
IARPA: The Good Judgment Project
In 2011, the Intelligence Advanced Research Projects Activity agency (IARPA) began the Aggregative Contingent Estimation (ACE) Program, pitting 5 research teams against each other to investigate and improve prediction of geopolitical events. One team, the Good Judgment Project (see the Wired interview with Philip Tetlock), solicited college graduates for the 4 year time period of ACE to register predictions on selected events, for a $150 honorarium. A last-minute notice was posted on LessWrong, and I immediately signed up and was accepted as I predicted.
The initial survey upon my acceptance was long and detailed (calibration on geopolitics, finance, and religion; personality surveys with a lot of fox/hedgehog questions; basic probability; a critical thinking test, the CRT; educational test scores; and then what looked like a full matrix IQ test - we were allowed to see some of our own results, like the season 2 calibration test19). The final results will no doubt turn up many interesting correlations or lack of correlation. I look forward to completing the study. At the very least, they will supply a few hundred predictions I can put on PredictionBook.com - formulating a quality prediction (falsifiable, objective, and interesting) can be the hardest part of predicting.
Season 1 results
My initial batch of short-term predictions did well; even though I make a major mistake when I fumble-fingered a prediction about Mugabe (I bet that he would fall from office in a month, when I believed the opposite), I was still up by $700 in its play-money. I have, naturally, been copying my predictions onto PredictionBook.com the entire time.
- Your total earnings for 84 out of 85 closed forecasts is 15,744.
- You were ranked 28 among the 204 forecasters in Group 3c.
Not too shabby; I was actually under the impression I was doing a lot worse than that. Hopefully I can do better in 2012 - I seem fairly accurate, so I ought to make my bets larger.
Season 2 results
Naturally I signed up for Season 2. But it took the GJP months to actually send us the honorarium, and for Season 2, they switched to a much harder to use prediction-market interface which I did not like at all. I used up my initial allotment of money, but I’m not sure how actively I will participate: there’s still some novelty but the UI was bad enough that all the fun is gone. The later addition of ‘trading agents’ where one could just specify one’s probability and it would make appropriate trades automatically lured me back in for some trading, but as one would expect from my disengagement, my final results were far worse than for season 1: I ranked 184 out of 245 (~75th percentile).
I might as well stick around for season 3. Maybe I will try harder this time.
Season 3 results
For some reason, I never saw my season 3 results; searching my emails turns up no mentions of the official results being released (only some issues with a few controversial contract decisions). I don’t recall my season 3 results being unusually good or bad.
Season 4 results
My results in 2014-2015 (the final season of the IARPA competition) ranked me 41 of 343 in my experimental group. This was much better than season 2, and slightly better than season 1 (~12th percentile vs 13th).
“The best lack all conviction, while the worst are full of passionate intensity.” –Yeats, “The Second Coming”
Faster even than making one’s own predictions is the procedure of calibrating yourself. Simply put, instead of buying shares or not, you give a direct probability: your 10% predictions should come true 10% of the time, your 20% predictions true 20% of the time, etc. This is not so much about figuring out the true probability of the event or fact in the real world but rather about your own ignorance. It is as much about learning humility and avoiding hubris as it is about accuracy. You can be well-calibrated even making predictions about topics you are completely ignorant of - simply flip a coin to choose between 2 possibilities. You are still better than someone who is equally ignorant but arrogantly tries to pick the right answers anyway and fails - he will be revealed as miscalibrated. If they are ignorant and don’t know it, they will come out overconfident; and if they are knowledgeable and don’t realize it, they will come out underconfident. (Note that learning of your overconfidence is less painful than in a prediction market, where you lose your money.)
Thus, one can simply compile a trivia list and test people on their calibration; there are at least 4 such online quizzes along with the board game Wits & Wagers. (Consultant Douglas Hubbard has a book How to Measure Anything: Finding the Value of “Intangibles” in Business which is principally on the topic of applying a combination of calibration and Fermi estimates to many business problems, which I found imaginative & interesting.) These tests are also useful for occasional independent checks on whether you easily succumb to bias or miscalibration in other domains; I personally seem to do reasonably well21.
Some professional groups do much better on forecasting than others. Two of the key factors found by Armstrong and other forecasting researchers is that the better groups have fast and clear feedback22, and conversely, Tetlock’s “hedgehogs” were characterized by constant attempts to rationalize unexpected outcomes and refrain from falsifying their cherished world-view. Trivia questions, and to a lesser extent the predictions on PredictionBook.com, offer both factors.
1001 PredictionBook Nights
I explain what I’ve learned from creating and judging thousands of predictions on personal and real-world matters: the challenges of maintenance, the limitations of prediction markets, the interesting applications to my other essays, skepticism about pundits and unreflective persons’ opinions, my own biases like optimism & planning fallacy, 3 very useful heuristics/approaches, and the costs of these activities in general. (Plus an extremely geeky parody of Fate/Stay Night.)
(Initial discussion on LessWrong.)
I am the core of my mind.
Belief is my body and choice is my blood.
I have recorded over a thousand predictions,
Unaware of fear
Nor aware of hope
Have withstood pain to update many times
Waiting for truth’s arrival.
This is the one uncertain path.
My whole life has been…
Unlimited Bayes Works!23
In October 2009, the site PredictionBook.com was announced on LW. I signed up in July 2010, as tracking free-form predictions was the logical endpoint of my dabbling in prediction markets, and I had recently withdrawn from Intrade due to fee changes. Since then I have been the principal user of PB.com, and a while ago, I registered my 1001th prediction. (I am currently up to >1628 predictions, with >383 judged; PB total has >4258 predictions.) I had to write and research most of them myself and they represent a large time investment. To what use have I put the site, and what have I gotten out of the predictions?
“Our errors are surely not such awfully solemn things. In a world where we are so certain to incur them in spite of all our caution, a certain lightness of heart seems healthier than this excessive nervousness on their behalf.” –William James, “The Will to Believe”, section VII
Using PredictionBook taught me two things as far as such sites go:
- I learned the value of centralizing (and backing up) predictions of interest to me. I ransacked LongBets.org,
WrongTomorrow.com, Intrade, FutureTimeline.net, and various collections of predictions like Arthur C. Clarke’s list, LessWrong’s own annual prediction threads (2010, 2011), or simply random comments on LW (sometimes Reddit too). This makes searching for previous predictions easier, graphs all my registered predictions, and makes backups a little simpler. WrongTomorrow promptly vindicated my paranoia by dying without notice. I now have a reply to David Brin’s oft-repeated plea for a “predictions registry”: no one cares, so if you want one, you need to do it yourself.
I realized that using prediction markets had narrowed my appreciation of what predictions are good for. IEM & Intrade had taught me contempt for certain pundits (and respect for Nate Silver) because they would mammer on about issues where I knew better from the relevant market; but there are very few liquid markets in either site, and so I learned this for only a few things like the US Presidential elections. Prediction markets will be flawed for the foreseeable future, with individual contracts subject to long-shot bias24 or simply bizarre claims due to illiquidity25; for these things, one must go elsewhere or not go at all.
At worst, this fixation on prediction markets - and real-money prediction markets - may lead one to engage in epic yak-shaving in striving to change US laws to permit prediction markets! I am reminded of Thoreau:
This spending of the best part of one’s life earning money in order to enjoy a questionable liberty during the least valuable part of it reminds me of the Englishman who went to India to make a fortune first, in order that he might return to England and live the life of a poet. He should have gone up the garret at once.
“Robert Morris has a very unusual quality: he’s never wrong. It might seem this would require you to be omniscient, but actually it’s surprisingly easy. Don’t say anything unless you’re fairly sure of it. If you’re not omniscient, you just don’t end up saying much. More precisely, the trick is to pay careful attention to how you qualify what you say…He has an almost superhuman integrity. He’s not just generally correct, but also correct about how correct he is. You’d think it would be such a great thing never to be wrong that everyone would do this. It doesn’t seem like that much extra work to pay as much attention to the error on an idea as to the idea itself. And yet practically no one does.” –Paul Graham
Do any particular sets of predictions come to my mind? Yes:
- My largest outstanding collection are the >207 predictions about the unreleased Evangelion movies & manga; I regard their upcoming releases as excellent chances to test my theories about Evangelion interpretation in a way that is usually impossible when it comes to literary interpretation
- For my personal Adderall double-blind trial, I recorded 16 predictions about a trial (guessing whether it was placebo or Adderall) to try to see how strong an effect I could diagnose, in addition to whether there was one at all. (I also did one for modafinil & LSD microdosing)
- During the big Bitcoin bubble, I recorded a number of predictions on Reddit & LW and followed up on a number of them; I believe this was educational for those involved - at the least, I think I tempered my own enthusiasm by noting the regular failure of the most optimistic predictions and the very low Outside View probability of a take-off
- I made qualitative predictions in Haskell Summer of Code for 2010 & 2011, but I’ve refrained from recording them because I’ve been accused of being subjective in my evaluations; for 2012 & 2013, I bit the bullet.
- For my modeling & predictions of when Google will kill its various products, I registered my own adjustments to the final set of 5-year survival predictions so as to compare my performance with the model’s performance 5 years later
Benefits from making predictions
Day ends, market closes up or down, reporter looks for good or bad news respectively, and writes that the market was up on news of Intel’s earnings, or down on fears of instability in the Middle East. Suppose we could somehow feed these reporters false information about market closes, but give them all the other news intact. Does anyone believe they would notice the anomaly, and not simply write that stocks were up (or down) on whatever good (or bad) news there was that day? That they would say, “hey, wait a minute, how can stocks be up with all this unrest in the Middle East?”26
When I do use predictions, I’ve noticed some direct benefits:
Giving probabilities can make an analysis clearer (how do I know what I think until I see what I predict?); when I speculated on the identity of Mike Darwin‘s patron (above, ’Notes’), the very low probabilities I assigned in the conclusion to any particular billionaire makes clear that I repose no real confidence in any of my guesses and that this is more of a Fermi problem puzzle or exercise than anything else. (And indeed, none of them were correct.) I believe that sharpening my analyses has also made me better at spotting political bloviation and pundits pontifying:
“Don’t ask whether predictions are made, ask whether predictions are implied.” –Steven Kaas
Going on the record with time-stamps can turn sour-grapes into a small victory. If one read my Silk Road article and saw a footnote to the effect that the Bitcoin forum administrators were censors who removed any discussion of the Silk Road, such an accusation is rather less convincing than a footnote linking to a prediction that a particular thread would be removed and noting that as the reader can verify for themselves, said thread was indeed subsequently deleted.
One of the things I hoped would make my site unusual was regularly employing prediction; I haven’t been able to do it as often as I hoped, but I’ve still used it in 19 pages:
- About: projections about finishing writing/research projects, and site pageviews
- Choosing Software: whether I will continue to use certain software tools chosen in accordance with its principles
- Haskell Summer of Code: success of the 2012 projects
- In Defense Of Inclusionism: predicting the WMF’s half-hearted efforts at editor retention will fail; predictions about informal experiments I’ve carried out
- Mistakes: computer Go
- Nootropics: checking the success of blinding Adderall, day-time modafinil, iodine, and nicotine experiments (see above);
- Zeo: checking blinding of 2 vitamin D experiments
- Notes: predictions on Steve Jobs’s lack of charity, correctness of speculative analysis
- Wikipedia and Knol: in my description of the failure of Knol as a Wikipedia or blog competitor, I naturally registered several estimates of when I expected it to die; I was correct to expect it to die quickly, in 2012 or 2013, but not that the content would remain public. This experience was part of the motivation for my later Google shutdowns analysis.
- Evangelion predictions: see above
- Harry Potter and the Methods of Rationality predictions -(an exercise similar to the Evangelion predictions)
- Prediction markets: political predictions, Intrade failure predictions, GJP acceptance
- Silk Road: prediction of censorship on main Bitcoin forums (see above), and of no legal repercussions
- Slowing Moore’s Law: asserts semiconductor manufacturing is fragile and hence Kryder’s law has been permanently set back by 2011 Thai floods
- The Notenki Memoirs: Hiroyuki Yamaga’s perpetually in-planning movie Aoki Uru will not be released.
- Modafinil: correctly predicted tolerance for a particularly frequent user
- Death Note script: I registered predictions on what replies I expected from Parlapanides, asking about whether he wrote the leaked script being analyzed, to forestall accusations of hindsight bias
- “The Crypto-Currency: Bitcoin and its mysterious inventor”, The New Yorker 2011; mentioned my own failed prediction of a government crackdown
- Google shutdowns: as part of my statistical modeling of the likely lifetimes of Google products, I took the final model’s predictions of 5-year survival (to 2018) and adjusted them to what I felt intuitively was more right.
- LSD microdosing: blinding index/guessing whether active or placebo
- Jed McCaleb interview: email interview with Bitcoin exchange MtGox founder Jed McCaleb, predicting whether the origin story was true (I guessed it was a leprechaun/urban-legend but it turned out to be sort of true)
- Blackmail: guessing about whether an anonymous extortionist would reply to my refusal or try a second time (he did neither)
“We should not be upset that others hide the truth from us, when we hide it so often from ourselves.” –François de La Rochefoucauld, Maximes 11
- I knew (to quote Julius Caesar) that “What we wish, we readily believe, and what we ourselves think, we imagine others think also.” or (to quote Orwell), “Politics…is a sort of sub-atomic or non-Euclidean word where it is quite easy for the part to be greater than the whole or for two objects to be in the same place simultaneously.”27, but it wasn’t until I was sure that George Bush would not be re-elected in 2004, that I knew that I could succumb to that even in abstract issues which I had read enormous quantities of information & speculation on.
- while I am weak in areas close to me, in other areas I am underconfident, which is a sin and as much to be remedied as overconfidence. (Specifically, it seemed I was initially overconfident on 95%+ predictions and underconfident in the 60-90% regime; I think I’ve learned my lesson, but by the nature of these things, my recorded calibration will take many predictions to recover in the extreme ranges.)
- I am too optimistic and not cynical enough; the cardinal example, personally, would be the five-year XiXiDu prediction which was falsified in one month. The Outside View heavily militated against it, as did my fellow predictors, and if it had been formulated as something socially disapproved of like alcohol or smoking, I would probably have gone with 10 or 20% like JoshuaZ; but because it was a fellow LessWronger trying to get his life straight…
I am considerably more skeptical of op-eds and other punditry, after tracking the rare clear predictions they made. (I was already wary due to Tetlock, and a more recent study of major pundits but not enough, it seems.)The rareness of such predictions has instill in me an appreciation of Hansonian signaling theories of politics - it is so hard to get falsifiable predictions out of writings even when they look clear; for example, leading up to the 2011 US Federal debt crisis and ratings downgrade, everyone prognosticated furiously - but did they mean any rating agency, or all of them, or just a majority?
I respect fundamental trends more; they are powerful predictors indeed, and like Philip Tetlock’s experts, I find that it’s hard to out-perform the past in predicting. I no longer expect much of politicians, who are as trapped as the rest of us.This could be seen as more use of base rates as the prior, or as moving towards more of an Outside View. I am frequently reminded of the power of reductionism and analysis - pace MoR Quirrel’s question to Harry28, what states of the world would a prediction coming true imply had become more likely? Sometimes when I record predictions, I see someone who has clearly not considered what his predictions coming true implies about the current state of the world; I sigh and reflect on how you just can’t get there from here.
- Merely contemplating seriously my predictions over years and decades makes the future much more concrete to me; I will live most of my life there, so I should take a longer-term perspective.
Making thousands of predictions has helped me gain detachment from particular positions and ideas (which made it easier for me to write my Mistakes essay and publicly admit them - after so many ‘failures’ on PB.com, what were a few described in more detail?) To quote Alain de Botton:
The best salve for failure – to have quite a lot else going on.
- Raw probabilities are more intuitive; I can’t describe this much better than the poker article, “This is what 5% feels like.”
Planning fallacy: I knew it perfectly well, but still committed it until I tracked predictions; this is true both of my own mundane activities like writing, and larger more global events (recently, running out the clock on the Palestinian nationhood UN vote)This was interesting because it’s so easy to make excuses - ‘I would’ve succeeded if not for X!’ The question (in the classic study) is whether students could predict their projects’ actual completion time; they’re not trying to predict project completion time given a hypothetical version of themselves which didn’t procrastinate. If they aren’t self-aware enough to know they procrastinate and to take that into account - their predictions are still bad, no matter why they’re bad. (And someone on the outside who is told that in the past the students had finished -1 days before the due date will just shrug and say: ‘regardless of whether they took so long because of procrastination, or because of Parkinson’s law, or because of a 3rd reason, I have no reason to believe they’ll finish early this time.’ And they’d be absolutely correct.) It’s like a fellow who predicts he won’t fall off a cliff, but falls off anyway. ‘If only that cliff hadn’t been there, I wouldn’t’ve fallen!’ Well, duh. But you still fell. How can you correct this until you stop making excuses?
Less hindsight bias; when I have my previous opinions written down, it’s harder to claim I knew it all along (when I didn’t), and as Arkes et al 1988 indicated, writing down my reasons (even in Twitter-sized comments) helped prevent it.
Example: I had put the 2011 S&P downgrade at 5%, and reminded of my skepticism, I can see the double-standards being applied by pundits - all of a sudden they remember how the ratings agencies failed in the housing bubble and how the academic literature has proven they are inferior to the CDS markets and how they are a bad government-granted monopoly, even though they were happy to cite the AAA rating beforehand and are still happy to cite the other ratings agencies… In short, while base rates are powerful indeed, there are still many exogenous events and multiplicities of low probability events.
I think, but am not sure, that I really have internalized these lessons; they simply seem… obvious to me, now. I was surprised when I looked up my earliest work and saw it was only around 14 months ago - I felt like I’d been recording predictions for far longer.
Making predictions has been personally costly; while some predictions have been total time investments of a score of seconds, other predictions required considerable research, and thinking carefully is no picnic, as we’ve all noticed. I justify the invested time as a learning experience which would hopefully pay off for others as well, who can free-ride off the many predictions (eg. the soon-to-expire predictions) I have laboriously added to PB.com. (Only a fool learns from his mistakes only.)
What I have not noticed? It was suggested that predictions might help me in resolutions based on some experimental evidence30; I did not notice anything, but I didn’t carefully track it or put in predictions about many routine tasks. Making predictions seems to be largely effective for improving one’s epistemic rationality; I make no promises or implied warranties as to whether it is instrumentally rational.
How I make predictions
A prediction can be broken up into 3 steps:
- The specification
- The due-date
- The probability
The first issue is simply formulating the prediction. The goal is to make a statement on an objective and easily checkable fact; imagine that the other people predicting are yourself if you had been raised in some completely opposite fashion like an evangelical Republican household, and they are quite as suspicious of you as you are of them, and believe you to be suffering from as many partisan and self-serving biases as you believe them to. Wording is important as words frame how we think about things and can directly bias us (eg. push polls)31. The prediction should be so clear that they would expose themselves to mockery even among their own kind if they were to seriously disagree about the judgment32. For example, ‘Obama will be the next President’ is perfectly precise - everyone knows and understands what it is to be President and how one would decide - and so there’s no need to do any more; it would be risible to try to deny it. On the other hand, ‘the globe will increase 1 degree Fahrenheit’ may initially sound good, but your dark counterpart immediately objects: ‘what if it’s colder in Russia? When is this increase going to happen? Is this exactly 1 degree or are you going to try to claim as success only 0.9 degrees too? Who’s deciding this anyway?’ A good resolution might be ‘OK, global temperatures will increase >=1.0 degrees Fahrenheit on average according to the next IPCC report’.
Deciding the due-date of a prediction is usually trivial and not worth discussing; when making open-ended predictions about people (eg. ‘X will receive a Nobel Prize’), I find it helpful to consult life tables like Social Security’s table to figure out their average life expectancy and then set the due-date to that. (This both minimizes the number of changes to the due date and helps calibrate us by pointing out what time spans we’re really dealing with.)
When we begin deciding what probability to give the prediction, we can employ a number of heuristics (partially drawn from “Techniques for probability estimates”):
What does the prediction about the future world imply about the present world?
Every prediction one makes is also a retrodiction: you are claiming that the world is now and in the past on a course towards the future you have picked out of all the possibilities (or not on that course), and on that course to the degree you specified. What does your claim imply about the world as it is now? The world has to be in a state which can progress of its own internal logic to the future state, and so we can work backwards to figure out what that implies about the present or past. (You can think of this as a kind of proof by contradiction: assuming prediction X is true, what can we infer from X about the present world which is absurd?)
In our first example, Miller predicted 15% for “Within ten years either genetic manipulation or embryo selection will have been used on at least 50% of Chinese babies to increase the babies’ expected intelligence”. This initially seems reasonable: China is a big place with known interests in eugenics. But then we start working backwards - this prediction implies handling >=9 million pregnancies annually, which entails hundreds of thousands of gynecologists, geneticists, lab technicians etc., which all have lead-times measured in years or decades. (It takes a long time to train a doctor even if your standards are low.) And the program must be set up with hundreds of thousands of employees, policies experimented with and implemented, and so on. As matters stand, even in the United States mere SNP genotyping couldn’t be done for 9 million people annually, and genetic sequencing is much more expensive & difficult, and genetic modification is even hairier. If we work backwards, we would expect to see such a program already begun and active as it frantically tries to scale up to handle those millions of cases a year in order to hit Miller’s deadline. But as far as I knows, all the pieces are absent in China as of the day it was predicted; hence, it’s already too late. And then there are the politics; it is a deeply doubtful assertion that the Chinese population would countenance this, given the stress over the One Child policy and the continuing selective abortion crisis. Even if the prediction comes true eventually, it definitely will not come true in time. (The same logic applies to “Within ten years the SAT testing service will require students to take a blood test to prove they are not on cognitive enhancing drugs.”; ~1.65 million test-takers implies scores of thousands of phlebotomists, who do not exist, although in theory they could be trained in under a year - but whence the trainers?)
A second example would be a series of predictions on anti-aging/life-extension registered in November 2011. The first and earliest prediction - “By 2025 there will be at least one confirmed person who has lived to 130” - initially seems at least possible (I am optimistic about the approaches suggested by SENS), and so I assigned it a reasonable probability of 3%. But I felt troubled - something about it seemed wrong. So I applied this heuristic: what does the existence of an 130 year-old in 2025 imply about people in 2011? Well, if someone is 130 in 2025, then that implies that are now 116 years old (). Then I looked up the then-oldest person in the world: Besse Cooper, aged 115 years old. Oops. It’s impossible for the prediction to come true, but because we didn’t think about what it coming true implied about the present world, we made an absurdly high prediction. We can do this for all the other anti-aging predictions; for example “By 2085 there will be at least one confirmed person who has lived to 150” can be rephrased as ‘someone aged 76 now will live to 2085’, which seems implausible except with a technological singularity of some sort (“Hmm, phrased in that context, my estimate has to go down”). This can be applied to financial or economic questions, too, since under even the weakest version of efficient markets, the markets are smarter than you - Tyler Cowen asks why we don’t see investor piling into solar power if it’s following an exponential curve downwards and is such a great idea (Robin Hanson appeals to discount rates and purblind investors).The idea of ‘rephrasing’ leads directly into the next heuristic.
Base rates. Already discussed, but base rates should be your mental starting point for every prediction, before you take into account any other opinion or belief.
Base rates are easily expressed in terms of frequencies: “of the last Y years, X happened only once, so I will start with 1/Y%”. (“There are 10 candidates for the 2012 Republican nominee, so I will assume 10% until I’ve looked at each candidate more closely.”) Frequencies have a long history in the academic literature of making suboptimal or fallacious performance just disappear33, and there’s no reason to think that is not true for your predictions as well. This works for personal predictions as well - focus on what sort of person you are, how you’ve done in similar cases over years, and you’ll improve your predictions34.An example: “A Level 7 (Chernobyl/2011 Japan level) nuclear accident will take place by end of 2020”. One’s gut impression is a very bad place to start because Fukushima and Chernobyl - mentioned in the very prediction! - are such vivid and mentally available examples. 60%? 50%? Read the coverage of Fukushima and many people give every impression of expecting fresh disasters in coming years. (Look at Germany quickly announcing the shutdown of its nuclear reactors, despite tsunamis not being a frequent problem in northern Europe, shall we say.) But if we start with base rates and look up nuclear accidents, we realize something interesting: Chernobyl and Fukushima come to mind readily in part because they are - literally - the only such level-7 accidents over the past >40 years. So the frequency would be 1 in ~20 years, which puts a different face on a prediction spanning 9 years. This gives us a base rate more like ~40%. This is our starting point for asking how much does the rate go down because Fukushima has prompted additional safety improvements or closure of older plants (Fukushima’s equally-outdated sibling nuclear plants will have a harder time getting stays in execution) and how much the rate goes up due to global warming or aging nuclear plants. But from here we can hope to arrive at a sensible answer and not be spooked by a recent incident.
Breaking predictions down into conjunctions
Similar to heuristic #1, we may not realize what a prediction implies internally and so wind up giving high probability to a vivid or interesting scenario.
“Hillary Clinton will become President in 2016” is specific, easily dateable, implies things about the present world like rumors of Clinton running and strong political connections (as do exist), and yet this prediction is still easy to mess up for someone in 2012. Why? Because becoming President is actually the outcome of a long series of steps, every one of which must be successful and every one of which is doubtful: Hillary must resign from the White House where she was then Secretary of State, she must announce a run, she must become Democratic nominee (out of several candidates), and she must actually win. It’s the exceptional nominee who ever has >50% odds, so we start with a coin flip and work our way down to perhaps a few percent. This is more plausible than most national-level Democrats, but not as plausible as pundits might lead you to believe.
We can see a particularly striking failure to analyze in the prediction “Obama gets reelected and during that time Hillary Clinton brokers the middle east peace deal between Israel and Palestine for the two state solution. This secures her presidency in 2016.”, where the predictor gave it a flabbergasting 80%; before clicking through, the reader is invited to assign probabilities to the following events (and then multiply them to obtain the probability that they will all come true):
- Barack Obama is re-elected
- A Middle East peace deal is brokered
- The peace deal is for a two state solution
- Hillary Clinton runs in 2016
- Hillary Clinton is the 2016 Democratic nominee
- Hillary Clinton is elected
(Sometimes the examples are even more extreme than 6 clauses.) This heuristic is not perfect, as it works best on step-by-step processes where every step must happen. If this is not true, the heuristic will be overly pessimistic. Worse, it is possible to lie to ourselves by simply breaking down the steps into ever tinier steps and giving them relatively small probabilities like 99%: the opposite of the good heuristic is the bad Subadditivity effect, where if we then multiple out each of our exaggerated sub-steps, we wind up being absurdly skeptical. Steven Kaas furnishes an example:
Walking requires dozens of different muscles working together, so if you think you can walk you’re just committing the conjunction fallacy.
Building predictions up into disjunctions
One of the problems with non-frequency information is that we’re not always good at an ‘absolute pitch’ for probability - we may have intuitive probabilities but they are fuzzy. On the other hand, comparisons are much easier: I may not be able to say that Obama had a 52.5% chance of election vs McCain at 47.3%, but I can tell you which guy was on the happier side of 50%. This suggests we pit predictions against each other: I pit my intuition about Obama against my intuition about McCain and I see Obama comes out on top. The more predictions you can pit against each other the better, which ultimates leads to an exhaustive list of outcomes, a full disjunction: “either Obama (52.5%) or McCain (47.3%) or Nader (0.2%) will win”
Surprised to see Ralph Nader there? He ran too, you know. This is one of the pitfalls of disjunctive reasoning (as overstated conditionality and floors on percentages are a pitfall of conjunctive reasoning), the pitfall of the possibilities you forgot to list and make room for.Nader is pretty trivial, but imagine you were discussing Middle Eastern politics and your interlocutor immediately goes “either Israel will aerially attack Iran or Israel will launch covert ops or the US will aerially attack Iran or…” If you dutifully begin assigning probabilities (“let’s see, 15% sounds reasonable, and covert ops is a lot less probable so we’ll give that just 5%, and then the US is just as likely to attack Iran so that’s 15% too, and…”), you find you have somehow concluded Iran will be attacked, 35%+, when no prediction market remotely agrees with you! What happened? You read about one disjunct (“Iran will be attacked, period”) divided up into fine detail, anchored on it, and ignored how many possibilities were also being tucked away under “Iran will not be attacked, period”. If you had constructed your own disjunction before listening to the other guy, you might have instead said that no-attack was 80%+ probable, and then correctly divvied up the remaining percentage among the various attack options. Even domain-experts have problems when the tree of categories or outcomes is presented to them with modifications, unfortunately35.
Sets of predictions must be consistent: a full set of disjunctions must add to 100%, the probability something will happen and will not happen must also sum to 100%, etc.36 It’s surprising how often people mess this up.
Modus tollens vs modus ponens
For an explanation of this aphorism, see “Knowing your argumentative limitations, OR one rationalist’s modus ponens is another’s modus tollens.”; a modern version is George Moore’s here is a hand argument, and it is related to the Duhem-Quine thesis. It can be considered a flaw in uses of proofs by contradiction or the reductio ad absurdum - how does one know the conclusion really is absurd and to reject one of the premises instead of perhaps “biting the bullet”, as in a mock review of smoking’s benefits? A fun Islamic version goes (Imam al-Haddad, The Sublime Treasures):
Nothing can be soundly understood
If daylight itself needs proof.
When Roberts argues that one’s subjective memories about sleep conflict with the sleep data recorded by one’s Zeo EEG, does that constitute a disproof of the Zeo’s accuracy? No: establishing contradictions between one’s memories/subjective impressions and the Zeo merely tells us that one (or both) are wrong; it doesn’t tell us that the Zeo is wrong unless you have additional data or arguments which say that the Zeo is less reliable than the memories. One could take the Zeo contradicting memories as just proof of the fallibility of sleep-related memories37! (The fundamental question of epistemology: “What do you believe, and why do you believe it?”)
For example, if someone is caught on camera sleep-walking, and denies strenuously that he was sleep-walking, do you take modus ponens and say his memories prove he was not sleep-walking and reject the camera footage; or modus tollens and say that the claim his sleep memories are reliable imply he could not have been caught on camera, but he was, therefore we can reject the claim his sleep memories imply no walking? But extraordinary claims require extraordinary evidence, so obviously you choose to take modus tollens - because you have priors which say that memories are malleable and untrustworthy, while camera footage is much harder to fake. Before the discovery of the timing error, the 2011 FTL neutrinos was an excellent place to apply this ‘I defy (that particular) data’ reasoning, as are such errors in general: Steven Kaas puts it nicely:
According to [the 2009 blog post] [“A New Challenge to Einstein”](http://blogs.discovermagazine.com/cosmicvariance/2009/10/12/a-new-challenge-to-einstein/), General Relativity has been refuted at 98% confidence. I wonder if it wouldn’t be more accurate to say that, actually, 98% confidence has been refuted at General Relativity.
Similarly, if I read Sturrock 2013 on the Shakespeare authorship question where he argues that Shakespeare’s plays were not written by Shakespeare but by De Vere, and gives an example analysis concluding that the odds Shakespeare wrote his plays is 1 in 10^13 (10 trillion), then I am apt to think that the available evidence on the issue could never afford us an extraordinary level of certainty, that we do not have this level of certainty in things like the theory of relativity, and that Sturrock has instead proven his analysis to be 1 in trillions likely to be a valid analysis! When any method claims to have reached such an extraordinary level of certainty, it has certainly disproven itself. (This can be applied on a much smaller scale as a statistical power analysis: asking to what extent the employed data could ever support the conclusion reached by a regular analysis.)
This point isn’t always appreciated: when you have 2 contradicting claims or arguments, only 1 can be correct but the contradiction doesn’t tell you which one is correct. You need to step outside the argument and find additional data or perspectives. From Gary Drescher’s Good and Real: Demystifying Paradoxes from Physics to Ethics:
A paradox arises when two seemingly airtight arguments lead to contradictory conclusions - conclusions that cannot possibly both be true. It’s similar to adding a set of numbers in a two-dimensional array and getting different answers depending on whether you sum up the rows first or the columns. Since the correct total must be the same either way, the difference shows that an error must have been made in at least one of the two sets of calculations. But it remains to discover at which step (or steps) an erroneous calculation occurred in either or both of the running sums. There are two ways to rebut an argument. We might call them countering and invalidating.
- To counter an argument is to provide another argument that establishes the opposite conclusion.
- To invalidate an argument, we show that there is some step in that argument that simply does not follow from what precedes it (or we show that the argument’s premises - the initial steps - are themselves false).
If an argument starts with true premises, and if every step in the argument does follow, then the argument’s conclusion must be true. However, invalidating an argument - identifying an incorrect step somewhere-does not show that the argument’s conclusion must be false. Rather, the invalidation merely removes that argument itself as a reason to think the conclusion true; the conclusion might still be true for other reasons. Therefore, to firmly rebut an argument whose conclusion is false, we must both invalidate the argument and also present a counterargument for the opposite conclusion.
In the case of a paradox, invalidating is especially important. Whichever of the contradictory conclusions is incorrect, we’ve already got an argument to counter it - that’s what makes the matter a paradox in the first place! Piling on additional counterarguments may (or may not) lead to helpful insights, but the counterarguments themselves cannot suffice to resolve the paradox. What we must also do is invalidate the argument for the false conclusion-that is, we must show how that argument contains one or more steps that do not follow.
Failing to recognize the need for invalidation can lead to frustratingly circular exchanges between proponents of the conflicting positions. One side responds to the other’s argument with a counterargument, thinking it a sufficient rebuttal. The other side responds with a counter-counterargument - perhaps even a repetition of the original argument - thinking it an adequate rebuttal of the rebuttal. This cycle may persist indefinitely. With due attention to the need to invalidate as well as counter, we can interrupt the cycle and achieve a more productive discussion.
An example from mathematics by Timothy Gowers (“Vividness in Mathematics and Narrative”, in Circles Disturbed: The Interplay of Mathematics and Narrative):
…a suggestion was made that proofs by contradiction are the mathematician’s version of irony. I’m not sure I agree with that: when we give a proof by contradiction, we make it very clear that we are discussing a counterfactual, so our words are intended to be taken at face value. But perhaps this is not necessary. Consider the following passage.
There are those who would believe that every polynomial equation with integer coefficients has a rational solution, a view that leads to some intriguing new ideas. For example, take the equation x² - 2 = 0. Let p/q be a rational solution. Then (p/q)² - 2 = 0, from which it follows that p² = 2q². The highest power of 2 that divides p² is obviously an even power, since if 2k is the highest power of 2 that divides p, then 22k is the highest power of 2 that divides p². Similarly, the highest power of 2 that divides 2q² is an odd power, since it is greater by 1 than the highest power that divides q². Since p² and 2q² are equal, there must exist a positive integer that is both even and odd. Integers with this remarkable property are quite unlike the integers we are familiar with: as such, they are surely worthy of further study.
I find that it conveys the irrationality of √2 rather forcefully. But could mathematicians afford to use this literary device? How would a reader be able to tell the difference in intent between what I have just written and the following superficially similar passage?
There are those who would believe that every polynomial equation has a solution, a view that leads to some intriguing new ideas. For example, take the equation x² + 1 = 0. Let i be a solution of this equation. Then i² + 1 = 0, from which it follows that i² = -1. We know that i cannot be positive, since then i² would be positive. Similarly, i cannot be negative, since i² would again be positive (because the product of two negative numbers is always positive). And i cannot be 0, since 0² = 0. It follows that we have found a number that is not positive, not negative, and not zero. Numbers with this remarkable property are quite unlike the numbers we are familiar with: as such, they are surely worthy of further study.
Indeed, how would a reader show the difference - why do we apply modus tollens when we accept √2 must be irrational but then apply modus ponens and accept i as being real in some sense? Do we simply appeal to the utility of using i, and say with Wittgenstein, “If a contradiction were now actually found in arithmetic - that would only prove that an arithmetic with such a contradiction in it could render very good service; and it would be better for us to modify our concept of the certainty required, than to say it would really not yet have been a proper arithmetic.” But such use of priors may lead us to fanaticism:
“An atheist familiar with biology and medicine has no reason to believe the biblical story of the resurrection. But a Christian who believes it by faith should not, according to Plantinga, be dissuaded by general biological evidence. Plantinga compares the difference in justified beliefs to a case where you are accused of a crime on the basis of very convincing evidence, but you know that you didn’t do it. For you, the immediate evidence of your memory is not defeated by the public evidence against you, even though your memory is not available to others. Likewise, the Christian’s faith in the truth of the gospels, though unavailable to the atheist, is not defeated by the secular evidence against the possibility of resurrection. Of course sometimes contrary evidence may be strong enough to persuade you that your memory is deceiving you. Something analogous can occasionally happen with beliefs based on faith, but it will typically take the form, according to Plantinga, of a change in interpretation of what the Bible means. This tradition of interpreting scripture in light of scientific knowledge goes back to Augustine, who applied it to the”days" of creation. But Plantinga even suggests in a footnote that those whose faith includes, as his does not, the conviction that the biblical chronology of creation is to be taken literally can for that reason regard the evidence to the contrary as systematically misleading. One would think that this is a consequence of his epistemological views that he would hope to avoid." –Thomas Nagel, “A Philosopher Defends Religion”
Characteristic of this philosophical use, we will often find instances of the disagreement anytime foundational issues like methodology comes up in a field; an example from sociology is provided by the classic paper “The Iron Law Of Evaluation And Other Metallic Rules” (emphasis added):
A possibility that deserves very serious consideration is that there is something radically wrong with the ways in which we go about conducting evaluations. Indeed, this argument is the foundation of a revisionist school of evaluation, composed of evaluators who are intent on calling into question the main body of methodological procedures used in evaluation research, especially those that emphasize quantitative an particularly experimental approaches to the estimation of net impacts. The revisionists include such persons as Michael Patton (1980) and Ego Guba (1981). Some of the revisionists are reformed number crunchers who have seen the errors of their ways and have been reborn as qualitative researchers. Others have come from social science disciplines in which qualitative ethnographic field methods have been dominant. Although the issue of the appropriateness of social science methodology is an important one, so far the revisionist arguments fall far short of being fully convincing. At the root of the revisionist argument appears to be that the revisionists find it difficult to accept the findings that most social programs, when evaluate for impact assessment by rigorous quantitative evaluation procedures, fail to register main effects: hence the defects must be in the method of making the estimates. This argument per se is an interesting one, and deserves attention: all procedures need to be continually re-evaluated. There are some obvious deficiencies in most evaluations, some of which are inherent in the procedures employed. For example, a program that is constantly changing and evolving cannot ordinarily be rigorously evaluated since the treatment to be evaluate cannot be clearly defined. Such programs either require new evaluation procedures or should not be evaluated at all.
Other sections-most notably Mr. Flynn’s attack on the death penalty-are also tainted by serious left-wing bias. He is eager to argue that the “competent” murderers of 1960 were “mentally retarded” by modern standards. You could just as easily conclude, however, that the “mentally retarded” murderers of today are “competent” by the standards of 1960. Mr. Flynn briefly considers this argument, and objects that while IQ has risen, “practical intelligence” - the “ability to live autonomous lives” - hasn’t. If so, the “Flynn effect” has no effect on this debate: Can’t we simply use unadjusted IQ scores as a proxy for practical intelligence?
And later commented on the issue of “common sense” & burdens of proof:
Hasn’t common sense been wrong before? Of course. But how do people show that a common sense view is wrong? By demonstrating a conflict with other views even more firmly grounded in common sense. The strongest scientific evidence can always be rejected if you’re willing to say, “Our senses deceive us” or “Memory is never reliable” or “All the scientists have conspired to trick us.” The only problem with these foolproof intellectual defenses is… that… they’re… absurd.
Bayesian E.T. Jaynes, in “Chapter 5: Queer uses for probability theory”, discusses the probabilistic generalization of the reasoning we are engaged in when we choose whether to modus ponens or modus tollens:
What probability would you assign to the hypothesis that Mr. Smith has perfect extrasensory perception (ESP)? He can guess right every time which number you have written down. To say zero is too dogmatic…We take this man who says he has extrasensory perception, and we will write down some numbers from 1 to 10 on a piece of paper and ask him to guess which numbers we’ve written down. We’ll take the usual precautions to make sure against other ways of finding out. If he guesses the first number correctly, of course we will all say “you’re a very lucky person, but I don’t believe it.” And if he guesses two numbers correctly, we’ll still say “you’re a very lucky person, but I don’t believe it.” By the time he’s guessed four numbers correctly - well, I still wouldn’t believe it. So my state of belief is certainly lower than −40 db. How many numbers would he have to guess correctly before you would really seriously consider the hypothesis that he has extrasensory perception? In my own case, I think somewhere around 10. My personal state of belief is, therefore, about −100 db. You could talk me into a ±10 change, and perhaps as much as ±30, but not much more than that. But on further thought we see that, although this result is correct, it is far from the whole story. In fact, if he guessed 1000 numbers correctly, I still would not believe that he has ESP, for an extension of the same reason that we noted in Chapter 4 when we first encountered the phenomenon of resurrection of dead hypotheses. An hypothesis A that starts out down at −100 db can hardly ever come to be believed whatever the data, because there are almost sure to be alternative hypotheses above it, perhaps down at −60 db. Then when we get astonishing data that might have resurrected A, the alternatives will be resurrected instead. Let us illustrate this by two famous examples, involving telepathy and the discovery of Neptune.
…on the basis of such a result [as Mrs. Steward’s experimental results], ESP researchers would proclaim a virtual certainty that ESP is real. …it hardly matters what these prior probabilities are; in the view of an ESP researcher who does not consider the prior probability particularly small, is so close to unity that its decimal expression starts with over a hundred 9’s. He will then react with anger and dismay when, in spite of what he considers this overwhelming evidence, we persist in not believing in ESP. Why are we, as he sees it, so perversely illogical and unscientific? The trouble is that the above calculations (5-9) and (5-12) represent a very naıve application of probability theory, in that they consider only and ; and no other hypotheses. If we really knew that and were the only possible ways the data (or more precisely, the observable report of the experiment and data) could be generated, then the conclusions that follow from (5-9) and (5-12) would be perfectly all right. But in the real world, our intuition is taking into account some additional possibilities that they ignore.
…When we are dealing with some extremely implausible hypothesis, recognition of a seemingly trivial alternative possibility can make orders of magnitude difference in the conclusions. Taking note of this, let us show how a more sophisticated application of probability theory explains and justifies our intuitive doubts.
Let , , and , , , be as above; but now we introduce some new hypotheses about how this report of the experiment and data might have come about, which will surely be entertained by the readers of the report even if they are discounted by its writers. These new hypotheses range all the way from innocent possibilities such as unintentional error in the record keeping, through frivolous ones (perhaps Mrs. Stewart was having fun with those foolish people, with the aid of a little mirror that they did not notice), to less innocent possibilities such as selection of the data (not reporting the days when Mrs. Stewart was not at her best), to deliberate falsification of the whole experiment for wholly reprehensible motives. Let us call them all, simply, “deception”. For our purposes it does not matter whether it is we or the researchers who are being deceived, or whether the deception was accidental or deliberate. Let the deception hypotheses have likelihoods and prior probabilities . There are, perhaps, 100 different deception hypotheses that we could think of and are not too far-fetched to consider, although a single one would suffice to make our point. In this new logical environment, what is the posterior probability of the hypothesis that was supported so overwhelmingly before? Probability theory now tells us: (5-13)
Introduction of the deception hypotheses has changed the calculation greatly; in order for to come anywhere near unity it is now necessary that: (5-14) From (5-7), is completely negligible so (5-14) is not greatly different from: (5-15)
But each of the deception hypotheses is, in my judgment, more likely than , so there is not the remotest possibility that inequality (5-15) could ever be satisfied. Therefore, this kind of experiment can never convince me of the reality of Mrs. Steward’s ESP; not because I assert dogmatically at the start, but because the verifiable facts can be accounted for by many alternative hypotheses..Indeed, the very evidence which the ESP’ers throw at us to convince us, has the opposite effect on our state of belief; issuing reports of sensational data defeats its own purpose. For if the prior probability of deception is greater than that of ESP, then the more improbable the alleged data are on the null hypothesis of no deception and no ESP, the more strongly we are led to believe, not in ESP, but in deception. For this reason, the advocates of ESP (or any other marvel) will never succeed in persuading scientists that their phenomenon is real, until they learn how to eliminate the possibility of deception in the mind of the reader.
It is interesting that Laplace perceived this phenomenon long ago. His Essai Philosophique sur les probabilités (1819) has a long chapter on the “Probabilities of Testimonies”, in which he calls attention to “the immense weight of testimonies necessary to admit a suspension of natural laws”. He notes that those who make recitals of miracles, “decrease rather than augment the belief which they wish to inspire; for then those recitals render very probable the error or the falsehood of their authors. But that which diminishes the belief of educated men often increases that of the uneducated, always avid for the marvelous.” We observe the same phenomenon at work today, not only in the ESP enthusiast, but in the astrologer, reincarnationist, exorcist, fundamentalist preacher or cultist of any sort, who attracts a loyal following among the uneducated by claiming all kinds of miracles; but has zero success in converting educated people to his teachings. Educated people, taught to believe that a cause-effect relation requires a physical mechanism to bring it about, are scornful of arguments which invoke miracles; but the uneducated seem actually to prefer them. [see also David Hume’s “Of Miracles”]
See also “Inherited Improbabilities: Transferring the Burden of Proof” on application to the Amanda Knox case.
Case Study: Testing Confirmation Bias
Original LessWrong discussion
Confirmation bias is one of the most common cognitive biases: it is putting excess weight on evidence that confirms your belief, and even ignoring evidence falsifying your belief. It’s particularly dreadful because at no point does one say something that is actually wrong - you can build up an immaculate scientific case for a wrong position by cherry-picking evidence. (A funny example: “Cigarette smoking: an underused tool in high-performance endurance training”.)
Confirmation bias is an issue in self-experimentation/Quantified Self because one is already at a disadvantage in evaluating the results - humans don’t weigh them very well; satt points out that (via the Bienaymé formula) “An RCT with a sample size of e.g. 400 would still be 10 times better than 4 self-experiments by this metric.” If one encountered 4 immaculately run self-experiments, I suspect they would feel like more evidence than 1/10th that RCT. When you toss in any selection effects (due to confirmation bias), the value of those 4 trials plunges even further.
Fortunately, just as there is a somewhat easy way to test for status quo bias, there’s also a somewhat easy way to test for confirmation bias: simply present a high-quality result - with the reverse of the true outcome. Or present the initial data about the setup and whatnot, but hide the results. (The savvy will recognize this as similar to Robin Hanson’s proposal for conclusion-blind peer review, which is a variant on result-blind peer review involving more deception.) Or better yet, present the same study with both egosyntonic and egodystonic. If the subject rates them differently, well, the only varying factor was the outcome… (One of the tests on YourMorals.org does just this by rewording a ‘study’ on gun control.) This is a strategy similar but not identical to a Sokal affair, since the subjects in the Sokal affair could plausibly claim that they didn’t understand the pseudo-physics and were trusting the word of a physics & mathematics professor in good standing - in a confirmation bias test, the subject must understand the material and reject it because it conflicts with his prior beliefs.
Amateur Science -
I do what I must,
because, I can.
For the good of all of us.
Except the ones who were tricked.
But there’s no sense crying over the missing frills,
You just keep on trying until you run out of pills.
And the Science38 was fun,
And you get neat posts done
For the people who are, still alive.
Seth Roberts is a psychology professor and blogger, famous for his unconventional diet proposed in his book The Shangri-La Diet; I’ve read his blog since ~2010, and found it filled with interesting self-experiment suggestions. He has also been mentioned positively repeatedly on LessWrong, for example in Eliezer Yudkowsky’s post “The Unfinished Mystery of the Shangri-La Diet”. Roberts posts much material critical of mainstream psychology & medicine. Fair enough; I’m not a huge fan either. But he also posts many anecdotes/interviews, most of which are unremittingly positive, and is willing to use sources - apparently not humorously - that I regard as utter cesspools. (For example, Ayurvedic medicine, which you may remember as being keen on heavy metal poisoning.) All this began to make me wonder. Roberts has many publications and theories, so one might just read through them carefully and see how many were borne out; but unfortunately, few to no real trials have been done as far as I am able to tell. (When asked about the absence of trials in March 2010, Roberts pointed to 20 unpublished “case series” of his Shangri-La diet by a “professor at SUNY Upstate Medical Center”.) Since it’s not easy to directly check his work, one would have to resort to much less reliable & indirect methods.
The First Experiment
Vitamin D is Roberts’s latest theory, which he posted often on starting around December 2011. If you look through the vitamin D category on his blog, you will see that most/all of the anecdotes are non-randomized retrospective anecdotes over <7 days about generalities like wake-time or mood, with no real analysis. This is unfortunate as sleep data is very noisy, and self-reported unrecorded data even worse. But I found the idea interesting enough (and completely uncovered in the academic literature I searched) to drop my modafinil experiments and start setting up some decent self-experiments.
While I was still running my first vitamin D & sleep experiment, I emailed with Roberts about it beginning 24 January 2012; back in June, he had written favorably on Zeos. Ge seemed interested and asked for help interpreting the Zeo data set (which I had provided as a CSV export). His main reaction was that I was only testing that vitamin D in the evening was damaging my sleep, and this was not interesting to him since he was suggesting that vitamin D in the morning would help sleep. I was randomizing days, which meant there could be multi-day effects that contaminated the results. My blinding was too complicated, specifically my attempt to keep vitamin D consumption constant (so as to not confound levels of vitamin D with timing). He said he had done a t-test on the data I had posted so far, and the effect was there but not statistically-significant; he also said in that email that he didn’t trust the Zeo summary score (ZQ) of sleep length/awakenings/composition. When I finished and posted my analysis that the damage had indeed reached significance for multiple metrics, Roberts said it was interesting work and he’d link it.
The Second Experiment
I had intended to stop there and go back to modafinil, but it got posted to Hacker News and apparently people were interested in it and what vitamin D in the morning would do. So I emailed Roberts again, after re-reading his criticisms, and proposed a design for a morning experiment: 5-day blocks, randomized as before, recorded the morning after, for a full 50 days. After some back and forth, we settled on 7-day paired blocks. He would have preferred that every pair of weeks be blinded, but I didn’t have that many Tupperwares on hand; he didn’t mention any criticism of Zeo data.
So I created the 50 active & placebo pills, and started the experiment. This time I did not post any data publicly.
Things proceeded reasonably well, and Friday (28 April 2012) was the last day. On Saturday, I uploaded the Zeo data, annotated the days by active/placebo, put in the 1-5 Mood ranking, and ran the same R functions. To my considerable surprise, of the 9 metrics, only 1 reached significance (‘Morning Feel’ - how I feel when I wake up in the morning, cruddy or refreshed, 1-5), but it was very statistically-significant (p=0.005, survived multiple correction) and also a strong effect-size too (d=0.7). By a remarkable coincidence, Roberts posted his own results that day, and found little effect but this one:
When I woke up in the morning I rated how rested I felt on a 0-100 scale, where 0 = not rested at all and 100 = completely rested. I’d been using this scale for years. Here are the results (means and standard errors):
Vitamin D3 had a clear effect, but the necessary dose was more than 2000 IU. If Vitamin D3 acts like sunlight, you might think that taking it in the morning would make me wake up earlier. Here are the results for the time I woke up:…There was no clear effect of dosage on when I got up. Shifting the time from 8 am to 9 am may have had an effect (I wish I had 3 more days at 9 am).
I take 5000 IU. ‘rested’ seems pretty much identical to ‘Morning Feel’. Very promising!
But I had kept my data private for a reason: so I could edit it. I tampered with the high marks in ‘Morning Feel’, and left it with a non-significant small increase. I published the supposed results to
gwern.net, and send an email to Seth Roberts, principally reading:
I was surprised to see http://blog.sethroberts.net/2012/04/28/effect-of-vitamin-d3-on-my-sleep/ today, but the timing is remarkably apt - yesterday was the last data day for my morning experiment (you remember helping me design it, I hope!), and I was in the middle of processing and then analyzing my results: http://www.gwern.net/Zeo#morning-analysis (If you don’t see the analysis, you may need to force-reload.)
So not to mince words, the upshot is that none of the metrics showed any significance. The best p-value is like 0.3. Given that it is a pretty good quality data set and the Zeo is a lot more reliable than writing things down or whatever your informants were doing, and I followed your suggestions on experimental design, I’d say this suggests that improvements from vitamin D are just noise/selection effects or alternately, reflect the very genuine improvement caused by not taking vitamin D in the evening and messing with your sleep. Anyway, can I assume you will post a link to it on your blog? I’d like to see what the other commentators make of it.
Then, I registered a prediction of 70% on PredictionBook.com for the next 3 days: “Seth Roberts will not post a blog post on my (supposedly) null result for taking vitamin D in the morning.”
Roberts replied within the hour, asking why I trusted Zeo data, when he thought his misreported when he went to sleep and there were negative reviews on Amazon. (I am not quoting him because I asked later whether I could, and he declined, since he didn’t want to hurt a struggling company like Zeo Inc.; Zeo would shut down in 2013.)
I can’t say I was surprised that he tried to deny the results (see above), but I was disappointed. There were many possible replies to my report - null results are expected in n=1 experiments, for example, given that many people’s responses will be idiosyncratic or their self-experiments underpowered. I mentioned my car being totaled and ruining that week (and the next), which was a plausible reason as well. And so on. But instead, he went after the Zeo. (I suppose I should be glad he didn’t resort to ad hominems like ‘you must have screwed something up’, although ironically, that was actually the right class of explanations for the data!) I replied:
[why I trust the Zeo data:] Principally the papers comparing the Zeo data to polysomnography, which is of course the gold standard. They give the Zeo something like 75% accuracy in the sense of giving the same state classification per time unit.
I’ve seen the Amazon page, and what I would ask is what are they comparing the Zeo to? “One man’s modus ponens is another man’s modus tollens”, as we philosophy types like to say. The study of sleep is replete with subjective illusions where people are simply flat out wrong about sleep: they forget long intervals before going to sleep, they underestimate number of wakings [among other things], false awakening is a real phenomenon, sense of time is distorted*, hypnagogic illusions abound, etc.
(I’d compare it to saying ‘well, horoscopes work for me’, except that’s a little unfair to the Zeo critics - as I said, there is a real mistake rate and this is one reason to use datasets like 50 days rather than 1 or 2.)
* lucid dreaming offers the best example. Some of LaBerge’s experiments involved the subject when lucid moving his eyes - externally detectable on EEG - and waiting a fixed time and then moving his eyes again. The times were considerably wrong, which makes sense since we’ve all had dreams where one ‘experiences’ hours or years, even though dreams/REM intervals only last a few minutes.
Roberts dismissed the papers as upper bounds and designed to produce best possible results (eg. use of new headbands - not that I noticed degradation of data after months of use, and as it happens, I replaced my headband shortly before starting); he would take it seriously only if the ZQ correlated with how he felt on waking or during the day. (ZQ was only one of the 9 metrics, and 2 of the metrics did correspond to those two examples.) This was a little surprising because he had previously seemed positive about Zeo results and had not criticized it - indeed, as of 1 June 2013, he still has yet to post or publish anything I am aware of listing criticisms of Zeo-based data or why one would not trust it.
I decided to drop the conversation there: I had heard enough. All I wanted now was permission to quote him, which he denied (see above).
As far as I know, before posting this my fake results were never linked or discussed publicly by Roberts. He has mentioned no links, I have seen nothing relevant in my RSS subscription, and Google Analytics reports no referrals. I was prepared to immediately correct the page if there was any activity, but there wasn’t39. He had posted multiple fresh blog posts since, such as another anecdotal interview on the Shangri-La Diet (based on a forum posting 2 days previously).
(One might wonder whether my experiment is itself an example of confirmation bias - had Roberts linked it, would I still be posting it? I hope so; besides the private PredictionBook.com prediction, I gave a hash pre-commitment to AngryParsley, told another LWer about my plan months ago, discussed it in
#lesswrong, and alluded to it in some of my LW comments.)
In 5 emails:
it’s really bad behavior, after I say I don’t want to be quoted, to quote and paraphrase me.
why I haven’t yet commented on your results: because I haven’t yet studied them. Not because they were negative.
I collected my own data in November. It took me four months to post it. So forgive me for not commenting on yours within 3 days.
I never looked at your (fake) data. I simply reacted to your email. Please note the absence of Zeo results in my own data (and I have a Zeo). I analyzed my own data – ignoring the Zeo stuff – long before your email.
You write as if as soon as I heard your (fake) results were negative, I dismissed them by questioning your Zeo. Actually, my email to you, reacting to your (fake) Vitamin D results, began “before I make a longer comment”. Those were its very first words. You failed to wait for that longer comment before passing judgment. Nor do you say anything about it in your description of what I did.
although you told me (August 2011) that your one-legged standing data supported my claims I have not yet blogged about it. I have not yet analyzed your data, either. This does not support your idea that if I don’t analyze or report some data, it must be due to confirmation bias.
You really won’t take down information that I said you couldn’t quote? After I ask you to?
(I have pointed out to him that I have scrupulously not quoted him, but only paraphrased him.)
So, Roberts is not very good about reporting results contradicting his theories. This is useful to keep in mind: he has a lot of ideas, a lot of which will be false, and one should treat them with due caution. This is also interesting because I believe Roberts has acknowledged as much in his papers defending self-experimentation (defending them as acceptable because they can be iterated on quickly and the end results will be valuable) - which is just another demonstration of the old observation that you can know all about a bias, and still succumb anyway.
I doubt I will do many more confirmation tests in the future, if any at all:
- This little test has used up a good deal of my time, and delayed me from posting my real results.
- Roberts is deeply offended by this, so I can forget about getting advice on setting up self-experiments from him in the future. (I made sure to get advice on my pending lithium self-experiment before I finished, but what about future self-experiments?)
- And testing confirmation bias in this fashion is intrinsically deceptive, so I probably have damaged my online reputation as well.
But if I were? My experiment has been fiercely criticized by Roberts and the LessWrong commenters, and they have given me multiple ways I could have done better:
The 3 day waiting period was too short
Besides simple impatience, I waited only 3 days because the wrong data was posted to my public page of sleep results in order to look as genuine as possible. Readers of that page were being misled along with Roberts, and their numbers were not trivial: Google Analytics reports 710 unique page views during April 2012 (this was in the absence of any submissions to social media like Hacker News etc), for ~24 people a day reading the page - and 72 people over the 3 day period.
I was aware of the visitors being duped along with Roberts and wished to minimize the number. I thought this was an acceptable design because while Roberts makes it sound like he was going to do an in-depth analysis before discussing my data and so the time interval is ridiculously short, I don’t believe this: if you look at the vitamin D category, you see he posts plenty of people’s reports without formally analyzing their data but just briefly describing it or quoting them, and he had time to post something like 3 blog posts before I published this, one of which was a link roundup perfect for linking my results (and he does so frequently - that was one roundup out of hundreds).
I didn’t realize that people would see the 3 day waiting period as so questionable. Thinking about it some more, I realize now what I should have done: I should have created a separate page on my site just for the fake results, and sent the subject that but linked it nowhere else. The subject would have no reason to be suspicious, the page would indeed be public, but it would not actually get traffic from any but the nosiest readers; hence, I could leave the fake page up for months. (At some point I could even put up the real results on the main page (for the normal readers), since it would be unlikely for the subject to just randomly visit the page and notice the discrepancy, especially if they had looked at it before.)As a follow-up experiment, I had an acquaintance send Roberts links to full-text of 2 new studies & papers on vitamin D & sleep which I knew he would be interested in. He replied within hours, but >9 days later he had not posted anything at all on them to his blog. Since he had no reason to be suspicious and both papers supported the link (eliminating the concern about confirmation bias), this demonstrates that 3 days was probably too short a time period.
I should have tested multiple null results on Roberts. One of the commenters, Tetronian:
One interaction is simply not enough to gauge how biased a person is. There are many confounding variables here, and this is an n = 1 sample. Roberts’ response may have been affected by his mood, how much he ate for breakfast, and any number of other things. As you said in the post, n = 1 experiments just aren’t good sources of data.
As matters stand, I have sent him 3 results: one-legged standing, vitamin D in the evening, and vitamin D in the morning. The first was a weak positive finding but the experiment was very low-quality by my standards so it did not surprise me that he was not very interested - I included it mostly for completeness. The second I could have reversed the results of if I had been thinking about confirmation bias at that point, but it would not have been a direct challenge to his theory - to the extent it matters to the theory at all, his theory speculates that it perhaps will damage sleep in the evening since it’s influencing circadian rhythms and it isn’t a mere matter of vitamin D deficiency. But since there is no known specific mechanism for vitamin D to affect sleep at all, it’s not a real test of confirmation bias.A possible followup is the 19 August 2012 write up by Chris L using a year of Zeo sleep data where he argues a clear trend exists with vitamin D supplementation reducing his deep sleep; Roberts commented on the post, but as of 23 May 2013, Roberts has not posted about Chris L on his blog (though he posted before and after on 3 positive anecdotes).
wait for credit
Obviously he wrote no blog post discussing my findings; in addition, Roberts has before and since written papers on his results & methodology, covering at length his vitamin D findings and the anecdote of his inspiration, Tara Grant - no mention is made of either of my superior experiments. Of course he has the right to write his papers as he pleases (as long as he’s throwing out evidence in his favor, of course; if I had instead refuted the vitamin D findings, I would regard his omission as dishonest), but this turns out to bother me more than I thought it would (if indeed I considered that at all before hand). I would have done better to not publish this little experiment at all.
As is true of every short description, this is a little over-simplified. People are risk-averse and fundamentally uncertain, so their beliefs about the true probability won’t directly translate into the percentage/price they will buy at, and one can’t even average out and say ‘this is what the market believes the probability is’. See economist Rajiv Sethi’s “On the Interpretation of Prediction Market Data” & “From Order Books to Belief Distributions”; for more rigor, see Wolfers & Zitzewitz’s paper, “Interpreting Prediction Market Prices as Probabilities”↩
Or negative-sum, when you consider the costs of running the prediction market and the various fees that might be assessed on participants - the house needs a cut. In some circumstances, prediction markets can be positive-sum for traders: if some party benefits from the information and will subsidize it to encourage trading. For example, when companies run internal prediction markets they tend to subsidize the markets.
Public prediction market subsidies are much rarer - the only instance I know of is Peter McCluskey subsidizing 2008 Intrade markets (announcement). As far as he could tell in November 2008, his subsidies did not do much. I emailed him May 2012, and he said:
I was somewhat disappointed with the results.
I don’t expect a small number of subsidized markets is enough to accomplish much. I suspect it would require many donors (or a billionaire donor) to create the markets needed for me to consider them successful. I see no hint that my efforts encouraged anyone else to subsidize such markets.
When one restricts the exchange of information among panelists so severely and denies them the chance to explain the rationales behind their estimates, it is no surprise that feedback loses its potency (indeed, the statistical information may encourage the sort of group pressures that Delphi was designed to pre-empt). We (Rowe and Wright 1996) compared a simple iteration condition (with no feedback) to a condition involving the feedback of statistical information (means and medians) and to a condition involving the feedback of reasons (with no averages) and found that the greatest degree of improvement in accuracy over rounds occurred in the “reasons” condition. Furthermore, we found that, although subjects were less inclined to change their forecasts as a result of receiving reasons feedback than they were if they received either “statistical” feedback or no feedback at all, when “reasons” condition subjects did change their forecasts they tended to change towards more accurate responses. Although panelists tended to make greater changes to their forecasts under the “iteration” and “statistical” conditions than those under the ‘reasons’ condition, these changes did not tend to be toward more accurate predictions. This suggests that informational influence is a less compelling force for opinion change than normative influence, but that it is a more effective force. Best (1974) has also provided some evidence that feedback of reasons (in addition to averages) can lead to more accurate judgments than feedback of averages (e.g., medians) alone.
It may be a stretch to generalize this to a single person predicting on their own, though many tools involve groups or you could view predicting as a Delphi method involving temporally separated selves. (If multiple selves works for Ainslie in explaining addiction, why not predicting?)↩
On 27 January 2008, the IEM sent out an email which accidentally listed all recipients in the CC; the listed emails totaled 292 emails. Given that many of these traders (like myself) are surely inactive or infrequent, and only a fraction will be active at a given time, this means the 10 or so markets are thinly inhabited.↩
The problem is that if a contract is at 10%, and you buy 10 contracts, then if the contract actually pays off, you have to come up with 100% to pay the other people their winnings. Intrade, to guarantee them payment, will make you pay the full 10%, and then freeze the 90% in your account.↩
This section first appeared on LessWrong.com as “2011 Intrade fee changes, or, Intrade considered no longer useful for LessWrongers” and includes some discussion.↩
When I submitted my withdrawal request for my balance, I received an email offering to instead set my account to ‘inactive’ status such that I could not trade but would not be charged the fee; if I wanted to trade, I would simply be charged that month’s $5. I declined the offer, but I couldn’t help wonder - why didn’t they simply set all accounts to ‘inactive’ and then let people opt in to the new fee structure? Or at least set ‘inactive’ all accounts which have not engaged in any transactions within X months?
Regardless, here are my probabilities for Intrade ending in the next few years:
- Intrade will close/merge/be sold by 2012: 5%
- Intrade will close/merge/be sold by 2013: 8%
- Intrade will close/merge/be sold by 2015: 18%
- Intrade will not be open for business in 2020: 35%
In March 2013 (relevant events post-dating my predictions include the US CFTC attacking Intrade), Intrade announced it was shutting down trading and liquidating all positions. I probably was far too optimistic.↩
I made $0.31 on DEM.2012, $3.65 on REP.2012, and $1.40 on 2012.REP.NOM.PALIN for a total profit of $5.36.↩
An aside: there’s not much point in accumulating more than, say, 1000 bitcoins. It’s generally believed that Bitcoin’s ultimate fate will be victory or failure - it’d be very strange if Bitcoin leveled off as a stable permanent alternative currency for only part of the Internet. In such a situation, the difference between 1000 bitcoins and 1500 bitcoins is like the difference to Bill Gates between $60 billion and $65 billion; it matters in some abstract sense, but not even a tiny fraction as much as the difference between $1 and $100 million. Money is logarithmic in utility, as the saying goes.↩
The famous neurotransmitter dopamine is intimately involved with feelings of happiness and pleasure (which is why dopamine is affected by most addictions or addictive drugs). It also is involved in learning - make an error and no dopamine for you; “Midbrain Dopamine Neurons Encode a Quantitative Reward Prediction Error Signal” (Bayer & Glimcher 2005, Neuron):
The midbrain dopamine neurons are hypothesized to provide a physiological correlate of the reward prediction error signal required by current models of reinforcement learning. We examined the activity of single dopamine neurons during a task in which subjects learned by trial and error when to make an eye movement for a juice reward. We found that these neurons encoded the difference between the current reward and a weighted average of previous rewards, a reward prediction error, but only for outcomes that were better than expected. Thus, the firing rate of midbrain dopamine neurons is quantitatively predicted by theoretical descriptions of the reward prediction error signal used in reinforcement learning models for circumstances in which this signal has a positive value. We also found that the dopamine system continued to compute the reward prediction error even when the behavioral policy of the animal was only weakly influenced by this computation.
Unfortunately they don’t give any population statistics so it’s hard for me to interpret my results:
Your calibration score is -3. Calibration is defined as the difference between the percentage average confidence rating and the percentage of correct answers. A score of zero is perfect calibration. Positive numbers indicate overconfidence and can go up to 100. Negative numbers represent under-confidence and can go down to -100.
Your discrimination score is 4.48. Discrimination is defined as the difference between the percentage average confidence rating for the correct items and the percentage average confidence rating for the incorrect items. Higher positive numbers indicate greater discrimination and are better scores.
Specifically, prediction #1007. In its preface to the results page, GJP told us:
Question 1007 (the “lethal confrontation” question) illustrates this point. Many of our best forecasters got ‘burned’ on this question because a Chinese fishing captain killed a South Korean Coast Guard officer late in the forecasting window - an outcome that the tournament’s sponsors deemed to satisfy the criteria for resolving the question as ‘yes’, but one that had little geopolitical significance (it did not signify a more assertive Chinese naval policy). These forecasters had followed our advice (or their own common sense) by lowering their estimated likelihood of a lethal confrontation as time elapsed and made their betting decisions based on this assumption.
For example, in the YourMorals.org tests dealing with calibration/bias, I usually do well above average, even for LessWrongers; see:
- “an experimental investigation of how people evaluate research evidence that either supports or opposes their pre-existing beliefs”
- “Over-claiming Technique”
- “Balanced Inventory of Desirable Responding”
- “Marlowe-Crowne Social Desirability Scale”
- “This scale is designed to measure the better-than-average effect, which is also known as the illusory superiority bias.”
The 2001 anthology of reviews and papers, Principles of Forecasting, is invaluable, although many of the papers are highly technical. Excerpts from Dylan Evans’s Risk Intelligence (in the Wall Street Journal) may be more readable:
Psychologists have tended to assume that such biases are universal and virtually impossible to avoid. But certain groups of people-such as meteorologists and professional gamblers-have managed to overcome these biases and are thus able to estimate probabilities much more accurately than the rest of us. Are they doing something the rest of us can learn? Can we improve our risk intelligence?
Sarah Lichtenstein, an expert in the field of decision science, points to several characteristics of groups that exhibit high intelligence with respect to risk. First, they tend to be comfortable assigning numerical probabilities to possible outcomes. Starting in 1965, for instance, U.S. National Weather Service forecasters have been required to say not just whether or not it will rain the next day, but how likely they think it is in percentage terms. Sure enough, when researchers measured the risk intelligence of American forecasters a decade later, they found that it ranked among the highest ever recorded, according to a study in the Journal of the Royal Statistical Society.
It helps, too, if the group makes predictions only on a narrow range of topics. The question for weather forecasters, for example, is always roughly the same: Will it rain or not? Doctors, on the other hand, must consider all sorts of different questions: Is this rib broken? Is this growth malignant? Will this drug cocktail work? Studies have found that doctors score rather poorly on tests of risk intelligence.
Finally, groups with high risk intelligence tend to get prompt and well-defined feedback, which increases the chance that they will incorporate new information into their understanding. For weather forecasters, it either rains or it doesn’t. For battlefield commanders, targets are either disabled or not. For doctors, on the other hand, patients may not come back, or they may be referred elsewhere. Diagnoses may remain uncertain.
…Royal Dutch Shell introduced just such a program in the 1970s. Senior executives had noticed that when newly hired geologists predicted oil strikes at four out of 10 new wells, only one or two actually produced. This overconfidence cost Royal Dutch Shell millions of dollars. In the training program, the company gave geologists details of previous explorations and asked them for numerical estimates of the chances of finding oil. The inexperienced geologists were then given feedback on the number of oil strikes that had actually been made. By the end of the program, their estimates roughly matched the actual number of oil strikes.
…Just by becoming aware of our tendency to be overconfident or underconfident in our estimates, we can go a long way toward correcting for our most common errors. Doctors, for instance, could provide numerical estimates of probability when making diagnoses and then get data about which ones turned out to be right. As for the rest of us, we could estimate the likelihood of various events in a given week, record our estimates in numerical terms, review them the next week and thus measure our risk intelligence in everyday life. A similar technique is used by many successful gamblers: They keep accurate and detailed records of their earnings and their losses and regularly review their strategies in order to learn from their mistakes.
Long-shot bias is the overvaluing of events in the 0-5% range or so; it plagues even heavily traded markets on Intrade. Ron Paul and Michele Bachmann are 2 cases in point - they are covered by the heavily-traded US Presidential contracts, yet they are priced too high, and this has been noted by many:
In fact, the price differences implied a (small) arbitrage opportunity that persisted for most of summer 2003 and has reappeared in 2004. Similar patterns existed for Tradesports securities on other financial variables like crude oil, gold prices and exchange rates. This finding is consistent with the long-shot bias being more pronounced on smaller-scale exchanges.
This is apparently due in part to the short-term pressure on prediction market traders; Robin Hanson says:
“Intrade and IEM don’t usually pay interest on deposits, so for long term bets you can win the bet and still lose overall. The obvious solution is for them to pay such interest, but then they’d lose a hidden tax many customers don’t notice.”
Another reason to use a free-form site like PB.com - you can (and I have) made predictions about decades or centuries into the far future without worrying about how to earn returns of thousands of percent.↩
Going through Intrade to copy over predictions to PB.com, I was struck by how non-liquid markets could be left at hilarious prices, prices that make no rational sense since they can’t even represent someone hedging against that outcome because so few shares have been sold; example contracts include:↩
To see what is in front of one’s nose needs a constant struggle. One thing that helps toward it is to keep a diary, or, at any rate, to keep some kind of record of one’s opinions about important events. Otherwise, when some particularly absurd belief is exploded by events, one may simply forget that one ever held it. Political predictions are usually wrong. But even when one makes a correct one, to discover why one was right can be very illuminating. In general, one is only right when either wish or fear coincides with reality. If one recognizes this, one cannot, of course, get rid of one’s subjective feelings, but one can to some extent insulate them from one’s thinking and make predictions cold-bloodedly, by the book of arithmetic. In private life most people are fairly realistic. When one is making out one’s weekly budget, two and two invariably make four. Politics, on the other hand, is a sort of sub-atomic or non-Euclidean word where it is quite easy for the part to be greater than the whole or for two objects to be in the same place simultaneously. Hence the contradictions and absurdities I have chronicled above, all finally traceable to a secret belief that one’s political opinions, unlike the weekly budget, will not have to be tested against solid reality.
…while I suppose it is barely possible that perfectly good people exist even though I have never met one, it is nonetheless improbable that someone would be beaten for fifteen minutes and then stand up and feel a great surge of kindly forgiveness for his attackers. On the other hand it is less improbable that a young child would imagine this as the role to play in order to convince his teacher and classmates that he is not the next Dark Lord.
The import of an act lies not in what that act resembles on the surface, Mr. Potter, but in the states of mind which make that act more or less probable.
Can people consistently attempt to falsify, that is, search for refuting evidence, when testing the truth of hypotheses? Experimental evidence indicates that people tend to search for confirming evidence. We report two novel experiments that show that people can consistently falsify when it is the only helpful strategy. Experiment 1 showed that participants readily falsified somebody else’s hypothesis. Their task was to test a hypothesis belonging to an ‘imaginary participant’ and they knew it was a low quality hypothesis. Experiment 2 showed that participants were able to falsify a low quality hypothesis belonging to an imaginary participant more readily than their own low quality hypothesis. The results have important implications for theories of hypothesis testing and human rationality.
One line of thought in evolutionary psychology is that our minds are not evolved for truth-seeking per se, but rather are split between heuristics and effective procedures like that, and argumentation to try to deceive & persuade others; eg. “Why do humans reason? Arguments for an argumentative theory” (Mercier & Sperber 2011). This ties in well with why we are better at falsifying the theories of others - you don’t convince anyone by falsifying your own theories, but you do by falsifying others’ theories.↩
Half of participants were assigned randomly to a “self-prediction” intervention, asking them to predict their future acceptance of HBV vaccination. The main outcome measure was subsequent vaccination behavior. Other measures included perceived barriers to HBV vaccination, measured prior to the intervention. Results: There was a [statistically-]significant interaction between the intervention and vaccination barriers, indicating the effect of the intervention differed depending on perceived vaccination barriers. Among high-barriers patients, the intervention [statistically-]significantly increased vaccination acceptance. Among low-barriers patients, the intervention did not influence vaccination acceptance. Conclusions: The self-prediction intervention [statistically-]significantly increased vaccination acceptance among “high-barriers” patients, who typically have very low vaccination rates.
Rowe & Wright 2001:
In phrasing questions, use clear and succinct definitions and avoid emotive terms.How a question is worded can lead to [substantial] response biases. By changing words or emphasis, one can induce respondents to give dramatically different answers to a question. For example, Hauser (1975) describes a 1940 survey in which 96% of people answered yes to the question “do you believe in freedom of speech?” and yet only 22% answered yes to the question “do you believe in freedom of speech to the extent of allowing radicals to hold meetings and express their views to the community?” The second question is consistent with the first; it simply entails a fuller definition of the concept of freedom of speech. One might therefore ask which of these answers more clearly reflects the views of the sample. Arguably, the more apt representation comes from the question that includes a clearer definition of the concept of interest, because this should ensure that the respondents are all answering the same question. Researchers on Delphi per se have shown little empirical interest in question wording. Salancik, Wenger and Heifer (1971) provide the only example of which we are aware; they studied the effect of question length on initial panelist consensus and found that one could apparently obtain greater consensus by using questions that were neither “too short” nor “too long.” This is a generally accepted principle for wording items on surveys: they should be long enough to define the question adequately so that respondents do not interpret it differently, yet they should not be so long and complicated that they result in information overload, or so precisely define a problem that they demand a particular answer. Also, questions should not contain emotive words or phrases: the use of the term “radicals” in the second version of the freedom-of-speech question, with its potentially negative connotations, might lead to emotional rather than reasoned responses.
Frame questions in a balanced manner.
Tversky and Kahneman (1974, 1981) provide a second example of the way in which question framing may bias responses. They posed a hypothetical situation to subjects in which human lives would be lost: if subjects were to choose one option, a certain number of people would definitely die, but if they chose a second option, then there was a probability that more would die, but also a chance that less would die. Tversky and Kahneman found that the proportion of subjects choosing each of the two options changed when they phrased the options in terms of people surviving instead of in terms of dying (i.e., subjects responded differently to an option worded “60% will survive” than to one worded “40% will die,” even though these are logically identical statements). The best way to phrase such questions might be to clearly state both death and survival rates (balanced), rather than leave half of the consequences implicit. Phrasing a question in terms of a single perspective, or numerical figure, may provide an anchor point as the focus of attention, so biasing responses.
For example, the famous and replicated examples of doctors failing to correctly apply Bayes’ theorem to cancer rates is reduced when the percentages are translated into frequencies. Rowe & Wright 2001 give this advice:
- When possible, give estimates of uncertainty as frequencies rather than probabilities or odds.
Many applications of Delphi require panelists to make either numerical estimates of the probability of an event happening in a specified time period, or to assess their confidence in the accuracy of their predictions. Researchers on behavioral decision making have examined the adequacy of such numerical judgments. Results from these findings, summarized by Goodwin and Wright (1998), show that sometimes judgments from direct assessments (what is the probability that…?) are inconsistent with those from indirect methods. In one example of an indirect method, subjects might be asked to imagine an urn filled with 1,000 colored balls (say, 400 red and 600 blue). They would then be asked to choose between betting on the event in question happening, or betting on a red ball being drawn from the urn (both bets offering the same reward). The ratio of red to blue balls would then be varied until a subject was indifferent between the two bets, at which point the required probability could be inferred. Indirect methods of eliciting subjective probabilities have the advantage that subjects do not have to verbalize numerical probabilities. Direct estimates of odds (such as 25 to 1, or 1,000 to 1), perhaps because they have no upper or lower limit, tend to be more extreme than direct estimates of probabilities (which must lie between zero and one). If probability estimates derived by different methods for the same event are inconsistent, which method should one take as the true index of degree of belief? One way to answer this question is to use a single method of assessment that provides the most consistent results in repeated trials. In other words, the subjective probabilities provided at different times by a single assessor for the same event should show a high degree of agreement, given that the assessor’s knowledge of the event is unchanged. Unfortunately, little research has been done on this important problem. Beach and Phillips (1967) evaluated the results of several studies using direct estimation methods. Test-retest correlations were all above 0.88, except for one study using students assessing odds, where the reliability was 0.66.
Gigerenzer (1994) provided empirical evidence that the untrained mind is not equipped to reason about uncertainty using subjective probabilities but is able to reason successfully about uncertainty using frequencies. Consider a gambler betting on the spin of a roulette wheel. If the wheel has stopped on red for the last 10 spins, the gambler may feel subjectively that it has a greater probability of stopping on black on the next spin than on red. However, ask the same gambler the relative frequency of red to black on spins of the wheel and he or she may well answer 50-50. Since the roulette ball has no memory, it follows that for each spin of the wheel, the gambler should use the latter, relative frequency assessment (50-50) in betting. Kahneman and Lovallo (1993) have argued that forecasters tend to see forecasting problems as unique when they should think of them as instances of a broader class of events. They claim that people’s natural tendency in thinking about a particular issue, such as the likely success of a new business venture, is to take an “inside” rather than an “outside” view. Forecasters tend to pay particular attention to the distinguishing features of the particular event to be forecast (e.g., the personal characteristics of the entrepreneur) and reject analogies to other instances of the same general type as superficial. Kahneman and Lovallo cite a study by Cooper, Woo, and Dunkelberger (1988), which showed that 80% of entrepreneurs who were interviewed about their chances of business success described this as 70% or better, while the overall survival rate for new business is as low as 33 percent. Gigerenzer’s advice, in this context, would be to ask the individual entrepreneurs to estimate the proportion of new businesses that survive (as they might make accurate estimates of this relative frequency) and use this as an estimate of their own businesses surviving. Research has shown that such interventions to change the required response mode from subjective probability to relative frequency improve the predictive accuracy of elicited judgments. For example, Sniezek and Buckley (1991) gave students a series of general knowledge questions with two alternative answers for each, one of which was correct. They asked students to select the answer they thought was correct and then estimate the probability that it was correct. Their results showed the same general overconfidence that Arkes (2001) discusses. However, when Sniezek and Buckley asked respondents to state how many of the questions they had answered correctly of the total number of questions, their frequency estimates were accurate. This was despite the fact that the same individuals were generally overconfident in their subjective probability assessments for individual questions. Goodwin and Wright (1998) discuss the usefulness of distinguishing between single-event probabilities and frequencies. If a reference class of historic frequencies is not obvious, perhaps because the event to be forecast is truly unique, then the only way to assess the likelihood of the event is to use a subjective probability produced by judgmental heuristics. Such heuristics can lead to judgmental overconfidence, as Arkes (2001) documents.
Osberg and Shrauger (1986) determined prediction accuracy by scoring an item as a hit if the respondents predicted the event definitely or probably would occur and it did, or if the respondent predicted that the event definitely or probably would not occur and it did not. Respondents who were instructed to focus on their own personal dispositions predicted [statistically-]significantly more of the 55 items correctly (74%) than did respondents in the control condition who did not receive instructions (69%). Respondents whose instructions were to focus on personal base rates had higher accuracy (72%) and respondents whose instructions were to focus on population base rates had lower accuracy (66%) than control respondents, although these differences were not statistically-significant.
Ideally, intelligence analysts should be able to recognize what relevant evidence is lacking and factor this into their calculations. They should also be able to estimate the potential impact of the missing data and to adjust confidence in their judgment accordingly. Unfortunately, this ideal does not appear to be the norm. Experiments suggest that “out of sight, out of mind” is a better description of the impact of gaps in the evidence.
This problem has been demonstrated using fault trees, which are schematic drawings showing all the things that might go wrong with any endeavor. Fault trees are often used to study the fallibility of complex systems such as a nuclear reactor or space capsule.
A fault tree showing all the reasons why a car might not start was shown to several groups of experienced mechanics.96 The tree had seven major branches–insufficient battery charge, defective starting system, defective ignition system, defective fuel system, other engine problems, mischievous acts or vandalism, and all other problems–and a number of subcategories under each branch. One group was shown the full tree and asked to imagine 100 cases in which a car won’t start. Members of this group were then asked to estimate how many of the 100 cases were attributable to each of the seven major branches of the tree. A second group of mechanics was shown only an incomplete version of the tree: three major branches were omitted in order to test how sensitive the test subjects were to what was left out.
If the mechanics’ judgment had been fully sensitive to the missing information, then the number of cases of failure that would normally be attributed to the omitted branches should have been added to the “Other Problems” category. In practice, however, the “Other Problems” category was increased only half as much as it should have been. This indicated that the mechanics shown the incomplete tree were unable to fully recognize and incorporate into their judgments the fact that some of the causes for a car not starting were missing. When the same experiment was run with non-mechanics, the effect of the missing branches was much greater.
As compared with most questions of intelligence analysis, the “car won’t start” experiment involved rather simple analytical judgments based on information that was presented in a well-organized manner. That the presentation of relevant variables in the abbreviated fault tree was incomplete could and should have been recognized by the experienced mechanics selected as test subjects. Intelligence analysts often have similar problems. Missing data is normal in intelligence problems, but it is probably more difficult to recognize that important information is absent and to incorporate this fact into judgments on intelligence questions than in the more concrete “car won’t start” experiment.
Rowe & Wright 2001:
- Use coherence checks when eliciting estimates of probabilities.
Assessed probabilities are sometimes incoherent. One useful coherence check is to elicit from the forecaster not only the probability (or confidence) that an event will occur, but also the probability that it will not occur. The two probabilities should sum to one. A variant of this technique is to decompose the probability of the event not occurring into the occurrence of other possible events. If the events are mutually exclusive and exhaustive, then the addition rule can be applied, since the sum of the assessed probabilities should be one. Wright and Whalley (1983) found that most untrained probability assessors followed the additivity axiom in simple two-outcome assessments involving the probabilities of an event happening and not happening. However, as the number of mutually exclusive and exhaustive events in a set increased, more forecasters became supra-additive, and to a greater extent, in that their assessed probabilities added up to more than one. Other coherence checks can be used when events are interdependent (Goodwin and Wright 1998; Wright, et al. 1994).
There is a debate in the literature as to whether decomposing analytically complex assessments into analytically more simple marginal and conditional assessments of probability is worthwhile as a means of simplifying the assessment task. This debate is currently unresolved (Wright, Saunders and Ayton 1988; Wright et al. 1994). Our view is that the best solution to problems of inconsistency and incoherence in probability assessment is for the pollster to show forecasters the results of such checks and then allow interactive resolution between them of departures from consistency and coherence. MacGregor (2001) concludes his review of decomposition approaches with similar advice.
When assessing probability distributions (e.g., for the forecast range within which an uncertainty quality will lie), individuals tend to be overconfident in that they forecast too narrow a range. Some response modes fail to counteract this tendency. For example, if one asks a forecaster initially for the median value of the distribution (the value the forecaster perceives as having a 50% chance of being exceeded), this can act as an anchor. Tversky and Kahneman (1974) were the first to show that people are unlikely to make sufficient adjustments from this anchor when assessing other values in the distribution. To counter this bias, Goodwin and Wright (1998) describe the “probability method” for eliciting probability distributions, an assessment method that de-emphasizes the use of the median as a response anchor. McClelland and Bolger (1994) discuss overconfidence in the assessment of probability distributions and point probabilities. Wright and Ayton (1994) provide a general overview of psychological research on subjective probability. Arkes (2001) lists a number of principles to help forecasters to counteract overconfidence.
That sleep affects consciousness & memory is an uncontroversial claim; eg in the WSJ:
One little known aspect of insomnia is that the seemingly sleep-deprived often underestimate (or overestimate) how much shut-eye they’re getting, says Matt Bianchi, director of the sleep division at Massachusetts General Hospital in Boston. “They could sleep seven hours in the sleep lab and they would say they didn’t sleep one minute,” says Bianchi, adding that many patients also wake up multiple times without remembering it. That disconnect has sparked numerous apps and gadgets that offer to help people gauge how much sleep they’re getting.
I was monitoring my email and RSS closely the night I sent the email. I had already written and proofread the real version, and written an email explaining I had discovered a mistake; so to change my site was just a matter of issuing a single revision-control command (
darcs rollback) & re-syncing my site, and then sending the email. I don’t think the bad version would have been up for more than 10 or 20 minutes past him posting or replying clearly that he would post it.↩
You might get the opposite impression reading articles like this New York Times article, but consider the flip side of large percentage growth in philanthropy - they must be starting off from a small absolute base!↩
Charles Simonyi is actually the first person to come to mind when I think about ‘weird wealthy American technologist interested in old and long-term information who has already demonstrated philanthropy on a large scale’↩
Walker was the second, due to his library. Information on his net wealth isn’t too easy to come by; he had ~410 billion in 2000 but fell to 300 million and Business Insider claims “Although he never recovered financially, Walker had enough money - barely - to complete an expensive dream home in Connecticut, which includes a fantastic personal library (featured by Wired in 2008).”↩
I initially couldn’t find anything on charitable giving by Jobs. Eventually I found a The Times interview with Jobs where the reporter says “Jobs had volunteered himself as an advisor to John Kerry’s unsuccessful campaign for the White House. He and his wife, Lauren, had given hundreds of thousands of dollars to Democratic causes over the last few years.” Ars Technica mentions a few others, but conflates Jobs with Apple. A large $150m donation speculated to be Jobs has been confirmed to not be from him, although rumors about a $50m donation to a hospital continue to circulate. (In 2004, Fortune estimated Jobs’s fortune at $2.1 billion.) And in general, absence of evidence is evidence of absence. While Bono praises Apple for its charity:
Through the sale of (RED) products, Apple has been (RED)’s largest contributor to the Global Fund to Fight AIDS, Tuberculosis and Malaria - giving tens of millions of dollars that have transformed the lives of more than two million Africans through H.I.V. testing, treatment and counseling. This is serious and significant. And Apple’s involvement has encouraged other companies to step up.
Isaacson’s 2011 Steve Jobs biography (finished before he died and so includes nothing on Jobs’s will) does occasionally discuss Jobs’s few acts of philanthropy and offers a different version of the (RED) contributions:
He was not particularly philanthropic. He briefly set up a foundation, but he discovered that it was annoying to have to deal with the person he had hired to run it, who kept talking about “venture” philanthropy and how to “leverage” giving. Jobs became contemptuous of people who made a display of philanthropy or thinking they could reinvent it. Earlier he had quietly sent in a $5,000 check to help launch Larry Brilliant’s Seva Foundation to fight diseases of poverty, and he even agreed to join the board. But when Brilliant brought some board members, including Wavy Gravy and Jerry Garcia, to Apple right after its IPO to solicit a donation, Jobs was not forthcoming. He instead worked on finding ways that a donated Apple II and a VisiCalc program could make it easier for the foundation to do a survey it was planning on blindness in Nepal….His biggest personal gift was to his parents, Paul and Clara Jobs, to whom he gave about $750,000 worth of stock. They sold some to pay off the mortgage on their Los Altos home, and their son came over for the little celebration. “It was the first time in their lives they didn’t have a mortgage,” Jobs recalled. “They had a handful of their friends over for the party, and it was really nice.” Still, they didn’t consider buying a nicer house. “They weren’t interested in that,” Jobs said. “They had a life they were happy with.” Their only splurge was to take a Princess cruise each year. The one through the Panama Canal “was the big one for my dad,” according to Jobs, because it reminded him of when his Coast Guard ship went through on its way to San Francisco to be decommissioned…[Mona Simpson’s novel] depicts Jobs’s quiet generosity to, and purchase of a special car for, a brilliant friend who had degenerative bone disease, and it accurately describes many unflattering aspects of his relationship with Lisa, including his original denial of paternity…Bono got Jobs to do another deal with him in 2006, this one for his Product Red campaign that raised money and awareness to fight AIDS in Africa. Jobs was never much interested in philanthropy, but he agreed to do a special red iPod as part of Bono’s campaign. It was not a wholehearted commitment. He balked, for example, at using the campaign’s signature treatment of putting the name of the company in parentheses with the word “red” in superscript after it, as in (APPLE)RED. “I don’t want Apple in parentheses,” Jobs insisted. Bono replied, “But Steve, that’s how we show unity for our cause.” The conversation got heated-to the F-you stage-before they agreed to sleep on it. Finally Jobs compromised, sort of. Bono could do what he wanted in his ads, but Jobs would never put Apple in parentheses on any of his products or in any of his stores. The iPod was labeled (PRODUCT)RED, not (APPLE)RED.
Inside Apple: How America’s Most Admired - and Secretive - Company Really Works (Lashinsky 2012) reportedly includes the following choice anecdote:
A highlight of the Top 100 for attendees was an extended Q&A between Jobs and his executives. One asked why Jobs himself wasn’t more philanthropic. He responded that he thought giving away money was a waste of time.
Cook’s second decision was to start a charity program, matching donations of up to $10,000, dollar for dollar annually. This too was widely embraced: The lack of an Apple corporate-matching program had long been a sore point for many employees. Jobs had considered matching programs particularly ineffective because the contributions would never amount to enough to make a difference. Some of his friends believed that Jobs would have taken up some causes once he had more time, but Jobs used to say that he was contributing to society more meaningfully by building a good company and creating jobs. Cook believed firmly in charity. “My objective - one day - is to totally help others,” he said. “To me, that’s real success, when you can say, ‘I don’t need it anymore. I’m going to do something else.’”
Of course, if we really want to rescue Jobs’s reputation, we can still do. It could be the case that Jobs was very charitable, but he does so completely anonymously or perhaps that he has preferred to reinvest his wealth in gaining more wealth and only donating after his death; a Buffet-like strategy that - ex post - would seem to be a very wise one given the stock performance of
AAPL. Jobs’s death October 2011 means that this theory is falsifiable sooner than I had expected while writing this essay. Based on Jobs’s previous charitable giving, and the general impression I have from the hagiographic press coverage is Apple itself is Jobs’s charitable gift to the world (which I can’t help but suspect either influenced or is influenced by the man himself). My own general expectation is that he will definitely not donate ~99% of his wealth to charity like Buffett or Gates (80%), probably not >50% (70%), and more likely somewhere in the 0-10% range (60%). As of 1 January 2013, my predictions have been borne out. If any philanthropy comes of Jobs’s Pixar billions, I expect it to be at the behest of his widow, Laurene Powell Jobs, who has long been involved in non-profits; to quote Isaacson again:
Jobs’s relationship with his wife was sometimes complicated but always loyal. Savvy and compassionate, Laurene Powell was a stabilizing influence and an example of his ability to compensate for some of his selfish impulses by surrounding himself with strong-willed and sensible people. She weighed in quietly on business issues, firmly on family concerns, and fiercely on medical matters. Early in their marriage, she cofounded and launched College Track, a national after-school program that helps disadvantaged kids graduate from high school and get into college. Since then she had become a leading force in the education reform movement. Jobs professed an admiration for his wife’s work: “What she’s done with College Track really impresses me.” But he tended to be generally dismissive of philanthropic endeavors and never visited her after-school centers.
If you total up in your mind all of the philanthropic investments that Laurene has made that the public knows about, that is probably a fraction of 1% of what she actually does, and that’s the most I can say.
Last year the founder of the Stanford Social Innovation Review called Apple one of “America’s Least Philanthropic Companies.” Jobs had terminated all of Apple’s long-standing corporate philanthropy programs within weeks after returning to Apple in 1997, citing the need to cut costs until profitability rebounded. But the programs have never been restored.
Unlike Bill Gates - the tech world’s other towering figure - Jobs has not shown much inclination to hand over the reins of his company to create a different kind of personal legacy. While his wife is deeply involved in an array of charitable projects, Jobs’ only serious foray into personal philanthropy was short-lived. In January 1987, after launching Next, he also, without fanfare or public notice, incorporated the Steven P. Jobs Foundation. “He was very interested in food and health issues and vegetarianism,” recalls Mark Vermilion, the community affairs executive Jobs hired to run it. Vermilion persuaded Jobs to focus on “social entrepreneurship” instead. But the Jobs foundation never did much of anything, besides hiring famed graphic designer Paul Rand to design its logo. (Explains Vermilion: “He wanted a logo worthy of his expectations.”) Jobs shut down the foundation after less than 15 months.